Data Elixir - Issue 492
07-09-2024
ISSUE 492 · July 9, 2024In the NewsAI Revolutionized Protein Science but Didn’t End ItThree years ago, Google’s AlphaFold pulled off the biggest artificial intelligence breakthrough in science to date. In doing so, it accelerated molecular research and kindled deep questions about why we do science. Here's the story of AlphaFold and what it means for the future of AI in science, and especially, biology. Data StrategyThe Amazon Weekly Business ReviewIn the latest post in his Becoming Data Driven series, Cedric Chin explores the weekly meeting at Amazon where leaders look at 400-500 metrics in a single hour. It's a lot of data but that's the number of levers they have available to run the business. Here's a breakdown of the metrics, how the meeting works, and how it helps Amazon win. PRESENTED BY Observable See the benefits of building dashboards with codeEveryone who has used a traditional BI tool knows their limitations. When you build with code, you can create bespoke, expressive, and interactive dashboards and data apps. Learn more about how code improves data modeling, unlocks more precise layouts, and allows you to create unique chart types. Posts & TutorialsBoosting vs. semi-supervised learningWhile gradient boosted algorithms are amazing, they aren't a silver bullet for everything - especially when you're dealing with a dataset that only has a small set of labels. For those use-cases, this video shows why semi-supervised learning techniques can be a better approach. Python's Sparse Array EcosystemA sparse array is a large dataset that contains a significant number of zero or null values. An efficient way to work with sparse arrays is to only keep track of the non-zero elements but doing that is non-trivial. This post explores some options for working with sparse arrays in python, including pros/cons of each. How to Interview and Hire ML/AI EngineersGreat post for anyone interested in either hiring or being hired into an ML/AI Engineer role. Covers technical and non-technical qualities to assess, phone screens, interview loops, debriefs, and finally, opinionated thoughts on what makes a good hire. Tools & CodeFun with PositronPositron is a next-generation data science IDE that combines the best of RStudio and Visual Studio Code. It's a new IDE but already, it offers first-class support for R and/or Python, support for VS Code compatible extensions, and a focus on data science workflows. This post is a nice introduction to Positron's features, settings, key extensions, and more. GraphRAG: New tool for complex data discoveryGraphRAG is a graph-based approach to retrieval-augmented generation (RAG) that enables question-answering over private or previously unseen datasets. This is a Microsoft Research project that was introduced a few months ago and was recently released on GitHub. There's a lot here, including a paper and extensive documentation. mapglmapgl is a new R package that makes it easy to work with the latest versions of Mapbox GL JS and MapLibre GL JS using R. It has globe projections, fast interaction with large datasets, lots of tricks to use in Shiny, and more. Open-Access BooksThe Orange Book of Machine LearningThis new book is a practical introduction for anyone getting started with machine learning or working with tabular data. Covers a wide range of topics from statistics and EDA to regression and ensemble methods. It's well-organized with linked references throughout, making it easy to go deeper. Last Issue's Top Links
Level up your data game.LLMs are getting really good, fast. If you don't use LLMs on a regular basis yet, you're missing out. Check out these examples from Data Elixir's GPT-4o powered assistant. |