Data Science Newsletter | Data Elixir

Data Elixir - Issue 492

07-09-2024

How Amazon wins with data. Boosting vs. semi-supervised learning. Science and LLMs. Sparse arrays. Practical intro to ML. ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌

ISSUE 492 · July 9, 2024

In the News

AI Revolutionized Protein Science but Didn’t End It

Three years ago, Google’s AlphaFold pulled off the biggest artificial intelligence breakthrough in science to date. In doing so, it accelerated molecular research and kindled deep questions about why we do science. Here's the story of AlphaFold and what it means for the future of AI in science, and especially, biology.
Quanta Magazine | Yasemin Saplakoglu

Data Strategy

The Amazon Weekly Business Review

In the latest post in his Becoming Data Driven series, Cedric Chin explores the weekly meeting at Amazon where leaders look at 400-500 metrics in a single hour. It's a lot of data but that's the number of levers they have available to run the business. Here's a breakdown of the metrics, how the meeting works, and how it helps Amazon win.
Commoncog | Cedric Chin

PRESENTED BY Observable

The best dashboards are built with code.

See the benefits of building dashboards with code

Everyone who has used a traditional BI tool knows their limitations. When you build with code, you can create bespoke, expressive, and interactive dashboards and data apps. Learn more about how code improves data modeling, unlocks more precise layouts, and allows you to create unique chart types.

Reach Data Elixir readers by sponsoring an issue. Click here for details.

Posts & Tutorials

Boosting vs. semi-supervised learning

While gradient boosted algorithms are amazing, they aren't a silver bullet for everything - especially when you're dealing with a dataset that only has a small set of labels. For those use-cases, this video shows why semi-supervised learning techniques can be a better approach.
YouTube | :probabl. - 11.5 minutes

Python's Sparse Array Ecosystem

A sparse array is a large dataset that contains a significant number of zero or null values. An efficient way to work with sparse arrays is to only keep track of the non-zero elements but doing that is non-trivial. This post explores some options for working with sparse arrays in python, including pros/cons of each.
Quansight Labs Blog | Hameer Abbasi

How to Interview and Hire ML/AI Engineers

Great post for anyone interested in either hiring or being hired into an ML/AI Engineer role. Covers technical and non-technical qualities to assess, phone screens, interview loops, debriefs, and finally, opinionated thoughts on what makes a good hire.
Eugene Yan and Jason Liu

Tools & Code

Fun with Positron

Positron is a next-generation data science IDE that combines the best of RStudio and Visual Studio Code. It's a new IDE but already, it offers first-class support for R and/or Python, support for VS Code compatible extensions, and a focus on data science workflows. This post is a nice introduction to Positron's features, settings, key extensions, and more.
Andrew Heiss

GraphRAG: New tool for complex data discovery

GraphRAG is a graph-based approach to retrieval-augmented generation (RAG) that enables question-answering over private or previously unseen datasets. This is a Microsoft Research project that was introduced a few months ago and was recently released on GitHub. There's a lot here, including a paper and extensive documentation.
Microsoft Research

mapgl

mapgl is a new R package that makes it easy to work with the latest versions of Mapbox GL JS and MapLibre GL JS using R. It has globe projections, fast interaction with large datasets, lots of tricks to use in Shiny, and more.
Kyle Walker

Open-Access Books

The Orange Book of Machine Learning

This new book is a practical introduction for anyone getting started with machine learning or working with tabular data. Covers a wide range of topics from statistics and EDA to regression and ensemble methods. It's well-organized with linked references throughout, making it easy to go deeper.
Carl McBride Ellis

Last Issue's Top Links

What to do with age? Linear, Discrete, Both, or Spline - Vincent Arel-Bundock
The Strengths, Weaknesses and Blind Spots of Managers - Ben Wigert
Lessons Learned From Scaling to Multi-Terabyte Datasets - v2thegreat

Level up your data game.

LLMs are getting really good, fast. If you don't use LLMs on a regular basis yet, you're missing out. Check out these examples from Data Elixir's GPT-4o powered assistant.

How did you like this issue of Data Elixir?

👎 1 2 3 4 5 👍

Was this email forwarded to you? Sign up here >

Respond To Email