Data Engineering Weekly | Ananth Packkildurai | Substack

Data Engineering Weekly #177

06-24-2024

͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏

Forwarded this email? Subscribe here for more

Data Engineering Weekly #177

The Weekly Data Engineering Newsletter

Ananth Packkildurai

Jun 24

READ IN APP

Experience Enterprise-Grade Apache Airflow

Astro augments Airflow with enterprise-grade features to enhance productivity, meet scalability and availability demands across your data pipelines, and more.

Learn More →

Redpoint: The InfraRed Report

The impact of macroeconomic slowness results in increased focus on prioritizing reduced infrastructure spending. Redpoint’s InfraRed report shows a promising sign of a rebound in spending, where much of the money goes to GenAI. A few highlights from the report

Unstructured data goes mainstream.
AI-driven code development is going mainstream now.

https://www.redpoint.com/infrared/report/

Bessemer Venture Partners: Roadmap: AI Infrastructure

Companies are increasing their investment in AI but struggle to integrate it with their product experience or find the right talent and infrastructure to empower it. Bessemer publishes the Data + AI infrastructure market map to help companies understand the landscape and the key players.

https://www.bvp.com/atlas/roadmap-ai-infrastructure

Shreya Shankar: Short Musings on AI Engineering and "Failed AI Projects"

LLMs significantly reduce the entry barrier and are faster than ever for prototypes. The author highlights that while people ship initial deployments of LLM-based pipelines much faster than traditional ML pipelines, expectations become similarly tempered once they are in production. Should we term this as an AI product curve of disappointment?

https://www.sh-reya.com/blog/ai-engineering-short/

Sem Sinchenko: Unity Catalog - The First Look

A balanced take on the open source version of Unity Catalog, what is in the open source, and what is in the commercial version. While seeing the Unity Catalog code, I thought it might be a stripped-down version of the code used in production. But it seems a complete rewrite. It is still unclear if Databricks will use the open-source version in their prod.

I guess it is okay to build something from scratch, but I was surprised it was announced on the stage as if Databricks open-source their production version of the Unity catalog.

https://semyonsinchenko.github.io/ssinchenko/post/uniticatalog-first-look/

Netflix: A Recap of the Data Engineering Open Forum at Netflix

Netflix publishes a recap of all the talks in the first Data Engineering open forum tech meetups. The blog contains a summary of each talk and a link to the YouTube channel with all the talks.

https://netflixtechblog.com/a-recap-of-the-data-engineering-open-forum-at-netflix-6b4d4410b88f

Pinterest: Ray Infrastructure at Pinterest

Pinterest writes about the challenges of running Ray infrastructure in production and integration with the existing Pinterest ecosystem. The case study is a classic success of the platform engineering with a wrapers around the cluster to ease the usage and increases the adaptability.

https://medium.com/pinterest-engineering/ray-infrastructure-at-pinterest-0248efe4fd52

Lyft: ETA (Estimated Time of Arrival) Reliability at Lyft

Another platform engineering case study where Lyft writes about computing ETA to show the availability of the ride and its challenges in building real-time. The blog details the classification model, training approach and historical data analysis.

https://eng.lyft.com/eta-estimated-time-of-arrival-reliability-at-lyft-d4ca2720bda8

Allegro Tech: A Mission to Cost-Effectiveness: Reducing the cost of a single Google Cloud Dataflow Pipeline by Over 60%

The blog is an excellent case study of hyopoesis driven cost optimization with the detail analysis to verify the hypothesis. The author highlighted three hypothesis contributing cost in Google Cloud Dataflow pipeline.

Physical resources are underutilized.
Physical resources have not the best price-to-performance ratio.
Configuration of the Dataflow job is suboptimal and could be optimized.

https://blog.allegro.tech/2024/06/cost-optimization-data-pipeline-gcp.html

Apurva Mehta: Stream Processing is more ubiquitous than you think

It’s common to find in many data trends prediction that this year will be the year of stream processing, however Michael Drogalis highlighted there are not many usecases for stream processing.

Ref: https://x.com/MichaelDrogalis/status/1782439448996970946

The author highlights the low latency, reactive usecases more ubiquitous than what we think? Question to the readers, what do you think of the current state of real-time data processing engines? Are there enough usecases?

https://x.com/apurva1618/status/1803101801081975029?s=12&t=IJ8nTG5H26su2sWUh2bYSw

Influx Data: How Good is Parquet for Wide Tables (Machine Learning Workloads) Really?

Is parquet is still good enough for Machine Learning, Vector and Lake House workloads?

Influx data conclude that while technical concerns about Parquet metadata are valid, the actual overhead is smaller than generally recognized. In fact, optimizing writer settings and simple implementation tweaks can reduce overhead by 30-40%. With significant additional implementation optimization, decode speeds could improve by up to 4x.

https://www.influxdata.com/blog/how-good-parquet-wide-tables/

All rights reserved ProtoGrowth Inc, India. I have provided links for informational purposes and do not suggest endorsement. All views expressed in this newsletter are my own and do not represent current, former, or future employer” opinions.