Data Engineering Weekly #179
07-07-2024
Experience Enterprise-Grade Apache AirflowAstro augments Airflow with enterprise-grade features to enhance productivity, meet scalability and availability demands across your data pipelines, and more. Notion: Building and scaling Notion’s data lakeNotion writes about scaling the data lake by bringing critical data ingestion operations in-house.
The architecture resulted in over a million dollars in cost savings!!! https://www.notion.so/blog/building-and-scaling-notions-data-lake DoorDash: Beyond the Click: Elevating DoorDash’s Personalized Notification Experience with GNN RecommendationDoorDash writes about Graph Neural Network (GNN) models to create personalized restaurant recommendations. The blog highlights the advantages of GNN over traditional machine learning models, which struggle to discern relationships between various entities, such as users and restaurants, and edges, such as order. https://doordash.engineering/2024/06/25/doordash-customize-notifications-how-gnn-work/ Jack Vanlightly: Understanding Apache Paimon's Consistency Model Part 1Table formats are definitely a hot topic in the industry now. Apache Paimon is a relatively low-key format spun out of Flink’s table store. The author highlights Paimon’s consistency model by examining the metadata model. The dataset organization as a set of LSM trees seems an interesting approach, which sounds similar to Google’s Napa data warehouse system. https://jack-vanlightly.com/analyses/2024/7/3/understanding-apache-paimon-consistency-model-part-1 Sponsored: Elevate Your Airflow DAG Debugging Skills With This Free Online CourseAre you spending too much time trying to debug your Airflow DAGs? In less than 30 minutes, this free comprehensive course from Astronomer Academy will guide you through a straightforward approach to recognizing and resolving the most common debugging issues. Canva: How we build experiments in-houseExperimentation is a vital part of product design for any company of any scale. Canva writes about the evolution of experimentation system design and the current state of the function. https://www.canva.dev/blog/engineering/how-we-build-experiments-in-house/ Slack: Unlocking Efficiency and Performance: Navigating the Spark 3 and EMR 6 Upgrade Journey at SlackMigration of a large-scale data infrastructure brings its challenges. Slack writes about migrating its data infrastructure to EMR 6 and Spark 3. The blog highlights the pre-migration and post-migration steps, highlighting the importance of structured execution on such large-scale migration projects. Meta: Leveraging AI for efficient incident responseThe incident analysis is a vital part of the incident management lifecycle process to find the hidden flaws of the system design and prevent future issues. Meta writes about its usage of AI to understand the effectiveness of the incident response and root cause analytics. Ling Huang: MLOps — Data Drift Monitoring ArchitectureDrift detection is a popular feature supported by data observability tools. The essence of drift monitoring is running statistical tests on time windows of data. The blog gives an overview of statistical techniques used in drift detection. https://medium.com/@linghuang_76674/drift-monitoring-architecture-aa57fc26a19c Lam Hoang: How Chunk Sizes Affect Semantic Retrieval ResultsChunking is a crucial technique that involves breaking down large pieces of text into smaller, more manageable segments. This process is particularly important when working with large language models (LLMs) and semantic retrieval systems. The X thread is a good reference for measuring the relevancy of a vector search. The blog highlights how the chunk size impacts the vector search and choosing the optimal size. https://ai.plainenglish.io/investigating-chunk-size-on-semantic-results-b465867d8ca1 Diederik Jan Lemkes: How to choose between dbt clone and defer. And how we clone for all contributorsThe article discusses strategies for using production data in dbt development environments without high costs or permissions. It compares two dbt features - defer and clone - for referencing production tables, ultimately favoring a nightly cloning approach for specific models. The authors implement this using Airflow's dynamic task mapping to automatically clone tagged models to each contributor's development dataset, allowing seamless dbt builds with production-like data. Guillermo Musumeci: How To Calculate the Real Cost of Azure Databricks SQL Warehouse InstancesThe databricks billing on Azure is a bit confusing. Most of the billing goes via the Azure system, where the deduction happens via negotiated billing with Daabricks. Software manufacturing and distribution are indeed different art. The blog explains how we can programmatically measure the real cost of Azure databricks SQL warehouse instances. All rights reserved ProtoGrowth Inc, India. I have provided links for informational purposes and do not suggest endorsement. All views expressed in this newsletter are my own and do not represent current, former, or future employer” opinions. |