Data Engineering Weekly #178
07-01-2024
Experience Enterprise-Grade Apache AirflowAstro augments Airflow with enterprise-grade features to enhance productivity, meet scalability and availability demands across your data pipelines, and more. Ozge Demirci, Jonas Hannane & Xinrong Zhu: Who Is AI Replacing? The Impact of Generative AI on Online Freelancing PlatformsThe economic impact of Gen AI is widely speculated, and we see few signs of impact. The paper highlights the substantial impact of generative AI on reducing demand for certain freelance jobs while increasing the complexity and pay of the remaining jobs, leading to greater competition and shifts in required skills. The key highlights of the paper, 1. Decrease in Job Posts: The introduction of ChatGPT led to a 21% decrease in job posts for automation-prone jobs (such as writing and coding) within eight months compared to jobs requiring manual-intensive skills. Image-generating AI technologies resulted in a 17% decrease in job posts related to image creation. 2. Increased Competition: Reducing job posts increased competition among freelancers. The remaining automation-prone jobs were more complex and offered higher pay. 3. Job Complexity and Pay: Despite the decrease in job posts, the complexity and pay for the remaining automation-prone jobs increased. 4. Specific Job Clusters Affected:
https://papers.ssrn.com/sol3/papers.cfm?abstract_id=4602944 Leopold Aschenbrenner: Situational Awareness - The Decade AheadTracing the advancements from GPT-2 to GPT-4, the paper argues that AGI (Artificial General Intelligence) by 2027 is plausible. The paper highlights several challenges, including the need for massive industrial mobilization to support the growing demands for GPU, data centers, and power infrastructure. Controlling AI systems that are much smarter than humans is an unsolved technical problem, and failure could lead to catastrophic outcomes. What do you all think? Do you think human society can handle human-level intelligent machines? https://situational-awareness.ai/ Ben Lorica: Why Your Generative AI Projects Are FailingYes, I added this article as a logical sequence of the previous two articles 😂 Though the promise of LLMs is amazing, enterprises struggle to integrate the system seamlessly without disturbing the workflow. Looming regulatory requirements, data quality, governance issues & model accuracy keep failing enterprises. https://gradientflow.substack.com/p/why-your-generative-ai-projects-are Sponsored: Try Fully Managed Apache Airflow for FREERun Airflow without the hassle and management complexity. Take Astro (the fully managed Airflow solution) for a test drive today and unlock a suite of features designed to simplify, optimize, and scale your data pipelines. For a limited time, new sign-ups will receive a complimentary Airflow Fundamentals Certification exam (normally $150). Astasia Myers & Eric Flaningam: The rise of AI data infrastructureThe article discusses the emergence of AI data infrastructure as a critical area for innovation. The authors emphasize the increasing need for high-quality data for training and inference, focusing on unstructured data pipelines, retrieval-augmented generation (RAG), data curation, and AI memory. It is a good reminder to the data industry that we need to solve the fundamentals of data engineering to utilize AI better. https://www.felicis.com/insight/ai-data-infrastructure Chris Riccomini: Data Lakehouse Catalog Reality Check
I don’t think anyone can better describe the catalog war than this. Market pressure leads to marketing something that is not what it is and announcing that something is not ready yet. In all fairness, we can take it any day if it is a competition for open-source things. https://materializedview.io/p/data-lakehouse-catalog-reality-check Pedram Navid: The Rise of the Data Platform Engineer
The blog is a good summarization of the ever-changing and c’ ever-changing and confusing role. The question essentially is, are we so back to building yaml frameworks? https://databased.pedramnavid.com/p/the-rise-of-the-data-platform-engineer Booking.com: Meta-experiments: Improving experimentation through experimentationCan we experiment on the experimentation process? By implementing "meta-experiments," the team tested new features like low-power alerts, significantly boosting the quality of their A/B tests. This clever dogfooding enhanced their platform and gave the team a taste of their own medicine, fostering empathy for their users and uncovering pain points they hadn't experienced firsthand. https://booking.ai/meta-experiments-improving-experimentation-through-experimentation-6bdee314c512 Instacart: Bandits for Marketing OptimizationInstacart discusses its adaptive experimentation system for optimizing paid marketing budgets. The system uses a two-step process:
https://tech.instacart.com/bandits-for-marketing-optimization-f5a63b9bfaa7 Lazaro Hurtado: Evaluating RAG capabilities of Small Language ModelsIn this article, the author evaluates Small Language Models (SLMs) for use in Retrieval Augmented Generation (RAG) systems, comparing their performance to larger models using the Needle-In-A-Haystack benchmark. Some fine-tuned SLMs, particularly Gemma 2B and Llama2 7B, perform well in tasks similar to those in RAG applications, suggesting the potential for more resource-efficient and environmentally friendly alternatives to Large Language Models. However, the authors note that further research is needed to assess SLMs' capabilities fully in more complex scenarios typical of RAG systems. Geico: Searchable field-level encrypted customer PII with k-anonymityField-level encryption is a data protection measure that encrypts individual sensitive fields within records, keeping data encrypted throughout its lifecycle and narrowing the data protection focus to key management. To enable searching of encrypted data, GEICO uses k-anonymization, which involves storing truncated hash digests alongside encrypted values and allows for secure searches without knowing the encryption key. The approach balances security and performance, requiring careful tuning of the hash truncation length to manage the trade-off between protection against dictionary attacks and the number of false positives in search results. https://www.geico.com/techblog/searchable-field-level-encrypted-customer-pii/ All rights reserved ProtoGrowth Inc, India. I have provided links for informational purposes and do not suggest endorsement. All views expressed in this newsletter are my own and do not represent current, former, or future employer” opinions. |