3-step framework for scaling data quality in the age of generative AI
07-16-2024
3-step framework for scaling data quality in the age of generative AIApplying what we’ve learned from healthcare to data qualityI’ve found that data quality isn’t really about cleanliness or completeness or accuracy. Instead, it’s about trust. A recent survey showed that though nearly every data team is diving headfirst into AI applications, 68% of companies aren’t confident in the data behind these applications. Imagine this: someone looks at a dashboard and says, "That number doesn’t look right." Diagnosing it is a huge challenge. Are they even right? If so, where’s the problem coming from? It could be that the pipeline didn’t run, a data quality check failed, or the meaning of a metric changed and data consumers weren’t informed. Hours later, the company has lost trust in its own data and data team. This is the data trust gap, which I’ve written about before. It stems from a disconnect between data producers, who aim to create high-quality data and data products, and data consumers, who care less about the quality and more about how usable they actually are. Between these groups lies an ever-growing mess of diverse data tools, people, and information. The data trust problem is only intensifying today. In the age of generative AI — where algorithms not only interpret but create data — trust is the foundation for every data product. If a human sees a weird number, they can stop and investigate. But an AI will just use that number, often for critical business decisions, without hesitation. So what does it mean to “fix” data quality and build great data products in the age of AI? I think it comes down to shared culture, context, and collaboration. Let’s dive into why in today’s issue of Metadata Weekly. 🚀 3-step framework to scaling data quality in the age of generative AIJust like maintaining your personal health, improving data quality involves three key steps: awareness, cure, and prevention. 1. Awareness: Is our data high-quality now?When we talk about awareness in the context of data quality, we're really discussing the need to understand our current baseline. Where does our data stand right now? Are there any glaring issues we need to address? This involves pulling in context about what’s happening, detecting anomalies, and keeping users informed — for instance, notifying them if a pipeline didn’t run. Improving awareness means making information from the world of data producers accessible and understandable for data consumers — ideally managed by someone who understands both the technical and human side of data. It's about breaking down silos and ensuring everyone is on the same page. For example, this could involve pushing alerts directly into a BI tool or Slack channel, or using common color schemas like green, yellow, and red. All of this context could even be used to create a data product score, which measures the quality, usability, and trustworthiness of data. I’ve actually been surprised to see how quickly this has gained traction and increased adoption among our customers.
2. Cure: How can we make our data high-quality?This step addresses the most broken flow in data management today. Most teams today, such as sales or marketing, are fairly homogeneous. Everyone in the team will likely have a similar skill set and background. Meanwhile, data teams are incredibly diverse and involve people across different verticals and skill sets — data scientists, engineers, product managers, business analysts, stewards, and more. This diversity is why solving data quality isn't just a technical problem — it's a collaboration problem. Curing data quality issues involves growing a shared understanding, awareness, and context across the entire ecosystem. This requires getting all of the people involved in data to agree on what needs to be done, then translating that agreement into the actual workflows of data producers and consumers. One effective strategy is to develop Service Level Agreements (SLAs) — mutual agreements on how data should be handled, considering each group's needs and constraints. These agreements should ideally be created and maintained by cross-functional teams made up of data scientists, analysts, business leaders, IT professionals, and anyone else who has a stake in data quality at the company.
3. Prevention: How can we ensure we always have high-quality data?This step focuses on sustainability — how can we take what we’ve learned in the Awareness and Cure steps and implement it in a way that prevents these same issues from cropping up regularly? To be honest, I’d like to stop talking about data quality in the next few years. The more that we can nail the prevention of data quality issues, the more we can stop getting bogged down in whether the number on a dashboard is right and focus on actually using it. One powerful solution for data quality prevention is implementing data contracts. These establish agreements between different data stakeholders on how to handle quality checks and issues, ideally automating the process so people don’t have to focus on it constantly. The more you can automate data quality, the easier it will be for everyone. Effectively scaling data quality initiatives requires leveraging advanced tools and technologies designed to streamline data management and improve quality. Automated data lineage tracking, anomaly detection, and data quality monitoring can significantly reduce errors and efficiently resolve issues.
I just discussed data quality in the age of generative AI on the DataFramed podcast with Barr Moses (Monte Carlo Data) and George Fraser (Fivetran)! 📚 More from my reading list
Top links from last issue:
Metadata Weekly is free today. But if you enjoyed this post, you can tell Metadata Weekly that their writing is valuable by pledging a future subscription. You won't be charged unless they enable payments. |