Data Mesh and Data Privacy – You CAN Have Both
This post was contributed by Matthew Fleck, Founder & CEO at Anonomatic, partners of the Data Collaboration Alliance and data and privacy experts who are passionate about helping organizations solve for PII compliance.
[Editor's note: readers may also want to check out the Data Fabric concept for added context]
When I was first introduced to the idea of data meshes, I thought it was one of those incredibly brilliant ideas which simply took someone clever enough to look at a common problem from an uncommon perspective.
According to Starburst.io, data mesh is a new approach based on a modern, distributed architecture for analytical data management. It enables end users to easily access and query data where it lives without first transporting it to a data lake or data warehouse. The decentralized strategy of data mesh distributes data ownership to domain-specific teams that manage, own, and serve the data as a product.
There are many very good articles on data meshes such as this one by Barr Moses, CEO of Monte Carlo, titled What is a Data Mesh — and How Not to Mesh it Up. I won’t duplicate the details common in every article about data meshes. Rather I want to discuss a significant danger of data meshes that is not being addressed today.
The danger relates to how data privacy becomes an absolute nightmare if a data mesh is implemented without a corresponding evolution in how data privacy is implemented. So, let’s discuss how to avoid this pitfall.
To start, we’ll review the basic structure of a data mesh. The diagram below was recreated based on the Barr Moses’s post:
At a high level, a data mesh is composed of three separate components:
Domain-oriented data pipelines managed by functional owners
Underlying the data mesh architecture is a layer of universal interoperability, reflecting domain-agnostic standards, as well as observability and governance. (Original diagram courtesy of Monte Carlo Data.)
From the bottom up, it starts with a Universal Interoperability layer. Consider this the internal standards each domain must implement to deliver their data as a product.
The different domains appear on the next level up. Each domain holds responsibility for delivering their data as a product to other domains and to the central data repository. Within the domains, data is ETLed (Extracted, Transformed and Loaded). This basically takes data from one system, configures it so it can be loaded into another system, and then actually loads it into that other systems.
Data Infrastructure-as-a-Platform sits at the top of the architecture. It provides a standard, self-serve model by which data consumers and domains may either push their data to a warehouse or request data from a warehouse.
The beauty of a data mesh is its simplicity. It mirrors how individual business domains operate within an organization on a non-technical level. By using this same internal structure, a data mesh shifts the responsibility for providing data critical to an organization’s operations from large, centralized groups to small teams (domains). The efficiency and effectiveness come from the reality that the large, central teams often have only a shallow understanding of domain level data while the domains know their data intimately.
This clean, simple, and elegant solution deserves all the attention it currently receives.
Why is a Data Mesh so Risky for Data Privacy?
It’s because the very challenging task of implementing and enforcing data privacy grows exponentially for every domain in a de-centralized data mesh.