As Gartner outlined in its 2021 top trends for data analytics report, business leaders are beginning to understand the critical role of data and analytics in accelerating business decisions, shifting it from a ‘nice to have’ to a core business function. The combination of the pandemic accelerating digital transformation and businesses accumulating more data than ever before mean that investments into data and analytics have skyrocketed as organizations demand more valuable insights from their data that they can capitalize on for future challenges.
For many organizations, scaling and investing in data and analytics often means choosing the hottest new data platform that they perceive to be easy to integrate into operations to store, access and analyze everything in one place. But with data coming in thick and fast, in many different formats and from many different places, no such platform can overcome the human complexities that are involved in organizing all data in one place, and companies often aren’t seeing the transformational results they expect. This shows that companies must look beyond tools and platforms if they want to get more from their data.
Enter a new concept to alleviate the human complexities – Data Mesh. Coined by Zhamak Dehghani, principal technology consultant at ThoughtWorks, Data Mesh is a more holistic approach to managing scale within global enterprises and architectural changes, overcoming any flaws in the data lake and data warehouses models at scale. In this article, we’ll be looking at the principles behind the Data Mesh approach, how to establish a data mesh architecture, as well as a working example of how this concept has been put into action at Zalando.
But first, a quick background on Data Mesh
Unlike current data platforms that are largely centralized, monolithic, and often built around complex pipelines, Data Mesh is a concept that sees different datasets as distributed products, orientated around domains. The idea is that each domain-specific dataset has its own embedded engineers and product owners to manage that data and its availability to other teams, driving a level of data ownership and responsibility, which is often lacking in previous approaches. Other mesh-related technologies, like service mesh, refer to a dedicated infrastructure layer that allows developers to mix and match different services into a coherent whole. A similar output comes from a Data Mesh, but it’s more about best practices within the organization rather than a specific technology driving the mesh. Data Mesh allows teams of data analysts to pick and choose from curated data sets managed by domain experts, combining them into a more holistic understanding that would be possible from what one individual dataset is able to achieve.
It’s about rethinking organizational, architectural, and technological assumptions to get the best out of your data team and your data. Instead of centralizing data, which can often cause inefficiencies with data ownership and sourcing, Data Mesh is about giving the domain experts, who know the data best, the responsibility to control and scale their datasets, ultimately driving bigger and better analytics to help drive the business forward.
A global organization that has ongoing change and complexity in their data landscape, along with a large number of sources and consumers and diversity in data transformation, should consider a Data Mesh approach because it allows organizations to better understand their customers, their suppliers, and their products at a deeper level than is possible in traditional centralized solutions.
The four principles of Data Mesh
Data Mesh is founded on four main principles:
- Domain-driven data ownership architecture
- Data as a product
- Self-serve infrastructure as a platform
- Federated computational governance
To put this into action, let’s consider a music streaming app as an example. Data that is relevant to songs and albums, streams, podcast play rates, recommendations, or user behaviors and preferences would traditionally flow from media players into a centralized platform. But when a new feature or functionality is introduced, the entire monolithic data platform must adapt by transforming and cleansing the data each time in order to make that new dataset available for consumption by different teams, creating a roadblock for scalability.
But if we imagine the team building the media player is owning its dataset, while the team that tracks user behavior and preferences manages its own dataset, each group is then responsible for managing, hosting and serving their datasets specific to them, instead of having to push the domain data through the pipeline and reprioritizing ownership. This means that when a new feature or functionality is added, each team is able to adapt and scale their particular dataset in a way that is suitable for it’s application. Each data team could still have a centralized data storage in the cloud, but it could consist of independent buckets owned by the relevant domain, operating as an autonomous node on an interconnected mesh. The data sources can then be scaled along with the number of use cases, and/or diversity of access models by increasing the nodes on the mesh.
By treating each dataset as its own product, hence, the Data as a Product principle, processes can be managed internally and within each domain to help combat data silo effects that are often associated with domain-orientated data sets. For this to work effectively, it’s important that each dataset is discoverable, trustworthy and self-describing so that each team can understand and consume the datasets easily. It’s also worth considering new roles such as data product owners and data product developers in the Data Mesh responsible for allowing data to flow and satisfy the tools native to data scientists and data analysts.
Having distributed ownership and architecture in the Data Mesh approach means a new infrastructure may be required. It’s important that this infrastructure is self-serving and generalized, so that it does not include any domain-specific characteristics that could render it less usable to another domain. This can be performed via a central data infrastructure team that can own and provide the technology that domains need for their data products, making it easier for a team to create a new data product and make it available to data scientists and wider business intelligence teams within the company.
As a federated model, the Data Mesh approach is governed by data product owners agreeing on local and global rules to ensure their data products are compatible with policies, access control, schema management, and more.
Putting Data Mesh architecture into motion: The case of Zalando
Before exercising the Data Mesh approach and product-orientated thinking, it’s worth bearing in mind that having new data product teams managing different domains means that there is likely going to be different levels of expertise among data teams. They might also have preferences in data management tools or prefer to work in Spark rather than Hadoop, resulting in a mix of datasets organized in different formats and siloed in entirely different systems.
To overcome this potential barrier, technologies like Trino, that function as abstraction layers to access data from different places, can be implemented allowing teams to point these systems to multiple datasets simultaneously and query data where it resides at scale. Enabling technologies like Trino typically sit in a shared data infrastructure layer that can be used by the teams creating data products and teams consuming them. Federated query engines, such as Starburst, enable data analysts to query the distributed set of data products while tools that track data provenance, lineage, and versioning and tools for distributed data governance, data cleaning and data integration are integral to Data Mesh architecture. This is ultimately what the long-term success of Data Mesh relies on as it mitigates the risk of siloed data.
Europe’s leading fashion and lifestyle platform and Starburst’s customer, Zalando, transitioned to a Data Mesh self-serve infrastructure when transitioning from legacy data warehouses to a cloud data lake. Previously, the infrastructure team managed thousands of datasets that funneled through a central pipeline causing issues with ownership of the data, ultimately tarnishing its quality.
Switching to a Data Mesh approach, Zalando allowed its teams to build their own storage buckets, all within the central infrastructure layer of the AWS data lake. This meant that when data scientists and business intelligence teams needed to access the different datasets the data remained accessible and available at scale.
Instead of decentralising the infrastructure completely, Zalando ensured that the data teams were responsible for the data they stored themselves in their part of the system, while maintaining a central governance layer. This ensured that the end users or data product consumers could work through a standard, familiar SQL interface that abstracts away the backend infrastructure and distribution of data.
A forward-looking data management solution
Data Mesh is not necessarily about a specific type of technology or code that magically solves data problems at the touch of a button. Instead, it’s about the human side of technology and getting teams to be able to work independently to maximise the value out of data within that organization. Instead of having a central bottleneck for data pipelines – i.e. a small team in charge of cleaning, transforming, integrating, and serving enterprise-wide data, these processes are distributed across many small teams of domain experts, where each team creates “data products” that are consumable by other teams. In many cases, they can use their domain knowledge to clean and prepare data for consumption better than a central team (that does not possess domain knowledge). But more importantly, it gives organizations agility and flexibility in incorporating new data sets into an analysis task without having to wait for a central team to make the data available to the organization.
One key best-practice with Data Mesh is to enforce global standards within an organization through shared identifiers and quality control techniques of data products to ensure that those units consuming data products get a consistent and integrated experience across the different products they may use. Standard techniques for describing and registering a data product is also important and teams must work in parallel as much as possible, with minimal need for direct interaction across teams.
Data collection and analysis is certainly not slowing down and even though Data Mesh might sound like another buzzword among the big data and AI world, the concept will truly allow enterprises to become data-driven in their decisions. By combining deep understanding of the technical necessities of data management with the organizational complexities of putting these solutions to work within large global organizations, Data Mesh offers the ability for enterprises to extract more value out of their distributed data.
Rethinking data infrastructure and how companies manage their data is by no means a straightforward task, especially for large, global corporations. It requires a change in mindset on both a technological level and organizational level, which can be difficult for organizations to adapt to, even at the best of times. But with more data solutions on the market for enterprises to choose from, and data storage solutions continuing to grow and evolve, companies must evolve too. By adopting a Data Mesh approach to data management, organizations can have the confidence to conquer barriers and hiccups to accessing their data insights, ultimately generating transformative insights for the business and future demands.
Justin Borgman, CEO, Starburst Data