Big data promises a huge revolution in the way we do business, but let's be honest: how much real value do our data strategies usually deliver?

Big Data's Promise

Managing and generating value through data has been a challenge for most companies for decades. In recent years, the phenomenon of Big Data has brought a lot of optimism because of its promise to revolutionize business decision making. However, the limitations inherent in data architectures often prevent them from providing the expected value, leaving companies frustrated after pouring time and money into data solutions and not realizing tangible results.

To help solve these challenges, a new architectural paradigm emerged: Data Mesh. Data Mesh aims to remove bottlenecks and allow a more optimized delivery of value through data.

Understanding the Context: Data Warehouses and Data Lakes

Before getting into the details of this new architectural paradigm, we need to understand the current context and identify the causes of failure in the data journey of many companies.

Let's begin by reviewing some of the key components that are used as big data repositories. The first of these is the Data Warehouse, which emerged in large corporations decades ago. In this context, the business had a slower and more predictable pace, and the architecture was essentially composed of large and complex systems with little or no integration between them. The challenge was to get a unified view across the systems.

More recently, starting in the 2010s, a model that gained popularity was the Data Lake, which emerged in a much more dynamic and less predictable business environment. This model has been used not only by large corporations but also by new companies with disruptive value propositions. The architecture has evolved to include a great number of applications that are simple and integrated, often built on new technologies and techniques such as cloud and microservices.

To better understand the concepts behind these two models, I suggest reading Modernizing Your Data Warehouse with BigQuery and Data Lakes: The Key to Data Modernization.

It's important to highlight that in these two architectural paradigms, the teams responsible for managing data have characteristics of high specialization and centralization.

Why It's So Hard to Create Value through Data

But data warehouses and data lakes don't always solve our problems. Even though investments in the data area continue to grow, confidence in the return of real value added to the business is decreasing. Based on a study by NewVantage Partners, only 24% of companies are actually able to adopt a data culture.

However, it is important to point out that the problem does not lie in the technology itself, as the great advances of the last decade have dealt very well with the problems arising from the large volume and processing of data. Instead, limitations in delivering value to the business result from processes and data models, such as:

  • Monolithic and centralized architecture: based on the assumption that we need to centralize data to obtain real value, monolithic architectures have historically been complex and concentrated in a single place. This kind of architecture makes it relatively simple to start a Data Warehouse or Data Lake project, but it also makes it very difficult to scale since these models have problems keeping up with rapid changes.

 

Image courtesy of "How to Move Beyond a Monolithic Data Lake to a Distributed Data Mesh" on martinFowler.com

  • Highly centralized and specialized responsibilities: The responsibility for complex architectures is in the hands of a highly specialized engineering team that often works in isolation from the rest of the company, that is, far from where the data is generated and used. Because of this, the data team can become a bottleneck when changes like new data pipeline processes are needed. It's also usually the case that the members of this team don't have a business vision and certainly don't have visibility to all business areas, making it difficult for them to respond at the ideal speed.

Image courtesy of "How to Move Beyond a Monolithic Data Lake to a Distributed Data Mesh" on martinFowler.com

Centralized structures (both in terms of staff and in terms of the data platform) cause major challenges for the real democratization of data; one common example is inadequate data quality, due to the lack of business expertise by the engineering team. Centralized structures also make it hard to scale, both because of engineering limitations and also because of the complexity and interdependence of steps in the data pipeline.

Data Mesh to the Rescue!

Faced with the aforementioned problems, Zhamak Dehghani presented a new approach to data architectures, covered in detail in these two articles on Martin Fowler's blog: How to Move Beyond a Monolithic Data Lake to a Distributed Data Mesh, Data Mesh Principles and Logical Architecture.

The primary objective of Data Mesh is the democratization of data. It challenges the assumptions that large volumes of data must always be centralized and that data must be managed by a single team. To reach its full potential, the Data Mesh architecture follows 4 basic principles:

  1. Domain-oriented data architecture.
  2. Data as a product.
  3. Infrastructure that makes data accessible as self-service.
  4. Federated governance.

Let's take a closer look at each principle.

 

Image courtesy of "How to Move Beyond a Monolithic Data Lake to a Distributed Data Mesh" on martinFowler.com

1. Domain-Oriented Data Architecture

 The data architecture must be built and modeled in a way that is oriented to the different business domains instead of being centralized in a single team. This practice enables data to be used and managed close to its respective sources rather than having to be moved. This is of great importance because moving data comes with a cost: for example, we might have to add more processing jobs into a generic workflow, and each job is a possible point of failure.

Another benefit of this Data Mesh principle is that data responsibility is balanced according to the domains involved so that new data sources can be implemented and coupled in a more agile manner. This makes it easier to scale at the same pace as business demands, which evolve rapidly.

2. Data as a Product

Making data architectures distributed is interesting from the point of view of allowing more optimized scalability, but it brings problems that did not exist in the centralized model, such as the lack of standardization in access and data quality. To solve these problems, Data Mesh proposes thinking of data as a product, which means creating new roles, such as a Data Product Owner and a Data Developer. These new roles are responsible for defining and developing products. Instead of looking at data as a service, the Data Product Owner must apply product thinking to create a better experience for customers or users, while the Data Developer works with a focus on developing the product itself. As part of their responsibilities, the Data Product Owner of each domain must make sure that the data is accessible and well documented, determine the best form of storage, and ensure the quality of the data. The purpose of this principle is to provide a good experience for users to perform analysis and bring real value to the business.

3. Infrastructure that Makes Data Accessible as Self-Service

Another concern that arises in the decentralization scenario is the spreading of knowledge in technologies that were previously concentrated. There's a risk of overloading domain teams and generating reworks regarding the data platform and its infrastructure, which needs to be built and constantly managed. Since the skills needed for this task are highly specialized and difficult to find, it would be impractical to require each domain to create its own infrastructure environment. Thus, one of the Data Mesh principles is a self-service data platform that allows domain teams to operate autonomously. This infrastructure is intended to be a high-level abstraction to remove complexities and the challenge of provisioning and managing the lifecycle of data products. It is important to note that this platform must be domain agnostic. The self-service infrastructure must include features to reduce the current cost and expertise required to build data products, including scalable data storage, data product schemas, data pipeline construction and orchestration, data lineage, etc. Thus, the objective of this principle is to ensure that domain teams can create and consume data products autonomously, using the platform's abstractions.

 

Image courtesy of "How to Move Beyond a Monolithic Data Lake to a Distributed Data Mesh" on martinFowler.com

4. Federated Governance

One of the fundamental principles of Desh Mesh is to create federated governance, aiming to balance centralized and decentralized governance models to seek the positive points of both. Federated governance has some features such as domain decentralization, interoperability through global standardization, a dynamic topology, and, most importantly, the automated execution of decisions by the platform. Traditionally, governance teams use a centralized model of rules and processes and accumulate full responsibility for ensuring global standards for data. In Data Mesh, the governance team changes its approach to sharing responsibility through federation, being responsible for defining, for example, the global (not local) rules for data quality and security, instead of being responsible for quality and security of all company data. That is, each Data Product Owner has domain-local autonomy and decision-making power while creating and adhering to a set of global rules to ensure a healthy and interoperable ecosystem. Taking the LGPD (the Brazilian general data protection law) as an example: the global governance team remains legally responsible and can inspect domains to ensure adherence to global rules.

It is important to highlight that a domain's data only becomes a product after it has gone through the quality assurance process locally according to the expected data product quality metrics and global standardization rules. Data Product Owners in each domain are in the best position to decide how to measure data quality locally, knowing the details of the business operations that produce the data. Although such decision-making is localized and autonomous, it is necessary to ensure that the modeling is meeting the company's global quality standards as defined by the federated governance team.

Data Mesh Architecture and Adoption

The image below explains the architecture at a high level, including its four principles. It starts with the data platform and passes through the domains responsible not only for applications and systems but also for data, all under the macro responsibility of federated governance, ensuring product interoperability.

Image courtesy of "Data Mesh Principles and Logical Architecture" on martinFowler.com

Thanks to the recent advances in data storage and processing technology, the technological factor is not a problem for the adoption of Data Mesh, since the tools used in Data Lake/Warehouse can be used in the new model. As a reference, this article presents the possibility of creating a Data Mesh architecture based on GCP (Google Cloud Platform). In addition, there is a wide variety of cloud data storage options, enabling each domain to choose the right storage solution for their needs.

It is important to point out that adopting Data Mesh requires a change of culture within your company, from business areas to engineering, which can be a barrier in the implementation of this model. To know if your company would really benefit from Data Mesh, you need to answer some questions, such as:

  1. How many data sources do we have?
  2. How many people are on our data team?
  3. How many possible business domains do we have?
  4. Is the data engineering team currently a bottleneck? If so, how often is this the case? 
  5. What is the current level of importance that the company gives to the subject of data governance?

In general, the greater the number of data sources, set of consumers, business rules, and complexity of business domains, the more likely it is that your Data Lake/Data Warehouse can end up becoming a bottleneck in the delivery of quality solutions. This is a scenario that would possibly benefit from the adoption of a Data Mesh architecture. As another example, if you're discarding data sources that are valuable to business users because they're too complex to integrate into the current Data Lake/Data Warehouse structure, then this could be a good sign it might be time to migrate to this new architecture. It is also possible to carry out specific projects in situations that could make good use of Data Mesh and change the culture and architecture little by little.

Conclusion

Data Mesh may not be applicable in every environment, but it offers an alternative to current data architecture models, allowing greater synergy between technical teams and business areas, which are the big users of data. 

 

References

"How to Move Beyond a Monolithic Data Lake to a Distribute Data Mesh."  Zhamak Dehghani. martinFowler.com. 

"Data Mesh Principles and Logical Architecture." Zhamak Dehghani. martinFowler.com. 


Author

Gabriel Luz

Gabriel Luz is a Data Engineer at Avenue Code. He loves to learn new technologies and to work on challenging projects that impact people. In his free time, Gabriel likes to read about history and watch superhero movies and Flamengo soccer games.


The Complete Guide to Modernizing Your Data Platform with GCP

READ MORE

Machine Learning and Artificial Intelligence in GCP

READ MORE

Data Visualization in GCP

READ MORE