As a company grows, data-related problems can arise. These might include organisational silos, a lack of data sharing, and poor understanding of what data means outside the context of its business domain. Incompatible technologies can also be an issue, hindering actionable insights, and it can be difficult to push data through ‘extract, transform, load’ (ETL) pipelines. As demand for ad hoc data queries grows, the system struggles to cope, and the risk of shadow data analytics increases.
Any solution to these problems needs to start small, prove its worth and scale as the company grows. Ideally it would emphasise democratisation of data at the business domain level while accommodating different technologies and data analytic approaches. In many cases, a data mesh architecture may be the best option. (See our earlier blogpost Driving Growth with Data Mesh Architectures).
Benefits of Data Mesh
A data mesh can start small and grow as needed, providing a budget-friendly option that delivers value and meets evolving needs. It’s a distributed approach to data management that views different datasets as domain-oriented ‘data products’.
Each set of data products is managed by product owners and engineers who have good knowledge of the relevant domain. The idea is to employ a distributed level of data ownership and responsibility which is sometimes lacking in centralised, monolithic architectures like data lakes. In many ways a data mesh architecture resembles a microservice architecture. The focus on domain-specific data products tends to avoid the tight coupling of ingestion, storage, transformation, and consumption of data typically found in traditional data architectures.
If your company is eager to explore data mesh while reducing the risk of beginner mistakes, read on.
Avoid These Pitfalls to get the Most from a Data Mesh Architecture
Pitfall 1: Failure to Follow DATSIS Principles
The DATSIS acronym – Discoverable, Addressable, Trustworthy, Self-describing, Interoperable and Secure – is a good place to start. Failure to implement any part of DATSIS could doom your data mesh.
- Discoverable – consumers can research and identify data products from different domains. This is typically done with a centralised tool like a data catalogue.
- Addressable – as with microservices, data products are accessible via unique addresses and standard protocol (REST, AMQP, possibly SQL).
- Trustworthy – domain owners provide high quality data products that are useful and accurate.
- Self-describing – data product metadata shouldn’t result in consumer queries to data experts. In other words, data products are self-describing.
- Interoperable – data products must be consumable by other data products.
- Secure – access to data products is automatically regulated through access policies and security standards. This security is built into each data product.
Pitfall 2: Failure to Update Data Catalogues
The discoverability aspect of DATSIS is key. A data catalogue can be used as an inventory of data products in a data mesh, most often using metadata to support data discovery and governance.
Any discoverability mechanism must be kept up to date to protect the usefulness of the data mesh. Out-of-date documentation is often more damaging than no documentation. For this reason, we recommend a docs-as-code scheme where updating the data catalogue is part of the code review checklist for every pull request. With each merged pull request, updated metadata enters the DevOps pipeline and automatically updates the data catalogue. Depending on the data catalogue, it may be updated directly through API, pulling JSON files from an S3 bucket, or other methods.
Pitfall 3: Failure to Automate Testing
By definition, a data mesh is a decentralised collection of data. It’s important to ensure consistent quality across data products owned by different teams which may not be aware of each other. Automated test frameworks that specialise in API testing can be used to achieve this. We recommend the Karate framework and it’s useful to follow these five principles:
- Ensure every domain team is responsible for the quality of its own data. Testing depends upon the nature of the data and is decided upon by the team.
- Take advantage of the fact the data mesh is read-only. This means mock data can be tested and tests can also be run repeatedly against live data.
- Plug tests into developer laptops, CI/CD pipelines or live data accessed through specific data products or an orchestration layer. Test-driven design is another approach that can be used successfully in a data mesh.
- Include business domain subject matter experts (SMEs) when designing your tests.
- Involve data consumers when designing your tests to make sure data products meet their needs.
Pitfall 4: Overuse of Familiar Tools
When people become very proficient with one set of tools, they may use them in situations where they are not appropriate. Many companies struggle to scale data analytics because they try to use the data infrastructure to solve every need for information. An architecture where ETL pipelines pump data into a data lake is in many ways monolithic and has a finite capacity to deliver value. It simply does not scale well. A data lake excels at ad hoc queries and computationally intensive operations, but its centralised nature can make it hard to include pipelines from every domain in the company.
The decentralised nature of a data mesh allows it to include data from an almost arbitrary number of domains. Nevertheless, computationally intensive operations are time consuming with data mesh.
It’s important to use the right architecture to solve the right problems. You also need to recognise the important role of data engineers. If they don’t feel valued, or feel their jobs are threatened by a data mesh, they will act against it – even though data mesh and data lake architectures can be complementary.
Pitfall 5: Tight Coupling Between Data Products
As with microservices, tight coupling is the enemy of a highly functional data mesh. The independently deployable rule should be applied. This means every data product should be deployable at any time without corresponding changes to other data products on the mesh. Adhering to this rule will likely require a versioning scheme for data products.
Pitfall 6: Data Products don’t Evolve Correctly
Data evolves as a company evolves, often in unpredictable ways. This can result in changes to the domain structure of the company, and/or changes to the structure and nature of data within the domains. Data meshes should be built to adapt to these changes.
Adding domains to a data mesh is simple: add data products, ensure they are discoverable in a data catalogue or similar product, and build dashboards or other types of display as necessary.
Removing data products occurs less frequently and is a little more difficult. It is usually done manually. If another data product consumed the removed data product, it will have to be examined. Does it still make sense to expose the consuming data product to users? How are consumers of that data product notified about changes or complete removals of data products? The answers will be different for each company, and must be considered carefully.
Pitfall 7: Data Products aren’t Accurately Versioned
Data products must be versioned as data changes. Users of data products (including people who maintain dashboards) should be notified about changes, both breaking and non-breaking. Consumed data products need to be managed like resources in Helm charts or artifacts in Maven Artifactory.
Pitfall 8: Issues Related to Sync vs Async vs Pre-assembled Results
If a data mesh uses synchronous REST calls to package the output from a few data products, chances are the performance will be acceptable. But if the data mesh is used for more in-depth analytics combining a larger number of data products (such as the analysis typically done by a data lake), it is easy to see how synchronous communication might become a performance issue.
One way to resolve this is with a solution similar to a Command and Query Responsibility Segregation (CQRS) to pre-build and cache data results on a regular cadence. The cached results could be combined into a more complex data structure when the data product is run. This is very effective if you don’t require up to the moment results.
Another approach is to break the operation into separate pieces that can be run asynchronously using an Asynchronous Request-Reply pattern. Using this pattern implies that:
- There are no ordering dependencies between the datasets you construct. In other words, if you concurrently build five datasets, the content of Dataset #2 cannot be dependent on the content of Dataset #1.
- The caller will probably not receive an immediate response to their request. Instead, some sort of polling technique returns successfully only when all datasets are built and combined. If the dataset is very large, it may be stored somewhere with users given a link to access it, with appropriate infrastructure and security in place.
Use the Right Tool for the Job
A big advantage of the data mesh architecture is that it can start small and grow as demand increases. Early mistakes tend to be small mistakes and teams learn through experience how to manage increased demand for data while avoiding the inherent political and technical pitfalls. Data lakes and meshes are excellent solutions for different problems; it’s important to understand which is best for your employees and their data needs. The following table provides a useful comparison:
Architecture | Data Lake | Data Mesh |
---|---|---|
Data distribution? | Centralised | Distributed |
Scalable? | Limited | Yes |
Specialised staff required? | Yes, Data Engineers | No |
Large up-front investment? | Yes | No, begin small and grow |
Ad-hoc queries? | Yes | No |
Widely used tooling? | No, specialised tooling | Yes, standard microservice tooling |
Data lakes or data warehouse architecture could be right for the job. They are just not the right tool for every job – just as a data mesh is not the right tool for every job. In fact, it’s easy to see scenarios where a data mesh and data lake coexist and make each other stronger. Data lakes or warehouses require an investment of hundreds of thousands of dollars. Experienced data engineers must also be hired to ensure a return on that investment. However, they do have a place in today’s data architectures. Data lakes shine when they house large datasets that can be queried in computationally intensive ways. If this matches your needs, a data lake architecture may be right for you.
A data mesh can be an excellent tool for on-demand reporting, analytics, and streaming. However, performance can be limited by slow queries in any node. Building the infrastructure to run ad-hoc queries against a data mesh would be difficult, although new techniques show promising results.
If you’re considering how to evolve your company’s data analytics capabilities and would like to discuss the options, contact us here.
Mike is a Managing Principal Consultant at Sourced Group. Over the last 25 years Mike has worked as an engineer, architect, and leader of large engineering organizations.