Discovering, accessing, and integrating new datasets for use in data analytics, data science, and other data pipeline tasks is typically a slow process in large, complex organizations. These organizations typically have hundreds of thousands of datasets that are actively managed across a variety of internal data stores and have access to orders of magnitude more external datasets. Just finding relevant data for a particular process is an almost overwhelming task.
Even once the relevant data has been identified, going through the approval, governance, and staging processes necessary to actually use that data can take months in practice. This is often a major obstacle to organizational agility. Data scientists and analysts are pushed to use pre-approved and pre-curated data found in centralized repositories, such as data warehouses, instead of being encouraged to use a wider range of datasets in their analysis.
Moreover, even once data from new datasets becomes available for use in analytical tasks, the fact that they come from different data sources usually implies that they have different data semantics, which makes difficult to unify and integrate these data sets. For example, they may refer to the same real-world features using different identifiers as existing datasets or may associate different attributes (and types of those attributes) to real-world features modeled in existing datasets. . Additionally, data on these features is likely to be sampled using a different context than existing datasets. Semantic differences between datasets make it difficult to integrate them into the same analytical task, reducing the possibility of obtaining a holistic view of the data.
Overcome Data Integration Challenges
Nevertheless, despite all these challenges, it is essential that these tasks of data discovery, integration and staging be done for data analysts and scientists in an organization to succeed. This is usually done today with significant human effort, some on behalf of the person doing the analysis, but most is done by centralized teams, especially when it comes to onboarding, cleaning, and updating. in data scene. The problem, of course, is that centralized teams become organizational bottlenecks, further hampering agility. The current status quo is not acceptable to anyone and several proposals have emerged to address this issue.
Two of the best-known proposals are the “data fabric” and the “data mesh”. Rather than focusing on an overview of these ideas, this article instead focuses on the application of data structure and data mesh specifically to the problem of data integration, and how they approach the challenge. Eliminate reliance on a centralized enterprise-wide team to perform this integration.
Let’s take the example of an American automobile manufacturer which acquires another automobile manufacturer in Europe. The American automaker maintains a parts database, detailing information on all the different parts needed to make a car – supplier, price, warranty, inventory, and more. This data is stored in a relational database – for example, PostgreSQL. The European automaker also maintains a parts database, stored in JSON in a MongoDB database. Obviously, integrating these two datasets would be very useful, since it is much easier to deal with a single parts database than two separate ones, but there are many challenges. They are stored in different formats (relational or nested), by different systems, use different terms and identifiers, and even different units for various data attributes (eg, feet or meters, dollars or euros). Executing this integration is a lot of work, and if done by a central enterprise-wide team, it can take years.
Automation with the Data Fabric approach
The Data Fabric approach attempts to automate the integration process as much as possible with little or no human effort. For example, it uses machine learning (ML) techniques to discover overlapping attributes (e.g. they both contain vendor and warranty information) and dataset values (e.g. many providers from one dataset also appear in the other dataset). ) to flag these two datasets as candidates for integration in the first place.
ML can also be used to convert the JSON dataset into a relational model: software functional dependencies that exist in the JSON dataset are discovered (for example, whenever we see a value for supplier_name from X, we let’s see supplier_address of Y) and used to identify groups of attributes likely to correspond to an independent semantic entity (for example, a supplier entity), and create tables for these entities and the associated foreign keys in parent tables. Entities with overlapping domains can be merged, the end result being a complete relational schema. (Much of this can actually be done without ML, such as with the algorithm described in this SIGMOD 2016 research paper.)
This relational schema produced from the European dataset can then be integrated into the existing relational schema of the American dataset. ML can also be used in this process. For example, query history can be used to observe how analysts access these individual datasets compared to other datasets and discover similarities in access patterns. These similarities can be used to restart the data integration process. Similarly, ML can be used for mapping entities to datasets. At some point, humans need to be involved in finalizing data integration, but the more Data Fabric techniques can automate key steps in the process, the less work humans have to do, making them less likely to become a bottleneck.
The human-centric data mesh approach
Data mesh takes a totally different approach to this same data integration problem. While ML and automated techniques are certainly not discouraged in data meshing, fundamentally, humans still play a central role in the integration process. However, these humans are not a centralized team, but rather a collection of domain experts.
Each dataset belongs to a particular domain that has expertise in that dataset. This team is responsible for making this data set available to the rest of the business as a data product. If another dataset arrives that, if integrated with an existing dataset, would increase the usefulness of the original dataset, the value of the original data product would increase if the data integration was performed.
To the extent that these teams of domain experts are incentivized when the value of the data product they produce increases, they will be motivated to do the hard work of data integration themselves. Ultimately, the integration is done by subject matter experts who understand automotive parts data well, instead of a centralized team who don’t know the difference between a radiator and a grille.
Transforming the role of humans in data management
In summary, the data structure always requires a central human team that performs critical functions for the overall orchestration of the structure. Nevertheless, in theory, this team is unlikely to become an organizational bottleneck because much of its work is automated by the AI processes in the fabric.
In contrast, in the data mesh, the human team is never on the critical path of a task performed by data consumers or producers. However, the focus is much less on replacing humans with machines, and instead the focus is on shifting human effort to the distributed teams of domain experts who are most involved. in its execution.
In other words, data structure is fundamentally about eliminating human effort, while data mesh is about making smarter and more efficient use of human effort.
Of course, it would seem at first glance that eliminating human effort is always better than reallocating it. However, despite the incredible recent advances we’ve made in ML, we’re still not at the point today where we can fully trust machines to perform those key data management and integration activities that are today. performed by humans today.
As long as humans are still involved in the process, it’s important to ask how they can be used most effectively. Also, some ideas of the data fabric are quite complementary to the data mesh and can be used together (and vice versa). Thus, the question of which to use today (data mesh or data fabric) and whether it is even a question of one over the other in the first place is not obvious. Ultimately, an optimal solution will likely take the best ideas from each of these approaches.
Daniel Abadi is Darnell-Kanal Professor of Computer Science at the University of Maryland, College Park and Chief Scientist at star burst.
Welcome to the VentureBeat community!
DataDecisionMakers is where experts, including data technicians, can share data insights and innovations.
If you want to learn more about cutting-edge insights and up-to-date information, best practices, and the future of data and data technology, join us at DataDecisionMakers.
You might even consider writing your own article!
Learn more about DataDecisionMakers