The Data Mesh—An Advanced Distributed Data Lake Architecture
Rapid advances in machine learning (ML) and artificial intelligence (AI) are extending the range of problems that can be solved and making building and training models easier. The primary barrier to advances in this field in manufacturing continues to be incomplete or inaccurate data sets.
In the March issue of Automation.com, we discussed the traditional centralized data lake architecture and the benefits of moving to a distributed data lake. We’re taking that one step further to create a flexible, highperformance, highly scalable data integration platform for manufacturing—a true “data mesh.”
In this architecture, data owners, individuals or teams with in-depth knowledge and understanding of specific data sets, are in charge of creating, processing, and providing their specialized data. This distributed, decentralized, loosely connected system treats data as a product, allowing the data owners, who have the most expertise and insight into the data’s importance, to assume responsibility for its management. This approach empowers them to collect and publish data without relying on a centralized team, which leads to a more complete data set. Data consumers, such as computer and data scientists, are then free to discover and use that information for AI/ML or other applications.
Data collection in the AI era
It’s clear that AI/ML is transformative. The substantial advancements in large language models, like ChatGPT, represent just one of numerous AI/ML techniques that have already begun to transform all industries, encompassing internet search, software development, and document generation. In the manufacturing sector, AI-driven advancements can optimize production processes, reduce waste, and improve quality control, opening up new possibilities for efficiency and innovation. Despite the impressive capabilities that AI/ML approaches currently demonstrate, they are still in their early stages of development and have immense potential for growth and improvement. In a detailed review of the semiconductor industry, McKinsey estimates that “AI/ML now contributes between $5 billion and $8 billion annually to earnings before interest and taxes,” but also that this ”reflects only about 10 percent of AI/ML’s full potential within the industry.”
Artificial intelligence and machine learning applications have an immense appetite for data, and suboptimal results can stem from incomplete or inaccurate data input. Models such as ChatGPT are trained using vast amounts of data sourced from the internet. In the context of manufacturing, it is crucial to gather data that is specifically tailored to our environment and process to achieve the desired outcomes.
Existing approaches—data warehouses and monolithic data lakes
Manufacturing generates vast amounts of data from diverse sources such as machinery, sensors, and enterprise systems. Historically, data management in this sector has relied on “data warehouses,” which are often built on sizable relational databases. The objective of a data warehouse is to establish a centralized storage space for data, generally used for reporting and data analysis purposes. Data warehouses are adept at handling traditional “structured” manufacturing data, such as values obtained from PLCs or time-series data, which can be easily organized and comprehended.
As AI/ML models continue to advance, they are now capable of deriving valuable insights not only from traditional structured data but also from unstructured data. Unlike structured data, unstructured data lacks a specific format or organization, presenting a more complex landscape for analysis. Examples include text documents, waveforms, log files, blueprints, schematics, images, and videos. It is characterized by a high degree of variability and a lack of predictability.
Unstructured data offers engineers and data scientist the opportunity to gain significant insights that cannot be obtained solely from conventional, structured manufacturing data, thus delivering additional context and understanding. Reports and Markets suggests that “over 40% of the operational value of IoT is extracting and monetizing unstructured data.”
Data warehouses are poorly equipped to manage this unstructured data. Transformations to coerce it into a more structured format suitable for a data warehouse can limit the insights that organizations can derive from unstructured data, as they may not have the tools to perform advanced analytics, such as natural language processing and image recognition. Data lakes aim to overcome this limitation by using alternative data storage and processing solutions that are better suited for handling unstructured data, such as NoSQL databases and big data processing platforms like Hadoop and Spark, and by storing it in raw format or with very little modification.
In both the monolithic data lake and data warehouse, the goal is to consolidate everything in a single storage system. It’s immediately obvious that it’s wasteful, both in terms of storage and data collection resources, as well as data management effort. It’s difficult to see a purpose for duplicating data that is already part of a working system, such as a manufacturing database, MES, ERP, or similar. Few organizations of any size have a single data warehouse, database, or manufacturing system.
Different departments and divisions build data storage for different use cases, or IT systems are adopted as part of an acquisition. These “silos” of information are not readily integrated, often having been created for different functionality and being managed by different organizational units. This is problematic for any monolithic approach, data warehouse or data lake.
Intuitively, the architecture does not match the nature of data sources in manufacturing. The number of data sources in manufacturing is huge and distributed—and all the sources produce data in different formats. Every piece of equipment, PLC, subsystem, and even smart breakers and small sensors are generating data that’s potentially valuable for AI/ML applications. The collection and management architecture should match the real nature of the data— distributed and very variable.
All of these data sources have internal domain experts, someone who really knows the tool or application and how and what data to extract These domain experts should be empowered to manage their data “product” to easily add, remove, or upgrade sources.
What we need is not to put all this data into a centralized system, but to be able to query, view, and extract this data as if it were in a single system. Rather than try to copy all our existing sources of data into a central system, we need to “wrap-and-embrace” diverse data sources; i.e., integrate these distributed platforms so that they can be searched as a single platform.
With this concept, we move from a centralized system to a heterogeneous and distributed set of data platforms. This is possible with a distributed architecture using an industrial internet of things (IIoT) or edge approach whereby small, self-contained applications (“microservices”) close to the equipment can be managed by the data owner. These platforms are integrated by “data virtualization,” allowing users and applications to query the data without caring about where or how the data is stored—the “data mesh.”
Data meshes focus on treating data as a product and emphasize the importance of cross-functional, domain-oriented teams that are responsible for their own data.
The main principles of data meshes are:
Domain-oriented ownership: Data is treated as a product, with individuals or teams taking ownership of their data and its quality.
These teams are responsible for generating, processing, and serving their domain-specific data.
Self-serve data infrastructure: Data meshes encourage building a self-serve infrastructure platform to support domain teams. This platform should be easily accessible and empower teams to discover, access, and use data without relying on a centralized team.
Product thinking for data: Data is treated as a product with its own lifecycle, from creation to usage. This mindset helps organizations focus on the value of data and its usability for AI/ML or other consumers.
All existing data platforms, the data warehouses, databases, MES, ERP, IT systems, and shared drives, become a part of the data mesh, not by moving the data from these systems to a central system, but by providing a virtualization layer.
The approach is also hugely scalable. As the volume, variety, and velocity of data grow, a data mesh can scale to handle large amounts of information across multiple sources without compromising performance or user experience.
Solutions for data ingestion and integration
The data mesh revolutionizes data management and offers numerous benefits, including increased scalability, flexibility, and decentralization; however, it must address the issue of data ingestion and seamlessly integrating disparate data sources. The data owners understand the data but are typically not trained software developers and so need easy-to-use tools to construct efficient ingestion pipelines. A data mesh must also provide a comprehensive solution for seamless data fusion and an intuitive and streamlined platform for data access.
Tools to empower data owners
Just as with a traditional data lake, the data mesh requires an ingestion pipeline. In a monolithic data lake, this is typically a software application created by the IT department. The complexity and management of this application increase as the amount and type of data being added to the data lake increase.
In a data mesh, this becomes much simpler. “Ingestion pipeline” now really just means “describe the data.” This task has also now moved from IT to the data owner or domain expert, and onto a single node. The process is hugely simplified, as we’re asking the domain expert to add just the data they know well without needing knowledge of any other data pipelines.
The remaining challenge is that the data mesh must provide user-friendly tools to allow non-programmers to easily define data, both structured PLC, sensor, timeseries, etc., and also unstructured images, spreadsheets, text documents, and so on. A new class of drag-anddrop tools is emerging that allows not only this ingestion but also the ability to export data in any format required by its consumers.
These solutions empower data owners to streamline the process of creating ingestion pipelines, simplifying their tasks and allowing them to unlock the full potential of the data mesh infrastructure.
For the data mesh to be a practical solution, distributing data must not unduly increase the complexity for the consumers of the data—the clients. Client access should be abstracted from the data locations, and queries should be focused on what data is required, not how to access it; in other words, data virtualization. Approaches to data virtualization, commonly used with SQL databases and data warehouses, include a REST API, MQTT, and many other IoT protocols. While these all have value, and should be supported by a data mesh, the primary limitation of these approaches is that access is largely defined by the creator of the interface, not the user of the data. The data set available is rigid, and the client may receive more data than required (over-fetching) or need to make multiple calls to access and combine the required data (underfetching).
Additionally, the data virtualization layer should be mostly invisible during the data ingestion process. Data owners should not be required to possess expertise in any virtualization technology; the data mesh software should make the information available in the necessary format, making the process seamless for the data owner.
For conventional manufacturing data (structured data), OPC UA, with support from small systems to the cloud, can be a valuable virtualization layer, but it is not well suited for unstructured data.
In 2012, to address the limitations and inefficiencies of their existing REST APIs, Facebook (now Meta) developed GraphQL. GraphQL was open-sourced in 2015 and has been widely adopted and used by many organizations and companies, as well as being a standard for developing APIs. GraphQL is a query language and runtime for building and executing client-server queries.
With GraphQL, the client makes a request, a query, specifying the fields it wants to retrieve. The server responds with the requested data. This allows the client to retrieve exactly the data needed, resolving the problem of under- and over-fetching. GraphQL supports data of all types, whether it be structured or unstructured.
GraphQL allows multiple services to be combined into a single, unified system. This “federation” is an essential concept for a scalable data mesh, providing the ability to independently create data sources while still delivering a consistent and unified API for clients to consume.
Distributing and federating the data in this fashion has the potential to improve performance by allowing the execution of complex queries to span multiple data sources and enabling each service to resolve its part of the query, minimizing unnecessary data transfer, and leading to faster response times and reduced bandwidth usage.
Data mesh: empowering manufacturing Implementing a data mesh in manufacturing holds the potential to significantly enhance operational efficiency and innovation. Utilizing a decentralized, domain-driven architecture, the data mesh facilitates seamless data discovery, encourages collaboration, and enables organizations to make real-time data-driven decisions to fully realize the potential of their digital manufacturing initiatives.
A primary advantage of the data mesh approach is that it empowers data owners to manage and provide the data themselves. This helps break down data silos, promote a culture of data-driven decision-making, and develop the notion of data as a product. As a result, organizations can foster innovation, increase agility, and sustain a competitive edge in an industry that is becoming more complex and fast-paced.
Images courtesy of ErgoTech
This feature comes from the ebook AUTOMATION 2023 Volume 3: IIoT & Industry 4.0.
Xem Thêm: Hệ thống MES