adesso Blog

Metadata-driven data pipelines are a game changer for data processing in companies. Instead of manually revising each step every time a data source changes, these pipelines use metadata (data about data) to dynamically update processes. This can save time and reduce errors as the pipelines configure their operations themselves based on the current metadata. In short, metadata-driven pipelines increase the efficiency and flexibility of data processing and allow teams to focus on more important tasks than manual pipeline maintenance.

Similar to the maintenance of classic data pipelines, the maintenance of metadata can prove to be an error-prone and repetitive bottleneck in the maintenance and further development of a pipeline framework. In this blog post, I use practical examples to show how the Jsonnet template language makes it easier to maintain metadata.

The cornerstones of classic metadata-driven pipeline frameworks

The architecture of classic metadata-driven pipeline frameworks centres on two main components: the ETL tool and the metadata repository.

The ETL tool - ETL stands for Extraction, Transformation, Loading - is the workhorse of the framework that is responsible for carrying out the data transformations. It extracts data from various sources, transforms this data according to defined rules and finally loads it into target environments such as databases or data warehouses.

At the same time, the metadata repository plays an important role. It serves as a central repository for metadata that controls the process logic of the ETL tool. This metadata includes information about data sources, target structures, transformation rules and sequences as well as the mapping of data fields. The ETL tool interacts with this repository, reads the required metadata and can also update it after a successful run to reflect the status of data processing.

Together, these two components form the centrepiece of any metadata-driven pipeline framework, enabling structured and efficient data processing and ensuring that data movement remains traceable and maintainable.

Standards for metadata repositories are often file-based formats such as JSON and YAML or tabular formats such as relational database tables.

You can find out more about Metadata Driven Pipelines for Microsoft Fabric in the Microsoft Community Hub. A metadata-based pipeline framework for Microsoft Fabric is presented there.

The challenges of manual metadata maintenance and the solution provided by Jsonnet

Although metadata can simplify loading processes, the maintenance of metadata can also prove to be costly. Similar metadata often has to be repeated in different forms and in different places, which can lead to increased maintenance effort and inconsistencies. Manually updating metadata every time the data structure or processing method changes is not only time-consuming but also prone to human error.

Enter Jsonnet

Jsonnet offers an efficient solution here. Jsonnet is an open-source template language that was specially developed for generating JSON data and is particularly suitable for automating configuration files. Jsonnet enables the programmatic definition of metadata and the creation of reusable templates.

You can find more information about the Jsonnet Configuration Language on this website.

With Jsonnet, complex metadata structures can be defined through simple and clear abstractions, avoiding repetitive elements through functions and variables. If a change is required, it can be made in a central location from which Jsonnet automatically updates all instances of this metadata throughout the project. This not only eliminates redundancies but also saves valuable time and minimises the potential for errors.

Automating metadata maintenance with Jsonnet helps data teams focus more on optimising data processing and analysis instead of getting lost in the depths of manual metadata maintenance.

Let's see an example

To illustrate this, we will create a simple example. In a JSON file, we define a list of JSON objects that represent database tables on an SQL server, which we consider to be data sources for a data pipeline. Each database table corresponds to a JSON object with the following attributes: server, database, schema, table name and tags. Tags can be used, for example, to start the loading process for a subset of SQL tables in the ETL tool.

In our example, we define three tables: customer, purchase and vendor.

In Jsonnet, the configuration file could look like this:

	
	// Table definitions
	local tables = ["customer", "purchase", "vendor"];
	// Data source definition
	local server = 'sql-server.company.com';
	local db = 'database';
	local schema = 'dbo';
	// Create a json array of table metadata by iterating over tables
	[
	    {
	        server: server,
	        db: db,
	        schema: schema,
	        table: t,
	        tags : [
	            "server:" + server,
	            "db:" + server + "-" + db,
	            "schema:" + server + "-" + db + "-" + schema,
	            "table:" + server + "-" + db + "-" + schema + "-" + t
	        ]
	    } for t in tables
	]
	

The tags are automatically generated from the table attributes as concatenated strings. The JSON array of sources is generated using a list comprehension notation similar to that of Python. Jsonnet allows inline comments to improve readability.

After installing Jsonnet (with ubuntu in Docker: `apt install jsonnet`) we "compile" the jsonnet file to JSON with `jsonnet sql_sources.jsonnet > sql_sources.json`:

The generated JSON file looks like this:

	
	[
	   {
	      "db": "database",
	      "schema": "dbo",
	      "server": "sql-server.company.com",
	      "table": "customer",
	      "tags": [
	         "server:sql-server.company.com",
	         "db:sql-server.company.com-database",
	         "schema:sql-server.company.com-database-dbo",
	         "table:sql-server.company.com-database-dbo-customer"
	      ]
	   },
	   {
	      "db": "database",
	      "schema": "dbo",
	      "server": "sql-server.company.com",
	      "table": "purchase",
	      "tags": [
	         "server:sql-server.company.com",
	         "db:sql-server.company.com-database",
	         "schema:sql-server.company.com-database-dbo",
	         "table:sql-server.company.com-database-dbo-purchase"
	      ]
	   },
	   {
	      "db": "database",
	      "schema": "dbo",
	      "server": "sql-server.company.com",
	      "table": "vendor",
	      "tags": [
	         "server:sql-server.company.com",
	         "db:sql-server.company.com-database",
	         "schema:sql-server.company.com-database-dbo",
	         "table:sql-server.company.com-database-dbo-vendor"
	      ]
	   }
	]
	

The functions used in this example represent only a fraction of the Jsonnet language functions. Other options include conditional logic, templates and user-defined functions.

Jsonnet in a metadata deployment workflow

The following diagram shows an example deployment workflow on Azure that uses jsonnet to generate the metadata.

  • 1. a user edits the .jsonnet metadata templates (for example: Adding new table definitions, creating a new version, etc.).
  • 2. git push to the metadata repository
  • 3. an Azure Pipeline...
    • a.... generates .json files from the .jsonnet templates
    • b.... loads the generated .json files into an Azure Blob Storage.
  • 4. the Data Factory accesses the new version of the metadata in the Azure Blob Storage during the next trigger run.

Conclusion

Metadata-driven data pipelines undoubtedly have the potential to revolutionise the way companies process data. These pipelines enable the dynamic updating of processes through the use of metadata, eliminating the need for manual adjustments every time a data source changes. However, maintaining this metadata presents a challenge that can impact the maintenance and evolution of pipeline frameworks. However, as I have shown, Jsonet provides an efficient solution for metadata maintenance.

You can find more exciting topics from the world of adesso in our previous blog posts.

Picture Stefan Klempnauer

Author Dr. Stefan Klempnauer

Stefan Klempnauer is a data and analytics consultant with a strong focus on data platforms, cloud infrastructure and AI. At adesso he designs and implements data platform solutions in customer projects.


Our blog posts at a glance

Our tech blog invites you to dive deep into the exciting dimensions of technology. Here we offer you insights not only into our vision and expertise, but also into the latest trends, developments and ideas shaping the tech world.

Our blog is your platform for inspiring stories, informative articles and practical insights. Whether you are a tech lover, an entrepreneur looking for innovative solutions or just curious - we have something for everyone.

To the blog posts

Save this page. Remove this page.