What is data orchestration? Why do we need it? And how should your company go about it? Let's explore the best tools and practices for data orchestration.
In one of our previous data analytics series posts, we talked about data pipelines and how to transform and prepare our data to be consumed by our final customers. As stated in that article, companies usually have to deal with a considerable amount of diverse data sources, which may also have different granularity, formats, units, users, ingestion and update rates, and security and compliance needs. Because of this, many different data processes have to be created to handle all of them.
Why Do We Need Data Orchestration?
The problem is that most of the time, different processes are interdependent, so that they have to be executed in a given sequence. So if one part of this sequence fails, we have to make sure that the other steps are not executed or that we run an exception flow.
Data orchestration tools allow us to do so in a centralized place by:
- Managing the sequence in which each job is executed;
- Scheduling when to run or how to trigger them; and
- Managing the data sources and sinks.
In other words, data orchestration tools ensure that each step is executed in the right order, at the right time, and in the right place.
These tools also make it easier to handle different data sources or data processes, providing operators or functions that allow us to easily connect and communicate with external tools, such as storages, databases, data warehouses, data processing tools, clusters, APIs, and so on.
The monitoring capabilities of these tools also alert us when there is any interruption or failure in our pipelines so that we can solve the issue in a timely manner. And in fact, the ability to configure retries and exceptions makes it possible to easily handle failures in an automated way, without human interference.
Data Orchestration Phases
In general, data orchestration consists of three phases:
1. Organization (also known as systemization): In this phase we try to understand the incoming data by analyzing their types, origins, formats, standards, and schemas, after which we ingest the data into the dataflow.
2. Transformation (also known as unification): Given that data is ingested from many different sources, they may also have many different formats. For example, dates and numbers may be stored in different ways or using different conventions. In order to make the data easily accessible, we have to transform it so that each piece uses the same format for the same data types.
During the transformation phase, we also start modeling our data. Relational databases and APIs contain normalized data (meaning data that has been split up and rearranged by category), so now we need to denormalize it when ingesting data from these sources. While having normalized data in the relational database makes it easier to store, update, transact, and retrieve the data, it's important to note that in an analytics use case, it can hinder the reading performance or make things confusing for our end users. Denormalized data is better for data orchestration and for end users since it reunites all information that is pertinent to user needs.
3. Activation: After our data is ingested, transformed, standardized, and unified, we want to make the data available for the other systems that will consume it. This step involves storing the data somewhere where it can be read by the final application, which may be a storage, data lake, data warehouse, queuing application, pub/sub application, or even an API or another data pipeline.
Data Orchestration Tools in Google Cloud Platform (GCP)
Even though you can use Virtual Machine instances in Compute Engine or Container in Kubernetes Engine to deploy your favorite data orchestration tool, GCP offers some managed or serverless data orchestration tools that you can use with minimal infrastructure and configuration:
Workflows: is a serverless data orchestration tool that uses yaml files to create simple REST API-based data pipelines.
Example of a workflow. Image courtesy of Guillaume Laforge.
Cloud Composer: Is a fully-managed GCP Airflow service, providing us with a highly available, scalable, and cost-efficient Airflow cluster. It also offers many integrations to other GCP services, making it easier to handle different workloads based on the GCP stack.
Example of an Airflow DAG. Image courtesy of Airflow.
Workflows is usually recommended for simpler workloads. The usage of a yaml declarative file and the limitation of only being able to integrate with REST APIs makes it harder to use and customize for complex workloads, but it can still be very useful for simpler GCP-based workloads.
Cloud Composer, on the other hand, is recommended for more complex workloads. It is based on Airflow, which has a rich ecosystem and a lot of built-in and community provided tools and operators that can handle multiple scenarios. The downsides are that the creation of the orchestration is not as straightforward as when using Workflows, requiring knowledge of the Python language, of the Airflow libraries, and also of how the tool works, given that its misuse may lead to performance issues. It also needs a cluster in order to run, which can be more expensive and harder to maintain.
For more information regarding when to use each tool, please refer to GCP's orchestration documentation.
In summary, GCP provides tools to help you implement your data orchestration with minimal infrastructure and operations overhead. In addition to its competitive cost, GCP also provides a complete ecosystem to support the most diverse scenarios.
Want to know more about how to make the most of your data? Check out the other blogs in our data analytics series:
Frederico Caram is a Data Architect at Avenue Code. He enjoys reading historical fantasy novels, ballroom dancing, and playing video games.