In our previous posts, we talked about how Data Warehouses and Data Lakes can help us manage our data in a more secure, cost-effective, reliable, and scalable way. But there is another critical concept we need to understand in order to work with these architectures: Data Pipelines.
Why Do We Need Data Pipelines?
Companies usually have to deal with a considerable amount of diverse data sources, which may also have different granularity, formats, units, users, ingestion and update rates, and security and compliance needs. Some sources may even overlap but have diverging data regarding the same information.
In order to address these issues, we have to do some operations on the data, including: ingest, clean, normalize, aggregate, and transform. Usually, these operations are inter-dependent and have to be executed in a given order. The combination of one or more of these operations is what we call a data pipeline.
What is a Data Pipeline?
A data pipeline has three main elements:
1. The Source: The data input, which may be a flat file, a database, a messaging queue, a pub/sub, an API, or any other format that may be needed.
2. The Processing Steps: Each step receives an input and performs an operation that returns an output. The output from step one is then used as an input for the next step, and so on until the pipeline is completed.
3. The Sink (or Destination): The data output, as in the source case, can be any system that will receive the data; it may even be equal to the source.
Regardless of whether data comes from batch or from streaming sources, a data pipeline should produce the same output. It divides the data into smaller chunks and processes it in parallel, enabling horizontal scalability and higher reliability. The last destination does not have to be a data warehouse or a data lake; it can also be any other application, such as a visualization tool, a REST API, or a device.
It is important to note that the data pipeline must also monitor each step to make sure it is running properly and that the data is consistent; if an error occurs, it must also take an action to correct it, usually by backtracking to the previous state.
Data Pipeline Architectures
The architecture style of a data pipeline depends on the goal you wish to accomplish, but there are some generic architectures and also best practices for common scenarios that we'll outline below:
One of the most common data pipeline architectures is built to handle batch data. This kind of scenario is straightforward: the source is usually a static file (a database or a data warehouse), and it usually also sinks to one of these formats. The time windows may vary from near-real-time (e. g. 15 min) to monthly workloads.
As companies evolve toward a more data-driven approach, it also becomes more important to have the data as up-to-date as possible. In streaming architectures, the data source is usually a messaging, queueing, or pub/sub tool to make sure that the data can be continuously ingested. But there are also some other details that you have to be aware of when dealing with this kind of architecture, such as how to handle late arriving data, duplicate data, missing data, failures, and delays. Usually a streaming architecture uses a windowing strategy combined with watermarks and IDs to deal with these issues:
Windows can be:
Fixed/tumbling: uses equally-sized intervals in non-overlapping chunks where each event can only belong to one window (similar to traditional batch processing).
Sliding: windows again have the same length, but the time windows overlap, so each event belongs to a number of windows ([window interval]/[window step]).
Session: windows have various sizes and intervals and are defined based on session identifiers present in the data.
Lambda architecture is a popular approach that is used to handle both batch and stream processing in the same pipeline. One important detail about this architecture is that it encourages storing data in a raw format so that inconsistencies in the real-time data can be handled afterward in the batch processing.
One of the downsides of Lambda architecture is that we need to have different pipelines for streaming and for batch processing. Delta architecture overcomes this by using the same pipeline for both. In order to accomplish this, the tools that accept delta pipelines have to ensure that the data is consistent across all the nodes before processing it. The diagram for this architecture is similar to the streaming diagram presented above.
Data Pipelines in GCP (Google Cloud Platform)
GCP provides many tools to work with data pipelines and can handle all of the previously explained architectures. Some of the most common tools used for this are:
Dataflow: Dataflow is a serverless Apache Beam service. One of the main advantages of using Dataflow is its ability to execute both batch and stream pipelines.
Dataproc: Dataproc is a managed Apache Hadoop cluster. It can run most of the Apache Hadoop ecosystem tools in an easy to manage and cost-efficient way, it can use preemptible instances to run non-critical workloads to save money, and it can be configured to scale up and down.
Dataform: Dataform is a data modeling tool that enables us to improve the way data is handled inside BigQuery, making it easier to share the code, keep track of changes, and ensure data quality in each step.
BigQuery Engine: BigQuery is also a strong option to run some of our data pipelines. Its architecture allows us to ingest streaming and batch data and can handle petabytes of data in its distributed processing queries.
DataFusion: DataFusion is a managed implementation of the open source project CDAP; it enables us to create pipelines that run on top of Dataproc in a drag and drop way.
Dataprep: Dataprep is a serverless service that enables us to create data cleaning and enhancement jobs using drag and drop.
For example, one way to handle a complex event processing pipeline in GCP would be:
Image courtesy of Cloud Architecture Center.
Getting Started with Data Pipelines
In summary, GCP provides tools to achieve the scalability your business needs with a complete ecosystem to support the most used business scenarios, all at a competitive cost. And you won't have to worry about the underlying infrastructure and operations.
Want to know more about how to make the most of your data? Check out the other blogs in our data analytics series:
Frederico Caram is a Data Architect at Avenue Code. He enjoys reading historical fantasy novels, ballroom dancing, and playing video games.