More and more businesses are trying to leverage big data analytics to innovate and build new solutions, but unstructured, raw data doesn't provide proper insights. The answer? Intelligent cloud solutions.
In an increasingly competitive market, companies are seeking every opportunity to gain access to business insights through data. To gain trustworthy insights, they need a solution to refine information from unorganized data stored in their relational and non-relational databases.
This is where intelligent cloud solutions like Azure Data Factory (ADF) and Databricks come into the picture. These solutions allow businesses to create workflows to ingest data from multiple on-premise and/or cloud data sources, transform/shape the data, and gain insights to make business decisions.
To explain this concept, we'll begin by examining how ADF and Databricks work and then present a visual of a data pipeline built on these tools.
What is Azure Data Factory?
ADF is a fully managed, highly scalable, cost effective serverless data integration service offered by Azure cloud.
ADF comes with pre-built constructors, allowing us to construct ETL (Extract Transform Load), ELT (Extract Load Transform), and Data Integration Pipelines.
ADF is a code free ETL tool that enables us to extract data from various resources and transform data for analytical cases. It can be linked to a data store, a database, or a data warehouse. We can define datasets and build pipelines to connect data to our destination.
ADF can have one or more pipelines. A pipeline is a logical grouping of activities that together perform a task. In other words, it is a grouping of different steps that perform different actions to process our data, as depicted below:
Image courtesy of Microsoft's Azure Data Factory documentation.
A dataset is a named view of data that simply points to or references the data you want to use in your activities as inputs and outputs. Datasets identify data within different data stores, such as tables, files, folders, and documents. For example, an Azure Blob dataset specifies the blob container and folder in Blob Storage from which the activity should read the data.
What is Azure Databricks?
Databricks is a SaaS-based data engineering tool optimized for the Azure platform. It offers SQL, data science, and machine learning environments to support the development of complex applications.
Using Databricks SQL, analysts can run queries on data lakes, create multiple visualizations to explore query results, and build and share dashboards. The outcomes of machine learning solutions can be visualized in Power BI using Databricks as a connector to derive valuable insights.
Azure Databricks supports Python, Scala, R, Java , and SQL, as well as data science frameworks and libraries like TensorFlow, PyTorch, and scikit-learn.
This sample of a data factory pipeline shows that data is ingested to a Blob Storage using Data Factory, then processed in Azure Databricks, and finally moved to Azure SQL Data Warehouse.
Image courtesy of Gaurav Malhotra's "Ingest, prepare, and transform using Azure Databricks and Data Factory" on azure.microsoft.
Azure Databricks vs. Azure Synapse: A Quick Comparison
So far we've talked only about Azure Databricks as our data engineering tool, but it is also possible to use Azure Synapse. Both Azure Databricks and Azure Synapse are platforms for processing large amounts of data. The right option depends on the needs of the organization. Before we conclude this article, let's take a moment to compare and assess each option.
Azure Databricks is an Apache Spark-based analytics platform optimized for the Microsoft Azure cloud services platform. Designed with the founders of Apache Spark, Databricks is integrated with Azure to provide one-click setup, streamlined workflows, and an interactive workspace that enables collaboration between data scientists, data engineers, and business analysts.
Azure Synapse (formerly Azure SQL Data Warehouse) is a limitless analytics service that brings together enterprise data warehousing and big data analytics. It gives you the freedom to query data on your terms—using either serverless or dedicated resources—at scale. It is a full data warehousing solution that allows a relational data model, stored procedures, etc. It also provides all SQL features any BI-er might use, including a full standard T-SQL experience, bringing together the best SQL technologies.
Check out the comparison chart below for more information, and don't hesitate to comment on this blog with thoughts or questions!
Machine Learning development
ML optimized Databricks runtime
GPU enabled clusters
Managed and hosted version of MLflow is provided in Databricks
Built-in support for AzureML
No multi-user collaboration on notebook
Ad-hoc data lake discovery
Data lake can be mounted and queried. Supports use of Python, Scala, and R to read the data.
SQL on-demand pool or Spark in order to query data from your data lake.
Spark Structured Streaming as part of Databricks is proven to work seamlessly (has extra features as part of the Databricks Runtime, e.g. Z-order clustering when using Delta, join optimizations, etc.)
Since it's a data warehouse, we can ingest real-time data into Synapse using Stream analytics, but this currently doesn’t support Delta.
SQL analysis and data warehousing
A delta-lake-based data warehouse is possible, but not with the full width of SQL and data warehousing capabilities like a traditional data warehouse.
Full data warehousing allowing a full relational data model, stored procedures, etc.
Q&A "differnce between synapse and databricks" docs.Microsoft.
"Azure Synapse Analytics" docs.Microsoft.
"Datasets in Azure Data Factory and Azure Synapse Analytics" docs.Microsoft.
Abhishek Ramesh is an Intermediate Developer at Avenue Code who is interested in Java and cloud technology.