The Complete Guide to Modernizing Your Data Platform with GCP

Written by Frederico Caram | 5/18/22 5:00 PM

Did you know that only 32% of companies report realizing tangible and measurable value from their data, and only 27% say data and analytics projects produce insights and recommendations that are highly actionable?

What’s Going Wrong?

Given the fact that companies across industries are laser-focused on becoming data-driven, those numbers are shockingly low. In an effort to extract business value, enormous financial investments are being made in data-related technology and expertise. But most companies are failing. Why?

As Google Cloud’s partner specializing in data analytics, we’ve put together a nine-part blog series to help you modernize your data practices to gain actionable insights and transform your company to become truly data-driven.

In this post, we’ll give a high-level recap of each article, starting with the six pillars of data modernization success.

I. The 6 Pillars of Data Modernization Success

(Read the full article here.)

1. VALUE OVER DATA

Instead of asking: “Which data do we have that can bring value to our business?” companies should ask, “Which core business processes can be improved if we have the right data?" First define the problems your company needs to solve, then identify which data you should be gathering to solve those problems.

2. BUSINESS ALIGNMENT

Business strategy should drive technical decisions and not the other way around. So, when starting a new data project, we should always ask: “How can our data bring us closer to achieving our business objectives?”

3. ITERATIONS AND EVOLVABILITY

Business needs and data are constantly changing, and your data solutions should be too. Work in small iterations to test ideas quickly.

4. FROM SILOS TO INTEGRATION

Business areas often have their own data pools and processes, which may result in different data versions, values, and interpretations. Integrating those data sources into a data lake or data warehouse gives us a single source of truth, and it also allows each area to know which data exists to support them in improving their own analysis and decision making.

5. BUSINESS OWNERSHIP

One challenge with creating a data lake/data warehouse is that data ownership passes to the data team, which may not have enough business context to drive decision making. The best approach is to keep the ownership of the data under a business domain and include the data professional to support data access, discovery, exploration, and analysis.

6. CULTURE

Finally, using data as a business value driver requires a high-level culture change. It means asking the right questions, developing a solid process to answer those questions, and being open to exploration and experimentation. It also means failing and adapting fast.

II. 4 Strategies to Boost Sales with Data Mining

(Read the full article here.)

In the hands of skilled data analysts, your sales data can help your business grow to new levels. But the opposite is also true: misinterpreted data can be the bane of your business.

Therein lies the difference between converting a lead or losing it, and for any company trying to expand, every sale counts. So let’s look at four advanced sales analytics strategies to implement in your business.

DATA WAREHOUSE

A data warehouse can act as a unified repository for sales and marketing data, storing information in an organized manner to facilitate more effective sales strategies.

A properly built data warehouse also carries out processes like quality verification, standardization, and data integration, which in turn bring competitive advantages like confidence in data reliability and agility in decision making.

In terms of data warehousing solutions available on the market, Google Cloud Platform offers BigQuery, a serverless, highly scalable, and cost-effective multi-cloud data warehouse designed to promote business agility.

EXPLORATORY DATA ANALYSIS

Creating your sales analytics data warehouse paves the way for the application of statistical techniques that can answer business questions using data, saving time and effort while increasing decision-making maturity.

A variety of tools and technologies are available, but the important thing is to answer relevant questions to boost business results and to identify opportunities to initiate more complex and profitable projects.

PREDICTIVE ANALYSIS

Alongside machine learning techniques, data science applications enable your company to make very accurate predictions about what will happen in the future, training models based on past events.

This enables companies to answer questions like "How many clients will we acquire next month?" and "Which of my clients will cancel contracts next month (customer churn)?"

If increasing sales is your goal, the tools you need already exist in your database.

To help you accomplish this, Google Cloud Platform offers Vertex AI, where you can build, deploy, and scale ML models with pre-trained and custom tooling within a unified AI platform.

CLUSTERING ANALYSIS

How do you get the right message to the right customer at the right time? Customer segmentation. Using unsupervised machine learning techniques, you can understand and segment your audience with rich detail and confidence.

Clustering algorithms make it possible to group and understand your customers and leads through various features in your sales management software.

Clustering analysis enables you to group customers who exhibit the same behavior so that you can offer products they're mathematically more likely to buy.

HOW TO GET STARTED

Building a data warehouse/data lake is the first step of the journey.

To do this, your data analysts need to understand which data is available at your organization and ask detailed questions about the data volumetry, velocity, and veracity, as well as the current tools and resources your team uses. This information enables data professionals to design the right solution.

III. Modernizing Your Data Warehouse with BigQuery

(Read the full article here.)

DATA WAREHOUSES AND THE CLOUD

Data Warehouses (DWs) are critical for any company that wants to adopt a data-driven culture. The primary purpose of a data warehouse is to collect and store data from many different data sources and to make this data available for fast, reliable, secure, and easy retrieval, as well as subsequent analysis and insight.

With the rise of cloud computing, big cloud computer providers like Amazon, Azure, and Google (among others) also offer their own data warehouse solutions. These cloud providers make it easier to manage and horizontally scale data warehouses while also facilitating the integration of the data warehouse with the providers' other tools.

GCP ADVANTAGES

In comparison with its competitors, a big advantage that Google Cloud Platform (GCP) offers is its serverless data warehouse: BigQuery. With BigQuery, you don’t have to worry about managing, providing, or dimensioning any infrastructure; instead, you can focus on your data and on how you can use it to improve your company’s products, services, operations, and decision making.

Just like most modern Data Lakehouse tools, BigQuery separates storage from processing. This separation helps achieve better availability, scalability, and cost-efficiency.

GCP AND GOOGLE PRODUCT INTEGRATIONS

Another big advantage that GCP offers is easy integration with other GCP and Google products, which makes it the best choice for website analytics, enabling us to better understand our customer’s journey and behavior.

BI ENGINE

BigQuery also offers BI Engine, an engine that improves BigQuery integration with multiple data visualization tools, providing faster queries, simplified architecture, and smart tuning.

WHY CHOOSE BIGQUERY

In summary, BigQuery can help you achieve the data availability and scalability your business needs, without any worries about the underlying infrastructure or operations, all with a competitive cost and a complete ecosystem of support for the most common business scenarios.

IV. Data Lakes: The Key to Data Modernization

(Read the full article here.)

In today's environment, businesses face new challenges arising from the sheer volume of data, the increase in the generation and consumption of unstructured data, the need to have real or near real-time data, and the trend toward migrating data from on-premise servers to the cloud.

THE SWITCH TO CLOUD

Legacy data warehouses are unable to adequately handle unstructured data since they usually depend on fixed schemas to perform well, and that requires structured data formats. They also have limitations when it comes to scaling horizontally, making it more expensive to handle high volumes of data or high ingestion rates.

Ultimately, moving from on-premise servers to the cloud, alongside computational evolutions, made it much cheaper to store data than to process it. So it became easier to store multiple copies of transformed data than to process it all at once, resulting in new paradigms, such as moving from ETL (extract-transform-load) to ELT (extract-load-transform).

DATA LAKES AND DATA WAREHOUSES

To overcome some of these challenges, a new concept arose: Data Lakes. A Data Lake is a centralized place to store data, allowing us to ingest, store, transform, analyze, and model data in a secure, cost-effective, easily organized, and manageable way. Data Lakes aren’t supposed to replace Data Warehouses (check out our previous post on Data Warehouses), but they can integrate or incorporate Data Warehouses to make the best use of each solution.

It's important to note that a Data Lake is an architectural concept and not a tool. Even though it is very common for vendors to relate a Data Lake to a specific tool, Data Lakes usually utilize a variety of tools to load, store, transform, and expose data. This data concept is not only about moving and transforming your data through data pipelines, but it's also about doing it in a traceable way, managing the data lifecycle and its lineage, identifying sensitive information and making sure that the data only gets to people who are authorized to see it.

DATA LAKE ORGANIZATION

Data Lakes are usually organized into layers. As data moves from one layer to another through data pipelines, it gets cleaner, more informative, more trustworthy, better curated, and altogether more meaningful for the business.

WHY USE GCP FOR YOUR DATA LAKE

One easy and effective way to organize your Data Lake is through Google Cloud Platform (GCP). GCP provides tools to cover most of your Data Lake needs with easily scalable and manageable serverless components so that the team can focus on what brings value to the business instead of focusing on how the infrastructure will be managed. It provides tools for batch ingestion, stream/real-time ingestion, change data capture, landing zone and raw data, and data warehouse and data marts.

GCP Architecture provides the following reference as an example of what's possible:

Image courtesy of Google Cloud.

V. What You Need to Know About Data Pipelines

(Read the full article here.)

We’ve talked about how Data Warehouses and Data Lakes can help us manage our data in a more secure, cost-effective, reliable, and scalable way. But there is another critical concept we need to understand in order to work with these architectures: Data Pipelines.

WHY DO WE NEED DATA PIPELINES?

Companies usually have to deal with a considerable amount of diverse data sources, which may also have different granularity, formats, units, users, ingestion and update rates, and security and compliance needs. Some sources may even overlap but have diverging data regarding the same information.

To address these issues, we have to do some operations on the data, including: ingest, clean, normalize, aggregate, and transform. Usually, these operations are interdependent and have to be executed in a given order. The combination of one or more of these operations is what we call a data pipeline.

WHAT IS A DATA PIPELINE?

A data pipeline has three main elements:

The Source
The Processing Steps
The Sink (or Destination)

Regardless of whether data comes from batch or from streaming sources, a data pipeline should produce the same output. It divides the data into smaller chunks and processes it in parallel, enabling horizontal scalability and higher reliability. The last destination does not have to be a data warehouse or a data lake; it can also be any other application, such as a visualization tool, a REST API, or a device. Data pipelines must also monitor each step and correct errors along the way.

DATA PIPELINE ARCHITECTURES

The architecture style of a data pipeline depends on the goal you wish to accomplish, but there are some generic architectures. Read the full article for best practices on each architecture highlighted below.

BATCH

One of the most common data pipeline architectures is built to handle batch data: the source is usually a static file (a database or a data warehouse), and it usually also sinks to one of these formats:

STREAMING

As companies evolve toward a more data-driven approach, it becomes more important to have the data as up-to-date as possible. In streaming architectures, the data source is usually a messaging, queueing, or pub/sub tool to make sure that the data can be continuously ingested. But you may need to use a windowing strategy with watermarks and IDs to deal with issues like late arriving data, duplicate data, missing data, failures, and delays. Windows can be:

Fixed/tumbling:

Sliding:

Session:

LAMBDA ARCHITECTURE

Lambda architecture is a popular approach that is used to handle both batch and stream processing in the same pipeline. One important detail about this architecture is that it encourages storing data in a raw format so that inconsistencies in the real-time data can be handled afterward in the batch processing.

DELTA ARCHITECTURE

One of the downsides of Lambda architecture is that we need to have different pipelines for streaming and for batch processing. Delta architecture overcomes this by using the same pipeline for both. The diagram for this architecture is similar to the streaming diagram presented above.

DATA PIPELINES IN GCP

GCP provides many tools to work with data pipelines and can handle all of the previously explained architectures. Some of the most common tools used for this are: Dataflow, Dataproc, Dataform, BigQuery Engine, DataFusion, and Dataprep.

For example, one way to handle a complex event processing pipeline in GCP would be:

Image courtesy of Cloud Architecture Center.

VI. Data Orchestration in GCP

(Read the full article here.)

As stated above, companies usually have to deal with a considerable amount of diverse data sources, which may also have different granularity, formats, units, users, ingestion and update rates, and security and compliance needs. Because of this, many different data processes have to be created to handle each scenario.

WHY DO WE NEED DATA ORCHESTRATION?

The problem is that most of the time, different processes are interdependent, so that they have to be executed in a given sequence. So if one part of this sequence fails, we have to make sure that the other steps are not executed or that we run an exception flow.

Data orchestration tools allow us to do so in a centralized place by:

Managing the sequence in which each job is executed;
Scheduling when to run or how to trigger them; and
Managing the data sources and sinks.

In other words, data orchestration tools ensure that each step is executed in the right order, at the right time, and in the right place.

These tools also make it easier to handle different data sources or data processes, providing operators or functions that allow us to easily connect and communicate with external tools, such as storages, databases, data warehouses, data processing tools, clusters, APIs, and so on.

The monitoring capabilities of these tools also alert us when there is any interruption or failure in our pipelines so that we can solve the issue in a timely manner. And in fact, the ability to configure retries and exceptions makes it possible to easily handle failures in an automated way, without human interference.

DATA ORCHESTRATION PHASES

In general, data orchestration consists of three phases:

Organization (also known as systemization): In this phase we try to understand the incoming data by analyzing their types, origins, formats, standards, and schemas, after which we ingest the data into the dataflow.
Transformation (also known as unification): Given that data is ingested from many different sources, they may also have many different formats. In order to make the data easily accessible, we have to transform it so that each piece uses the same format for the same data types.
During this phase, we also start modeling our data. Relational databases and APIs contain normalized data (meaning data that has been split up and rearranged by category), so now we need to denormalize it when ingesting data from these sources.

Activation: After our data is ingested, transformed, standardized, and unified, we want to make the data available for the other systems that will consume it. This step involves storing the data somewhere where it can be read by the final application, which may be a storage, data lake, data warehouse, queuing application, pub/sub application, or even an API or another data pipeline.

DATA ORCHESTRATION TOOLS IN GCP

Even though you can use Virtual Machine instances in Compute Engine or Container in Kubernetes Engine to deploy your favorite data orchestration tool, GCP offers some managed or serverless data orchestration tools that you can use with minimal infrastructure and configuration.

Two great options are Workflows, which is usually recommended for simpler workloads, and Cloud Composer, which is recommended for more complex workloads. Using GCP will help you implement your data orchestration with minimal infrastructure and operations overhead.

VII. What Every Company Needs to Know about Data Governance and Security

(Read the full article here.)

As cloud migration growth rates increase, so do new concerns regarding information management. Three of the biggest concerns are related to data protection, compliance and regulations, and visibility and control. Let’s take a look at how to address each.

PROTECTING DATA (DIGITAL ASSETS)

This is the biggest concern of moving to cloud computing since it involves storing business data in a public cloud infrastructure, and in most cases, companies deploy the enterprise system together with the business data. With the rise of security threats and breaches, data security is a sensitive topic.

Protecting data against unauthorized access is a top priority, whether that data is personally identifiable information (PII), confidential corporate information, intellectual property, or trade secrets.

COMPLIANCE AND REGULATIONS

Different geographies have various sets of regulations that cover data management and security. Compliance teams are responsible for guaranteeing adherence to these regulations and standards, and they may have concerns about regulation oversights for data stored in the cloud.

LACK OF VISIBILITY AND CONTROL

Unfortunately, many data management professionals and data consumers lack visibility into their own data landscape and don't know which data assets are available, where they are located, how and if they can be used, who has access to which data, and whether or not they should have access to it. This uncertainty limits companies' ability to further leverage their own data to improve productivity or drive business value, and it also raises questions about the benefit-to-risk payoff of data storage in the cloud.

ESSENTIAL PROCESSES FOR DATA GOVERNANCE

These risk factors highlight critical processes that are essential for data governance:

Data Assessment
Metadata Cataloging
Data Quality
Access Control Management
Information Security

Addressing these risks while enjoying the benefits provided by cloud computing has increased the value of understanding data governance, as well as discovering what is important for business operations and decision making.

WHAT IS DATA GOVERNANCE (AND WHY DO WE NEED IT)?

Data governance is one part of the overall discipline of data management, albeit a very important one. Whereas data governance is about the roles, responsibilities, and processes that ensure accountability for and ownership of data assets, DAMA International defines data management as “an overarching term that describes the processes used to plan, specify, enable, create, acquire, maintain, use, archive, retrieve, control, and purge data.”

In terms of the practical approach to data governance, the core mission of data governance teams is generally to:

Optimize, organize, secure, govern, and regulate corporate data assets to ensure reliable and secure business insights;
Influence and inform the future state designs that result from the overarching business transformation; and
Build technologies, policies, and frameworks that make it easier and more intuitive for data consumers to do the right thing when it comes to protecting the corporation.

Data governance needs to be in place for the full data lifecycle, from the moment data is collected or ingested through the point at which that data is destroyed or archived. During the entire life cycle of the data, data governance focuses on making the data available to all data consumers in a form that they can readily access and understand in business terms.

This way, the data can be used to generate the desired business outcomes (analysis and insights) in addition to conforming to regulatory standards, if/where relevant. The final outcome of data governance is to enhance trust in the data.

Trustworthy data is a "must have" for using corporate data to support decision making, risk assessment, and management using key performance indicators (KPIs).

Primary Data Governance Topics. Image courtesy of Finextra.

DATA GOVERNANCE FRAMEWORK

The primary goal of a data governance framework is to support the creation of a single set of rules and processes for collecting, storing, and using data. In this way, the framework makes it easier to streamline and scale core governance processes, enabling you to maintain compliance and security standards, democratize data, and support decision making.

A data governance framework supports the execution of data governance by defining the essential process components of a data governance program.

Outcomes can be measured and monitored throughout the execution of established processes, then optimized for trust, privacy, and data protection. Key outcomes include: tracking processes covering data quality and data proliferation; monitoring for data privacy and risk exposure; alerts for anomalies and the creation of an audit trail; and issue management and workflow facilitation.

An overall data governance program framework covering core macro activities. Image courtesy of Data Governance: The Definitive Guide: People, Processes, and Tools to Operationalize Data Trustworthiness.

BUSINESS BENEFITS OF ROBUST DATA GOVERNANCE

Setting a data governance strategy is critical, as is designing an operational model to run the data governance framework in stages to support the evolution of the model in accordance with the level of data governance maturity to be achieved:

General overview of data governance maturity. Image adapted from IBM Maturity Model.

A good data governance strategy and a solid operational model allow companies to know that, whether the data they are accessing is current or historical data, it will be reliable and usable for analysis. The benefits of data governance can be summarized as follows:

Business benefits of a data governance program.

DATA GOVERNANCE WITH GCP

Google offers some of the most trusted tools to enable data governance at an organizational level. These include a Data Catalog that helps data discoverability, metadata management, and data class-level controls that allow for the separation of sensitive data from other data within containers, as well other tools like Data Loss Prevention and Identity Access Management.

Below is a GCP data governance infrastructure overview:

Data Catalog and DLP. Image courtesy of Google Cloud.

Data Catalog is a fully managed and scalable metadata management service from Google Cloud's Data Analytics family of products. Its focus is searching for insightful data, understanding data, and making data useful.

FINAL THOUGHTS ON DATA GOVERNANCE

Data governance helps organizations better manage the availability, usability, integrity, and security of their corporate data. With the right technology, data governance can also deliver tremendous business value and support a company's digital transformation journey.

At its most basic level, data governance is about bringing data under control and keeping it secure. Successful data governance requires knowing where data is located, how it originated, who has access to it, and what it contains. Effective data governance is a prerequisite for maintaining business compliance, whether that compliance is self-imposed or mandated by an industry or external regulatory body.

The quality, veracity, and availability of data to authorized personnel can also determine whether an organization meets or violates stringent regulatory requirements.

VIII. Data Visualization in GCP

(Read the full article here.)

Now that we understand how to create a rich and robust data structure, we need to look at how to leverage it to inform business decision making.

Data visualization is all about enabling our users to make data-based decisions in an easy and intuitive way, without requiring technical savviness or knowledge about the processes that are executed in the background.

HOW TO STRUCTURE DATA FOR YOUR USERS

Since our goal is to create friendly and easy-to-use interfaces for our users, our first consideration should be about who will use our tools and the context in which those tools will be used. In the book Storytelling with Data, Cole Knaflic proposes three important questions to ask regarding our users:

Who is our target audience?
What kind of decision do we expect our user(s) to make based on this data? What is their role in the decision making process?
How is our data supporting our audience in taking the required actions for their business context?

Another tool that is also very helpful in understanding our users and uncovering new and better ways to use their data is design thinking. Below is an example of a common process applied in design thinking:

DATA VISUALIZATION TOOLS

After understanding our user(s), their context, and their needs, we can start focusing on how we are going to deliver our information to them. There are plenty of data visualization tools in the market, such as: Tableau, Power BI, QlikView, Metabase, and IBM Cognos, among many others. Google Cloud Platform also offers us two fully managed tools for this job:

Data Studio: Data Studio is a free tool that turns your data into informative, easy-to-read, easy-to-share, and fully customizable dashboards and reports.

Looker: Looker is a flexible, multi-cloud platform that can scale effortlessly to meet data and query volumes to help future-proof your data strategy. Looker is the recommended choice for corporate cases, as it provides better performance, scalability, and flexibility than Data Studio.

Both tools integrate really well with Google Cloud Platform tools and are supported by BigQuery BI Engine, which significantly improves the querying speed and consequently their responsiveness, enabling you to make better decisions based on your data. Data Studio and Looker both help your users make informed, data-driven business decisions.

IX. Machine Learning and Artificial Intelligence in GCP

(Read the full article here.)

While data visualization can help us better understand what happened in the past through descriptive analysis and diagnosis, machine learning and artificial intelligence models can help us become more proactive with predictive and prescriptive analysis, and they can also help us extract information from unstructured data to enrich our analysis.

ML MODELS

There are several different machine learning models and architectures, but most of them fall into one or more of the following categories:

Supervised Learning: learns by examples, so we must provide the model with several inputs and the expected outputs so that the model can infer from the inputs and extrapolate the expected outputs.
Unsupervised Learning: uses the input data to extract patterns without relying on labeled outputs. Unsupervised learning models usually seek to organize data according to their features or characteristics.
Semi-Supervised Learning: is similar to supervised learning but can also use unlabeled data to increase the dataset used for inferring the expected value; it may also use some unsupervised learning techniques to enrich the data before the predictions.
Reinforcement Learning: uses a trial and error approach for learning. It tries a set of actions in an environment or in a model of the environment, and based on the result (which can be defined in the form of a reward), it can optimize its actions to better fit the problem.

Since some of these models are more broadly applicable, it's relatively easy to find some ready-to-use, pre-trained implementations of these models in open source libraries or provided by some vendors as Managed Services.

MACHINE LEARNING IN GCP

Google is one of the main players when it comes to machine learning, and Google Cloud Platform leverages this expertise to cover many different customer needs in an easy, cost-effective, and scalable way using Managed APIs, BigQuery ML, and Vertex AI:

Managed APIs: GCP provides Managed APIs that can be used to solve common Machine Learning problems without the need to train a new model or have deep knowledge regarding the underlying technology.

BigQuery ML: Allows businesses to build and deploy models based on SQL language inside BigQuery.

Vertex AI: Vertex AI is a Managed Machine Learning and AI platform that helps you manage the whole lifecycle of your Machine Learning product. It offers a single interface and an API to apply Machine Learning models to different scenarios, as well as MLOps tools to remove the complexity of model maintenance.

Vertex AI helps you train auto ML models with minimal code or create and manage custom models and their whole pipeline.

Vertex AI also integrates with other GCP data tools such as BigQuery, Dataproc, Dataflow, and Cloud Storage, among others, making it easier to integrate your models in your data pipeline without having to worry about the underlying infrastructure.

Modernize Your Data Platform With GCP Today

As a certified Google Partner specializing in Data Analytics, our Avenue Code team has several Google Cloud Platform experts who can help you create, manage, deploy and integrate your models, making use of the best tools for each scenario and enabling you to make better use of your data to support your business decisions. Get in touch with one of our specialists to create your data analytics strategy today!

*This guide was created by Frederico Caram with co-authors Tulio Souza and Andre Soares.

View full post