Nearly every business is facing new challenges in how to store and effectively utilize data. Today we'll look at how data lakes solve these challenges and why GCP is the available, scalable, cost-efficient data lake hosting solution your business needs.
Data Challenges in Today's Business Context
In today's environment, businesses face new challenges arising from the sheer volume of data, the increase in the generation and consumption of unstructured data, the need to have real or near real-time data, and the trend toward migrating data from on-premise servers to the cloud.
Legacy data warehouses are unable to adequately handle unstructured data since they usually depend on fixed schemas to perform well, and that requires structured data formats. They also have limitations when it comes to scaling horizontally, making it more expensive to handle high volumes of data or high ingestion rates.
Ultimately, moving from on-premise servers to the cloud, alongside computational evolutions, made it much cheaper to store data than to process it. So it became easier to store multiple copies of transformed data than to process it all at once, resulting in new paradigms, such as moving from ETL (extract-transform-load) to ELT (extract-load-transform).
Data Lakes and Data Warehouses
To overcome some of these challenges, a new concept arose: Data Lakes. A Data Lake is a centralized place to store data, allowing us to ingest, store, transform, analyze, and model data in a secure, cost-effective, easily organized, and manageable way. Data Lakes aren’t supposed to replace Data Warehouses (check out our previous post on Data Warehouses), but they can integrate or incorporate Data Warehouses to make the best use of each solution.
It's important to note that a Data Lake is an architectural concept and not a tool. Even though it is very common for vendors to relate a Data Lake to a specific tool, Data Lakes usually utilize a variety of tools to load, store, transform, and expose data. This data concept is not only about moving and transforming your data through data pipelines, but it's also about doing it in a traceable way, managing the data lifecycle and its lineage, identifying sensitive information and making sure that the data only gets to people who are authorized to see it.
Data Lake Organization
Data Lakes are usually organized into layers. As data moves from one layer to another through data pipelines, it gets cleaner, more informative, more trustworthy, better curated, and altogether more meaningful for the business. One common way to structure Data Lake layers is to categorize them as Bronze, Silver, and Gold layers, as displayed in the following figure:
Image courtesy of Medium.
Why Use GCP for Your Data Lake
One easy and effective way to organize your Data Lake is through Google Cloud Platform (GCP). GCP provides tools to cover most of your Data Lake needs with easily scalable and manageable serverless components so that the team can focus on what brings value to the business instead of focusing on how the infrastructure will be managed for the most common scenarios:Batch Ingestion: GCP provides many tools to copy and receive data from other cloud providers or on-premise structures, such as: Transfer Appliance, Transfer Service, and GSutil. In scenarios where data must be received from or called by an API, we can also use computing tools such as Cloud Functions, Cloud Run, App Engine, Compute Engine or even Cloud Data Fusion.
Stream/Real Time Ingestion: Usually the best tools for stream ingestion in GCP are Cloud Pub/Sub combined Dataflow, where Pub/Sub is responsible for receiving the data and Dataflow is responsible for processing it and moving it to a persistent storage.
Change Data Capture: If you are using MySQL or Oracle databases, Datastream allows you to stream your changing data to cloud storage.
Landing Zone and Raw Data: The most common use case is to use Cloud storage for your landing data and raw data, but in certain scenarios, BigQuery, BigTable, or Pub/Sub can also be used.
Data Warehouse and Data Marts: The most common tool for these scenarios is BigQuery, but GCP also offers alternatives, such as DataProc and Databricks for companies that prefer or need to use Apache Hadoop Stack.
GCP Architecture provides the following reference as an example of what's possible:
Image courtesy of Google Cloud.
The Available, Scalable, Cost-Efficient Data Solution
In summary, GCP can help you achieve the data availability and scalability your business needs at a competitive cost and with a complete ecosystem to support the most common business scenarios; and you won't have to worry about the underlying infrastructure or operations.
At Avenue Code, we have several Google Cloud Platform experts who can help you modernize your Data Warehouse to be highly available, scalable, and cost-efficient. Don't hesitate to reach out to discuss your project!
Learn More About Data Modernization
Want to learn more about data modernization and analysis? Be sure to check out the other blogs in our series: The 6 Pillars of Data Modernization Success, 4 Strategies to Boost Sales with Data Mining, and Modernizing Your Data Warehouse with Big Query.
Frederico Caram is a Data Architect at Avenue Code. He enjoys reading historical fantasy novels, ballroom dancing, and playing video games.