Best Practices for Machine Learning with GCP

Written by Hossein Javedani Sadaei | 9/18/19 5:00 PM

In this blog, I am going to explain some of the best practices for building a machine learning system in Google Cloud Platform. We'll start by showing how to understand and formulate the problem and end with tips for training and deploying the model. We'll also discuss required GCP tools, data processing in GCP, and data validation.

Understanding and Formulating the Problem

One of the first steps in developing any ML problem is understanding what you’d like the ML model to produce. What are the inputs for the model and, more importantly, what is the result? A developer needs to determine and define all related metrics for success with the ML system.

One thing that will help a lot is to review some heuristics for that problem. By this, I mean you should consider how you would solve the problem if you weren't using machine learning; then, think about to what extent and in which ways your model will be better than that heuristic.

The next step is to formulate the problem in the simplest way possible. Simple problem formulation is always easier to reason about and implement. Later, when you build the complete model, it's easy to add more complexity to increase accuracy. Provide a list of the data you want the machine learning model to accept. Work to find the resources where each input comes from, and evaluate how much work it will be to acquire a data pipeline using GCP to construct each column of a row. It is better to first concentrate on easily obtainable inputs.

General ML Workflow in GCP

Before we go deeper, let's talk about the general form of a machine learning workflow. In general, any machine learning workflow in the cloud consists of the following steps:

Machine learning resource management
Data ingestion and collection
Storing data
Processing data
Machine learning training and deployment

We will use this workflow and determine the best way to follow it in GCP.

Machine Learning Resource Management

At the beginning of any cloud project, you need to decide who will do what and to what extent clients need to have access to resources. Google Cloud Platform provides resource containers, such as organizations, folders, and projects, that enable you to group and hierarchically organize other GCP resources.

For machine learning resource management, it's good to define a project for each particular ML model. This makes it easier to empower and use all GCP services, including managing APIs, enabling billing, adding and removing collaborators, and managing permissions for GCP resources. For example, now you are easily able to create a model version with the trained model you previously uploaded to Cloud Storage.

Data Ingestion and Collection

To ingest data for building machine learning models, there are some GCP and third-party tools available. In fact, the tools you use entirely depend on the data type and the source of data. Sometimes there are APIs on the data provider side that can be used for data ingestion. In this case, one option is to spin up some of the following resources -- Compute Engine, Kubernetes, or App Engine -- to acquire data.

If the data comes in real-time, Cloud Pub/Sub seems to be a very good option; for example, you can use Google transfer appliance to transfer a huge amount of data. To import data from another service cloud provider, it's possible to use Cloud Storage Transfer Service. Also, you can employ GCP Online Transfer to use your network to move data to Google Cloud Storage.

Storing Data

As part of the data life cycle for building a machine learning model, we need to think of where data should be stored. Normally, the first place data will be stored is either Google Cloud Storage or Big Query. For stream data, the sync can be BigQuery as well, since it's able to work with this kind of data.

Processing Data

Raw data cannot be utilized for machine learning development purposes. It must be processed. In GCP, once we transfer data to Google Cloud Storage or BigQuery, data is obtainable via some applications and tools for processing as follows:

Google Cloud Dataflow: This processing is a fully-managed service for transforming and improving data in stream and batch modes.
Google Dataprep: Cloud Dataprep is used to examine and transform raw data from disparate and/or large datasets into clean and structured data for further analysis and processing.
Cloud Dataproc: Cloud Dataproc allows you to run Hadoop clusters on GCP and also gives access to Hadoop-ecosystem tools; this has strong appeal if you are previously accustomed to using Hadoop tools.

The data that you can use in your training must obey the following rules to run on the Google AI Platform:

The data must be in a format that you can read and feed to your training code.
The data must be in a location that your code can access. This typically means that it should be saved with one of the GCP storage or big data services.

VERIFICATION AND VALIDATION

The method of cleansing, enhancing, and transforming your data can add significant changes to it, some of which might not be expected. So there are some techniques for verifying your dataset, from start to finish, for your data wrangling efforts. You need to think of the following:

Consistency - Does your data fit into required values? Do field values meet the data type for the column? Are values within admissible ranges? Are rows unique? Duplicated?
Completeness - Are all exacted values incorporated into your data? Are some fields dropping values? Are there required values that are not present in the dataset?

Machine Learning Training and Deployment Processes in GCP

SELECTING PLATFORM AND RUNTIME VERSIONS

After cleaning the data and placing it in proper storage, it's time to start building a machine learning model. AI Platform from GCP runs your training job on computing resources in the cloud. You can train a built-in algorithm against your dataset without writing a training application; you can design a training application to run on AI Platform as well.

Before I continue, I should note that Cloud ML Engine is now a part of AI Platform. So you can scale up model training by utilizing the Cloud ML Engine training service or AI platform in a serverless environment within GCP. AI Platform promotes popular ML frameworks. It also presents built-in tools to help you understand your models and efficiently explain them to business users. AI Platform delivers the power and flexibility of TensorFlow, scikit-learn, and XGBoost to the cloud.

Along with native support for modern frameworks like TensorFlow, you can arrange any other framework running on Cloud ML Engine. In this case, just upload a Docker container with your training program, and Cloud ML Engine will put it to work on Google's infrastructure. You can use AI Platform to train your machine learning models by applying the resources of Google Cloud Platform. Besides this, you can host your trained models on AI Platform so that you can post prediction requests and manage your models and jobs using the GCP services.

DEVELOP AND DEPLOY YOUR ML MODEL WITH AI PLATFORM

For training with huge datasets, the best practice is to run the model as a distributed TensorFlow job with AI Platform so you can designate multiple machines in a training cluster. To accelerate the training time, it's better to train with GPUs or TPUs. GPUs and TPUs are designed to perform mathematically intense operations at high speed.

Before running your training application with AI platform, you must package your application, along with any supplementary dependencies you require, and upload the package to a Cloud Storage bucket that your Google Cloud Platform project can reach. AI Platform then presents model training. You must give your training job a name. One of the best techniques is to define a base name for all jobs connected with a given model and then append a date/time string. This practice makes it simple to sort lists of jobs by name because all jobs for a model are then arranged together in ascending order.

When operating a training job on AI Platform, you must also define the number and types of machines you need. To make the process easier, you can choose from a set of predefined cluster blueprints called scale tiers. Alternatively, you can pick a custom tier and specify the machine types yourself.

One very significant step in developing a machine learning model is hyperparameter tuning. If you need to use hyperparameter tuning, you must add configuration details when you build your training job. However, the best practice is to directly use HyperTune. By using this service, you can deliver better outcomes more quickly by automatically tuning deep learning hyperparameters. Data scientists frequently manage thousands of tuning experiments on the cloud. HyperTune saves many hours of tiresome and error-prone work.

GCP applies regions to set the location of computing resources. If you store your training dataset on Cloud Storage, you should run your training job in the same region as the Cloud Storage bucket you're using for the training data. If you need to run your job in a separate region from your data bucket, your job may become longer.

You can define the output directory for your job by arranging a job directory when you configure the job. When you submit the job, AI Platform verifies the directory so that you can adjust any problems before the job runs. You are required to account for the --job-dir argument in your application. Take the argument value when you parse your other parameters and apply it while saving your application's output.

One good practice for applications is to output data, including checkpoints through training and a saved model. You can output other data as required by your application. It's easiest to save your output files to a Cloud Storage bucket in the same GCP project as your training job. GCP VMs may be restarted occasionally. You should ensure that your training job is resilient to these restarts by saving model checkpoints frequently and by configuring your job to restore the most up-to-date checkpoint. You normally save model checkpoints in the Cloud Storage path that you specify with the --job-dir argument in the gcloud ai-platform jobs submit training command. The TensorFlow Estimator API executes checkpoint functionality for you. If your model is wrapped in an Estimator, you do not need to worry about restart events on your VMs, so using Estimator API is a much better approach.

There are also some good practices for deploying the models. For instance, if you are deploying a custom prediction method, upload any additional model artifacts to your model directory as well. The total file size of your model directory must be 250 MB or less; if you create subsequent versions of your model, put each one into its own separate directory within your Cloud Storage bucket.

I hope this blog gave you some great tips for building a machine learning model in GCP. Do you have any other recommendations? Let us know in the comments below how you're implementing your ML model in GCP!

References

For references and further reading, check out:

Google Cloud

Google Cloud AI Platform

Cloud Dataflow documentation

Cloud Dataproc documentation

Interacting with BigQuery

Deploying models

Cloud ML Engine is now a part of AI Platform

View full post