This article shows a demo use case for Machine Learning prediction using an available dataset from Google Cloud containing all the Chicago Taxi trips from 2013 until 2018. The primary business need in this demo is to provide accurate predictions for the total duration of taxi trips in Chicago. These predictions enable taxi companies to inform customers of expected trip durations via a mobile app, enhancing customer satisfaction and operational efficiency.

Machine Learning Use Case

The machine learning use case involves developing a complete workflow, from data exploration to model deployment, using Google Cloud services. The objective is to deliver trip duration predictions (that can later be provided in real-time) via a Deep Learning Regression model.

As a solution, we are proposing a Deep Learning Regression model that will be trained and deployed on Google Cloud to predict trip durations in minutes. The workflow includes:

  • Data exploration
  • Feature engineering
  • Model training and deployment

The ultimate goal is to provide reliable trip duration predictions, improving customer experience and operational planning.

Data Exploration

To make sure we get the best result out of our model, we proposed a data exploration methodology consisting of the following goals and steps:

Exploratory Data Analysis (EDA) Goals:

  • Assess dataset size and suitable tools
  • Identify data types and necessary changes
  • Determine target variables for prediction
  • Identify and discard irrelevant columns
  • Analyze variable distributions and correlations
  • Handle missing values
  • Create new features from existing variables
  • Apply necessary data transformations and scaling

Steps Followed:

  1. Dataset Preparation:
    • Copy the Chicago Taxi Trips dataset to a new BigQuery schema.
    • Check data types and dataset size.
    • Use a sample dataset for initial EDA due to the large size (over 102 million rows).
  2. Identifying Key Variables:
    • Discard irrelevant columns like unique_key and taxi_id.
    • Select trip_seconds as the target variable, transformed to minutes for better interpretability.
  3. Correlation Analysis:
    • Analyze correlations to decide which variables to retain or discard.
  4. Handling Missing Values:
    • Identify columns with significant missing data and decide to remove rows with missing values for selected features.
  5. Additional Explorations:
    • Perform further EDA in BigQuery, such as checking unique values of categorical variables.

Key Findings

As a result of our data exploration, we found some interesting correlation patterns:

  • - Strong correlations were identified between variables such as fare and trip_total, tolls, and trip_total etc.
  • - Decisions were made to discard or keep variables based on correlation strength and relevance.

Missing Values:

  • - A significant number of missing values were found in various columns. As a response to that, we decided to discard rows with missing values for selected features.

Example Code Snippets

Check Missing Values:

SELECT count(*) FROM `gcp-ml-especialization-2024.spec_taxi_trips.taxi_trips`;

SELECT count(*) FROM `gcp-ml-especialization-2024.spec_taxi_trips.taxi_trips` WHERE fare IS NULL;


Initial Raw Variables Key Findings and Data Transformation

During our analysis, we identified that some original variables in the dataset contained redundant information. Specifically, the dropoff_location variable, which held geolocation data as latitude/longitude pairs, duplicated information already present in the dropoff_latitude and dropoff_longitude variables. Consequently, we retained the latter two and discarded dropoff_location. These retained variables, along with pickup_latitude and pickup_longitude, will be used to create a new engineered feature called distance, representing the Euclidean distance between pickup and dropoff coordinates.

Example of dropoff_location Content

Snapshot of sample dropoff_location feature content

Similarly, the pickup_location variable was discarded from further analysis. Below is an example of its content:

Snapshot of sample pickup_location feature content

Key Findings on Trip Time-Related Information

Our dataset analysis revealed that only trip_start_timestamp and trip_end_timestamp variables contained time-related information in UTC. Duplicate timestamps across different trips were observed, prompting us to adopt a cross-sectional data analysis approach rather than a time-series/panel approach.

Using Pandas data frames in a Python notebook within Vertex Workbench, we created new time-related features such as minute, hour, day, day of the week, and month for each trip. These features were derived from trip_start_timestamp and trip_end_timestamp and will be detailed in the Feature Engineering section. Post-creation, the original timestamp variables were discarded.

Example of Time-Related Variables Content

Sample content of trip_start_timestamp and trip_end_timestamp

Distribution of Variables in the Dataset

We analyzed the distribution of variables using histograms and bar charts. Our findings indicate that most variables exhibit asymmetric distributions. For instance, the original target variable trip_seconds showed a right-hand skew before being transformed into trip_minutes.

Example of Trip Seconds Distribution

Trip_seconds sample distribution

Some variables, like dropoff_community_area, displayed multimodal distributions:

Barplot of dropoff_community_area

Others, such as pickup_latitude, showed a symmetric distribution:

Pickup latitude sample distribution

Data Transformation and Deep Learning Model Training

Given the plan to train a Deep Neural Network regressor for the trip_minutes target variable, no data transformations were implemented to alter feature distributions due to time constraints and the natural fit for deep learning models. However, other data transformations were implemented to expedite model training.

Data Type Changes
We changed the data type of trip_seconds from integer to float64, enabling the division by 60 to create the trip_minutes feature.

Min-Max Scaling
To enhance deep learning model training, we applied Min-Max Scaling to continuous variables using BigQuery SQL scripts. This scaling was implemented to improve data stability and performance.

Feature Engineering and Selection

Feature engineering steps were executed in a BigQuery script for performance optimization. The steps included:

  1. Selecting desired fields and creating new features from timestamp variables.
  2. Discarding null values.
  3. Label encoding categorical variables.
  4. Creating a new distance feature (Euclidean distance between pickup and dropoff coordinates).
  5. Applying Min-Max scaling to continuous features.

The resulting dataset was stored in gcp-ml-especialization-2024.spec_taxi_trips.chicago_taxi_trips_final_dataset1.

Train, Validation, and Test Splits
The dataset was split into train (70%), validation (20%), and test (10%) sets based on trip_start_timestamp.

Train Dataset Creation

SELECT *

FROM `gcp-ml-especialization-2024.spec_taxi_trips.chicago_taxi_trips_final_dataset1`

WHERE trip_start_timestamp BETWEEN '2013-01-01 00:00:00 UTC' AND '2018-01-21 23:45:00 UTC';

Validation Dataset Creation

SELECT *

FROM `gcp-ml-especialization-2024.spec_taxi_trips.chicago_taxi_trips_final_dataset1`

WHERE trip_start_timestamp BETWEEN '2018-01-21 23:45:01 UTC' AND '2018-10-24 23:45:00 UTC';

Test Dataset Creation

SELECT *

FROM `gcp-ml-especialization-2024.spec_taxi_trips.chicago_taxi_trips_final_dataset1`

WHERE trip_start_timestamp > '2018-10-24 23:45:00 UTC';

By implementing these feature engineering and data preprocessing steps, we optimized the dataset for effective deep learning model training, validation, and testing.

Further Reduction of Variables in the Dataset for Model Training

Due to the substantial size of the datasets and the limitations of the available infrastructure for this demo, it was necessary to further reduce the dataset size. This required minimizing the number of features considered for Deep Learning Model training, and prioritizing the retention of most rows for each feature. Consequently, the following variables were discarded for model training, validation, and testing:

  • Day_trip_start (redundant due to the day_of_the_week_trip_start variable)
  • Hour_trip_end
  • Day_trip_end
  • Day_of_the_week_trip_end
  • Fare
  • Dropoff_latitude
  • Dropoff_longitude
  • Company_num

The final datasets included these features:

  • Hour_trip_start
  • Day_of_week_trip_start
  • Month_trip_start
  • Trip_minutes (target variable)
  • Trip_miles
  • Pickup_latitude
  • Pickup_longitude
  • Distance

This resulted in a total of seven features and one target variable.

Dataset Sampling, Machine Learning Model Training, and Development

Proposed Machine Learning Model

Given the original dataset's size, containing millions of rows, and its tabular nature, a deep neural network was chosen to build a regression model for predicting taxi trip durations in Chicago. No special components (e.g., convolutional layers) were needed. Rectified Linear Activation functions were adopted in all layers for their beneficial properties in deep architectures. The Adam optimization algorithm was selected to address common issues like gradient vanishing in deep neural networks.

Framework Used

The Keras framework (with TensorFlow backend) was used, specifically version 2.12.0, due to Google Cloud's constraints for model registry and deployment. Keras was chosen for its abstraction over TensorFlow, simplifying model specification and training.

Deep Neural Network Architecture Choice

The final architecture was influenced by several factors:

  • Available infrastructure resources (memory and computation time)
  • Size of the training dataset
  • Time constraints for development (data exploration, feature engineering, model training, evaluation, testing, and deployment)
  • Testing of the deployed model with batch predictions using Google Cloud solutions

After evaluating different architectures, the final neural network included:

  • Input layer: Densely connected with ReLU activation, seven nodes
  • Second layer: Densely connected with ReLU activation, five nodes
  • Third layer: Densely connected with ReLU activation, three nodes
  • Fourth layer: Densely connected with ReLU activation, two nodes
  • Output layer: One node for regression prediction

Hyperparameters and Training Configuration

Due to time and hardware limitations, cross-validation was not implemented. Initial training had a high learning rate, but final training used a learning rate of 0.01. The Mean Squared Error (MSE) loss function and Adam optimization algorithm were used. The final training session ran for 10 epochs with the following parameters:

  • Beta_1: 0.9
  • Beta_2: 0.999
  • Epsilon: None
  • Decay: 0.0
  • Amsgrad: False

Dataset Sampling for Model Training, Validation, and Test

The dataset was split into training, validation, and test sets using the trip_start_timestamp field. Random sampling was unnecessary as the training dataset included data from all years in the raw dataset.

Machine Learning Model Evaluation and Performance Assessment

The MSE metric was selected to assess model performance on training, validation, and test sets, as it emphasizes large errors and is a standard metric in academia and industry. Other metrics used included Mean Absolute Error (MAE), determination coefficient (R2), and Root Mean Squared Error (RMSE).

Final Obtained Results

The final trained model achieved the following results on the validation set:

Figure: Final Regression Model Performance on Validation Set

The deep learning regression model explained 27.3% of the variance in the validation set, which is considered suboptimal.

On the test set, the final regression model obtained the following results:

Figure: Final Regression Model Performance on Test Set

The model explained 23.98% of the variance in the test set, also considered suboptimal.

Given the constraints faced during development, these were the best achievable results. However, the main purpose of this use case was to demonstrate a complete workflow implementation of a machine learning solution using Google Cloud. This objective was successfully fulfilled.

Final Considerations and Recommendations for Future Developments

Suggested Actions for Improvement

To substantially improve the model's performance, consider the following next steps:

  1. Feature Selection: Due to constraints, the training sessions used the same feature set, which may not have been optimal. Implement a feature selection algorithm, such as Mutual Information, on a dataset sample to better guide feature selection for future models. This should yield a more significant set of features for neural network models.
  2. Neural Network Architectures: The development tested a limited number of neural network architectures. Future work should explore a broader range of architectures, including more layers and varying numbers of neurons in each layer.
  3. Hyperparameter Optimization: Experiment with different combinations of hyperparameters, as they are crucial for deep learning model performance.

The main objective of this development was to present a complete workflow for developing a machine-learning model to meet specific business needs using Google Cloud. This project provided a detailed description of all steps involved (from data exploration to model deployment) while adhering to Google Machine Learning Best Practices. Recommendations for future developments were also provided.


Author

Alfredo Passos

Professional with solid experience in Operations Research, Advanced Analytics, Big Data, Machine Learning, Predictive Analytics, and quantitative methods to support decision-making in companies of all sizes. Experienced in applications across Financial, Actuarial, Fraud, Commercial, Retail, Telecom, Food, Livestock, and other fields.