Important note: this blog contains a brief summary of the developments of the Chicago Taxi Trips use case. For more details and information, please access the complete official document of this work.
This article shows a demo use case for Machine Learning prediction using an available dataset from Google Cloud containing all the Chicago Taxi trips from 2013 until 2018. The primary business need in this demo is to provide accurate predictions for the total duration of taxi trips in Chicago. These predictions enable taxi companies to inform customers of expected trip durations via a mobile app, enhancing customer satisfaction and operational efficiency.
The machine learning use case involves developing a complete workflow, from data exploration to model deployment, using Google Cloud services. The objective is to deliver trip duration predictions (that can later be provided in real-time) via a Deep Learning Regression model.
As a solution, we are proposing a Deep Learning Regression model that will be trained and deployed on Google Cloud to predict trip durations in minutes. The workflow includes:
The ultimate goal is to provide reliable trip duration predictions, improving customer experience and operational planning.
To make sure we get the best result out of our model, we proposed a data exploration methodology consisting of the following goals and steps:
As a result of our data exploration, we found some interesting correlation patterns:
trip_total
,
tolls
, and trip_total
etc.Check Missing Values:
SELECT count(*) FROM `gcp-ml-especialization-2024.spec_taxi_trips.taxi_trips`;SELECT count(*) FROM `gcp-ml-especialization-2024.spec_taxi_trips.taxi_trips`
WHERE fare IS NULL;
During our analysis, we identified that some original variables in the dataset contained redundant information. Specifically, the dropoff_location
variable, which held geolocation data as latitude/longitude pairs, duplicated information already present in the dropoff_latitude
and dropoff_longitude
variables. Consequently, we retained the latter two and discarded dropoff_location
. These retained variables, along with pickup_latitude
and pickup_longitude
, will be used to create a new engineered feature called distance
, representing the Euclidean distance between pickup and dropoff coordinates.
Snapshot of sample dropoff_location feature content
Similarly, the pickup_location
variable was discarded from further analysis. Below is an example of its content:
Snapshot of sample pickup_location feature content
Our dataset analysis revealed that only trip_start_timestamp
and trip_end_timestamp
variables contained time-related information in UTC. Duplicate timestamps across different trips were observed, prompting us to adopt a cross-sectional data analysis approach rather than a time-series/panel approach.
Using Pandas data frames in a Python notebook within Vertex Workbench, we created new time-related features such as minute, hour, day, day of the week, and month for each trip. These features were derived from trip_start_timestamp
and trip_end_timestamp
and will be detailed in the Feature Engineering section. Post-creation, the original timestamp variables were discarded.
Sample content of trip_start_timestamp and trip_end_timestamp
We analyzed the distribution of variables using histograms and bar charts. Our findings indicate that most variables exhibit asymmetric distributions. For instance, the original target variable trip_seconds
showed a right-hand skew before being transformed into trip_minutes
.
Trip_seconds sample distribution
Some variables, like dropoff_community_area
, displayed multimodal distributions:
Barplot of dropoff_community_area
Others, such as pickup_latitude
, showed a symmetric distribution:
Pickup latitude sample distribution
Given the plan to train a Deep Neural Network regressor for the trip_minutes
target variable, no data transformations were implemented to alter feature distributions due to time constraints and the natural fit for deep learning models. However, other data transformations were implemented to expedite model training.
We changed the data type of trip_seconds
from integer to float64, enabling the division by 60 to create the trip_minutes feature.
To enhance deep learning model training, we applied Min-Max Scaling to continuous variables using BigQuery SQL scripts. This scaling was implemented to improve data stability and performance.
Feature engineering steps were executed in a BigQuery script for performance optimization. The steps included:
The resulting dataset was stored in:
gcp-ml-especialization-2024.spec_taxi_trips.chicago_taxi_trips_final_dataset1
.
By implementing these feature engineering and data preprocessing steps, we optimized the dataset for effective deep learning model training, validation, and testing.
Due to the substantial size of the datasets and the limitations of the available infrastructure for this demo, it was necessary to further reduce the dataset size. This required minimizing the number of features considered for Deep Learning Model training, and prioritizing the retention of most rows for each feature. Consequently, the following variables were discarded for model training, validation, and testing:
The final datasets included these features:
This resulted in a total of seven features and one target variable.
As depicted in the previous section, all data preprocessing was carried out using Big Query scripts. The final tables used in model training, validation and test were made available by the Big Query functionality/API that is callable from within the Jupyter notebooks, to read Big Query tables as Pandas dataframes from within Vertex Workbench for model training validation and testing.
For the deployed model, the model endpoint is able to receive requests in JSON formats, in .csv files sitting in Cloud Storage or reading from Big Query tables, both for batch predictions as well as for online predictions.
We now present the feature engineering Big Query script in the code snippet below:
WITH temp1 AS (
SELECT
--taxi_id,
--trip_start_timestamp,
--trip_end_timestamp,
EXTRACT(MINUTE FROM trip_start_timestamp AT TIME ZONE "UTC") as minute_trip_start,
EXTRACT(HOUR FROM trip_start_timestamp AT TIME ZONE "UTC") as hour_trip_start,
EXTRACT(DAY FROM trip_start_timestamp AT TIME ZONE "UTC") as day_trip_start,
EXTRACT(DAYOFWEEK FROM trip_start_timestamp AT TIME ZONE "UTC") as day_of_week_trip_start,
EXTRACT(MONTH FROM trip_start_timestamp AT TIME ZONE "UTC") as month_trip_start,
--EXTRACT(YEAR FROM trip_start_timestamp AT TIME ZONE "UTC") as year_trip_start,
EXTRACT(MINUTE FROM trip_end_timestamp AT TIME ZONE "UTC") as minute_trip_end,
EXTRACT(HOUR FROM trip_end_timestamp AT TIME ZONE "UTC") as hour_trip_end,
EXTRACT(DAY FROM trip_end_timestamp AT TIME ZONE "UTC") as day_trip_end,
EXTRACT(DAYOFWEEK FROM trip_end_timestamp AT TIME ZONE "UTC") as day_of_week_trip_end,
EXTRACT(MONTH FROM trip_end_timestamp AT TIME ZONE "UTC") as month_trip_end,
-- EXTRACT(YEAR FROM trip_end_timestamp AT TIME ZONE "UTC") as year_trip_end,
CAST(trip_seconds AS float64)/60 AS trip_minutes,
trip_miles,
--pickup_census_tract,
--dropoff_census_tract,
pickup_community_area,
dropoff_community_area,
fare,
tips,
--tolls,
extras,
trip_total,
payment_type,
company,
ROUND(pickup_latitude,3) as pickup_latitude,
ROUND(pickup_longitude,3) as pickup_longitude,
--pickup_location,
ROUND(dropoff_latitude,3) as dropoff_latitude,
ROUND(dropoff_longitude,3) as dropoff_longitude,
--dropoff_location
ST_DISTANCE(ST_GEOGPOINT(pickup_longitude,pickup_latitude),ST_GEOGPOINT(dropoff_longitude,dropoff_latitude)) as distance
FROM `gcp-ml-especialization-2024.spec_taxi_trips.taxi_trips`
--WHERE trip_start_timestamp ='2013-01-01T00:00:00.000Z'
)
-- Eliminating NULL VALUES in all fields
, temp2 as (
SELECT * FROM temp1
WHERE
minute_trip_start IS NOT NULL AND
hour_trip_start IS NOT NULL AND
day_trip_start IS NOT NULL AND
day_of_week_trip_start IS NOT NULL AND
month_trip_start IS NOT NULL AND
minute_trip_end IS NOT NULL AND
hour_trip_end IS NOT NULL AND
day_trip_end IS NOT NULL AND
day_of_week_trip_end IS NOT NULL AND
month_trip_end IS NOT NULL AND
trip_minutes IS NOT NULL AND
trip_miles IS NOT NULL AND
pickup_community_area IS NOT NULL AND
dropoff_community_area IS NOT NULL AND
fare IS NOT NULL AND
--tolls IS NOT NULL AND
extras IS NOT NULL AND
trip_total IS NOT NULL AND
payment_type IS NOT NULL AND
company IS NOT NULL AND
pickup_latitude IS NOT NULL AND
pickup_longitude IS NOT NULL AND
dropoff_latitude IS NOT NULL AND
dropoff_longitude IS NOT NULL AND
distance IS NOT NULL
)
-- Label encoding of company categorical variable
, temp3 as (
SELECT
company,
DENSE_RANK() OVER (ORDER BY company ASC) AS company_num
FROM
(
SELECT DISTINCT company FROM temp2
)
)
, temp4 as (
SELECT * FROM temp2
LEFT JOIN temp3 USING (company)
)
-- Label encoding of payment_type categorical variable
, temp5 as (
SELECT
payment_type,
DENSE_RANK() OVER (ORDER BY payment_type ASC) AS payment_type_num
FROM
(
SELECT DISTINCT payment_type FROM temp4
)
)
, temp6 as (
SELECT * FROM temp4
LEFT JOIN temp5 USING (payment_type)
)
-------
, temp7 as (
SELECT
--taxi_id,
--trip_start_timestamp,
--trip_end_timestamp,
payment_type,
company,
minute_trip_start,
hour_trip_start,
day_trip_start,
day_of_week_trip_start,
month_trip_start,
--year_trip_start,
minute_trip_end,
hour_trip_end,
day_trip_end,
day_of_week_trip_end,
month_trip_end,
-- year_trip_end,
ML.MIN_MAX_SCALER(trip_minutes) OVER() as trip_minutes,
ML.MIN_MAX_SCALER(trip_miles) OVER() as trip_miles,
--pickup_census_tract,
--dropoff_census_tract,
pickup_community_area,
dropoff_community_area,
ML.MIN_MAX_SCALER(fare) OVER() as fare,
ML.MIN_MAX_SCALER(tips) OVER() as tips,
--tolls,
ML.MIN_MAX_SCALER(extras) OVER() as extras,
ML.MIN_MAX_SCALER(trip_total) OVER() as trip_total,
ML.MIN_MAX_SCALER(pickup_latitude) OVER() as pickup_latitude,
ML.MIN_MAX_SCALER(pickup_longitude) OVER() as pickup_longitude,
--pickup_location,
ML.MIN_MAX_SCALER(dropoff_latitude) OVER() as dropoff_latitude,
ML.MIN_MAX_SCALER(dropoff_longitude) OVER() as dropoff_longitude,
--dropoff_location,
ML.MIN_MAX_SCALER(distance) OVER() as distance,
company_num,
payment_type_num
FROM temp6
)
-------
, final1 as (
SELECT * FROM temp7
)
select * from final1
Below we present a CODE SNIPPET for reading Big Query Tables in Vertex AI as Pandas dataframes:
The machine learning models considered for this demo, are deep learning neural networks models.
The model selection criteria used model selection was model performance in the validation set, according to the performance metrics adopted (MSE, r2 coefficient and explained variance). The final chosen model was the one with highest explained variance and r2 and the same time, did not present very high MSE.
The dataset was split into training, validation, and test sets using the trip_start_timestamp field. Random sampling was unnecessary. Instead, we have splitted original dataset using the timestamp variable trip_start_timestamp in the following manner:
- Train dataset was formed with observations ranging from trip_start_timestamp ranging from '2013-01-01 00:00:00 UTC' AND '2018-01-21 23:45:00 UTC',
- Validation dataset was formed with observations ranging from trip_start_timestamp ranging from '2018-01-21 23:45:01 UTC' AND '2018-10-24 23:45:00 UTC',
- Test dataset was formed with observations ranging from trip_start_timestamp greater than '2018-10-24 23:45:00 UTC' .
Train, Validation, and Test Splits were accomplished in such a way that The dataset was split into train (70% of the data), validation (20% of the data), and test (10%of the data) sets based on trip_start_timestamp field
.
The following queries were accomplished in Big Query to create these datasets:
SELECT *
FROM `gcp-ml-especialization-2024.spec_taxi_trips.chicago_taxi_trips_final_dataset1`
WHERE trip_start_timestamp BETWEEN '2013-01-01 00:00:00 UTC' AND '2018-01-21 23:45:00 UTC';
SELECT *
FROM `gcp-ml-especialization-2024.spec_taxi_trips.chicago_taxi_trips_final_dataset1`
WHERE trip_start_timestamp BETWEEN '2018-01-21 23:45:01 UTC' AND '2018-10-24 23:45:00 UTC';
SELECT *
FROM `gcp-ml-especialization-2024.spec_taxi_trips.chicago_taxi_trips_final_dataset1`
WHERE trip_start_timestamp > '2018-10-24 23:45:00 UTC';
We chose deep neural network regression models for predicting taxi trip durations in Chicago. No special components (e.g., convolutional layers) were needed in the deep learning models architecture. Rectified Linear Activation functions were adopted in all layers for their beneficial properties in deep architectures. The Adam optimization algorithm was selected to address common issues like gradient vanishing in deep neural networks.
The Keras framework (with TensorFlow backend) was used, specifically version 2.12.0, due to Google Cloud's constraints for model registry and deployment. Keras was chosen for its abstraction over TensorFlow, simplifying model specification and training.
The final architecture was influenced by several factors:
After evaluating different architectures, the final neural network included:
Regarding hyperparameter tuning, the main hyperparameter considered was the learning rate used for the training process. The model performance optimization was aimed to increasing r2 and explained variance while reducing MSE.
Three different deep learning models were trained, each one considering a different architecture and a given learning rate. The first trained model used a learning rate of 0.8, the second one a learning rate of 0.08 and the third (chosen final) model a learning rate of 0.01. So that hyperparameter tuning was done along these three training sessions.
The final training session ran for 10 epochs with the following additional parameters:
All the models trained in this demo used the Adam optimization algorithm. As explained, each model used a different set of hyperparameters (including the learning rate, and the number of epochs) for training, in the previous section. All training sessions were conducted using the same training dataset (chicago_taxi_trips_final_dataset_train table).
Regarding adherence to Google`s Machine learning Best Practices, several best practices were followed in this development, such as:
The MSE metric was selected to assess model performance on training, validation, and test sets, as it emphasizes large errors and is a standard metric in academia and industry. Other metrics used included Mean Absolute Error (MAE), determination coefficient (R2), and Root Mean Squared Error (RMSE).
The model bias/ variance tradeoff were determined by analyzing the models performance metrics conjointly in the validation dataset.
The best model was considered the one with highest explained variance and r2 and that, at the same time did not presented very high MSE.
The final trained model achieved the following results on the validation set:
The deep learning regression model explained 27.3% of the variance in the validation set, which is considered suboptimal.
On the test set, the final regression model obtained the following results:
The model explained 23.98% of the variance in the test set, also considered suboptimal.
Given the constraints faced during development, these were the best achievable results. However, the main purpose of this use case was to demonstrate a complete workflow implementation of a machine learning solution using Google Cloud. This objective was successfully fulfilled.
To substantially improve the model's performance, consider the following next steps:
The main objective of this development was to present a complete workflow for developing a machine learning model to meet specific business needs using Google Cloud. This project provided a detailed description of all steps involved (from data exploration to model deployment) while adhering to Google Machine Learning Best Practices. Recommendations for future developments were also provided.