NYC Taxi Trip Duration Prediction

Demo

Demo video coming soon

A full walkthrough of NYC Taxi Trip Duration Prediction will be embedded here

Overview

An end-to-end machine learning pipeline trained on 3.4M+ NYC taxi records to predict trip duration. The project covers the full ML lifecycle: raw data ingestion, exploratory analysis, cyclical feature engineering, k-NN regression with hyperparameter tuning, and cross-validated evaluation — achieving R² near 1.0 and a 70% reduction in prediction error over a naive baseline.

Problem

01Predicting trip duration requires handling complex temporal patterns — rush hour and day-of-week effects that linear encodings misrepresent
02Raw taxi data contains 3.4M+ records with missing values, outliers, and features that require domain-informed engineering before modeling
03Naive baseline models fail to capture cyclical time patterns — treating hour 23 and hour 0 as maximally different when they're temporally adjacent
04Model evaluation without proper cross-validation leads to overfitting and misleading performance claims

Solution

Built a full ML pipeline from raw data ingestion through cross-validated evaluation. Applied sine/cosine encoding to time features to properly represent their cyclical nature. Trained a k-NN regressor with grid search over k values and distance metrics. Validated with k-fold cross-validation and compared against a naive mean-duration baseline — making the performance story concrete and honest.

Key Features

Cyclical Feature Engineering

Sine/cosine encoding of hour, day-of-week, and month — preserving the circular nature of time and fixing a core failure point of standard linear encodings.

k-NN Regression + Tuning

Grid search hyperparameter tuning over k values and distance metrics — selecting the optimal model through systematic, cross-validated comparison.

3.4M Record Pipeline

Full data pipeline handling 3.4M+ records with missing value imputation, outlier detection, and memory-efficient processing.

Visual Diagnostics

Residual plots, prediction vs. actual scatter plots, and feature visualizations — diagnosing model behavior, not just reporting metrics.

Cross-Validated Evaluation

k-fold cross-validation with explicit baseline comparison — R² near 1.0 and a 70% error reduction over the naive duration mean predictor.

Tech Stack

Backend

Python

AI/ML

Scikit-learnk-NN RegressionFeature Engineering

Data

PandasNumPy

Analytics

Matplotlib

Impact

Achieved R² near 1.0 — near-perfect prediction performance validated across k-fold cross-validation, not just a single train/test split
70% reduction in prediction error compared to a naive mean-duration baseline — a quantified, meaningful improvement with an honest reference point
Cyclical time encoding was the single highest-impact decision — directly responsible for the majority of the error reduction over standard approaches

What I Learned

Technical

Cyclical feature encoding: why linear time representations fail and how sine/cosine transformation fixes it at the geometry level
k-NN behavior at scale — computational cost tradeoffs, distance metric sensitivity, and the importance of feature scaling before fitting
Baseline models are not optional: without a naive comparator, "R² near 1.0" is an incomplete and potentially misleading result
Cross-validation rigor — using k-fold correctly to prevent data leakage between train and validation splits

Product Thinking

Visual diagnostics are as important as metrics — residual patterns revealed systematic over-prediction on short trips that aggregate metrics alone masked
The full ML lifecycle matters as much as the model — data quality, feature decisions, and evaluation rigor each contribute equally to a credible result
Communicating model performance requires context: the metric only means something next to the baseline it beat

Access

View on GitHub

Repository available on GitHub — link above.

Get in touch