Machine Learning · Data Science

NYC Taxi Trip Duration Prediction

R² near 1.0. 70% error reduction. 3.4M records.

PythonScikit-learnPandasNumPyMatplotlibk-NN RegressionFeature Engineering
00

Demo

Demo video coming soon

A full walkthrough of NYC Taxi Trip Duration Prediction will be embedded here

01

Overview

An end-to-end machine learning pipeline trained on 3.4M+ NYC taxi records to predict trip duration. The project covers the full ML lifecycle: raw data ingestion, exploratory analysis, cyclical feature engineering, k-NN regression with hyperparameter tuning, and cross-validated evaluation — achieving R² near 1.0 and a 70% reduction in prediction error over a naive baseline.

02

Problem

  • 01Predicting trip duration requires handling complex temporal patterns — rush hour and day-of-week effects that linear encodings misrepresent
  • 02Raw taxi data contains 3.4M+ records with missing values, outliers, and features that require domain-informed engineering before modeling
  • 03Naive baseline models fail to capture cyclical time patterns — treating hour 23 and hour 0 as maximally different when they're temporally adjacent
  • 04Model evaluation without proper cross-validation leads to overfitting and misleading performance claims
03

Solution

Built a full ML pipeline from raw data ingestion through cross-validated evaluation. Applied sine/cosine encoding to time features to properly represent their cyclical nature. Trained a k-NN regressor with grid search over k values and distance metrics. Validated with k-fold cross-validation and compared against a naive mean-duration baseline — making the performance story concrete and honest.

04

Key Features

Cyclical Feature Engineering

Sine/cosine encoding of hour, day-of-week, and month — preserving the circular nature of time and fixing a core failure point of standard linear encodings.

k-NN Regression + Tuning

Grid search hyperparameter tuning over k values and distance metrics — selecting the optimal model through systematic, cross-validated comparison.

3.4M Record Pipeline

Full data pipeline handling 3.4M+ records with missing value imputation, outlier detection, and memory-efficient processing.

Visual Diagnostics

Residual plots, prediction vs. actual scatter plots, and feature visualizations — diagnosing model behavior, not just reporting metrics.

Cross-Validated Evaluation

k-fold cross-validation with explicit baseline comparison — R² near 1.0 and a 70% error reduction over the naive duration mean predictor.

05

Tech Stack

Backend
Python
AI/ML
Scikit-learnk-NN RegressionFeature Engineering
Data
PandasNumPy
Analytics
Matplotlib
06

Impact

  • Achieved R² near 1.0 — near-perfect prediction performance validated across k-fold cross-validation, not just a single train/test split
  • 70% reduction in prediction error compared to a naive mean-duration baseline — a quantified, meaningful improvement with an honest reference point
  • Cyclical time encoding was the single highest-impact decision — directly responsible for the majority of the error reduction over standard approaches
07

What I Learned

Technical
  • Cyclical feature encoding: why linear time representations fail and how sine/cosine transformation fixes it at the geometry level
  • k-NN behavior at scale — computational cost tradeoffs, distance metric sensitivity, and the importance of feature scaling before fitting
  • Baseline models are not optional: without a naive comparator, "R² near 1.0" is an incomplete and potentially misleading result
  • Cross-validation rigor — using k-fold correctly to prevent data leakage between train and validation splits
Product Thinking
  • Visual diagnostics are as important as metrics — residual patterns revealed systematic over-prediction on short trips that aggregate metrics alone masked
  • The full ML lifecycle matters as much as the model — data quality, feature decisions, and evaluation rigor each contribute equally to a credible result
  • Communicating model performance requires context: the metric only means something next to the baseline it beat
08

Access

Repository available on GitHub — link above.

Get in touch