Demo
Demo video coming soon
A full walkthrough of NYC Taxi Trip Duration Prediction will be embedded here
Overview
An end-to-end machine learning pipeline trained on 3.4M+ NYC taxi records to predict trip duration. The project covers the full ML lifecycle: raw data ingestion, exploratory analysis, cyclical feature engineering, k-NN regression with hyperparameter tuning, and cross-validated evaluation — achieving R² near 1.0 and a 70% reduction in prediction error over a naive baseline.
Problem
- 01Predicting trip duration requires handling complex temporal patterns — rush hour and day-of-week effects that linear encodings misrepresent
- 02Raw taxi data contains 3.4M+ records with missing values, outliers, and features that require domain-informed engineering before modeling
- 03Naive baseline models fail to capture cyclical time patterns — treating hour 23 and hour 0 as maximally different when they're temporally adjacent
- 04Model evaluation without proper cross-validation leads to overfitting and misleading performance claims
Solution
Built a full ML pipeline from raw data ingestion through cross-validated evaluation. Applied sine/cosine encoding to time features to properly represent their cyclical nature. Trained a k-NN regressor with grid search over k values and distance metrics. Validated with k-fold cross-validation and compared against a naive mean-duration baseline — making the performance story concrete and honest.
Key Features
Cyclical Feature Engineering
Sine/cosine encoding of hour, day-of-week, and month — preserving the circular nature of time and fixing a core failure point of standard linear encodings.
k-NN Regression + Tuning
Grid search hyperparameter tuning over k values and distance metrics — selecting the optimal model through systematic, cross-validated comparison.
3.4M Record Pipeline
Full data pipeline handling 3.4M+ records with missing value imputation, outlier detection, and memory-efficient processing.
Visual Diagnostics
Residual plots, prediction vs. actual scatter plots, and feature visualizations — diagnosing model behavior, not just reporting metrics.
Cross-Validated Evaluation
k-fold cross-validation with explicit baseline comparison — R² near 1.0 and a 70% error reduction over the naive duration mean predictor.
Tech Stack
Impact
- Achieved R² near 1.0 — near-perfect prediction performance validated across k-fold cross-validation, not just a single train/test split
- 70% reduction in prediction error compared to a naive mean-duration baseline — a quantified, meaningful improvement with an honest reference point
- Cyclical time encoding was the single highest-impact decision — directly responsible for the majority of the error reduction over standard approaches
What I Learned
- Cyclical feature encoding: why linear time representations fail and how sine/cosine transformation fixes it at the geometry level
- k-NN behavior at scale — computational cost tradeoffs, distance metric sensitivity, and the importance of feature scaling before fitting
- Baseline models are not optional: without a naive comparator, "R² near 1.0" is an incomplete and potentially misleading result
- Cross-validation rigor — using k-fold correctly to prevent data leakage between train and validation splits
- Visual diagnostics are as important as metrics — residual patterns revealed systematic over-prediction on short trips that aggregate metrics alone masked
- The full ML lifecycle matters as much as the model — data quality, feature decisions, and evaluation rigor each contribute equally to a credible result
- Communicating model performance requires context: the metric only means something next to the baseline it beat