Back to Projects

Project Overview

Regression model benchmarking Linear Regression, Random Forest, XGBoost, and Gradient Boosting for used car price prediction. Deployed as an interactive Streamlit app where users can enter car details and get an estimated market price instantly.

0.92Best R² (XGBoost)
₹54KAvg RMSE
4Models Compared
LiveStreamlit App

Feature Engineering

Raw used car data needed significant cleaning and feature engineering before modeling:

  • Age: current year − manufacturing year
  • Brand encoding: Target encoding (brand average price) — better than one-hot for 40+ brands
  • Mileage transformation: Log transform to handle right-skewed distribution
  • Fuel type & transmission: Ordinal encoding
  • Engine + Power: Extracted numeric values from "1498 CC" and "102 bhp" strings
Python
import pandas as pd, numpy as np
from sklearn.preprocessing import LabelEncoder

df['car_age']   = 2026 - df['year']
df['log_km']    = np.log1p(df['kms_driven'])
df['engine_cc'] = df['engine'].str.extract(r'(\d+)').astype(float)
df['power_bhp'] = df['max_power'].str.extract(r'(\d+\.?\d*)').astype(float)

# Target encoding for brand
brand_avg = df.groupby('brand')['selling_price'].mean()
df['brand_enc'] = df['brand'].map(brand_avg)

df['fuel_enc']  = LabelEncoder().fit_transform(df['fuel'])
df['trans_enc'] = LabelEncoder().fit_transform(df['transmission'])

features = ['car_age','log_km','engine_cc','power_bhp',
            'brand_enc','fuel_enc','trans_enc','seats']

Model Comparison

Python
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor
from xgboost import XGBRegressor
from sklearn.metrics import r2_score, mean_squared_error

models = {
    'Linear Regression':    LinearRegression(),
    'Random Forest':        RandomForestRegressor(n_estimators=200, max_depth=15),
    'Gradient Boosting':    GradientBoostingRegressor(n_estimators=200, learning_rate=0.05),
    'XGBoost':              XGBRegressor(n_estimators=300, learning_rate=0.05, max_depth=6),
}
for name, model in models.items():
    model.fit(X_train, y_train)
    preds = model.predict(X_test)
    print(f"{name}: R²={r2_score(y_test,preds):.4f}  RMSE={mean_squared_error(y_test,preds,squared=False):,.0f}")

Results

0.71Linear Regression R²
0.88Random Forest R²
0.90Gradient Boosting R²
0.92XGBoost R² ✅
XGBoostRandom ForestScikit-learn PandasStreamlitPython