At Trembit we recently tackled a Vehicle Price Prediction project that not only estimates a car’s current market value with high-level accuracy but also forecasts how that value will change over the next years. This approach allows businesses to make data-driven decisions on trade-ins, pricing strategies, and inventory management.
Our client, a leading leasing platform, sought to understand the dynamic nature of used car prices over time. This knowledge was crucial for informed business decisions and maximizing profitability. Our team developed a robust predictive model to address this critical need.
Accurate vehicle pricing is paramount across the automotive industry, regardless of whether you operate a dealership, manage a fleet, or facilitate online car sales. Inaccurate pricing and price prediction, whether too high or too low, directly impacts profitability.
Traditional pricing methodologies often rely on generalized estimates and historical sales data, which can quickly become obsolete. However, integrating advanced Machine Learning techniques into automotive practices is revolutionizing price forecasting, enabling businesses to predict future market values with greater accuracy.
In this article, we’ll walk you through the end-to-end process behind our solution, from gathering and cleaning massive automotive datasets to creating a stacked ML regressor with ten base models & one meta-model, and finally, showcasing a Streamlit application that brings it all together in a user-friendly interface.
Project Overview
When we started this project, our mission was clear: develop a cutting-edge predictive analytics solution that could deliver highly accurate used car prices and forecast vehicle depreciation over seven years. We knew this meant going beyond simple linear models. Instead, we envisioned an ensemble approach—specifically, a model stacking architecture with ten base algorithms and a powerful meta-model on top.
By combining multiple regression techniques, including Gradient Boosting, XGBoost, and various linear models like Lasso and Ridge, we aimed to capture the intricate patterns underlying vehicle price fluctuations. This ensemble strategy was paired with robust data preprocessing and feature engineering to ensure the highest possible accuracy. Finally, we wrapped the solution in a Streamlit application—giving users an intuitive way to select vehicle attributes and instantly see how prices might evolve in time and under various mileage usage scenarios.
Data Collection & Preprocessing
Uncovering the Real Dataset
Our journey began with a sizable dataset containing more than 1.7 million records of cars listed for sale. Yet, on closer inspection, we found that these records represented just 84,000 unique vehicles. Sellers often relist or update their listings over time. In some cases, a single car appeared as many as 130 times, each entry with slightly different pricing or descriptions.
To address this duplication:
- We relied on a “url” field that served as an identifier, enabling us to group identical cars across different listings.
- We decided to keep only the most recent record for each vehicle, ensuring that we captured the most accurate, up-to-date market price.
Handling Outliers to Reflect Reality
Data exploration revealed that certain listings had prices as high as 100,000,000 euros—clearly unrealistic for a used car market. Meanwhile, some cars were priced below 100 euros, which further investigation showed were actually rentals (despite filters intended to capture only vehicles for sale).
- Price: We capped the upper range at 150,000 euros and eliminated implausibly low-price rentals.
- Year Built: Cars built after 2025 or before 1993 were removed not only to maintain a more Gaussian-like distribution but also to reflect time causation.
- Power: While most vehicles had power under ~500 HP, extreme outliers claimed up to 800,000 HP! We found that removing cars with power over 751 HP (less than 0.1% of the dataset) gave us a more realistic, bell-shaped power distribution.
- Mileage: Vehicles showed a multimodal distribution, with spikes at 0 or 1 (often brand-new listings), a cluster around 500, and another around 700,000. We also spotted outliers with over 2,000,000 kilometers. For realistic predictions, we removed those ultra-high-mileage outliers.
By systematically removing (or capping) these outliers, we honed in on a dataset that reflected real-world market conditions and avoided skewed training data.
Feature Engineering & Transformation
Taming Missing Values
Missing data can reduce model quality significantly if handled poorly. We took a feature-by-feature approach:
- Seller Type: Missing 79% of its values. Initially, we considered dropping the feature entirely. However, careful experimentation showed better results when introducing an “unknown seller” category.
- fuelType, bodyType, color: Each had its unique pattern of missingness, so we tested various fill strategies (e.g., grouping, introducing “unknown” categories) to see which improved performance the most.
Categorical Grouping
Realistic automotive data includes many distinct categories:
- Brand: Over 100 unique brands in the raw data. By consolidating the least popular 0.5% into a single “other” category, we went from 108 to 22 unique brands.
- Location: With 816 distinct locations initially, we grouped the least popular 1% into “rare_location,” ending up with 67 categories.
- Model: The model field was the biggest culprit, with 10,536 entries. Grouping anything with fewer than 10 occurrences into “rare_model” decreased this to 535.
By reducing data sparsity, these grouping techniques helped models learn more effectively without being overwhelmed by rarely seen categories.
Custom Features & Transformations
We introduced new features to reflect the real-world context:
- Age: publish_year – year_built, directly measuring how old the car was at the time of listing.
- Mileage Categories: Distinguishing between “new” (0–1), “slightly used,” and “used” miles created additional boolean indicators. This approach captured the multi-peak nature of mileage distribution in a more explicit, machine-friendly way.
Finally, we applied a BoxCox transformation on the target variable (price) to address skewness. This step stabilized the distribution of prices, resulting in smoother and more predictably shaped data for our regressor.
Model Stacking Approach & Architecture
Why Stacking?
Rather than relying on a single regressor, we employed model stacking—a technique where multiple algorithms generate predictions that feed into a meta-model. Each algorithm captures a different facet of the data. Combining them often yields stronger predictions, improves generalization, and mitigates overfitting.
Ten Base Models + One Meta-Model
Our ten base regressors included:
- Linear
- Ridge
- Lasso
- ElasticNet
- SGD Regressor
- XGB Regressor
- Random Forest
- Gradient Boosting Regressor (GBR)
- Bagging Regressor (BR)
- AdaBoost Regressor (ABR)
After training each base model, their out-of-fold predictions became the features for the final layer—our meta-model: HistGradientBoostingRegressor. This final model learned how to combine each base model’s strengths into a single, highly accurate prediction pipeline.
K-Fold Cross-Validation: Preventing Data Leakage
To ensure robust performance and prevent data leakage, we used K-fold cross-validation (usually 4 or 5 folds). Each fold produced predictions on unseen data, generating reliable “second-level” inputs for the meta-model. This rigorous approach helped us maintain high integrity in our results and avoid overly optimistic metrics.
Predicting Future Depreciation
The Challenge: Forecasting 7 Years Ahead
While our primary regressor excelled at predicting a car’s current price, our client needed to understand how and why that price would change over the next seven years. The biggest wild card? Mileage. A car’s mileage evolves differently for every driver, making a single “mileage over time” model unreliable.
Symbolic Regression for Realistic Mileage Curves
We discovered that typical regression approaches on our dataset indicated decreasing mileage after year 15—an obvious misinterpretation of the data. Instead, we used a Symbolic Regression approach for the first 15 years, capturing an exponential rise that naturally plateaus rather than declining. This more physically plausible function better represents how most cars accumulate miles over time.
Seven Scenarios, One Dynamic Forecast
Since real drivers use vehicles at different rates, we introduced seven usage multipliers (e.g., ×0.75, ×1, ×1.25) to create multiple “what-if” scenarios:
- Low-mileage usage (e.g., weekend drivers)
- Typical usage
- High-mileage usage (e.g., commercial fleets)
For each scenario, we updated the car’s “age” and “mileage” each year, then ran these updated features through our stacked regression model to predict future prices. The result? A suite of seven price curves showing how value might shift from year to year under different usage patterns.
Model Performance & Evaluation
Quantifying Success
Using a held-out test set, we measured the performance of the stacked regressor:
- MSE (Mean Squared Error): 0.00207
- MAE (Mean Absolute Error): 0.02529
- R² (Coefficient of Determination): 0.913

Figure 1: Screenshot of the results received.
An R² score of 0.913 indicates that our model explains over 91% of the variance in vehicle prices—a robust showing in the predictive analytics domain. These results highlight our system’s capacity to deliver accurate and consistent price predictions across various vehicles.
Comparisons to Traditional Methods
Classic pricing strategies often rely on broad heuristics—like age brackets, brand popularity, or national average depreciation rates. Our advanced ensemble learning outperforms rule-of-thumb models by leveraging thousands of data points about a car’s features, location, usage profile, and more. While direct, side-by-side comparisons with “traditional methods” can be tricky, a 0.913 R² is typically considered excellent in the automotive space.
Client-Ready Deployment
To make the tool accessible, we wrapped the solution in a Streamlit application. The app featured:
- Dropdown menus for selecting car attributes.
- Real-time predictions of price trends over seven years.
- Interactive visualizations for easy interpretation.

Figure 2: Screenshot of the Streamlit app interface.
This user-friendly approach ensured that even non-technical users could harness the power of machine learning for strategic decision-making.
Business Impact & Potential Use Cases
Maximizing Profits & Streamlining Operations
Our solution acts as a real-time vehicle pricing intelligence tool for dealerships and used car marketplaces. By accurately forecasting depreciation:
- Inventory Management: Reduce the risk of overpaying for trade-ins or letting undervalued cars linger.
- Optimized Selling Strategies: Price vehicles more competitively from day one, accelerating sales and reducing holding costs.
- Targeted Marketing: Highlight lower future depreciation for better resale value and brand positioning.
Long-Term Planning & Fleet Management
Fleet managers and automotive leasing companies can project residual values of vehicles several years ahead:
- Lease Structuring: Calculate more precise monthly payments based on scientifically forecasted residual values.
- Maintenance Scheduling: Combine mileage forecasts with maintenance records, avoiding downtime and minimizing unexpected costs.
Conclusion
In an era where data-driven decision-making is crucial for success, our advanced vehicle price prediction model offers a transformative upgrade from traditional valuation approaches. Through a careful data cleaning process, rigorous feature engineering, and a stacked ensemble architecture, we’ve unlocked a powerful tool to forecast car prices accurately. Not only can dealerships and fleet owners optimize their current pricing, but they can also confidently plan for depreciation over the next seven years—giving them a decisive competitive edge.
At the heart of this project is our passion for delivering Machine Learning solutions that address real business needs. If you’re looking to enhance your automotive operations, improve pricing strategies, or explore predictive analytics for other complex problems, we’re here to help.
Ready for the Next Step?
Contact Trembit team today to discover how our custom ML-based analysis can revolutionize your pricing strategies and unlock new revenue opportunities. Together, we’ll steer your automotive business toward greater efficiency, profitability, and customer satisfaction.