Project Overview
After doing classification in my earlier projects, I fancied having a go at regression. I wanted to have a go of actually predicting numbers instead of just categories. I started with house prices, but the datasets were all over the place. Populations made much more sense: they're structured, follow trends, and aren't as messy. Perfect for testing out regression models.
I built two models:
- Linear Regression (LR) — simple, easy to understand, good starting point.
- Random Forest Regression (RFR) — more complex, uses loads of decision trees, better at handling tricky patterns.
The project ran in two phases: first, building and testing Linear Regression; then, moving to Random Forests and comparing the two. Together, they became a proper study of regression in practice.
Technical Implementation
Phase 1: Data Engineering - Cleaning, Encoding, and Structuring
I grabbed two population datasets from Kaggle. They weren't exactly tidy, but they were structured and had categorical data that couldn't go straight into models.
What I did:
- Dropped unnecessary columns, keeping only consistent features across both datasets.
- Label encoding: turned country codes into numbers.
- Renamed population columns by year (e.g., 2020, 2021, 2022) for easier reference.
- Exported to CSV for quick iteration (I've used MySQL in other projects, but CSV was very efficient for these smaller datasets).
What I learned: This prep phase was crucial - I quickly realised that rubbish data = rubbish models, no matter how clever the algorithm.
Phase 2: Linear Regression - A Solid Starting Point
With the cleaned dataset, I built a multiple linear regression model using scikit-learn:
Model Setup:
- Target (y): population in 2022 (later extended to 2023).
- Features (X): earlier population years.
- Split into training and testing sets (with reproducibility via random_state).
- Fitted the model and ran predictions on the test set.
Evaluation metrics:
- MAPE (Mean Absolute Percentage Error): % difference between predicted and actual.
- R² (coefficient of determination): how much variance in the data the model explains.
Key findings:
- The model heavily weighted recent years (2020, 2021), which makes sense — they're the strongest predictors.
- Performance was good for large countries, terrible for small ones. E.g., India predicted well, but small island nations gave ridiculous outputs (even negative populations, due to the high intercept term).
- Adding 2022 as a new feature when predicting 2023 improved accuracy — showing the value of richer datasets.
- I even coded my own error metric, row by row, to visualise error spikes. Result: bigger countries had very low error (<1%), smaller ones spiked to 80%+.
Verdict: Linear Regression is powerful for trend-heavy data, but fragile with outliers and small values.
Phase 3: Random Forest Regression - Power vs Overfitting
Part 2 was about stepping up the game. Linear models are limited, so I turned to Random Forest Regression (RFR):
Model Architecture:
- Built with sklearn.ensemble, using bagging (bootstrap aggregating) across decision trees.
- Explored hyperparameters: n_estimators (no. of trees), max_depth (to stop overfitting), bootstrap (whether to sample with replacement), and loads more (RFR has 17 tunable params vs LR's 4).
- Learned how bootstrapping prevents single outliers dominating the model, and how pruning helps generalisation.
Experiments:
- Default vs hyper-tuned models: tuning improved performance on training data but made models overfit, collapsing on unseen data.
- Turning off bootstrapping + raising n_estimators boosted in-sample accuracy, but killed generalisation.
- Out-of-bag scoring (oob_score=True) gave me an internal measure of generalisation without needing a separate test split.
Results:
- RFR was robust for small countries (better than LR).
- Hyper-tuned models nailed training datasets but failed cross-dataset tests — classic overfitting.
- RFR struggled with 2023 predictions compared to LR. The model often repeated population values it had seen in training, instead of extrapolating new ones.
Takeaway: RFR is powerful, but prone to memorising instead of generalising.
Phase 4: Evaluation, Comparison, and Lessons Learned
Linear Regression vs Random Forest Regression:
| Features: | Linear Regression: | Random Forest Regression: |
|---|---|---|
| Complexity: | Simple, interpretable | Complex, many parameters |
| Speed: | Very fast | Slower (depends on trees/depth) |
| Non-linear handling: | Weak | Strong |
| Outlier sensitivity: | High | Low (bootstrapping helps) |
| Overfitting: | Rare | Common if over-tuned |
| Best at: | Large-population trends | Small-population predictions |
Key conclusions:
- LR excelled at big-country predictions, thanks to assigning huge weight to recent years.
- RFR excelled at small-country predictions, thanks to its ensemble averaging.
- Both failed at generalising across datasets without careful tuning and preprocessing.
- Hyper-tuning is a double-edged sword: it improves fit on known data, but often at the cost of robustness.
Lessons I took away:
- Always start simple (LR gave me a strong baseline).
- More data is better, but only if it's clean and representative.
- Overfitting is the silent killer. A "great" training score often hides a useless model.
- Matching model to problem is everything. I found that LR was better for trend-heavy data, while RFR was better for non-linear and noisy cases.
Conclusion
This project wasn't just "fit two models and compare them." It was a full machine learning engineering cycle:
- Data wrangling (cleaning, encoding, structuring).
- Model building (LR + RFR).
- Hyperparameter exploration.
- Error analysis (MAPE, R², custom metrics).
- Cross-dataset testing for generalisation.
- Visualisation (Plotly graphs, error plots, scaling insights).
Most importantly, I came out with a real understanding of when to use which model. Linear Regression gave speed and clarity; Random Forest gave flexibility and power, but at the cost of complexity and overfitting risk.
For me, this project was about more than predicting populations. It was about learning how to think like an ML engineer: test assumptions, challenge results, and understand that no model is universally "the best."