You know NumPy and Pandas. Now it is time to train a model.
scikit-learn is the standard library for machine learning in Python. It is simple, well-documented, and works for most real-world tasks without a GPU.
Setup
pip install scikit-learn pandas numpy
import sklearn
print(sklearn.__version__) # 1.5+
The ML Workflow
Every supervised ML task follows these steps:
1. Load data
2. Prepare features (X) and target (y)
3. Split into train and test sets
4. Train a model
5. Evaluate on test set
6. Make predictions on new data
Let’s go through each one.
Step 1: Load Data
We will use the California Housing dataset — a classic regression problem.
from sklearn.datasets import fetch_california_housing
import pandas as pd
housing = fetch_california_housing()
# Put into a DataFrame for easy exploration
df = pd.DataFrame(housing.data, columns=housing.feature_names)
df["target"] = housing.target # median house value (in $100,000s)
print(df.shape) # (20640, 9)
print(df.head(3))
print(df.describe())
Step 2: Prepare Features and Target
X = df.drop(columns=["target"]) # features: 8 columns
y = df["target"] # target: house price
print(X.shape) # (20640, 8)
print(y.shape) # (20640,)
Step 3: Train/Test Split
We train on 80% of the data and evaluate on 20%. The test set simulates new, unseen data.
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(
X, y,
test_size=0.2,
random_state=42
)
print(X_train.shape) # (16512, 8)
print(X_test.shape) # (4128, 8)
random_state=42 makes the split reproducible. Use any number — it just fixes the random seed.
Step 4a: Train a Linear Regression Model
Linear regression assumes a straight-line relationship between features and target. It is simple, fast, and interpretable.
from sklearn.linear_model import LinearRegression
model_lr = LinearRegression()
model_lr.fit(X_train, y_train)
print("Coefficients:", model_lr.coef_)
print("Intercept:", model_lr.intercept_)
Step 5a: Evaluate Linear Regression
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score
import numpy as np
y_pred_lr = model_lr.predict(X_test)
mse = mean_squared_error(y_test, y_pred_lr)
rmse = np.sqrt(mse)
mae = mean_absolute_error(y_test, y_pred_lr)
r2 = r2_score(y_test, y_pred_lr)
print(f"RMSE: {rmse:.3f}") # ~0.745
print(f"MAE: {mae:.3f}") # ~0.533
print(f"R²: {r2:.3f}") # ~0.576
What these metrics mean:
- RMSE — Root Mean Squared Error. Lower is better. Same units as y.
- MAE — Mean Absolute Error. Average absolute difference. More intuitive than RMSE.
- R² — How much variance the model explains. 1.0 is perfect. 0.0 means the model is no better than the mean.
Linear regression gets R² ~0.58. Not bad, but we can do better.
Step 4b: Train a Random Forest
Random forests are much more powerful for most tasks. They combine many decision trees to reduce overfitting.
from sklearn.ensemble import RandomForestRegressor
model_rf = RandomForestRegressor(
n_estimators=100, # number of trees
random_state=42,
n_jobs=-1 # use all CPU cores
)
model_rf.fit(X_train, y_train)
Step 5b: Evaluate Random Forest
y_pred_rf = model_rf.predict(X_test)
rmse_rf = np.sqrt(mean_squared_error(y_test, y_pred_rf))
mae_rf = mean_absolute_error(y_test, y_pred_rf)
r2_rf = r2_score(y_test, y_pred_rf)
print(f"RMSE: {rmse_rf:.3f}") # ~0.505
print(f"MAE: {mae_rf:.3f}") # ~0.328
print(f"R²: {r2_rf:.3f}") # ~0.805
Random forest gets R² ~0.80. That is a big improvement over linear regression.
Compare Both Models
print(f"{'Model':<20} {'RMSE':<10} {'MAE':<10} {'R²':<10}")
print("-" * 50)
print(f"{'LinearRegression':<20} {rmse:.3f} {mae:.3f} {r2:.3f}")
print(f"{'RandomForest':<20} {rmse_rf:.3f} {mae_rf:.3f} {r2_rf:.3f}")
Step 6: Make Predictions on New Data
import pandas as pd
# A single new house (must match training feature names and order)
new_house = pd.DataFrame([{
"MedInc": 5.0,
"HouseAge": 20.0,
"AveRooms": 6.0,
"AveBedrms": 1.0,
"Population": 1200.0,
"AveOccup": 3.0,
"Latitude": 37.88,
"Longitude": -122.23,
}])
price = model_rf.predict(new_house)
print(f"Predicted price: ${price[0] * 100_000:,.0f}")
# Predicted price: $280,000 (approximate)
Feature Importance
Random forests tell you which features matter most.
import pandas as pd
importances = pd.Series(
model_rf.feature_importances_,
index=X.columns
).sort_values(ascending=False)
print(importances)
# MedInc 0.522
# Latitude 0.143
# Longitude 0.112
# HouseAge 0.057
# ...
Median income (MedInc) is by far the most important feature. This makes sense.
Quick Reference
| Task | Code |
|---|---|
| Split data | train_test_split(X, y, test_size=0.2) |
| Linear regression | LinearRegression().fit(X_train, y_train) |
| Random forest | RandomForestRegressor(n_estimators=100) |
| Predict | model.predict(X_test) |
| RMSE | np.sqrt(mean_squared_error(y_test, y_pred)) |
| R² | r2_score(y_test, y_pred) |
| Feature importance | model.feature_importances_ |
What’s Next?
You trained your first ML model. Now it is time to understand what is happening inside. The next article explains how neural networks work — the intuition behind neurons, layers, and backpropagation.
How Neural Networks Work: A Developer’s Guide