Your First Machine Learning Model with scikit-learn

You know NumPy and Pandas. Now it is time to train a model.

scikit-learn is the standard library for machine learning in Python. It is simple, well-documented, and works for most real-world tasks without a GPU.

Setup

pip install scikit-learn pandas numpy

import sklearn
print(sklearn.__version__)  # 1.5+

The ML Workflow

Every supervised ML task follows these steps:

1. Load data
2. Prepare features (X) and target (y)
3. Split into train and test sets
4. Train a model
5. Evaluate on test set
6. Make predictions on new data

Let’s go through each one.

Step 1: Load Data

We will use the California Housing dataset — a classic regression problem.

from sklearn.datasets import fetch_california_housing
import pandas as pd

housing = fetch_california_housing()

# Put into a DataFrame for easy exploration
df = pd.DataFrame(housing.data, columns=housing.feature_names)
df["target"] = housing.target  # median house value (in $100,000s)

print(df.shape)       # (20640, 9)
print(df.head(3))
print(df.describe())

Step 2: Prepare Features and Target

X = df.drop(columns=["target"])  # features: 8 columns
y = df["target"]                  # target: house price

print(X.shape)  # (20640, 8)
print(y.shape)  # (20640,)

Step 3: Train/Test Split

We train on 80% of the data and evaluate on 20%. The test set simulates new, unseen data.

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
    X, y,
    test_size=0.2,
    random_state=42
)

print(X_train.shape)  # (16512, 8)
print(X_test.shape)   # (4128, 8)

random_state=42 makes the split reproducible. Use any number — it just fixes the random seed.

Step 4a: Train a Linear Regression Model

Linear regression assumes a straight-line relationship between features and target. It is simple, fast, and interpretable.

from sklearn.linear_model import LinearRegression

model_lr = LinearRegression()
model_lr.fit(X_train, y_train)

print("Coefficients:", model_lr.coef_)
print("Intercept:", model_lr.intercept_)

Step 5a: Evaluate Linear Regression

from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score
import numpy as np

y_pred_lr = model_lr.predict(X_test)

mse = mean_squared_error(y_test, y_pred_lr)
rmse = np.sqrt(mse)
mae = mean_absolute_error(y_test, y_pred_lr)
r2 = r2_score(y_test, y_pred_lr)

print(f"RMSE: {rmse:.3f}")   # ~0.745
print(f"MAE:  {mae:.3f}")    # ~0.533
print(f"R²:   {r2:.3f}")     # ~0.576

What these metrics mean:

RMSE — Root Mean Squared Error. Lower is better. Same units as y.
MAE — Mean Absolute Error. Average absolute difference. More intuitive than RMSE.
R² — How much variance the model explains. 1.0 is perfect. 0.0 means the model is no better than the mean.

Linear regression gets R² ~0.58. Not bad, but we can do better.

Step 4b: Train a Random Forest

Random forests are much more powerful for most tasks. They combine many decision trees to reduce overfitting.

from sklearn.ensemble import RandomForestRegressor

model_rf = RandomForestRegressor(
    n_estimators=100,   # number of trees
    random_state=42,
    n_jobs=-1           # use all CPU cores
)
model_rf.fit(X_train, y_train)

Step 5b: Evaluate Random Forest

y_pred_rf = model_rf.predict(X_test)

rmse_rf = np.sqrt(mean_squared_error(y_test, y_pred_rf))
mae_rf = mean_absolute_error(y_test, y_pred_rf)
r2_rf = r2_score(y_test, y_pred_rf)

print(f"RMSE: {rmse_rf:.3f}")   # ~0.505
print(f"MAE:  {mae_rf:.3f}")    # ~0.328
print(f"R²:   {r2_rf:.3f}")     # ~0.805

Random forest gets R² ~0.80. That is a big improvement over linear regression.

Compare Both Models

print(f"{'Model':<20} {'RMSE':<10} {'MAE':<10} {'R²':<10}")
print("-" * 50)
print(f"{'LinearRegression':<20} {rmse:.3f}     {mae:.3f}     {r2:.3f}")
print(f"{'RandomForest':<20} {rmse_rf:.3f}     {mae_rf:.3f}     {r2_rf:.3f}")

Step 6: Make Predictions on New Data

import pandas as pd

# A single new house (must match training feature names and order)
new_house = pd.DataFrame([{
    "MedInc":    5.0,
    "HouseAge":  20.0,
    "AveRooms":  6.0,
    "AveBedrms": 1.0,
    "Population": 1200.0,
    "AveOccup":  3.0,
    "Latitude":  37.88,
    "Longitude": -122.23,
}])

price = model_rf.predict(new_house)
print(f"Predicted price: ${price[0] * 100_000:,.0f}")
# Predicted price: $280,000 (approximate)

Feature Importance

Random forests tell you which features matter most.

import pandas as pd

importances = pd.Series(
    model_rf.feature_importances_,
    index=X.columns
).sort_values(ascending=False)

print(importances)
# MedInc       0.522
# Latitude     0.143
# Longitude    0.112
# HouseAge     0.057
# ...

Median income (MedInc) is by far the most important feature. This makes sense.

Quick Reference

Task	Code
Split data	`train_test_split(X, y, test_size=0.2)`
Linear regression	`LinearRegression().fit(X_train, y_train)`
Random forest	`RandomForestRegressor(n_estimators=100)`
Predict	`model.predict(X_test)`
RMSE	`np.sqrt(mean_squared_error(y_test, y_pred))`
R²	`r2_score(y_test, y_pred)`
Feature importance	`model.feature_importances_`

What’s Next?

You trained your first ML model. Now it is time to understand what is happening inside. The next article explains how neural networks work — the intuition behind neurons, layers, and backpropagation.

How Neural Networks Work: A Developer’s Guide

Setup#

The ML Workflow#

Step 1: Load Data#

Step 2: Prepare Features and Target#

Step 3: Train/Test Split#

Step 4a: Train a Linear Regression Model#

Step 5a: Evaluate Linear Regression#

Step 4b: Train a Random Forest#

Step 5b: Evaluate Random Forest#

Compare Both Models#

Step 6: Make Predictions on New Data#

Feature Importance#

Quick Reference#

What’s Next?#

Related Articles#