Before you train any machine learning model, you need to handle data. NumPy and Pandas are the two libraries you will use every day.

This is a practical crash course. No theory — just the operations you actually need.

Setup

pip install numpy pandas

Check versions:

import numpy as np
import pandas as pd

print(np.__version__)   # 2.x
print(pd.__version__)   # 2.x

NumPy: Arrays

NumPy gives you fast multi-dimensional arrays. They are faster than Python lists for math operations.

Creating Arrays

import numpy as np

# From a list
a = np.array([1, 2, 3, 4, 5])
print(a)         # [1 2 3 4 5]
print(a.dtype)   # int64
print(a.shape)   # (5,)

# 2D array (matrix)
m = np.array([[1, 2, 3],
              [4, 5, 6]])
print(m.shape)   # (2, 3)

# Zeros, ones, ranges
zeros = np.zeros((3, 3))
ones = np.ones((2, 4))
r = np.arange(0, 10, 2)    # [0 2 4 6 8]
linspace = np.linspace(0, 1, 5)  # [0.   0.25 0.5  0.75 1.  ]

Indexing and Slicing

a = np.array([10, 20, 30, 40, 50])

print(a[0])      # 10
print(a[-1])     # 50
print(a[1:4])    # [20 30 40]
print(a[::2])    # [10 30 50]

# 2D indexing
m = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]])
print(m[0, 1])   # 2  (row 0, col 1)
print(m[:, 1])   # [2 5 8]  (all rows, col 1)
print(m[1:, :2]) # [[4 5], [7 8]]

Broadcasting

Broadcasting lets you do math between arrays of different shapes. This is used constantly in ML.

a = np.array([1, 2, 3])

# Add a scalar to every element
print(a + 10)        # [11 12 13]
print(a * 2)         # [2 4 6]
print(a ** 2)        # [1 4 9]

# Broadcasting between 2D and 1D
m = np.ones((3, 3))
row = np.array([1, 2, 3])
print(m + row)
# [[2. 3. 4.]
#  [2. 3. 4.]
#  [2. 3. 4.]]

Useful Math Operations

a = np.array([4, 1, 7, 2, 9, 3])

print(np.mean(a))    # 4.33
print(np.std(a))     # 2.75
print(np.min(a))     # 1
print(np.max(a))     # 9
print(np.sum(a))     # 26
print(np.argmax(a))  # 4  (index of max value)
print(np.sort(a))    # [1 2 3 4 7 9]

Boolean Indexing (Very Common in ML)

a = np.array([5, 12, 3, 8, 15, 1])

# Get values greater than 6
mask = a > 6
print(mask)       # [False  True False  True  True False]
print(a[mask])    # [12  8 15]

# Shorter way
print(a[a > 6])   # [12  8 15]

Pandas: DataFrames

Pandas gives you labeled 2D data — like a spreadsheet in Python. ML datasets almost always start as Pandas DataFrames.

Creating a DataFrame

import pandas as pd

data = {
    "name": ["Alex", "Sam", "Jordan", "Taylor"],
    "age": [25, 32, 28, 45],
    "salary": [50000, 72000, 61000, 90000],
    "city": ["Berlin", "London", "Paris", "Berlin"],
}

df = pd.DataFrame(data)
print(df)

Output:

     name  age  salary    city
0    Alex   25   50000  Berlin
1     Sam   32   72000  London
2  Jordan   28   61000   Paris
3  Taylor   45   90000  Berlin

Exploring Data

print(df.shape)          # (4, 4)
print(df.dtypes)         # column types
print(df.describe())     # stats for numeric columns
print(df.info())         # overview with null counts
print(df.head(2))        # first 2 rows
print(df.tail(2))        # last 2 rows

Selecting Columns and Rows

# Single column (returns Series)
print(df["age"])

# Multiple columns
print(df[["name", "salary"]])

# Filter rows
print(df[df["age"] > 28])

# Filter with multiple conditions
print(df[(df["age"] > 25) & (df["city"] == "Berlin")])

# loc: label-based
print(df.loc[0, "name"])      # Alex
print(df.loc[1:2, ["name", "salary"]])

# iloc: position-based
print(df.iloc[0, 0])          # Alex
print(df.iloc[:2, 1:3])       # rows 0-1, cols 1-2

Data Cleaning

This is 80% of real ML work.

# Create a messy dataset
data = {
    "age": [25, None, 28, 45, None],
    "salary": [50000, 72000, None, 90000, 61000],
    "city": ["Berlin", "London", "Paris", None, "Berlin"],
}
df = pd.DataFrame(data)

# Check for missing values
print(df.isnull().sum())
# age       2
# salary    1
# city      1

# Drop rows with any missing values
df_clean = df.dropna()

# Fill missing values
df["age"] = df["age"].fillna(df["age"].mean())       # fill with mean
df["city"] = df["city"].fillna("Unknown")             # fill with string

# Drop duplicates
df = df.drop_duplicates()

# Rename columns
df = df.rename(columns={"salary": "annual_salary"})

# Change column type
df["age"] = df["age"].astype(int)

Aggregation

data = {
    "city": ["Berlin", "London", "Berlin", "Paris", "London"],
    "salary": [50000, 72000, 61000, 90000, 68000],
}
df = pd.DataFrame(data)

# Group by city, get mean salary
print(df.groupby("city")["salary"].mean())
# city
# Berlin    55500.0
# London    70000.0
# Paris     90000.0

# Multiple aggregations
print(df.groupby("city")["salary"].agg(["mean", "min", "max"]))

Converting to NumPy for ML

All scikit-learn and PyTorch functions expect NumPy arrays or tensors.

# Get the numeric columns as a NumPy array
X = df[["age", "salary"]].to_numpy()
print(X.shape)  # (n_rows, 2)

Quick Reference

TaskCode
Create arraynp.array([1, 2, 3])
Array shapea.shape
Meannp.mean(a)
Filter arraya[a > 5]
Create DataFramepd.DataFrame(dict)
Check nullsdf.isnull().sum()
Fill nullsdf.fillna(value)
Filter rowsdf[df["col"] > val]
Group + aggregatedf.groupby("col").mean()
To NumPydf.to_numpy()

What’s Next?

Now that you can handle data, it is time to train your first model. In the next article, you will use scikit-learn to build a real machine learning model from scratch.

Your First Machine Learning Model with scikit-learn