Before you train any machine learning model, you need to handle data. NumPy and Pandas are the two libraries you will use every day.
This is a practical crash course. No theory — just the operations you actually need.
Setup
pip install numpy pandas
Check versions:
import numpy as np
import pandas as pd
print(np.__version__) # 2.x
print(pd.__version__) # 2.x
NumPy: Arrays
NumPy gives you fast multi-dimensional arrays. They are faster than Python lists for math operations.
Creating Arrays
import numpy as np
# From a list
a = np.array([1, 2, 3, 4, 5])
print(a) # [1 2 3 4 5]
print(a.dtype) # int64
print(a.shape) # (5,)
# 2D array (matrix)
m = np.array([[1, 2, 3],
[4, 5, 6]])
print(m.shape) # (2, 3)
# Zeros, ones, ranges
zeros = np.zeros((3, 3))
ones = np.ones((2, 4))
r = np.arange(0, 10, 2) # [0 2 4 6 8]
linspace = np.linspace(0, 1, 5) # [0. 0.25 0.5 0.75 1. ]
Indexing and Slicing
a = np.array([10, 20, 30, 40, 50])
print(a[0]) # 10
print(a[-1]) # 50
print(a[1:4]) # [20 30 40]
print(a[::2]) # [10 30 50]
# 2D indexing
m = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]])
print(m[0, 1]) # 2 (row 0, col 1)
print(m[:, 1]) # [2 5 8] (all rows, col 1)
print(m[1:, :2]) # [[4 5], [7 8]]
Broadcasting
Broadcasting lets you do math between arrays of different shapes. This is used constantly in ML.
a = np.array([1, 2, 3])
# Add a scalar to every element
print(a + 10) # [11 12 13]
print(a * 2) # [2 4 6]
print(a ** 2) # [1 4 9]
# Broadcasting between 2D and 1D
m = np.ones((3, 3))
row = np.array([1, 2, 3])
print(m + row)
# [[2. 3. 4.]
# [2. 3. 4.]
# [2. 3. 4.]]
Useful Math Operations
a = np.array([4, 1, 7, 2, 9, 3])
print(np.mean(a)) # 4.33
print(np.std(a)) # 2.75
print(np.min(a)) # 1
print(np.max(a)) # 9
print(np.sum(a)) # 26
print(np.argmax(a)) # 4 (index of max value)
print(np.sort(a)) # [1 2 3 4 7 9]
Boolean Indexing (Very Common in ML)
a = np.array([5, 12, 3, 8, 15, 1])
# Get values greater than 6
mask = a > 6
print(mask) # [False True False True True False]
print(a[mask]) # [12 8 15]
# Shorter way
print(a[a > 6]) # [12 8 15]
Pandas: DataFrames
Pandas gives you labeled 2D data — like a spreadsheet in Python. ML datasets almost always start as Pandas DataFrames.
Creating a DataFrame
import pandas as pd
data = {
"name": ["Alex", "Sam", "Jordan", "Taylor"],
"age": [25, 32, 28, 45],
"salary": [50000, 72000, 61000, 90000],
"city": ["Berlin", "London", "Paris", "Berlin"],
}
df = pd.DataFrame(data)
print(df)
Output:
name age salary city
0 Alex 25 50000 Berlin
1 Sam 32 72000 London
2 Jordan 28 61000 Paris
3 Taylor 45 90000 Berlin
Exploring Data
print(df.shape) # (4, 4)
print(df.dtypes) # column types
print(df.describe()) # stats for numeric columns
print(df.info()) # overview with null counts
print(df.head(2)) # first 2 rows
print(df.tail(2)) # last 2 rows
Selecting Columns and Rows
# Single column (returns Series)
print(df["age"])
# Multiple columns
print(df[["name", "salary"]])
# Filter rows
print(df[df["age"] > 28])
# Filter with multiple conditions
print(df[(df["age"] > 25) & (df["city"] == "Berlin")])
# loc: label-based
print(df.loc[0, "name"]) # Alex
print(df.loc[1:2, ["name", "salary"]])
# iloc: position-based
print(df.iloc[0, 0]) # Alex
print(df.iloc[:2, 1:3]) # rows 0-1, cols 1-2
Data Cleaning
This is 80% of real ML work.
# Create a messy dataset
data = {
"age": [25, None, 28, 45, None],
"salary": [50000, 72000, None, 90000, 61000],
"city": ["Berlin", "London", "Paris", None, "Berlin"],
}
df = pd.DataFrame(data)
# Check for missing values
print(df.isnull().sum())
# age 2
# salary 1
# city 1
# Drop rows with any missing values
df_clean = df.dropna()
# Fill missing values
df["age"] = df["age"].fillna(df["age"].mean()) # fill with mean
df["city"] = df["city"].fillna("Unknown") # fill with string
# Drop duplicates
df = df.drop_duplicates()
# Rename columns
df = df.rename(columns={"salary": "annual_salary"})
# Change column type
df["age"] = df["age"].astype(int)
Aggregation
data = {
"city": ["Berlin", "London", "Berlin", "Paris", "London"],
"salary": [50000, 72000, 61000, 90000, 68000],
}
df = pd.DataFrame(data)
# Group by city, get mean salary
print(df.groupby("city")["salary"].mean())
# city
# Berlin 55500.0
# London 70000.0
# Paris 90000.0
# Multiple aggregations
print(df.groupby("city")["salary"].agg(["mean", "min", "max"]))
Converting to NumPy for ML
All scikit-learn and PyTorch functions expect NumPy arrays or tensors.
# Get the numeric columns as a NumPy array
X = df[["age", "salary"]].to_numpy()
print(X.shape) # (n_rows, 2)
Quick Reference
| Task | Code |
|---|---|
| Create array | np.array([1, 2, 3]) |
| Array shape | a.shape |
| Mean | np.mean(a) |
| Filter array | a[a > 5] |
| Create DataFrame | pd.DataFrame(dict) |
| Check nulls | df.isnull().sum() |
| Fill nulls | df.fillna(value) |
| Filter rows | df[df["col"] > val] |
| Group + aggregate | df.groupby("col").mean() |
| To NumPy | df.to_numpy() |
What’s Next?
Now that you can handle data, it is time to train your first model. In the next article, you will use scikit-learn to build a real machine learning model from scratch.
Your First Machine Learning Model with scikit-learn