Pandas: Advanced Data Manipulation and Aggregation#
In this lesson, you will learn advanced data manipulation techniques using Pandas. Specifically, we will cover:
Combining dataframes
Data aggregation
Data reshaping
# Load dependencies (NumPy and Pandas)
import pandas as pd
import numpy as np
# We will keep using the Iris dataset for this tutorial
iris_df = pd.read_csv("https://raw.githubusercontent.com/mwaskom/seaborn-data/refs/heads/master/iris.csv")
iris_df
sepal_length | sepal_width | petal_length | petal_width | species | |
---|---|---|---|---|---|
0 | 5.1 | 3.5 | 1.4 | 0.2 | setosa |
1 | 4.9 | 3.0 | 1.4 | 0.2 | setosa |
2 | 4.7 | 3.2 | 1.3 | 0.2 | setosa |
3 | 4.6 | 3.1 | 1.5 | 0.2 | setosa |
4 | 5.0 | 3.6 | 1.4 | 0.2 | setosa |
... | ... | ... | ... | ... | ... |
145 | 6.7 | 3.0 | 5.2 | 2.3 | virginica |
146 | 6.3 | 2.5 | 5.0 | 1.9 | virginica |
147 | 6.5 | 3.0 | 5.2 | 2.0 | virginica |
148 | 6.2 | 3.4 | 5.4 | 2.3 | virginica |
149 | 5.9 | 3.0 | 5.1 | 1.8 | virginica |
150 rows × 5 columns
Combine dataframes#
Concate: pd.concat()
#
It allows you to concatenate pandas objects along a particular axis. See documentation for further details.
Concat rows
Here we would be combining two datasets with the same features (columns) but different observations.
# Create two dfs and vertically stack them.
df1 = pd.DataFrame(np.random.randn(3, 4), columns=["a", "b", "c", "d"])
df2 = pd.DataFrame(np.random.randn(3, 4), columns=["a", "b", "c", "d"])
print(df1)
print('-'*45)
print(df2)
df3 = pd.concat([df1, df2], axis=0)
print('-'*45)
print(df3)
a b c d
0 0.174450 -1.018138 0.968461 -0.412483
1 0.703028 -0.594527 1.997723 -1.159000
2 -0.415667 0.290691 1.538168 0.236634
---------------------------------------------
a b c d
0 1.141372 0.539766 0.924608 -1.013957
1 1.333616 -0.966171 1.258023 0.185296
2 -0.060995 0.330361 -0.710879 -0.408728
---------------------------------------------
a b c d
0 0.174450 -1.018138 0.968461 -0.412483
1 0.703028 -0.594527 1.997723 -1.159000
2 -0.415667 0.290691 1.538168 0.236634
0 1.141372 0.539766 0.924608 -1.013957
1 1.333616 -0.966171 1.258023 0.185296
2 -0.060995 0.330361 -0.710879 -0.408728
Concat columns.
Here our datasets have the same IDs, for example, subjects or time points, but different measures (columns).
# Create two dfs and vertically stack them.
df1 = pd.DataFrame(np.random.randn(3, 4), columns=["a", "b", "c", "d"])
df2 = pd.DataFrame(np.random.randn(3, 3), columns=["x", "y", "z"])
df4 = pd.concat([df1,df2], axis = 1)
df4
a | b | c | d | x | y | z | |
---|---|---|---|---|---|---|---|
0 | 0.358929 | -0.851815 | 0.313507 | 0.727029 | 0.732533 | -1.176725 | 0.359086 |
1 | 1.212873 | -1.549474 | 0.040366 | 0.525790 | 1.353637 | 1.343218 | -0.161630 |
2 | -0.344516 | -0.645047 | -0.412381 | 0.107239 | -0.375055 | -0.723952 | 0.493891 |
Merge: pd.merge()
#
SQL-style joining of tables (dataframes)
Important parameters include:
how
: type of merge {‘left’, ‘right’, ‘outer’, ‘inner’, ‘cross’}, default ‘inner’on
: names to join on. Normally it indicates the name of the column for matching up the observations.
See documentation for further details.
Look at the follow example:
# Create two tables, `left` and `right`.
left = pd.DataFrame({"key": ["jamie", "bill"], "lval": [15, 22]})
right = pd.DataFrame({"key": ["jamie", "bill", "asher"], "rval": [4, 5, 8]})
# Right join them on `key`, which means including all records from table on right.
joined = pd.merge(left, right, on="key", how="right")
print('---left')
print(left)
print('\n---right')
print(right)
print('\n---joined')
joined
---left
key lval
0 jamie 15
1 bill 22
---right
key rval
0 jamie 4
1 bill 5
2 asher 8
---joined
key | lval | rval | |
---|---|---|---|
0 | jamie | 15.0 | 4 |
1 | bill | 22.0 | 5 |
2 | asher | NaN | 8 |
# Compare to left join
pd.merge(left, right, on="key")
key | lval | rval | |
---|---|---|---|
0 | jamie | 15 | 4 |
1 | bill | 22 | 5 |
Join: join()
#
An SQL-like joiner, but this one takes advantage of indexes.
Give our dataframes indexes and distinctive columns names.
See documentation for further details.
left = pd.DataFrame(
{"A": ["A0", "A1", "A2"], "B": ["B0", "B1", "B2"]}, index=["K0", "K1", "K2"])
right = pd.DataFrame(
{"C": ["C0", "C2", "C3"], "D": ["D0", "D2", "D3"]}, index=["K0", "K2", "K3"])
right.join(left)
C | D | A | B | |
---|---|---|---|---|
K0 | C0 | D0 | A0 | B0 |
K2 | C2 | D2 | A2 | B2 |
K3 | C3 | D3 | NaN | NaN |
left.join(right)
A | B | C | D | |
---|---|---|---|---|
K0 | A0 | B0 | C0 | D0 |
K1 | A1 | B1 | NaN | NaN |
K2 | A2 | B2 | C2 | D2 |
Summary#
Use concat to combine based on shared indexes or columns.
Use merge if you want to combine datasets given a column (e.g. subject records).
Use join if you have shared indexes.
Data Aggregation#
Involves one or more of:
Splitting the data into groups
Applying a function to each group
Combining results
groupby()
method#
It allows you to compute summary statistics (e.g., sum, mean) on groups of data, which is essential for summarizing and exploring grouped data.
Basic case:
dataframe.groupby("column_name").aggregation method
# Dataframe --> group by species --> aggregate through the mean
iris_df.groupby("species").mean()
sepal_length | sepal_width | petal_length | petal_width | |
---|---|---|---|---|
species | ||||
setosa | 5.006 | 3.428 | 1.462 | 0.246 |
versicolor | 5.936 | 2.770 | 4.260 | 1.326 |
virginica | 6.588 | 2.974 | 5.552 | 2.026 |
# Dataframe --> group by species --> aggregate through the minimum
iris_df.groupby("species").min()
sepal_length | sepal_width | petal_length | petal_width | |
---|---|---|---|---|
species | ||||
setosa | 4.3 | 2.3 | 1.0 | 0.1 |
versicolor | 4.9 | 2.0 | 3.0 | 1.0 |
virginica | 4.9 | 2.2 | 4.5 | 1.4 |
You can find a full list of aggregation methods here: https://pandas.pydata.org/pandas-docs/stable/user_guide/groupby.html#built-in-aggregation-methods
More than one aggregation method:
agg()
method on the grouped data frame
See https://pandas.pydata.org/pandas-docs/stable/user_guide/groupby.html#the-aggregate-method
iris_df.groupby("species").agg(['min', 'mean', "max", "count"])
sepal_length | sepal_width | petal_length | petal_width | |||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
min | mean | max | count | min | mean | max | count | min | mean | max | count | min | mean | max | count | |
species | ||||||||||||||||
setosa | 4.3 | 5.006 | 5.8 | 50 | 2.3 | 3.428 | 4.4 | 50 | 1.0 | 1.462 | 1.9 | 50 | 0.1 | 0.246 | 0.6 | 50 |
versicolor | 4.9 | 5.936 | 7.0 | 50 | 2.0 | 2.770 | 3.4 | 50 | 3.0 | 4.260 | 5.1 | 50 | 1.0 | 1.326 | 1.8 | 50 |
virginica | 4.9 | 6.588 | 7.9 | 50 | 2.2 | 2.974 | 3.8 | 50 | 4.5 | 5.552 | 6.9 | 50 | 1.4 | 2.026 | 2.5 | 50 |
Multiple columns
iris_df.loc[iris_df["petal_width"] >= iris_df["petal_width"].mean(), "petal_width_bin"] = "high"
iris_df.loc[iris_df["petal_width"] < iris_df["petal_width"].mean(), "petal_width_bin"] = "low"
iris_df.groupby(["species", "petal_width_bin"]).mean()
sepal_length | sepal_width | petal_length | petal_width | ||
---|---|---|---|---|---|
species | petal_width_bin | ||||
setosa | low | 5.0060 | 3.4280 | 1.4620 | 0.246 |
versicolor | high | 6.0675 | 2.8625 | 4.4225 | 1.400 |
low | 5.4100 | 2.4000 | 3.6100 | 1.030 | |
virginica | high | 6.5880 | 2.9740 | 5.5520 | 2.026 |
Multiple columns and multiple aggregation methods
iris_df.groupby(["species", "petal_width_bin"]).agg(['min', 'mean', "max", "count"])
sepal_length | sepal_width | petal_length | petal_width | ||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
min | mean | max | count | min | mean | max | count | min | mean | max | count | min | mean | max | count | ||
species | petal_width_bin | ||||||||||||||||
setosa | low | 4.3 | 5.0060 | 5.8 | 50 | 2.3 | 3.4280 | 4.4 | 50 | 1.0 | 1.4620 | 1.9 | 50 | 0.1 | 0.246 | 0.6 | 50 |
versicolor | high | 5.2 | 6.0675 | 7.0 | 40 | 2.2 | 2.8625 | 3.4 | 40 | 3.6 | 4.4225 | 5.1 | 40 | 1.2 | 1.400 | 1.8 | 40 |
low | 4.9 | 5.4100 | 6.0 | 10 | 2.0 | 2.4000 | 2.7 | 10 | 3.0 | 3.6100 | 4.1 | 10 | 1.0 | 1.030 | 1.1 | 10 | |
virginica | high | 4.9 | 6.5880 | 7.9 | 50 | 2.2 | 2.9740 | 3.8 | 50 | 4.5 | 5.5520 | 6.9 | 50 | 1.4 | 2.026 | 2.5 | 50 |
pd.pivot_table()
function#
This function allows you to apply a function aggfunc
to selected values grouped by columns. See documentation for further details.
Compute mean sepal length for each species:
pd.pivot_table(iris_df, values="sepal_length", columns=["species"], aggfunc = np.mean)
species | setosa | versicolor | virginica |
---|---|---|---|
sepal_length | 5.006 | 5.936 | 6.588 |
# Similar to:
iris_df.groupby("species")[["sepal_length"]].mean().T
species | setosa | versicolor | virginica |
---|---|---|---|
sepal_length | 5.006 | 5.936 | 6.588 |
Reshaping Data#
pd.melt()
#
It allows you to convert a dataframe to long format.
It is useful to convert a dataframe into a format where one or more columns are identifier variables (id_vars), while all other columns, considered measured variables (value_vars).
From our original iris dataframe, say we want our species
to be identifier variables, while the rest be different measures. We can do the following:
# This just drops the previously binarized petal_width column
iris_df = iris_df.drop(columns="petal_width_bin")
iris_melted = pd.melt(iris_df, id_vars="species")
iris_melted
species | variable | value | |
---|---|---|---|
0 | setosa | sepal_length | 5.1 |
1 | setosa | sepal_length | 4.9 |
2 | setosa | sepal_length | 4.7 |
3 | setosa | sepal_length | 4.6 |
4 | setosa | sepal_length | 5.0 |
... | ... | ... | ... |
595 | virginica | petal_width | 2.3 |
596 | virginica | petal_width | 1.9 |
597 | virginica | petal_width | 2.0 |
598 | virginica | petal_width | 2.3 |
599 | virginica | petal_width | 1.8 |
600 rows × 3 columns
This is very useful if we want to plot both measures together, stratified by our identifed variable:
import seaborn as sns
sns.boxplot(x="variable", y="value", hue="species", data=iris_melted)
<Axes: xlabel='variable', ylabel='value'>
Practice exercises#
1- Given the following two dataframes, df_patients
and df_conditions
, representing patient information and their diagnosed conditions in a hospital setting respectively, do the following:
1.1- Use the join method to add the df_conditions
dataframe to df_patients
.
- See what happens when you use how=’inner’. Which patients remain in the final dataframe?
- See what happens when you use how=’outer’. How does the result differ?
1.2- Use the concat function to vertically stack df_patients
and df_conditions
. Why concatenating row-wise might not be very useful here?
1.3- Use concat to combine df_patients
and df_conditions
column-wise. See if the result looks similar to join. What do you notice about alignment?
import pandas as pd
# DataFrame with patient information
data_patients = {
'patient_id': [201, 202, 203, 204],
'age': [55, 63, 45, 70],
'weight': [68.0, 82.3, 74.5, 60.2]
}
df_patients = pd.DataFrame(data_patients)
df_patients.set_index('patient_id', inplace=True)
# DataFrame with medical condition details
data_conditions = {
'patient_id': [201, 202, 205],
'condition': ['Hypertension', 'Diabetes', 'Chronic Kidney Disease'],
'treatment_plan': ['Medication', 'Insulin Therapy', 'Dialysis']
}
df_conditions = pd.DataFrame(data_conditions)
df_conditions.set_index('patient_id', inplace=True)
# Your answers from here
Use a pivot table to compute the following statistics on sepal_width
and petal_width
grouped by species:
median
mean
# Your answers from here
Given the following dataframe, which contains monthly patient visit counts for different departments, reshape it into a long format using pd.melt()
, so that each row represents the patient count for a department in a particular month. Set the identifier variable as “Department” and the values column as “Patient_Count.” Check the documentation to figure out how to do this.
# Sample data
data = {
'Department': ['Cardiology', 'Neurology', 'Oncology'],
'Jan': [120, 80, 95],
'Feb': [150, 85, 100],
'Mar': [130, 90, 110]
}
# Create DataFrame
df = pd.DataFrame(data)
df
Department | Jan | Feb | Mar | |
---|---|---|---|---|
0 | Cardiology | 120 | 150 | 130 |
1 | Neurology | 80 | 85 | 90 |
2 | Oncology | 95 | 100 | 110 |
# Your answers from here