Pandas: Advanced Data Manipulation and Aggregation#

In this lesson, you will learn advanced data manipulation techniques using Pandas. Specifically, we will cover:

  • Combining dataframes

  • Data aggregation

  • Data reshaping

# Load dependencies (NumPy and Pandas) 
import pandas as pd
import numpy as np
# We will keep using the Iris dataset for this tutorial
iris_df = pd.read_csv("https://raw.githubusercontent.com/mwaskom/seaborn-data/refs/heads/master/iris.csv")
iris_df
sepal_length sepal_width petal_length petal_width species
0 5.1 3.5 1.4 0.2 setosa
1 4.9 3.0 1.4 0.2 setosa
2 4.7 3.2 1.3 0.2 setosa
3 4.6 3.1 1.5 0.2 setosa
4 5.0 3.6 1.4 0.2 setosa
... ... ... ... ... ...
145 6.7 3.0 5.2 2.3 virginica
146 6.3 2.5 5.0 1.9 virginica
147 6.5 3.0 5.2 2.0 virginica
148 6.2 3.4 5.4 2.3 virginica
149 5.9 3.0 5.1 1.8 virginica

150 rows × 5 columns

Combine dataframes#

Concate: pd.concat()#

It allows you to concatenate pandas objects along a particular axis. See documentation for further details.

  • Concat rows

Here we would be combining two datasets with the same features (columns) but different observations.

# Create two dfs and vertically stack them.
df1 = pd.DataFrame(np.random.randn(3, 4), columns=["a", "b", "c", "d"])
df2 = pd.DataFrame(np.random.randn(3, 4), columns=["a", "b", "c", "d"])

print(df1)
print('-'*45)
print(df2)

df3 = pd.concat([df1, df2], axis=0)

print('-'*45)
print(df3)
          a         b         c         d
0  0.174450 -1.018138  0.968461 -0.412483
1  0.703028 -0.594527  1.997723 -1.159000
2 -0.415667  0.290691  1.538168  0.236634
---------------------------------------------
          a         b         c         d
0  1.141372  0.539766  0.924608 -1.013957
1  1.333616 -0.966171  1.258023  0.185296
2 -0.060995  0.330361 -0.710879 -0.408728
---------------------------------------------
          a         b         c         d
0  0.174450 -1.018138  0.968461 -0.412483
1  0.703028 -0.594527  1.997723 -1.159000
2 -0.415667  0.290691  1.538168  0.236634
0  1.141372  0.539766  0.924608 -1.013957
1  1.333616 -0.966171  1.258023  0.185296
2 -0.060995  0.330361 -0.710879 -0.408728
  • Concat columns.

Here our datasets have the same IDs, for example, subjects or time points, but different measures (columns).

# Create two dfs and vertically stack them.
df1 = pd.DataFrame(np.random.randn(3, 4), columns=["a", "b", "c", "d"])
df2 = pd.DataFrame(np.random.randn(3, 3), columns=["x", "y", "z"])

df4 = pd.concat([df1,df2], axis = 1)

df4
a b c d x y z
0 0.358929 -0.851815 0.313507 0.727029 0.732533 -1.176725 0.359086
1 1.212873 -1.549474 0.040366 0.525790 1.353637 1.343218 -0.161630
2 -0.344516 -0.645047 -0.412381 0.107239 -0.375055 -0.723952 0.493891

Merge: pd.merge()#

SQL-style joining of tables (dataframes)

Important parameters include:

  • how : type of merge {‘left’, ‘right’, ‘outer’, ‘inner’, ‘cross’}, default ‘inner’

  • on : names to join on. Normally it indicates the name of the column for matching up the observations.

See documentation for further details.

Personal note: This is probably one of the most useful functions in Pandas. I use it almost in any project where I have to combine different datasets (very common!)

Look at the follow example:

# Create two tables, `left` and `right`.
left = pd.DataFrame({"key": ["jamie", "bill"], "lval": [15, 22]})
right = pd.DataFrame({"key": ["jamie", "bill", "asher"], "rval": [4, 5, 8]})

# Right join them on `key`, which means including all records from table on right.
joined = pd.merge(left, right, on="key", how="right")

print('---left')
print(left)
print('\n---right')
print(right)
print('\n---joined')
joined
---left
     key  lval
0  jamie    15
1   bill    22

---right
     key  rval
0  jamie     4
1   bill     5
2  asher     8

---joined
key lval rval
0 jamie 15.0 4
1 bill 22.0 5
2 asher NaN 8
# Compare to left join
pd.merge(left, right, on="key")
key lval rval
0 jamie 15 4
1 bill 22 5

Join: join()#

An SQL-like joiner, but this one takes advantage of indexes.

Give our dataframes indexes and distinctive columns names.

See documentation for further details.

left = pd.DataFrame(
    {"A": ["A0", "A1", "A2"], "B": ["B0", "B1", "B2"]}, index=["K0", "K1", "K2"])
   
right = pd.DataFrame(
    {"C": ["C0", "C2", "C3"], "D": ["D0", "D2", "D3"]}, index=["K0", "K2", "K3"])
right.join(left)
C D A B
K0 C0 D0 A0 B0
K2 C2 D2 A2 B2
K3 C3 D3 NaN NaN
left.join(right)
A B C D
K0 A0 B0 C0 D0
K1 A1 B1 NaN NaN
K2 A2 B2 C2 D2

Summary#

  • Use concat to combine based on shared indexes or columns.

  • Use merge if you want to combine datasets given a column (e.g. subject records).

  • Use join if you have shared indexes.

Data Aggregation#

Involves one or more of:

  • Splitting the data into groups

  • Applying a function to each group

  • Combining results

groupby() method#

It allows you to compute summary statistics (e.g., sum, mean) on groups of data, which is essential for summarizing and exploring grouped data.

  • Basic case: dataframe.groupby("column_name").aggregation method

# Dataframe --> group by species --> aggregate through the mean
iris_df.groupby("species").mean()
sepal_length sepal_width petal_length petal_width
species
setosa 5.006 3.428 1.462 0.246
versicolor 5.936 2.770 4.260 1.326
virginica 6.588 2.974 5.552 2.026
# Dataframe --> group by species --> aggregate through the minimum
iris_df.groupby("species").min()
sepal_length sepal_width petal_length petal_width
species
setosa 4.3 2.3 1.0 0.1
versicolor 4.9 2.0 3.0 1.0
virginica 4.9 2.2 4.5 1.4

You can find a full list of aggregation methods here: https://pandas.pydata.org/pandas-docs/stable/user_guide/groupby.html#built-in-aggregation-methods

  • More than one aggregation method: agg() method on the grouped data frame

See https://pandas.pydata.org/pandas-docs/stable/user_guide/groupby.html#the-aggregate-method

iris_df.groupby("species").agg(['min', 'mean', "max", "count"])
sepal_length sepal_width petal_length petal_width
min mean max count min mean max count min mean max count min mean max count
species
setosa 4.3 5.006 5.8 50 2.3 3.428 4.4 50 1.0 1.462 1.9 50 0.1 0.246 0.6 50
versicolor 4.9 5.936 7.0 50 2.0 2.770 3.4 50 3.0 4.260 5.1 50 1.0 1.326 1.8 50
virginica 4.9 6.588 7.9 50 2.2 2.974 3.8 50 4.5 5.552 6.9 50 1.4 2.026 2.5 50
  • Multiple columns

iris_df.loc[iris_df["petal_width"] >= iris_df["petal_width"].mean(), "petal_width_bin"] = "high"
iris_df.loc[iris_df["petal_width"] < iris_df["petal_width"].mean(), "petal_width_bin"] = "low"

iris_df.groupby(["species", "petal_width_bin"]).mean()
sepal_length sepal_width petal_length petal_width
species petal_width_bin
setosa low 5.0060 3.4280 1.4620 0.246
versicolor high 6.0675 2.8625 4.4225 1.400
low 5.4100 2.4000 3.6100 1.030
virginica high 6.5880 2.9740 5.5520 2.026
  • Multiple columns and multiple aggregation methods

iris_df.groupby(["species", "petal_width_bin"]).agg(['min', 'mean', "max", "count"])
sepal_length sepal_width petal_length petal_width
min mean max count min mean max count min mean max count min mean max count
species petal_width_bin
setosa low 4.3 5.0060 5.8 50 2.3 3.4280 4.4 50 1.0 1.4620 1.9 50 0.1 0.246 0.6 50
versicolor high 5.2 6.0675 7.0 40 2.2 2.8625 3.4 40 3.6 4.4225 5.1 40 1.2 1.400 1.8 40
low 4.9 5.4100 6.0 10 2.0 2.4000 2.7 10 3.0 3.6100 4.1 10 1.0 1.030 1.1 10
virginica high 4.9 6.5880 7.9 50 2.2 2.9740 3.8 50 4.5 5.5520 6.9 50 1.4 2.026 2.5 50

pd.pivot_table() function#

This function allows you to apply a function aggfunc to selected values grouped by columns. See documentation for further details.

Compute mean sepal length for each species:

pd.pivot_table(iris_df, values="sepal_length", columns=["species"], aggfunc = np.mean)
species setosa versicolor virginica
sepal_length 5.006 5.936 6.588
# Similar to:
iris_df.groupby("species")[["sepal_length"]].mean().T
species setosa versicolor virginica
sepal_length 5.006 5.936 6.588

Reshaping Data#

pd.melt()#

It allows you to convert a dataframe to long format.

It is useful to convert a dataframe into a format where one or more columns are identifier variables (id_vars), while all other columns, considered measured variables (value_vars).

Personal note: This is probably another useful function in Pandas, which I also use in almost all projects, particularly when I have to plot things (see below).

From our original iris dataframe, say we want our species to be identifier variables, while the rest be different measures. We can do the following:

# This just drops the previously binarized petal_width column
iris_df = iris_df.drop(columns="petal_width_bin")
iris_melted = pd.melt(iris_df, id_vars="species")
iris_melted
species variable value
0 setosa sepal_length 5.1
1 setosa sepal_length 4.9
2 setosa sepal_length 4.7
3 setosa sepal_length 4.6
4 setosa sepal_length 5.0
... ... ... ...
595 virginica petal_width 2.3
596 virginica petal_width 1.9
597 virginica petal_width 2.0
598 virginica petal_width 2.3
599 virginica petal_width 1.8

600 rows × 3 columns

This is very useful if we want to plot both measures together, stratified by our identifed variable:

import seaborn as sns
sns.boxplot(x="variable", y="value", hue="species", data=iris_melted)
<Axes: xlabel='variable', ylabel='value'>
../../_images/f6e5edee81248bbfa4b1c20bdc900b108f597907984192fde0bd3ff6a7206d29.png

Practice exercises#

Exercise 47

1- Given the following two dataframes, df_patients and df_conditions, representing patient information and their diagnosed conditions in a hospital setting respectively, do the following:

1.1- Use the join method to add the df_conditions dataframe to df_patients. - See what happens when you use how=’inner’. Which patients remain in the final dataframe? - See what happens when you use how=’outer’. How does the result differ?

1.2- Use the concat function to vertically stack df_patients and df_conditions. Why concatenating row-wise might not be very useful here?

1.3- Use concat to combine df_patients and df_conditions column-wise. See if the result looks similar to join. What do you notice about alignment?

import pandas as pd

# DataFrame with patient information
data_patients = {
    'patient_id': [201, 202, 203, 204],
    'age': [55, 63, 45, 70],
    'weight': [68.0, 82.3, 74.5, 60.2]
}
df_patients = pd.DataFrame(data_patients)
df_patients.set_index('patient_id', inplace=True)

# DataFrame with medical condition details
data_conditions = {
    'patient_id': [201, 202, 205],
    'condition': ['Hypertension', 'Diabetes', 'Chronic Kidney Disease'],
    'treatment_plan': ['Medication', 'Insulin Therapy', 'Dialysis']
}
df_conditions = pd.DataFrame(data_conditions)
df_conditions.set_index('patient_id', inplace=True)
# Your answers from here

Exercise 48

Use a pivot table to compute the following statistics on sepal_width and petal_width grouped by species:

  • median

  • mean

# Your answers from here

Exercise 49

Given the following dataframe, which contains monthly patient visit counts for different departments, reshape it into a long format using pd.melt(), so that each row represents the patient count for a department in a particular month. Set the identifier variable as “Department” and the values column as “Patient_Count.” Check the documentation to figure out how to do this.

# Sample data
data = {
    'Department': ['Cardiology', 'Neurology', 'Oncology'],
    'Jan': [120, 80, 95],
    'Feb': [150, 85, 100],
    'Mar': [130, 90, 110]
}

# Create DataFrame
df = pd.DataFrame(data)
df
Department Jan Feb Mar
0 Cardiology 120 150 130
1 Neurology 80 85 90
2 Oncology 95 100 110
# Your answers from here