Pandas: Advanced Data Manipulation and Aggregation

Pandas: Advanced Data Manipulation and Aggregation#

In this lesson, you will learn advanced data manipulation techniques using Pandas. Specifically, we will cover:

Combining dataframes
Data aggregation
Data reshaping

# Load dependencies (NumPy and Pandas) 
import pandas as pd
import numpy as np

# We will keep using the Iris dataset for this tutorial
iris_df = pd.read_csv("https://raw.githubusercontent.com/mwaskom/seaborn-data/refs/heads/master/iris.csv")
iris_df

	sepal_length	sepal_width	petal_length	petal_width	species
0	5.1	3.5	1.4	0.2	setosa
1	4.9	3.0	1.4	0.2	setosa
2	4.7	3.2	1.3	0.2	setosa
3	4.6	3.1	1.5	0.2	setosa
4	5.0	3.6	1.4	0.2	setosa
...	...	...	...	...	...
145	6.7	3.0	5.2	2.3	virginica
146	6.3	2.5	5.0	1.9	virginica
147	6.5	3.0	5.2	2.0	virginica
148	6.2	3.4	5.4	2.3	virginica
149	5.9	3.0	5.1	1.8	virginica

150 rows × 5 columns

Combine dataframes#

Concate: `pd.concat()`#

It allows you to concatenate pandas objects along a particular axis. See documentation for further details.

Concat rows

Here we would be combining two datasets with the same features (columns) but different observations.

# Create two dfs and vertically stack them.
df1 = pd.DataFrame(np.random.randn(3, 4), columns=["a", "b", "c", "d"])
df2 = pd.DataFrame(np.random.randn(3, 4), columns=["a", "b", "c", "d"])

print(df1)
print('-'*45)
print(df2)

df3 = pd.concat([df1, df2], axis=0)

print('-'*45)
print(df3)

          a         b         c         d
0.174450 -1.018138  0.968461 -0.412483
0.703028 -0.594527  1.997723 -1.159000
-0.415667  0.290691  1.538168  0.236634
---------------------------------------------
          a         b         c         d
1.141372  0.539766  0.924608 -1.013957
1.333616 -0.966171  1.258023  0.185296
-0.060995  0.330361 -0.710879 -0.408728
---------------------------------------------
          a         b         c         d
0.174450 -1.018138  0.968461 -0.412483
0.703028 -0.594527  1.997723 -1.159000
-0.415667  0.290691  1.538168  0.236634
1.141372  0.539766  0.924608 -1.013957
1.333616 -0.966171  1.258023  0.185296
-0.060995  0.330361 -0.710879 -0.408728

Concat columns.

Here our datasets have the same IDs, for example, subjects or time points, but different measures (columns).

# Create two dfs and vertically stack them.
df1 = pd.DataFrame(np.random.randn(3, 4), columns=["a", "b", "c", "d"])
df2 = pd.DataFrame(np.random.randn(3, 3), columns=["x", "y", "z"])

df4 = pd.concat([df1,df2], axis = 1)

df4

	a	b	c	d	x	y	z
0	0.358929	-0.851815	0.313507	0.727029	0.732533	-1.176725	0.359086
1	1.212873	-1.549474	0.040366	0.525790	1.353637	1.343218	-0.161630
2	-0.344516	-0.645047	-0.412381	0.107239	-0.375055	-0.723952	0.493891

Merge: `pd.merge()`#

SQL-style joining of tables (dataframes)

Important parameters include:

how : type of merge {‘left’, ‘right’, ‘outer’, ‘inner’, ‘cross’}, default ‘inner’
on : names to join on. Normally it indicates the name of the column for matching up the observations.

See documentation for further details.

Personal note: This is probably one of the most useful functions in Pandas. I use it almost in any project where I have to combine different datasets (very common!)

Look at the follow example:

# Create two tables, `left` and `right`.
left = pd.DataFrame({"key": ["jamie", "bill"], "lval": [15, 22]})
right = pd.DataFrame({"key": ["jamie", "bill", "asher"], "rval": [4, 5, 8]})

# Right join them on `key`, which means including all records from table on right.
joined = pd.merge(left, right, on="key", how="right")

print('---left')
print(left)
print('\n---right')
print(right)
print('\n---joined')
joined

---left
     key  lval
0  jamie    15
1   bill    22

---right
     key  rval
0  jamie     4
1   bill     5
2  asher     8

---joined

	key	lval	rval
0	jamie	15.0	4
1	bill	22.0	5
2	asher	NaN	8

# Compare to left join
pd.merge(left, right, on="key")

	key	lval	rval
0	jamie	15	4
1	bill	22	5

Join: `join()`#

An SQL-like joiner, but this one takes advantage of indexes.

Give our dataframes indexes and distinctive columns names.

See documentation for further details.

left = pd.DataFrame(
    {"A": ["A0", "A1", "A2"], "B": ["B0", "B1", "B2"]}, index=["K0", "K1", "K2"])
   
right = pd.DataFrame(
    {"C": ["C0", "C2", "C3"], "D": ["D0", "D2", "D3"]}, index=["K0", "K2", "K3"])

right.join(left)

	C	D	A	B
K0	C0	D0	A0	B0
K2	C2	D2	A2	B2
K3	C3	D3	NaN	NaN

left.join(right)

	A	B	C	D
K0	A0	B0	C0	D0
K1	A1	B1	NaN	NaN
K2	A2	B2	C2	D2

Summary#

Use concat to combine based on shared indexes or columns.
Use merge if you want to combine datasets given a column (e.g. subject records).
Use join if you have shared indexes.

Data Aggregation#

Involves one or more of:

Splitting the data into groups
Applying a function to each group
Combining results

`groupby()` method#

It allows you to compute summary statistics (e.g., sum, mean) on groups of data, which is essential for summarizing and exploring grouped data.

Basic case: dataframe.groupby("column_name").aggregation method

# Dataframe --> group by species --> aggregate through the mean
iris_df.groupby("species").mean()

	sepal_length	sepal_width	petal_length	petal_width
species
setosa	5.006	3.428	1.462	0.246
versicolor	5.936	2.770	4.260	1.326
virginica	6.588	2.974	5.552	2.026

# Dataframe --> group by species --> aggregate through the minimum
iris_df.groupby("species").min()

	sepal_length	sepal_width	petal_length	petal_width
species
setosa	4.3	2.3	1.0	0.1
versicolor	4.9	2.0	3.0	1.0
virginica	4.9	2.2	4.5	1.4

You can find a full list of aggregation methods here: https://pandas.pydata.org/pandas-docs/stable/user_guide/groupby.html#built-in-aggregation-methods

More than one aggregation method: agg() method on the grouped data frame

See https://pandas.pydata.org/pandas-docs/stable/user_guide/groupby.html#the-aggregate-method

iris_df.groupby("species").agg(['min', 'mean', "max", "count"])

	sepal_length				sepal_width				petal_length				petal_width
	min	mean	max	count	min	mean	max	count	min	mean	max	count	min	mean	max	count
species
setosa	4.3	5.006	5.8	50	2.3	3.428	4.4	50	1.0	1.462	1.9	50	0.1	0.246	0.6	50
versicolor	4.9	5.936	7.0	50	2.0	2.770	3.4	50	3.0	4.260	5.1	50	1.0	1.326	1.8	50
virginica	4.9	6.588	7.9	50	2.2	2.974	3.8	50	4.5	5.552	6.9	50	1.4	2.026	2.5	50

Multiple columns

iris_df.loc[iris_df["petal_width"] >= iris_df["petal_width"].mean(), "petal_width_bin"] = "high"
iris_df.loc[iris_df["petal_width"] < iris_df["petal_width"].mean(), "petal_width_bin"] = "low"

iris_df.groupby(["species", "petal_width_bin"]).mean()

		sepal_length	sepal_width	petal_length	petal_width
species	petal_width_bin
setosa	low	5.0060	3.4280	1.4620	0.246
versicolor	high	6.0675	2.8625	4.4225	1.400
versicolor	low	5.4100	2.4000	3.6100	1.030
virginica	high	6.5880	2.9740	5.5520	2.026

Multiple columns and multiple aggregation methods

iris_df.groupby(["species", "petal_width_bin"]).agg(['min', 'mean', "max", "count"])

		sepal_length				sepal_width				petal_length				petal_width
		min	mean	max	count	min	mean	max	count	min	mean	max	count	min	mean	max	count
species	petal_width_bin
setosa	low	4.3	5.0060	5.8	50	2.3	3.4280	4.4	50	1.0	1.4620	1.9	50	0.1	0.246	0.6	50
versicolor	high	5.2	6.0675	7.0	40	2.2	2.8625	3.4	40	3.6	4.4225	5.1	40	1.2	1.400	1.8	40
versicolor	low	4.9	5.4100	6.0	10	2.0	2.4000	2.7	10	3.0	3.6100	4.1	10	1.0	1.030	1.1	10
virginica	high	4.9	6.5880	7.9	50	2.2	2.9740	3.8	50	4.5	5.5520	6.9	50	1.4	2.026	2.5	50

`pd.pivot_table()` function#

This function allows you to apply a function aggfunc to selected values grouped by columns. See documentation for further details.

Compute mean sepal length for each species:

pd.pivot_table(iris_df, values="sepal_length", columns=["species"], aggfunc = np.mean)

species	setosa	versicolor	virginica
sepal_length	5.006	5.936	6.588

# Similar to:
iris_df.groupby("species")[["sepal_length"]].mean().T

species	setosa	versicolor	virginica
sepal_length	5.006	5.936	6.588

Reshaping Data#

`pd.melt()`#

It allows you to convert a dataframe to long format.

It is useful to convert a dataframe into a format where one or more columns are identifier variables (id_vars), while all other columns, considered measured variables (value_vars).

Personal note: This is probably another useful function in Pandas, which I also use in almost all projects, particularly when I have to plot things (see below).

From our original iris dataframe, say we want our species to be identifier variables, while the rest be different measures. We can do the following:

# This just drops the previously binarized petal_width column
iris_df = iris_df.drop(columns="petal_width_bin")

iris_melted = pd.melt(iris_df, id_vars="species")
iris_melted

	species	variable	value
0	setosa	sepal_length	5.1
1	setosa	sepal_length	4.9
2	setosa	sepal_length	4.7
3	setosa	sepal_length	4.6
4	setosa	sepal_length	5.0
...	...	...	...
595	virginica	petal_width	2.3
596	virginica	petal_width	1.9
597	virginica	petal_width	2.0
598	virginica	petal_width	2.3
599	virginica	petal_width	1.8

600 rows × 3 columns

This is very useful if we want to plot both measures together, stratified by our identifed variable:

import seaborn as sns
sns.boxplot(x="variable", y="value", hue="species", data=iris_melted)

<Axes: xlabel='variable', ylabel='value'>

../../_images/f6e5edee81248bbfa4b1c20bdc900b108f597907984192fde0bd3ff6a7206d29.png

Practice exercises#

Exercise 47

1- Given the following two dataframes, df_patients and df_conditions, representing patient information and their diagnosed conditions in a hospital setting respectively, do the following:

1.1- Use the join method to add the df_conditions dataframe to df_patients. - See what happens when you use how=’inner’. Which patients remain in the final dataframe? - See what happens when you use how=’outer’. How does the result differ?

1.2- Use the concat function to vertically stack df_patients and df_conditions. Why concatenating row-wise might not be very useful here?

1.3- Use concat to combine df_patients and df_conditions column-wise. See if the result looks similar to join. What do you notice about alignment?

import pandas as pd

# DataFrame with patient information
data_patients = {
    'patient_id': [201, 202, 203, 204],
    'age': [55, 63, 45, 70],
    'weight': [68.0, 82.3, 74.5, 60.2]
}
df_patients = pd.DataFrame(data_patients)
df_patients.set_index('patient_id', inplace=True)

# DataFrame with medical condition details
data_conditions = {
    'patient_id': [201, 202, 205],
    'condition': ['Hypertension', 'Diabetes', 'Chronic Kidney Disease'],
    'treatment_plan': ['Medication', 'Insulin Therapy', 'Dialysis']
}
df_conditions = pd.DataFrame(data_conditions)
df_conditions.set_index('patient_id', inplace=True)

# Your answers from here

Exercise 48

Use a pivot table to compute the following statistics on sepal_width and petal_width grouped by species:

median
mean

# Your answers from here

Exercise 49

Given the following dataframe, which contains monthly patient visit counts for different departments, reshape it into a long format using pd.melt(), so that each row represents the patient count for a department in a particular month. Set the identifier variable as “Department” and the values column as “Patient_Count.” Check the documentation to figure out how to do this.

# Sample data
data = {
    'Department': ['Cardiology', 'Neurology', 'Oncology'],
    'Jan': [120, 80, 95],
    'Feb': [150, 85, 100],
    'Mar': [130, 90, 110]
}

# Create DataFrame
df = pd.DataFrame(data)
df

	Department	Jan	Feb	Mar
0	Cardiology	120	150	130
1	Neurology	80	85	90
2	Oncology	95	100	110

# Your answers from here

Pandas: Advanced Data Manipulation and Aggregation

Contents

Pandas: Advanced Data Manipulation and Aggregation#

Combine dataframes#

Concate: pd.concat()#

Merge: pd.merge()#

Join: join()#