Pandas: Data Manipulation#
In the previous lesson, we studied how Pandas can be used to learn about your data through inspection and exploration.
In this lesson, you’ll learn how Pandas can help you process the data to prepare them for analysis.
Specifically, we will cover:
Advance filtering and subsetting.
The creation and removal of columns.
Transforming data through functions.
Sorting data.
# As always, we need to import Pandas
import pandas as pd
# We are going to keep using the Iris data
iris_df = pd.read_csv("https://raw.githubusercontent.com/mwaskom/seaborn-data/refs/heads/master/iris.csv")
iris_df
sepal_length | sepal_width | petal_length | petal_width | species | |
---|---|---|---|---|---|
0 | 5.1 | 3.5 | 1.4 | 0.2 | setosa |
1 | 4.9 | 3.0 | 1.4 | 0.2 | setosa |
2 | 4.7 | 3.2 | 1.3 | 0.2 | setosa |
3 | 4.6 | 3.1 | 1.5 | 0.2 | setosa |
4 | 5.0 | 3.6 | 1.4 | 0.2 | setosa |
... | ... | ... | ... | ... | ... |
145 | 6.7 | 3.0 | 5.2 | 2.3 | virginica |
146 | 6.3 | 2.5 | 5.0 | 1.9 | virginica |
147 | 6.5 | 3.0 | 5.2 | 2.0 | virginica |
148 | 6.2 | 3.4 | 5.4 | 2.3 | virginica |
149 | 5.9 | 3.0 | 5.1 | 1.8 | virginica |
150 rows × 5 columns
Advanced Filtering and Subsetting#
Boolean Filtering#
It is very common to subset a dataframe based on some condition on the data.
Like with NumPy, Pandas also knows what to do if you pass a boolean structure.
iris_df.sepal_length >= 7.5
0 False
1 False
2 False
3 False
4 False
...
145 False
146 False
147 False
148 False
149 False
Name: sepal_length, Length: 150, dtype: bool
# Rows satisfying the above condition
iris_df.loc[iris_df.sepal_length >= 7.5,:]
sepal_length | sepal_width | petal_length | petal_width | species | |
---|---|---|---|---|---|
105 | 7.6 | 3.0 | 6.6 | 2.1 | virginica |
117 | 7.7 | 3.8 | 6.7 | 2.2 | virginica |
118 | 7.7 | 2.6 | 6.9 | 2.3 | virginica |
122 | 7.7 | 2.8 | 6.7 | 2.0 | virginica |
131 | 7.9 | 3.8 | 6.4 | 2.0 | virginica |
135 | 7.7 | 3.0 | 6.1 | 2.3 | virginica |
We can combine more than one conditions, similarly to NumPy:
iris_df.loc[(iris_df['sepal_length' ]>= 4.5) & (iris_df['sepal_length'] <= 4.7),:]
sepal_length | sepal_width | petal_length | petal_width | species | |
---|---|---|---|---|---|
2 | 4.7 | 3.2 | 1.3 | 0.2 | setosa |
3 | 4.6 | 3.1 | 1.5 | 0.2 | setosa |
6 | 4.6 | 3.4 | 1.4 | 0.3 | setosa |
22 | 4.6 | 3.6 | 1.0 | 0.2 | setosa |
29 | 4.7 | 3.2 | 1.6 | 0.2 | setosa |
41 | 4.5 | 2.3 | 1.3 | 0.3 | setosa |
47 | 4.6 | 3.2 | 1.4 | 0.2 | setosa |
And like with NumPy, we can use this boolean filtering to modify existing observations:
iris_copy = iris_df.copy()
iris_copy.loc[(iris_copy['sepal_length' ]>= 4.5) & (iris_copy['sepal_length'] <= 4.7), "sepal_length"] = "medium_length"
iris_copy
sepal_length | sepal_width | petal_length | petal_width | species | |
---|---|---|---|---|---|
0 | 5.1 | 3.5 | 1.4 | 0.2 | setosa |
1 | 4.9 | 3.0 | 1.4 | 0.2 | setosa |
2 | medium_length | 3.2 | 1.3 | 0.2 | setosa |
3 | medium_length | 3.1 | 1.5 | 0.2 | setosa |
4 | 5.0 | 3.6 | 1.4 | 0.2 | setosa |
... | ... | ... | ... | ... | ... |
145 | 6.7 | 3.0 | 5.2 | 2.3 | virginica |
146 | 6.3 | 2.5 | 5.0 | 1.9 | virginica |
147 | 6.5 | 3.0 | 5.2 | 2.0 | virginica |
148 | 6.2 | 3.4 | 5.4 | 2.3 | virginica |
149 | 5.9 | 3.0 | 5.1 | 1.8 | virginica |
150 rows × 5 columns
Masking#
Here’s an example of masking using boolean conditions passed to the dataframe selector:
Here are the values for the feature sepal length
:
iris_df.sepal_length.values
array([5.1, 4.9, 4.7, 4.6, 5. , 5.4, 4.6, 5. , 4.4, 4.9, 5.4, 4.8, 4.8,
4.3, 5.8, 5.7, 5.4, 5.1, 5.7, 5.1, 5.4, 5.1, 4.6, 5.1, 4.8, 5. ,
5. , 5.2, 5.2, 4.7, 4.8, 5.4, 5.2, 5.5, 4.9, 5. , 5.5, 4.9, 4.4,
5.1, 5. , 4.5, 4.4, 5. , 5.1, 4.8, 5.1, 4.6, 5.3, 5. , 7. , 6.4,
6.9, 5.5, 6.5, 5.7, 6.3, 4.9, 6.6, 5.2, 5. , 5.9, 6. , 6.1, 5.6,
6.7, 5.6, 5.8, 6.2, 5.6, 5.9, 6.1, 6.3, 6.1, 6.4, 6.6, 6.8, 6.7,
6. , 5.7, 5.5, 5.5, 5.8, 6. , 5.4, 6. , 6.7, 6.3, 5.6, 5.5, 5.5,
6.1, 5.8, 5. , 5.6, 5.7, 5.7, 6.2, 5.1, 5.7, 6.3, 5.8, 7.1, 6.3,
6.5, 7.6, 4.9, 7.3, 6.7, 7.2, 6.5, 6.4, 6.8, 5.7, 5.8, 6.4, 6.5,
7.7, 7.7, 6. , 6.9, 5.6, 7.7, 6.3, 6.7, 7.2, 6.2, 6.1, 6.4, 7.2,
7.4, 7.9, 6.4, 6.3, 6.1, 7.7, 6.3, 6.4, 6. , 6.9, 6.7, 6.9, 5.8,
6.8, 6.7, 6.7, 6.3, 6.5, 6.2, 5.9])
And here are the boolean values generated by applying a comparison operator to those values:
mask = iris_df.sepal_length >= 7.5
mask.values
array([False, False, False, False, False, False, False, False, False,
False, False, False, False, False, False, False, False, False,
False, False, False, False, False, False, False, False, False,
False, False, False, False, False, False, False, False, False,
False, False, False, False, False, False, False, False, False,
False, False, False, False, False, False, False, False, False,
False, False, False, False, False, False, False, False, False,
False, False, False, False, False, False, False, False, False,
False, False, False, False, False, False, False, False, False,
False, False, False, False, False, False, False, False, False,
False, False, False, False, False, False, False, False, False,
False, False, False, False, False, False, True, False, False,
False, False, False, False, False, False, False, False, False,
True, True, False, False, False, True, False, False, False,
False, False, False, False, False, True, False, False, False,
True, False, False, False, False, False, False, False, False,
False, False, False, False, False, False])
The two sets of values have the same shape.
We can now overlay the logical values over the numeric ones and keep only what is True
:
iris_df.sepal_length[mask].values
array([7.6, 7.7, 7.7, 7.7, 7.9, 7.7])
Creating columns#
It is typical to create new columns in a dataframe from existing ones.
Using brackets []
#
Like in dictionaries, you can create a new entry (column) in the dataframe using []
.
In this example, a new column (or field) is created by multiplying summing sepal_length
with sepal_width
:
iris_df['sepal_surface'] = iris_df.sepal_length * iris_df.sepal_width
iris_df.head()
sepal_length | sepal_width | petal_length | petal_width | species | sepal_surface | |
---|---|---|---|---|---|---|
0 | 5.1 | 3.5 | 1.4 | 0.2 | setosa | 17.85 |
1 | 4.9 | 3.0 | 1.4 | 0.2 | setosa | 14.70 |
2 | 4.7 | 3.2 | 1.3 | 0.2 | setosa | 15.04 |
3 | 4.6 | 3.1 | 1.5 | 0.2 | setosa | 14.26 |
4 | 5.0 | 3.6 | 1.4 | 0.2 | setosa | 18.00 |
Note that:
The left side has form: DataFrame name, bracket notation, new column name.
The assignment operator
=
is used.The right side contains an expression; here, two df columns are multiplied together.
Bracket notation also works on the fields, but it’s more typing:
iris_df['sepal_surface_2'] = iris_df['sepal_length'] * iris_df['sepal_width']
iris_df.head()
sepal_length | sepal_width | petal_length | petal_width | species | sepal_surface | sepal_surface_2 | |
---|---|---|---|---|---|---|---|
0 | 5.1 | 3.5 | 1.4 | 0.2 | setosa | 17.85 | 17.85 |
1 | 4.9 | 3.0 | 1.4 | 0.2 | setosa | 14.70 | 14.70 |
2 | 4.7 | 3.2 | 1.3 | 0.2 | setosa | 15.04 | 15.04 |
3 | 4.6 | 3.1 | 1.5 | 0.2 | setosa | 14.26 | 14.26 |
4 | 5.0 | 3.6 | 1.4 | 0.2 | setosa | 18.00 | 18.00 |
iris_df.'sepal_surface_3' = iris_df.sepal_length * iris_df.sepal_width
Cell In[13], line 1
iris_df.'sepal_surface_3' = iris_df.sepal_length * iris_df.sepal_width
^
SyntaxError: invalid syntax
iris_df.sepal_surface_3 = iris_df.sepal_length + iris_df.sepal_width
/tmp/ipykernel_28682/1425502001.py:1: UserWarning: Pandas doesn't allow columns to be created via a new attribute name - see https://pandas.pydata.org/pandas-docs/stable/indexing.html#attribute-access
iris_df.sepal_surface_3 = iris_df.sepal_length + iris_df.sepal_width
Using loc[]
#
We can also use loc to create new columns. It is basically very similar to using just brackets:
iris_df.loc[:, 'sepal_surface_loc'] = iris_df['sepal_length'] * iris_df['sepal_width']
iris_df
sepal_length | sepal_width | petal_length | petal_width | species | sepal_surface | sepal_surface_2 | sepal_surface_loc | |
---|---|---|---|---|---|---|---|---|
0 | 5.1 | 3.5 | 1.4 | 0.2 | setosa | 17.85 | 17.85 | 17.85 |
1 | 4.9 | 3.0 | 1.4 | 0.2 | setosa | 14.70 | 14.70 | 14.70 |
2 | 4.7 | 3.2 | 1.3 | 0.2 | setosa | 15.04 | 15.04 | 15.04 |
3 | 4.6 | 3.1 | 1.5 | 0.2 | setosa | 14.26 | 14.26 | 14.26 |
4 | 5.0 | 3.6 | 1.4 | 0.2 | setosa | 18.00 | 18.00 | 18.00 |
... | ... | ... | ... | ... | ... | ... | ... | ... |
145 | 6.7 | 3.0 | 5.2 | 2.3 | virginica | 20.10 | 20.10 | 20.10 |
146 | 6.3 | 2.5 | 5.0 | 1.9 | virginica | 15.75 | 15.75 | 15.75 |
147 | 6.5 | 3.0 | 5.2 | 2.0 | virginica | 19.50 | 19.50 | 19.50 |
148 | 6.2 | 3.4 | 5.4 | 2.3 | virginica | 21.08 | 21.08 | 21.08 |
149 | 5.9 | 3.0 | 5.1 | 1.8 | virginica | 17.70 | 17.70 | 17.70 |
150 rows × 8 columns
Notice the use of :
here, which applies the new column values across all rows. However, this isn’t always necessary—you can specify conditions to control which rows receive a particular value when creating a column.
Here for example, let’s create a binarized version of the sepal_surface property:
iris_df.loc[iris_df['sepal_surface'] >= 15, 'sepal_surface_bin'] = "high"
iris_df
sepal_length | sepal_width | petal_length | petal_width | species | sepal_surface | sepal_surface_2 | sepal_surface_loc | sepal_surface_bin | |
---|---|---|---|---|---|---|---|---|---|
0 | 5.1 | 3.5 | 1.4 | 0.2 | setosa | 17.85 | 17.85 | 17.85 | high |
1 | 4.9 | 3.0 | 1.4 | 0.2 | setosa | 14.70 | 14.70 | 14.70 | NaN |
2 | 4.7 | 3.2 | 1.3 | 0.2 | setosa | 15.04 | 15.04 | 15.04 | high |
3 | 4.6 | 3.1 | 1.5 | 0.2 | setosa | 14.26 | 14.26 | 14.26 | NaN |
4 | 5.0 | 3.6 | 1.4 | 0.2 | setosa | 18.00 | 18.00 | 18.00 | high |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
145 | 6.7 | 3.0 | 5.2 | 2.3 | virginica | 20.10 | 20.10 | 20.10 | high |
146 | 6.3 | 2.5 | 5.0 | 1.9 | virginica | 15.75 | 15.75 | 15.75 | high |
147 | 6.5 | 3.0 | 5.2 | 2.0 | virginica | 19.50 | 19.50 | 19.50 | high |
148 | 6.2 | 3.4 | 5.4 | 2.3 | virginica | 21.08 | 21.08 | 21.08 | high |
149 | 5.9 | 3.0 | 5.1 | 1.8 | virginica | 17.70 | 17.70 | 17.70 | high |
150 rows × 9 columns
Unspecified observations will automatically be assigned a NaN value. However, you can always update these values later using boolean filtering to selectively assign values where needed:
iris_df.loc[(iris_df['sepal_length'] * iris_df['sepal_width']) < 15, 'sepal_surface_bin'] = "low"
iris_df
sepal_length | sepal_width | petal_length | petal_width | species | sepal_surface | sepal_surface_2 | sepal_surface_loc | sepal_surface_bin | |
---|---|---|---|---|---|---|---|---|---|
0 | 5.1 | 3.5 | 1.4 | 0.2 | setosa | 17.85 | 17.85 | 17.85 | high |
1 | 4.9 | 3.0 | 1.4 | 0.2 | setosa | 14.70 | 14.70 | 14.70 | low |
2 | 4.7 | 3.2 | 1.3 | 0.2 | setosa | 15.04 | 15.04 | 15.04 | high |
3 | 4.6 | 3.1 | 1.5 | 0.2 | setosa | 14.26 | 14.26 | 14.26 | low |
4 | 5.0 | 3.6 | 1.4 | 0.2 | setosa | 18.00 | 18.00 | 18.00 | high |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
145 | 6.7 | 3.0 | 5.2 | 2.3 | virginica | 20.10 | 20.10 | 20.10 | high |
146 | 6.3 | 2.5 | 5.0 | 1.9 | virginica | 15.75 | 15.75 | 15.75 | high |
147 | 6.5 | 3.0 | 5.2 | 2.0 | virginica | 19.50 | 19.50 | 19.50 | high |
148 | 6.2 | 3.4 | 5.4 | 2.3 | virginica | 21.08 | 21.08 | 21.08 | high |
149 | 5.9 | 3.0 | 5.1 | 1.8 | virginica | 17.70 | 17.70 | 17.70 | high |
150 rows × 9 columns
Try it yourself with Practice Exercise 1 in this notebook!
Using assign
#
This method also allows you to create new columns in a given dataframe.
iris_df
sepal_length | sepal_width | petal_length | petal_width | species | sepal_surface | sepal_surface_2 | sepal_surface_loc | sepal_surface_bin | |
---|---|---|---|---|---|---|---|---|---|
0 | 5.1 | 3.5 | 1.4 | 0.2 | setosa | 17.85 | 17.85 | 17.85 | high |
1 | 4.9 | 3.0 | 1.4 | 0.2 | setosa | 14.70 | 14.70 | 14.70 | low |
2 | 4.7 | 3.2 | 1.3 | 0.2 | setosa | 15.04 | 15.04 | 15.04 | high |
3 | 4.6 | 3.1 | 1.5 | 0.2 | setosa | 14.26 | 14.26 | 14.26 | low |
4 | 5.0 | 3.6 | 1.4 | 0.2 | setosa | 18.00 | 18.00 | 18.00 | high |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
145 | 6.7 | 3.0 | 5.2 | 2.3 | virginica | 20.10 | 20.10 | 20.10 | high |
146 | 6.3 | 2.5 | 5.0 | 1.9 | virginica | 15.75 | 15.75 | 15.75 | high |
147 | 6.5 | 3.0 | 5.2 | 2.0 | virginica | 19.50 | 19.50 | 19.50 | high |
148 | 6.2 | 3.4 | 5.4 | 2.3 | virginica | 21.08 | 21.08 | 21.08 | high |
149 | 5.9 | 3.0 | 5.1 | 1.8 | virginica | 17.70 | 17.70 | 17.70 | high |
150 rows × 9 columns
iris_df.assign(sepal_surface_assign = iris_df["sepal_length"] * iris_df["sepal_width"])
sepal_length | sepal_width | petal_length | petal_width | species | sepal_surface | sepal_surface_2 | sepal_surface_loc | sepal_surface_bin | sepal_surface_assign | |
---|---|---|---|---|---|---|---|---|---|---|
0 | 5.1 | 3.5 | 1.4 | 0.2 | setosa | 17.85 | 17.85 | 17.85 | high | 17.85 |
1 | 4.9 | 3.0 | 1.4 | 0.2 | setosa | 14.70 | 14.70 | 14.70 | low | 14.70 |
2 | 4.7 | 3.2 | 1.3 | 0.2 | setosa | 15.04 | 15.04 | 15.04 | high | 15.04 |
3 | 4.6 | 3.1 | 1.5 | 0.2 | setosa | 14.26 | 14.26 | 14.26 | low | 14.26 |
4 | 5.0 | 3.6 | 1.4 | 0.2 | setosa | 18.00 | 18.00 | 18.00 | high | 18.00 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
145 | 6.7 | 3.0 | 5.2 | 2.3 | virginica | 20.10 | 20.10 | 20.10 | high | 20.10 |
146 | 6.3 | 2.5 | 5.0 | 1.9 | virginica | 15.75 | 15.75 | 15.75 | high | 15.75 |
147 | 6.5 | 3.0 | 5.2 | 2.0 | virginica | 19.50 | 19.50 | 19.50 | high | 19.50 |
148 | 6.2 | 3.4 | 5.4 | 2.3 | virginica | 21.08 | 21.08 | 21.08 | high | 21.08 |
149 | 5.9 | 3.0 | 5.1 | 1.8 | virginica | 17.70 | 17.70 | 17.70 | high | 17.70 |
150 rows × 10 columns
Try it yourself with Practice Exercise 2 in this notebook!
Removing columns#
Using the reserverd keyword
del
to drop a DataFrame or single columns from the dataframe.
iris_df_drop = iris_df.copy()
iris_df_drop.head(2)
sepal_length | sepal_width | petal_length | petal_width | species | sepal_surface | sepal_surface_2 | sepal_surface_loc | sepal_surface_bin | |
---|---|---|---|---|---|---|---|---|---|
0 | 5.1 | 3.5 | 1.4 | 0.2 | setosa | 17.85 | 17.85 | 17.85 | high |
1 | 4.9 | 3.0 | 1.4 | 0.2 | setosa | 14.70 | 14.70 | 14.70 | low |
# delete the column 'x'
del iris_df_drop['sepal_surface_2']
iris_df_drop
sepal_length | sepal_width | petal_length | petal_width | species | sepal_surface | sepal_surface_loc | sepal_surface_bin | |
---|---|---|---|---|---|---|---|---|
0 | 5.1 | 3.5 | 1.4 | 0.2 | setosa | 17.85 | 17.85 | high |
1 | 4.9 | 3.0 | 1.4 | 0.2 | setosa | 14.70 | 14.70 | low |
2 | 4.7 | 3.2 | 1.3 | 0.2 | setosa | 15.04 | 15.04 | high |
3 | 4.6 | 3.1 | 1.5 | 0.2 | setosa | 14.26 | 14.26 | low |
4 | 5.0 | 3.6 | 1.4 | 0.2 | setosa | 18.00 | 18.00 | high |
... | ... | ... | ... | ... | ... | ... | ... | ... |
145 | 6.7 | 3.0 | 5.2 | 2.3 | virginica | 20.10 | 20.10 | high |
146 | 6.3 | 2.5 | 5.0 | 1.9 | virginica | 15.75 | 15.75 | high |
147 | 6.5 | 3.0 | 5.2 | 2.0 | virginica | 19.50 | 19.50 | high |
148 | 6.2 | 3.4 | 5.4 | 2.3 | virginica | 21.08 | 21.08 | high |
149 | 5.9 | 3.0 | 5.1 | 1.8 | virginica | 17.70 | 17.70 | high |
150 rows × 8 columns
Using the method
drop()
to drop one or more columns by specifyingaxis
argument equal 1.
del
, here a new dataframe is created.
iris_df_drop.drop(['sepal_surface', 'species'], axis=1).head()
sepal_length | sepal_width | petal_length | petal_width | sepal_surface_loc | sepal_surface_bin | |
---|---|---|---|---|---|---|
0 | 5.1 | 3.5 | 1.4 | 0.2 | 17.85 | high |
1 | 4.9 | 3.0 | 1.4 | 0.2 | 14.70 | low |
2 | 4.7 | 3.2 | 1.3 | 0.2 | 15.04 | high |
3 | 4.6 | 3.1 | 1.5 | 0.2 | 14.26 | low |
4 | 5.0 | 3.6 | 1.4 | 0.2 | 18.00 | high |
# The original dataframe is not modified
iris_df_drop.head()
sepal_length | sepal_width | petal_length | petal_width | species | sepal_surface | sepal_surface_loc | sepal_surface_bin | |
---|---|---|---|---|---|---|---|---|
0 | 5.1 | 3.5 | 1.4 | 0.2 | setosa | 17.85 | 17.85 | high |
1 | 4.9 | 3.0 | 1.4 | 0.2 | setosa | 14.70 | 14.70 | low |
2 | 4.7 | 3.2 | 1.3 | 0.2 | setosa | 15.04 | 15.04 | high |
3 | 4.6 | 3.1 | 1.5 | 0.2 | setosa | 14.26 | 14.26 | low |
4 | 5.0 | 3.6 | 1.4 | 0.2 | setosa | 18.00 | 18.00 | high |
# If we want to modify, we need to reassign the copy dataframe to it
iris_df_drop = iris_df_drop.drop(['sepal_surface', 'species'], axis=1)
iris_df_drop
sepal_length | sepal_width | petal_length | petal_width | sepal_surface_loc | sepal_surface_bin | |
---|---|---|---|---|---|---|
0 | 5.1 | 3.5 | 1.4 | 0.2 | 17.85 | high |
1 | 4.9 | 3.0 | 1.4 | 0.2 | 14.70 | low |
2 | 4.7 | 3.2 | 1.3 | 0.2 | 15.04 | high |
3 | 4.6 | 3.1 | 1.5 | 0.2 | 14.26 | low |
4 | 5.0 | 3.6 | 1.4 | 0.2 | 18.00 | high |
... | ... | ... | ... | ... | ... | ... |
145 | 6.7 | 3.0 | 5.2 | 2.3 | 20.10 | high |
146 | 6.3 | 2.5 | 5.0 | 1.9 | 15.75 | high |
147 | 6.5 | 3.0 | 5.2 | 2.0 | 19.50 | high |
148 | 6.2 | 3.4 | 5.4 | 2.3 | 21.08 | high |
149 | 5.9 | 3.0 | 5.1 | 1.8 | 17.70 | high |
150 rows × 6 columns
Note that with this drop()
method you can also drop specific observations by setting axis
=0
# Now a particular observation
iris_df_drop = iris_df_drop.drop([0], axis=0)
iris_df_drop
sepal_length | sepal_width | petal_length | petal_width | sepal_surface_loc | sepal_surface_bin | |
---|---|---|---|---|---|---|
1 | 4.9 | 3.0 | 1.4 | 0.2 | 14.70 | low |
2 | 4.7 | 3.2 | 1.3 | 0.2 | 15.04 | high |
3 | 4.6 | 3.1 | 1.5 | 0.2 | 14.26 | low |
4 | 5.0 | 3.6 | 1.4 | 0.2 | 18.00 | high |
5 | 5.4 | 3.9 | 1.7 | 0.4 | 21.06 | high |
... | ... | ... | ... | ... | ... | ... |
145 | 6.7 | 3.0 | 5.2 | 2.3 | 20.10 | high |
146 | 6.3 | 2.5 | 5.0 | 1.9 | 15.75 | high |
147 | 6.5 | 3.0 | 5.2 | 2.0 | 19.50 | high |
148 | 6.2 | 3.4 | 5.4 | 2.3 | 21.08 | high |
149 | 5.9 | 3.0 | 5.1 | 1.8 | 17.70 | high |
149 rows × 6 columns
Transforming your data#
Sometimes, as part of preprocessing, you may need to apply a function to transform your dataframe.
The most straightforward way to do this is with the apply
method. Typical uses of this are the following:
Applying transformations or computations that are more complex than what’s available through built-in methods.
Reducing the dimensions of the data, thereby allowing for aggregation operations when used with a function that returns a single value.
Applying user-defined or lambda functions across rows or columns efficiently.
Let’s explore some of these uses:
Creation of new columns, particularly when these need to arise from a complex operation.
def my_func(x):
return x**3 - x**2 - x + 10
iris_df["sepal_length"].apply(my_func)
0 111.541
1 98.739
2 87.033
3 81.576
4 105.000
...
145 259.173
146 214.057
147 235.875
148 203.688
149 174.669
Name: sepal_length, Length: 150, dtype: float64
You can also use a lambda
function, which is basically a small anonymous function that can take any number of arguments, but can only have one expression.
lambda arguments : expression
See the documentation for more details.
iris_df["sepal_length"].apply(lambda x: x**3 - x**2 - x + 10)
0 111.541
1 98.739
2 87.033
3 81.576
4 105.000
...
145 259.173
146 214.057
147 235.875
148 203.688
149 174.669
Name: sepal_length, Length: 150, dtype: float64
apply
takes the argument axis
, which specifies the axis along which the function is applied. By default, axis
is set to 0, meaning the function is applied to each column.
import numpy as np
iris_df[["sepal_length", "petal_length"]].apply(np.mean)
sepal_length 5.843333
petal_length 3.758000
dtype: float64
But we can change this to apply a function to each row by setting axis
to 1:
import numpy
iris_df[["sepal_length", "petal_length"]].apply(numpy.mean, axis=1)
0 3.25
1 3.15
2 3.00
3 3.05
4 3.20
...
145 5.95
146 5.65
147 5.85
148 5.80
149 5.50
Length: 150, dtype: float64
iris_df["avg_length"] = iris_df[["sepal_length", "petal_length"]].apply(numpy.mean, axis=1)
iris_df["avg_width"] = iris_df[["sepal_width", "petal_width"]].apply(numpy.mean, axis=1)
iris_df.head()
sepal_length | sepal_width | petal_length | petal_width | species | sepal_surface | sepal_surface_2 | sepal_surface_loc | sepal_surface_bin | avg_length | avg_width | |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | 5.1 | 3.5 | 1.4 | 0.2 | setosa | 17.85 | 17.85 | 17.85 | high | 3.25 | 1.85 |
1 | 4.9 | 3.0 | 1.4 | 0.2 | setosa | 14.70 | 14.70 | 14.70 | low | 3.15 | 1.60 |
2 | 4.7 | 3.2 | 1.3 | 0.2 | setosa | 15.04 | 15.04 | 15.04 | high | 3.00 | 1.70 |
3 | 4.6 | 3.1 | 1.5 | 0.2 | setosa | 14.26 | 14.26 | 14.26 | low | 3.05 | 1.65 |
4 | 5.0 | 3.6 | 1.4 | 0.2 | setosa | 18.00 | 18.00 | 18.00 | high | 3.20 | 1.90 |
Try it yourself with Practice Exercise 3 in this notebook!
Sorting Data#
By values: sort_values()
#
You can customize this sorting with the following parameters:
by
parameter takes string or list of stringsascending
takes True or Falseinplace
will save sorted values into the df
Look at its documentation for further details.
iris_df.sort_values(by=['sepal_length','petal_width'])
sepal_length | sepal_width | petal_length | petal_width | species | sepal_surface | sepal_surface_2 | sepal_surface_loc | sepal_surface_bin | avg_length | avg_width | |
---|---|---|---|---|---|---|---|---|---|---|---|
13 | 4.3 | 3.0 | 1.1 | 0.1 | setosa | 12.90 | 12.90 | 12.90 | low | 2.70 | 1.55 |
8 | 4.4 | 2.9 | 1.4 | 0.2 | setosa | 12.76 | 12.76 | 12.76 | low | 2.90 | 1.55 |
38 | 4.4 | 3.0 | 1.3 | 0.2 | setosa | 13.20 | 13.20 | 13.20 | low | 2.85 | 1.60 |
42 | 4.4 | 3.2 | 1.3 | 0.2 | setosa | 14.08 | 14.08 | 14.08 | low | 2.85 | 1.70 |
41 | 4.5 | 2.3 | 1.3 | 0.3 | setosa | 10.35 | 10.35 | 10.35 | low | 2.90 | 1.30 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
122 | 7.7 | 2.8 | 6.7 | 2.0 | virginica | 21.56 | 21.56 | 21.56 | high | 7.20 | 2.40 |
117 | 7.7 | 3.8 | 6.7 | 2.2 | virginica | 29.26 | 29.26 | 29.26 | high | 7.20 | 3.00 |
118 | 7.7 | 2.6 | 6.9 | 2.3 | virginica | 20.02 | 20.02 | 20.02 | high | 7.30 | 2.45 |
135 | 7.7 | 3.0 | 6.1 | 2.3 | virginica | 23.10 | 23.10 | 23.10 | high | 6.90 | 2.65 |
131 | 7.9 | 3.8 | 6.4 | 2.0 | virginica | 30.02 | 30.02 | 30.02 | high | 7.15 | 2.90 |
150 rows × 11 columns
iris_df.sort_values(by=['sepal_length','petal_width'], ascending=False)
sepal_length | sepal_width | petal_length | petal_width | species | sepal_surface | sepal_surface_2 | sepal_surface_loc | sepal_surface_bin | avg_length | avg_width | |
---|---|---|---|---|---|---|---|---|---|---|---|
131 | 7.9 | 3.8 | 6.4 | 2.0 | virginica | 30.02 | 30.02 | 30.02 | high | 7.15 | 2.90 |
118 | 7.7 | 2.6 | 6.9 | 2.3 | virginica | 20.02 | 20.02 | 20.02 | high | 7.30 | 2.45 |
135 | 7.7 | 3.0 | 6.1 | 2.3 | virginica | 23.10 | 23.10 | 23.10 | high | 6.90 | 2.65 |
117 | 7.7 | 3.8 | 6.7 | 2.2 | virginica | 29.26 | 29.26 | 29.26 | high | 7.20 | 3.00 |
122 | 7.7 | 2.8 | 6.7 | 2.0 | virginica | 21.56 | 21.56 | 21.56 | high | 7.20 | 2.40 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
41 | 4.5 | 2.3 | 1.3 | 0.3 | setosa | 10.35 | 10.35 | 10.35 | low | 2.90 | 1.30 |
8 | 4.4 | 2.9 | 1.4 | 0.2 | setosa | 12.76 | 12.76 | 12.76 | low | 2.90 | 1.55 |
38 | 4.4 | 3.0 | 1.3 | 0.2 | setosa | 13.20 | 13.20 | 13.20 | low | 2.85 | 1.60 |
42 | 4.4 | 3.2 | 1.3 | 0.2 | setosa | 14.08 | 14.08 | 14.08 | low | 2.85 | 1.70 |
13 | 4.3 | 3.0 | 1.1 | 0.1 | setosa | 12.90 | 12.90 | 12.90 | low | 2.70 | 1.55 |
150 rows × 11 columns
By index: sort_index()
#
You can customize this sorting with the following parameters:
axis
along which to sort. The value 0 identifies the rows, and 1 identifies the columns.ascending
takes True or Falseinplace
will save sorted values into the df
Look at its documentation for further details.
iris_df.sort_index(axis=0)
sepal_length | sepal_width | petal_length | petal_width | species | sepal_surface | sepal_surface_2 | sepal_surface_loc | sepal_surface_bin | avg_length | avg_width | |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | 5.1 | 3.5 | 1.4 | 0.2 | setosa | 17.85 | 17.85 | 17.85 | high | 3.25 | 1.85 |
1 | 4.9 | 3.0 | 1.4 | 0.2 | setosa | 14.70 | 14.70 | 14.70 | low | 3.15 | 1.60 |
2 | 4.7 | 3.2 | 1.3 | 0.2 | setosa | 15.04 | 15.04 | 15.04 | high | 3.00 | 1.70 |
3 | 4.6 | 3.1 | 1.5 | 0.2 | setosa | 14.26 | 14.26 | 14.26 | low | 3.05 | 1.65 |
4 | 5.0 | 3.6 | 1.4 | 0.2 | setosa | 18.00 | 18.00 | 18.00 | high | 3.20 | 1.90 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
145 | 6.7 | 3.0 | 5.2 | 2.3 | virginica | 20.10 | 20.10 | 20.10 | high | 5.95 | 2.65 |
146 | 6.3 | 2.5 | 5.0 | 1.9 | virginica | 15.75 | 15.75 | 15.75 | high | 5.65 | 2.20 |
147 | 6.5 | 3.0 | 5.2 | 2.0 | virginica | 19.50 | 19.50 | 19.50 | high | 5.85 | 2.50 |
148 | 6.2 | 3.4 | 5.4 | 2.3 | virginica | 21.08 | 21.08 | 21.08 | high | 5.80 | 2.85 |
149 | 5.9 | 3.0 | 5.1 | 1.8 | virginica | 17.70 | 17.70 | 17.70 | high | 5.50 | 2.40 |
150 rows × 11 columns
iris_df.sort_index(axis=0, ascending=False)
sepal_length | sepal_width | petal_length | petal_width | species | sepal_surface | sepal_surface_2 | sepal_surface_loc | sepal_surface_bin | avg_length | avg_width | |
---|---|---|---|---|---|---|---|---|---|---|---|
149 | 5.9 | 3.0 | 5.1 | 1.8 | virginica | 17.70 | 17.70 | 17.70 | high | 5.50 | 2.40 |
148 | 6.2 | 3.4 | 5.4 | 2.3 | virginica | 21.08 | 21.08 | 21.08 | high | 5.80 | 2.85 |
147 | 6.5 | 3.0 | 5.2 | 2.0 | virginica | 19.50 | 19.50 | 19.50 | high | 5.85 | 2.50 |
146 | 6.3 | 2.5 | 5.0 | 1.9 | virginica | 15.75 | 15.75 | 15.75 | high | 5.65 | 2.20 |
145 | 6.7 | 3.0 | 5.2 | 2.3 | virginica | 20.10 | 20.10 | 20.10 | high | 5.95 | 2.65 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
4 | 5.0 | 3.6 | 1.4 | 0.2 | setosa | 18.00 | 18.00 | 18.00 | high | 3.20 | 1.90 |
3 | 4.6 | 3.1 | 1.5 | 0.2 | setosa | 14.26 | 14.26 | 14.26 | low | 3.05 | 1.65 |
2 | 4.7 | 3.2 | 1.3 | 0.2 | setosa | 15.04 | 15.04 | 15.04 | high | 3.00 | 1.70 |
1 | 4.9 | 3.0 | 1.4 | 0.2 | setosa | 14.70 | 14.70 | 14.70 | low | 3.15 | 1.60 |
0 | 5.1 | 3.5 | 1.4 | 0.2 | setosa | 17.85 | 17.85 | 17.85 | high | 3.25 | 1.85 |
150 rows × 11 columns
Practice exercises#
1- Create a new column named petal_length_cat
, where petal_length
values greater than the dataset’s average are labelled as “high” and those equal to or below the average are labeled as “low.” Then, print the count of occurrences for each category in this column.
# Your answers from here
2- Create new columns for both sepals and petals representing their aspect ratio (length divided by width). Use assign
for this.
Generate a scatterplot to show the relationship between these two new columns, and calculate the correlation coefficient between them.
Additionally, create a boxplot to illustrate differences in each of these new variables across species.
# Your answers from here
3- Use apply()
to append a new column that is the minimum of (petal_length, petal_width).
Print the head, tail of the new dataframe to check things look correct.
# Your answers from here