Pandas: Data Manipulation#

In the previous lesson, we studied how Pandas can be used to learn about your data through inspection and exploration.

In this lesson, you’ll learn how Pandas can help you process the data to prepare them for analysis.

Specifically, we will cover:

  • Advance filtering and subsetting.

  • The creation and removal of columns.

  • Transforming data through functions.

  • Sorting data.

# As always, we need to import Pandas
import pandas as pd
# We are going to keep using the Iris data
iris_df = pd.read_csv("https://raw.githubusercontent.com/mwaskom/seaborn-data/refs/heads/master/iris.csv")
iris_df
sepal_length sepal_width petal_length petal_width species
0 5.1 3.5 1.4 0.2 setosa
1 4.9 3.0 1.4 0.2 setosa
2 4.7 3.2 1.3 0.2 setosa
3 4.6 3.1 1.5 0.2 setosa
4 5.0 3.6 1.4 0.2 setosa
... ... ... ... ... ...
145 6.7 3.0 5.2 2.3 virginica
146 6.3 2.5 5.0 1.9 virginica
147 6.5 3.0 5.2 2.0 virginica
148 6.2 3.4 5.4 2.3 virginica
149 5.9 3.0 5.1 1.8 virginica

150 rows × 5 columns

Advanced Filtering and Subsetting#

Boolean Filtering#

It is very common to subset a dataframe based on some condition on the data.

Like with NumPy, Pandas also knows what to do if you pass a boolean structure.

iris_df.sepal_length >= 7.5
0      False
1      False
2      False
3      False
4      False
       ...  
145    False
146    False
147    False
148    False
149    False
Name: sepal_length, Length: 150, dtype: bool
# Rows satisfying the above condition
iris_df.loc[iris_df.sepal_length >= 7.5,:]
sepal_length sepal_width petal_length petal_width species
105 7.6 3.0 6.6 2.1 virginica
117 7.7 3.8 6.7 2.2 virginica
118 7.7 2.6 6.9 2.3 virginica
122 7.7 2.8 6.7 2.0 virginica
131 7.9 3.8 6.4 2.0 virginica
135 7.7 3.0 6.1 2.3 virginica

We can combine more than one conditions, similarly to NumPy:

iris_df.loc[(iris_df['sepal_length' ]>= 4.5) & (iris_df['sepal_length'] <= 4.7),:]
sepal_length sepal_width petal_length petal_width species
2 4.7 3.2 1.3 0.2 setosa
3 4.6 3.1 1.5 0.2 setosa
6 4.6 3.4 1.4 0.3 setosa
22 4.6 3.6 1.0 0.2 setosa
29 4.7 3.2 1.6 0.2 setosa
41 4.5 2.3 1.3 0.3 setosa
47 4.6 3.2 1.4 0.2 setosa

And like with NumPy, we can use this boolean filtering to modify existing observations:

iris_copy = iris_df.copy()
iris_copy.loc[(iris_copy['sepal_length' ]>= 4.5) & (iris_copy['sepal_length'] <= 4.7), "sepal_length"] = "medium_length"
iris_copy
sepal_length sepal_width petal_length petal_width species
0 5.1 3.5 1.4 0.2 setosa
1 4.9 3.0 1.4 0.2 setosa
2 medium_length 3.2 1.3 0.2 setosa
3 medium_length 3.1 1.5 0.2 setosa
4 5.0 3.6 1.4 0.2 setosa
... ... ... ... ... ...
145 6.7 3.0 5.2 2.3 virginica
146 6.3 2.5 5.0 1.9 virginica
147 6.5 3.0 5.2 2.0 virginica
148 6.2 3.4 5.4 2.3 virginica
149 5.9 3.0 5.1 1.8 virginica

150 rows × 5 columns

Masking#

Here’s an example of masking using boolean conditions passed to the dataframe selector:

Here are the values for the feature sepal length:

iris_df.sepal_length.values
array([5.1, 4.9, 4.7, 4.6, 5. , 5.4, 4.6, 5. , 4.4, 4.9, 5.4, 4.8, 4.8,
       4.3, 5.8, 5.7, 5.4, 5.1, 5.7, 5.1, 5.4, 5.1, 4.6, 5.1, 4.8, 5. ,
       5. , 5.2, 5.2, 4.7, 4.8, 5.4, 5.2, 5.5, 4.9, 5. , 5.5, 4.9, 4.4,
       5.1, 5. , 4.5, 4.4, 5. , 5.1, 4.8, 5.1, 4.6, 5.3, 5. , 7. , 6.4,
       6.9, 5.5, 6.5, 5.7, 6.3, 4.9, 6.6, 5.2, 5. , 5.9, 6. , 6.1, 5.6,
       6.7, 5.6, 5.8, 6.2, 5.6, 5.9, 6.1, 6.3, 6.1, 6.4, 6.6, 6.8, 6.7,
       6. , 5.7, 5.5, 5.5, 5.8, 6. , 5.4, 6. , 6.7, 6.3, 5.6, 5.5, 5.5,
       6.1, 5.8, 5. , 5.6, 5.7, 5.7, 6.2, 5.1, 5.7, 6.3, 5.8, 7.1, 6.3,
       6.5, 7.6, 4.9, 7.3, 6.7, 7.2, 6.5, 6.4, 6.8, 5.7, 5.8, 6.4, 6.5,
       7.7, 7.7, 6. , 6.9, 5.6, 7.7, 6.3, 6.7, 7.2, 6.2, 6.1, 6.4, 7.2,
       7.4, 7.9, 6.4, 6.3, 6.1, 7.7, 6.3, 6.4, 6. , 6.9, 6.7, 6.9, 5.8,
       6.8, 6.7, 6.7, 6.3, 6.5, 6.2, 5.9])

And here are the boolean values generated by applying a comparison operator to those values:

mask = iris_df.sepal_length >= 7.5
mask.values
array([False, False, False, False, False, False, False, False, False,
       False, False, False, False, False, False, False, False, False,
       False, False, False, False, False, False, False, False, False,
       False, False, False, False, False, False, False, False, False,
       False, False, False, False, False, False, False, False, False,
       False, False, False, False, False, False, False, False, False,
       False, False, False, False, False, False, False, False, False,
       False, False, False, False, False, False, False, False, False,
       False, False, False, False, False, False, False, False, False,
       False, False, False, False, False, False, False, False, False,
       False, False, False, False, False, False, False, False, False,
       False, False, False, False, False, False,  True, False, False,
       False, False, False, False, False, False, False, False, False,
        True,  True, False, False, False,  True, False, False, False,
       False, False, False, False, False,  True, False, False, False,
        True, False, False, False, False, False, False, False, False,
       False, False, False, False, False, False])

The two sets of values have the same shape.

We can now overlay the logical values over the numeric ones and keep only what is True:

iris_df.sepal_length[mask].values
array([7.6, 7.7, 7.7, 7.7, 7.9, 7.7])

Creating columns#

It is typical to create new columns in a dataframe from existing ones.

Using brackets []#

Like in dictionaries, you can create a new entry (column) in the dataframe using [].

In this example, a new column (or field) is created by multiplying summing sepal_length with sepal_width:

iris_df['sepal_surface'] = iris_df.sepal_length * iris_df.sepal_width

iris_df.head()
sepal_length sepal_width petal_length petal_width species sepal_surface
0 5.1 3.5 1.4 0.2 setosa 17.85
1 4.9 3.0 1.4 0.2 setosa 14.70
2 4.7 3.2 1.3 0.2 setosa 15.04
3 4.6 3.1 1.5 0.2 setosa 14.26
4 5.0 3.6 1.4 0.2 setosa 18.00

Note that:

  • The left side has form: DataFrame name, bracket notation, new column name.

  • The assignment operator = is used.

  • The right side contains an expression; here, two df columns are multiplied together.

Bracket notation also works on the fields, but it’s more typing:

iris_df['sepal_surface_2'] = iris_df['sepal_length'] * iris_df['sepal_width']

iris_df.head()
sepal_length sepal_width petal_length petal_width species sepal_surface sepal_surface_2
0 5.1 3.5 1.4 0.2 setosa 17.85 17.85
1 4.9 3.0 1.4 0.2 setosa 14.70 14.70
2 4.7 3.2 1.3 0.2 setosa 15.04 15.04
3 4.6 3.1 1.5 0.2 setosa 14.26 14.26
4 5.0 3.6 1.4 0.2 setosa 18.00 18.00
Beware: The bracket notation must be used when assigning to a new column. The following examples will break:
iris_df.'sepal_surface_3' = iris_df.sepal_length * iris_df.sepal_width
  Cell In[13], line 1
    iris_df.'sepal_surface_3' = iris_df.sepal_length * iris_df.sepal_width
            ^
SyntaxError: invalid syntax
iris_df.sepal_surface_3 = iris_df.sepal_length + iris_df.sepal_width
/tmp/ipykernel_28682/1425502001.py:1: UserWarning: Pandas doesn't allow columns to be created via a new attribute name - see https://pandas.pydata.org/pandas-docs/stable/indexing.html#attribute-access
  iris_df.sepal_surface_3 = iris_df.sepal_length + iris_df.sepal_width

Using loc[]#

We can also use loc to create new columns. It is basically very similar to using just brackets:

iris_df.loc[:, 'sepal_surface_loc'] = iris_df['sepal_length'] * iris_df['sepal_width']
iris_df
sepal_length sepal_width petal_length petal_width species sepal_surface sepal_surface_2 sepal_surface_loc
0 5.1 3.5 1.4 0.2 setosa 17.85 17.85 17.85
1 4.9 3.0 1.4 0.2 setosa 14.70 14.70 14.70
2 4.7 3.2 1.3 0.2 setosa 15.04 15.04 15.04
3 4.6 3.1 1.5 0.2 setosa 14.26 14.26 14.26
4 5.0 3.6 1.4 0.2 setosa 18.00 18.00 18.00
... ... ... ... ... ... ... ... ...
145 6.7 3.0 5.2 2.3 virginica 20.10 20.10 20.10
146 6.3 2.5 5.0 1.9 virginica 15.75 15.75 15.75
147 6.5 3.0 5.2 2.0 virginica 19.50 19.50 19.50
148 6.2 3.4 5.4 2.3 virginica 21.08 21.08 21.08
149 5.9 3.0 5.1 1.8 virginica 17.70 17.70 17.70

150 rows × 8 columns

Notice the use of : here, which applies the new column values across all rows. However, this isn’t always necessary—you can specify conditions to control which rows receive a particular value when creating a column.

Here for example, let’s create a binarized version of the sepal_surface property:

iris_df.loc[iris_df['sepal_surface'] >= 15, 'sepal_surface_bin'] = "high"
iris_df
sepal_length sepal_width petal_length petal_width species sepal_surface sepal_surface_2 sepal_surface_loc sepal_surface_bin
0 5.1 3.5 1.4 0.2 setosa 17.85 17.85 17.85 high
1 4.9 3.0 1.4 0.2 setosa 14.70 14.70 14.70 NaN
2 4.7 3.2 1.3 0.2 setosa 15.04 15.04 15.04 high
3 4.6 3.1 1.5 0.2 setosa 14.26 14.26 14.26 NaN
4 5.0 3.6 1.4 0.2 setosa 18.00 18.00 18.00 high
... ... ... ... ... ... ... ... ... ...
145 6.7 3.0 5.2 2.3 virginica 20.10 20.10 20.10 high
146 6.3 2.5 5.0 1.9 virginica 15.75 15.75 15.75 high
147 6.5 3.0 5.2 2.0 virginica 19.50 19.50 19.50 high
148 6.2 3.4 5.4 2.3 virginica 21.08 21.08 21.08 high
149 5.9 3.0 5.1 1.8 virginica 17.70 17.70 17.70 high

150 rows × 9 columns

Unspecified observations will automatically be assigned a NaN value. However, you can always update these values later using boolean filtering to selectively assign values where needed:

iris_df.loc[(iris_df['sepal_length'] * iris_df['sepal_width']) < 15, 'sepal_surface_bin'] = "low"
iris_df
sepal_length sepal_width petal_length petal_width species sepal_surface sepal_surface_2 sepal_surface_loc sepal_surface_bin
0 5.1 3.5 1.4 0.2 setosa 17.85 17.85 17.85 high
1 4.9 3.0 1.4 0.2 setosa 14.70 14.70 14.70 low
2 4.7 3.2 1.3 0.2 setosa 15.04 15.04 15.04 high
3 4.6 3.1 1.5 0.2 setosa 14.26 14.26 14.26 low
4 5.0 3.6 1.4 0.2 setosa 18.00 18.00 18.00 high
... ... ... ... ... ... ... ... ... ...
145 6.7 3.0 5.2 2.3 virginica 20.10 20.10 20.10 high
146 6.3 2.5 5.0 1.9 virginica 15.75 15.75 15.75 high
147 6.5 3.0 5.2 2.0 virginica 19.50 19.50 19.50 high
148 6.2 3.4 5.4 2.3 virginica 21.08 21.08 21.08 high
149 5.9 3.0 5.1 1.8 virginica 17.70 17.70 17.70 high

150 rows × 9 columns

Try it yourself with Practice Exercise 1 in this notebook!

Using assign#

This method also allows you to create new columns in a given dataframe.

Info: This method is very useful when you want to create a new dataframe while maintaining the original as it was.
iris_df
sepal_length sepal_width petal_length petal_width species sepal_surface sepal_surface_2 sepal_surface_loc sepal_surface_bin
0 5.1 3.5 1.4 0.2 setosa 17.85 17.85 17.85 high
1 4.9 3.0 1.4 0.2 setosa 14.70 14.70 14.70 low
2 4.7 3.2 1.3 0.2 setosa 15.04 15.04 15.04 high
3 4.6 3.1 1.5 0.2 setosa 14.26 14.26 14.26 low
4 5.0 3.6 1.4 0.2 setosa 18.00 18.00 18.00 high
... ... ... ... ... ... ... ... ... ...
145 6.7 3.0 5.2 2.3 virginica 20.10 20.10 20.10 high
146 6.3 2.5 5.0 1.9 virginica 15.75 15.75 15.75 high
147 6.5 3.0 5.2 2.0 virginica 19.50 19.50 19.50 high
148 6.2 3.4 5.4 2.3 virginica 21.08 21.08 21.08 high
149 5.9 3.0 5.1 1.8 virginica 17.70 17.70 17.70 high

150 rows × 9 columns

iris_df.assign(sepal_surface_assign = iris_df["sepal_length"] * iris_df["sepal_width"])
sepal_length sepal_width petal_length petal_width species sepal_surface sepal_surface_2 sepal_surface_loc sepal_surface_bin sepal_surface_assign
0 5.1 3.5 1.4 0.2 setosa 17.85 17.85 17.85 high 17.85
1 4.9 3.0 1.4 0.2 setosa 14.70 14.70 14.70 low 14.70
2 4.7 3.2 1.3 0.2 setosa 15.04 15.04 15.04 high 15.04
3 4.6 3.1 1.5 0.2 setosa 14.26 14.26 14.26 low 14.26
4 5.0 3.6 1.4 0.2 setosa 18.00 18.00 18.00 high 18.00
... ... ... ... ... ... ... ... ... ... ...
145 6.7 3.0 5.2 2.3 virginica 20.10 20.10 20.10 high 20.10
146 6.3 2.5 5.0 1.9 virginica 15.75 15.75 15.75 high 15.75
147 6.5 3.0 5.2 2.0 virginica 19.50 19.50 19.50 high 19.50
148 6.2 3.4 5.4 2.3 virginica 21.08 21.08 21.08 high 21.08
149 5.9 3.0 5.1 1.8 virginica 17.70 17.70 17.70 high 17.70

150 rows × 10 columns

Try it yourself with Practice Exercise 2 in this notebook!

Removing columns#

  • Using the reserverd keyword del to drop a DataFrame or single columns from the dataframe.

Beware: This deletion happens in-place, that is, the original datafame is modified.
iris_df_drop = iris_df.copy()
iris_df_drop.head(2)
sepal_length sepal_width petal_length petal_width species sepal_surface sepal_surface_2 sepal_surface_loc sepal_surface_bin
0 5.1 3.5 1.4 0.2 setosa 17.85 17.85 17.85 high
1 4.9 3.0 1.4 0.2 setosa 14.70 14.70 14.70 low
# delete the column 'x'
del iris_df_drop['sepal_surface_2']
iris_df_drop
sepal_length sepal_width petal_length petal_width species sepal_surface sepal_surface_loc sepal_surface_bin
0 5.1 3.5 1.4 0.2 setosa 17.85 17.85 high
1 4.9 3.0 1.4 0.2 setosa 14.70 14.70 low
2 4.7 3.2 1.3 0.2 setosa 15.04 15.04 high
3 4.6 3.1 1.5 0.2 setosa 14.26 14.26 low
4 5.0 3.6 1.4 0.2 setosa 18.00 18.00 high
... ... ... ... ... ... ... ... ...
145 6.7 3.0 5.2 2.3 virginica 20.10 20.10 high
146 6.3 2.5 5.0 1.9 virginica 15.75 15.75 high
147 6.5 3.0 5.2 2.0 virginica 19.50 19.50 high
148 6.2 3.4 5.4 2.3 virginica 21.08 21.08 high
149 5.9 3.0 5.1 1.8 virginica 17.70 17.70 high

150 rows × 8 columns

  • Using the method drop() to drop one or more columns by specifying axis argument equal 1.

Info: In contrast to using del, here a new dataframe is created.
iris_df_drop.drop(['sepal_surface', 'species'], axis=1).head()
sepal_length sepal_width petal_length petal_width sepal_surface_loc sepal_surface_bin
0 5.1 3.5 1.4 0.2 17.85 high
1 4.9 3.0 1.4 0.2 14.70 low
2 4.7 3.2 1.3 0.2 15.04 high
3 4.6 3.1 1.5 0.2 14.26 low
4 5.0 3.6 1.4 0.2 18.00 high
# The original dataframe is not modified
iris_df_drop.head()
sepal_length sepal_width petal_length petal_width species sepal_surface sepal_surface_loc sepal_surface_bin
0 5.1 3.5 1.4 0.2 setosa 17.85 17.85 high
1 4.9 3.0 1.4 0.2 setosa 14.70 14.70 low
2 4.7 3.2 1.3 0.2 setosa 15.04 15.04 high
3 4.6 3.1 1.5 0.2 setosa 14.26 14.26 low
4 5.0 3.6 1.4 0.2 setosa 18.00 18.00 high
# If we want to modify, we need to reassign the copy dataframe to it 
iris_df_drop = iris_df_drop.drop(['sepal_surface', 'species'], axis=1)
iris_df_drop
sepal_length sepal_width petal_length petal_width sepal_surface_loc sepal_surface_bin
0 5.1 3.5 1.4 0.2 17.85 high
1 4.9 3.0 1.4 0.2 14.70 low
2 4.7 3.2 1.3 0.2 15.04 high
3 4.6 3.1 1.5 0.2 14.26 low
4 5.0 3.6 1.4 0.2 18.00 high
... ... ... ... ... ... ...
145 6.7 3.0 5.2 2.3 20.10 high
146 6.3 2.5 5.0 1.9 15.75 high
147 6.5 3.0 5.2 2.0 19.50 high
148 6.2 3.4 5.4 2.3 21.08 high
149 5.9 3.0 5.1 1.8 17.70 high

150 rows × 6 columns

Note that with this drop() method you can also drop specific observations by setting axis=0

# Now a particular observation
iris_df_drop = iris_df_drop.drop([0], axis=0)
iris_df_drop
sepal_length sepal_width petal_length petal_width sepal_surface_loc sepal_surface_bin
1 4.9 3.0 1.4 0.2 14.70 low
2 4.7 3.2 1.3 0.2 15.04 high
3 4.6 3.1 1.5 0.2 14.26 low
4 5.0 3.6 1.4 0.2 18.00 high
5 5.4 3.9 1.7 0.4 21.06 high
... ... ... ... ... ... ...
145 6.7 3.0 5.2 2.3 20.10 high
146 6.3 2.5 5.0 1.9 15.75 high
147 6.5 3.0 5.2 2.0 19.50 high
148 6.2 3.4 5.4 2.3 21.08 high
149 5.9 3.0 5.1 1.8 17.70 high

149 rows × 6 columns

Transforming your data#

Sometimes, as part of preprocessing, you may need to apply a function to transform your dataframe.

The most straightforward way to do this is with the apply method. Typical uses of this are the following:

  • Applying transformations or computations that are more complex than what’s available through built-in methods.

  • Reducing the dimensions of the data, thereby allowing for aggregation operations when used with a function that returns a single value.

  • Applying user-defined or lambda functions across rows or columns efficiently.

Let’s explore some of these uses:

  • Creation of new columns, particularly when these need to arise from a complex operation.

def my_func(x):
    return x**3 - x**2 - x + 10

iris_df["sepal_length"].apply(my_func)
0      111.541
1       98.739
2       87.033
3       81.576
4      105.000
        ...   
145    259.173
146    214.057
147    235.875
148    203.688
149    174.669
Name: sepal_length, Length: 150, dtype: float64

You can also use a lambda function, which is basically a small anonymous function that can take any number of arguments, but can only have one expression.

lambda arguments : expression

See the documentation for more details.

iris_df["sepal_length"].apply(lambda x: x**3 - x**2 - x + 10)
0      111.541
1       98.739
2       87.033
3       81.576
4      105.000
        ...   
145    259.173
146    214.057
147    235.875
148    203.688
149    174.669
Name: sepal_length, Length: 150, dtype: float64

apply takes the argument axis, which specifies the axis along which the function is applied. By default, axis is set to 0, meaning the function is applied to each column.

import numpy as np
iris_df[["sepal_length", "petal_length"]].apply(np.mean)
sepal_length    5.843333
petal_length    3.758000
dtype: float64

But we can change this to apply a function to each row by setting axis to 1:

import numpy
iris_df[["sepal_length", "petal_length"]].apply(numpy.mean, axis=1)
0      3.25
1      3.15
2      3.00
3      3.05
4      3.20
       ... 
145    5.95
146    5.65
147    5.85
148    5.80
149    5.50
Length: 150, dtype: float64
iris_df["avg_length"] = iris_df[["sepal_length", "petal_length"]].apply(numpy.mean, axis=1)
iris_df["avg_width"] = iris_df[["sepal_width", "petal_width"]].apply(numpy.mean, axis=1)

iris_df.head()
sepal_length sepal_width petal_length petal_width species sepal_surface sepal_surface_2 sepal_surface_loc sepal_surface_bin avg_length avg_width
0 5.1 3.5 1.4 0.2 setosa 17.85 17.85 17.85 high 3.25 1.85
1 4.9 3.0 1.4 0.2 setosa 14.70 14.70 14.70 low 3.15 1.60
2 4.7 3.2 1.3 0.2 setosa 15.04 15.04 15.04 high 3.00 1.70
3 4.6 3.1 1.5 0.2 setosa 14.26 14.26 14.26 low 3.05 1.65
4 5.0 3.6 1.4 0.2 setosa 18.00 18.00 18.00 high 3.20 1.90

Try it yourself with Practice Exercise 3 in this notebook!

Sorting Data#

By values: sort_values()#

You can customize this sorting with the following parameters:

  • by parameter takes string or list of strings

  • ascending takes True or False

  • inplace will save sorted values into the df

Look at its documentation for further details.

iris_df.sort_values(by=['sepal_length','petal_width'])
sepal_length sepal_width petal_length petal_width species sepal_surface sepal_surface_2 sepal_surface_loc sepal_surface_bin avg_length avg_width
13 4.3 3.0 1.1 0.1 setosa 12.90 12.90 12.90 low 2.70 1.55
8 4.4 2.9 1.4 0.2 setosa 12.76 12.76 12.76 low 2.90 1.55
38 4.4 3.0 1.3 0.2 setosa 13.20 13.20 13.20 low 2.85 1.60
42 4.4 3.2 1.3 0.2 setosa 14.08 14.08 14.08 low 2.85 1.70
41 4.5 2.3 1.3 0.3 setosa 10.35 10.35 10.35 low 2.90 1.30
... ... ... ... ... ... ... ... ... ... ... ...
122 7.7 2.8 6.7 2.0 virginica 21.56 21.56 21.56 high 7.20 2.40
117 7.7 3.8 6.7 2.2 virginica 29.26 29.26 29.26 high 7.20 3.00
118 7.7 2.6 6.9 2.3 virginica 20.02 20.02 20.02 high 7.30 2.45
135 7.7 3.0 6.1 2.3 virginica 23.10 23.10 23.10 high 6.90 2.65
131 7.9 3.8 6.4 2.0 virginica 30.02 30.02 30.02 high 7.15 2.90

150 rows × 11 columns

iris_df.sort_values(by=['sepal_length','petal_width'], ascending=False)
sepal_length sepal_width petal_length petal_width species sepal_surface sepal_surface_2 sepal_surface_loc sepal_surface_bin avg_length avg_width
131 7.9 3.8 6.4 2.0 virginica 30.02 30.02 30.02 high 7.15 2.90
118 7.7 2.6 6.9 2.3 virginica 20.02 20.02 20.02 high 7.30 2.45
135 7.7 3.0 6.1 2.3 virginica 23.10 23.10 23.10 high 6.90 2.65
117 7.7 3.8 6.7 2.2 virginica 29.26 29.26 29.26 high 7.20 3.00
122 7.7 2.8 6.7 2.0 virginica 21.56 21.56 21.56 high 7.20 2.40
... ... ... ... ... ... ... ... ... ... ... ...
41 4.5 2.3 1.3 0.3 setosa 10.35 10.35 10.35 low 2.90 1.30
8 4.4 2.9 1.4 0.2 setosa 12.76 12.76 12.76 low 2.90 1.55
38 4.4 3.0 1.3 0.2 setosa 13.20 13.20 13.20 low 2.85 1.60
42 4.4 3.2 1.3 0.2 setosa 14.08 14.08 14.08 low 2.85 1.70
13 4.3 3.0 1.1 0.1 setosa 12.90 12.90 12.90 low 2.70 1.55

150 rows × 11 columns

By index: sort_index()#

You can customize this sorting with the following parameters:

  • axis along which to sort. The value 0 identifies the rows, and 1 identifies the columns.

  • ascending takes True or False

  • inplace will save sorted values into the df

Look at its documentation for further details.

iris_df.sort_index(axis=0)
sepal_length sepal_width petal_length petal_width species sepal_surface sepal_surface_2 sepal_surface_loc sepal_surface_bin avg_length avg_width
0 5.1 3.5 1.4 0.2 setosa 17.85 17.85 17.85 high 3.25 1.85
1 4.9 3.0 1.4 0.2 setosa 14.70 14.70 14.70 low 3.15 1.60
2 4.7 3.2 1.3 0.2 setosa 15.04 15.04 15.04 high 3.00 1.70
3 4.6 3.1 1.5 0.2 setosa 14.26 14.26 14.26 low 3.05 1.65
4 5.0 3.6 1.4 0.2 setosa 18.00 18.00 18.00 high 3.20 1.90
... ... ... ... ... ... ... ... ... ... ... ...
145 6.7 3.0 5.2 2.3 virginica 20.10 20.10 20.10 high 5.95 2.65
146 6.3 2.5 5.0 1.9 virginica 15.75 15.75 15.75 high 5.65 2.20
147 6.5 3.0 5.2 2.0 virginica 19.50 19.50 19.50 high 5.85 2.50
148 6.2 3.4 5.4 2.3 virginica 21.08 21.08 21.08 high 5.80 2.85
149 5.9 3.0 5.1 1.8 virginica 17.70 17.70 17.70 high 5.50 2.40

150 rows × 11 columns

iris_df.sort_index(axis=0, ascending=False)
sepal_length sepal_width petal_length petal_width species sepal_surface sepal_surface_2 sepal_surface_loc sepal_surface_bin avg_length avg_width
149 5.9 3.0 5.1 1.8 virginica 17.70 17.70 17.70 high 5.50 2.40
148 6.2 3.4 5.4 2.3 virginica 21.08 21.08 21.08 high 5.80 2.85
147 6.5 3.0 5.2 2.0 virginica 19.50 19.50 19.50 high 5.85 2.50
146 6.3 2.5 5.0 1.9 virginica 15.75 15.75 15.75 high 5.65 2.20
145 6.7 3.0 5.2 2.3 virginica 20.10 20.10 20.10 high 5.95 2.65
... ... ... ... ... ... ... ... ... ... ... ...
4 5.0 3.6 1.4 0.2 setosa 18.00 18.00 18.00 high 3.20 1.90
3 4.6 3.1 1.5 0.2 setosa 14.26 14.26 14.26 low 3.05 1.65
2 4.7 3.2 1.3 0.2 setosa 15.04 15.04 15.04 high 3.00 1.70
1 4.9 3.0 1.4 0.2 setosa 14.70 14.70 14.70 low 3.15 1.60
0 5.1 3.5 1.4 0.2 setosa 17.85 17.85 17.85 high 3.25 1.85

150 rows × 11 columns

Practice exercises#

Exercise 44

1- Create a new column named petal_length_cat, where petal_length values greater than the dataset’s average are labelled as “high” and those equal to or below the average are labeled as “low.” Then, print the count of occurrences for each category in this column.

# Your answers from here

Exercise 45

2- Create new columns for both sepals and petals representing their aspect ratio (length divided by width). Use assign for this.

Generate a scatterplot to show the relationship between these two new columns, and calculate the correlation coefficient between them.

Additionally, create a boxplot to illustrate differences in each of these new variables across species.

# Your answers from here

Exercise 46

3- Use apply() to append a new column that is the minimum of (petal_length, petal_width).

Print the head, tail of the new dataframe to check things look correct.

# Your answers from here