Pandas: Data Manipulation

Pandas: Data Manipulation#

In the previous lesson, we studied how Pandas can be used to learn about your data through inspection and exploration.

In this lesson, you’ll learn how Pandas can help you process the data to prepare them for analysis.

Specifically, we will cover:

Advance filtering and subsetting.
The creation and removal of columns.
Transforming data through functions.
Sorting data.

# As always, we need to import Pandas
import pandas as pd

# We are going to keep using the Iris data
iris_df = pd.read_csv("https://raw.githubusercontent.com/mwaskom/seaborn-data/refs/heads/master/iris.csv")
iris_df

	sepal_length	sepal_width	petal_length	petal_width	species
0	5.1	3.5	1.4	0.2	setosa
1	4.9	3.0	1.4	0.2	setosa
2	4.7	3.2	1.3	0.2	setosa
3	4.6	3.1	1.5	0.2	setosa
4	5.0	3.6	1.4	0.2	setosa
...	...	...	...	...	...
145	6.7	3.0	5.2	2.3	virginica
146	6.3	2.5	5.0	1.9	virginica
147	6.5	3.0	5.2	2.0	virginica
148	6.2	3.4	5.4	2.3	virginica
149	5.9	3.0	5.1	1.8	virginica

150 rows × 5 columns

Advanced Filtering and Subsetting#

Boolean Filtering#

It is very common to subset a dataframe based on some condition on the data.

Like with NumPy, Pandas also knows what to do if you pass a boolean structure.

iris_df.sepal_length >= 7.5

    False
    False
    False
    False
    False
       ...  
  False
  False
  False
  False
  False
Name: sepal_length, Length: 150, dtype: bool

# Rows satisfying the above condition
iris_df.loc[iris_df.sepal_length >= 7.5,:]

	sepal_length	sepal_width	petal_length	petal_width	species
105	7.6	3.0	6.6	2.1	virginica
117	7.7	3.8	6.7	2.2	virginica
118	7.7	2.6	6.9	2.3	virginica
122	7.7	2.8	6.7	2.0	virginica
131	7.9	3.8	6.4	2.0	virginica
135	7.7	3.0	6.1	2.3	virginica

We can combine more than one conditions, similarly to NumPy:

iris_df.loc[(iris_df['sepal_length' ]>= 4.5) & (iris_df['sepal_length'] <= 4.7),:]

	sepal_length	sepal_width	petal_length	petal_width	species
2	4.7	3.2	1.3	0.2	setosa
3	4.6	3.1	1.5	0.2	setosa
6	4.6	3.4	1.4	0.3	setosa
22	4.6	3.6	1.0	0.2	setosa
29	4.7	3.2	1.6	0.2	setosa
41	4.5	2.3	1.3	0.3	setosa
47	4.6	3.2	1.4	0.2	setosa

And like with NumPy, we can use this boolean filtering to modify existing observations:

iris_copy = iris_df.copy()
iris_copy.loc[(iris_copy['sepal_length' ]>= 4.5) & (iris_copy['sepal_length'] <= 4.7), "sepal_length"] = "medium_length"
iris_copy

	sepal_length	sepal_width	petal_length	petal_width	species
0	5.1	3.5	1.4	0.2	setosa
1	4.9	3.0	1.4	0.2	setosa
2	medium_length	3.2	1.3	0.2	setosa
3	medium_length	3.1	1.5	0.2	setosa
4	5.0	3.6	1.4	0.2	setosa
...	...	...	...	...	...
145	6.7	3.0	5.2	2.3	virginica
146	6.3	2.5	5.0	1.9	virginica
147	6.5	3.0	5.2	2.0	virginica
148	6.2	3.4	5.4	2.3	virginica
149	5.9	3.0	5.1	1.8	virginica

150 rows × 5 columns

Masking#

Here’s an example of masking using boolean conditions passed to the dataframe selector:

Here are the values for the feature sepal length:

iris_df.sepal_length.values

array([5.1, 4.9, 4.7, 4.6, 5. , 5.4, 4.6, 5. , 4.4, 4.9, 5.4, 4.8, 4.8,
       4.3, 5.8, 5.7, 5.4, 5.1, 5.7, 5.1, 5.4, 5.1, 4.6, 5.1, 4.8, 5. ,
       5. , 5.2, 5.2, 4.7, 4.8, 5.4, 5.2, 5.5, 4.9, 5. , 5.5, 4.9, 4.4,
       5.1, 5. , 4.5, 4.4, 5. , 5.1, 4.8, 5.1, 4.6, 5.3, 5. , 7. , 6.4,
       6.9, 5.5, 6.5, 5.7, 6.3, 4.9, 6.6, 5.2, 5. , 5.9, 6. , 6.1, 5.6,
       6.7, 5.6, 5.8, 6.2, 5.6, 5.9, 6.1, 6.3, 6.1, 6.4, 6.6, 6.8, 6.7,
       6. , 5.7, 5.5, 5.5, 5.8, 6. , 5.4, 6. , 6.7, 6.3, 5.6, 5.5, 5.5,
       6.1, 5.8, 5. , 5.6, 5.7, 5.7, 6.2, 5.1, 5.7, 6.3, 5.8, 7.1, 6.3,
       6.5, 7.6, 4.9, 7.3, 6.7, 7.2, 6.5, 6.4, 6.8, 5.7, 5.8, 6.4, 6.5,
       7.7, 7.7, 6. , 6.9, 5.6, 7.7, 6.3, 6.7, 7.2, 6.2, 6.1, 6.4, 7.2,
       7.4, 7.9, 6.4, 6.3, 6.1, 7.7, 6.3, 6.4, 6. , 6.9, 6.7, 6.9, 5.8,
       6.8, 6.7, 6.7, 6.3, 6.5, 6.2, 5.9])

And here are the boolean values generated by applying a comparison operator to those values:

mask = iris_df.sepal_length >= 7.5

mask.values

array([False, False, False, False, False, False, False, False, False,
       False, False, False, False, False, False, False, False, False,
       False, False, False, False, False, False, False, False, False,
       False, False, False, False, False, False, False, False, False,
       False, False, False, False, False, False, False, False, False,
       False, False, False, False, False, False, False, False, False,
       False, False, False, False, False, False, False, False, False,
       False, False, False, False, False, False, False, False, False,
       False, False, False, False, False, False, False, False, False,
       False, False, False, False, False, False, False, False, False,
       False, False, False, False, False, False, False, False, False,
       False, False, False, False, False, False,  True, False, False,
       False, False, False, False, False, False, False, False, False,
        True,  True, False, False, False,  True, False, False, False,
       False, False, False, False, False,  True, False, False, False,
        True, False, False, False, False, False, False, False, False,
       False, False, False, False, False, False])

The two sets of values have the same shape.

We can now overlay the logical values over the numeric ones and keep only what is True:

iris_df.sepal_length[mask].values

array([7.6, 7.7, 7.7, 7.7, 7.9, 7.7])

Creating columns#

It is typical to create new columns in a dataframe from existing ones.

Using brackets `[]`#

Like in dictionaries, you can create a new entry (column) in the dataframe using [].

In this example, a new column (or field) is created by multiplying summing sepal_length with sepal_width:

iris_df['sepal_surface'] = iris_df.sepal_length * iris_df.sepal_width

iris_df.head()

	sepal_length	sepal_width	petal_length	petal_width	species	sepal_surface
0	5.1	3.5	1.4	0.2	setosa	17.85
1	4.9	3.0	1.4	0.2	setosa	14.70
2	4.7	3.2	1.3	0.2	setosa	15.04
3	4.6	3.1	1.5	0.2	setosa	14.26
4	5.0	3.6	1.4	0.2	setosa	18.00

Note that:

The left side has form: DataFrame name, bracket notation, new column name.
The assignment operator = is used.
The right side contains an expression; here, two df columns are multiplied together.

Bracket notation also works on the fields, but it’s more typing:

iris_df['sepal_surface_2'] = iris_df['sepal_length'] * iris_df['sepal_width']

iris_df.head()

	sepal_length	sepal_width	petal_length	petal_width	species	sepal_surface	sepal_surface_2
0	5.1	3.5	1.4	0.2	setosa	17.85	17.85
1	4.9	3.0	1.4	0.2	setosa	14.70	14.70
2	4.7	3.2	1.3	0.2	setosa	15.04	15.04
3	4.6	3.1	1.5	0.2	setosa	14.26	14.26
4	5.0	3.6	1.4	0.2	setosa	18.00	18.00

Beware: The bracket notation must be used when assigning to a new column. The following examples will break:

iris_df.'sepal_surface_3' = iris_df.sepal_length * iris_df.sepal_width

  Cell In[13], line 1
    iris_df.'sepal_surface_3' = iris_df.sepal_length * iris_df.sepal_width
            ^
SyntaxError: invalid syntax

iris_df.sepal_surface_3 = iris_df.sepal_length + iris_df.sepal_width

/tmp/ipykernel_28682/1425502001.py:1: UserWarning: Pandas doesn't allow columns to be created via a new attribute name - see https://pandas.pydata.org/pandas-docs/stable/indexing.html#attribute-access
  iris_df.sepal_surface_3 = iris_df.sepal_length + iris_df.sepal_width

Using `loc[]`#

We can also use loc to create new columns. It is basically very similar to using just brackets:

iris_df.loc[:, 'sepal_surface_loc'] = iris_df['sepal_length'] * iris_df['sepal_width']
iris_df

	sepal_length	sepal_width	petal_length	petal_width	species	sepal_surface	sepal_surface_2	sepal_surface_loc
0	5.1	3.5	1.4	0.2	setosa	17.85	17.85	17.85
1	4.9	3.0	1.4	0.2	setosa	14.70	14.70	14.70
2	4.7	3.2	1.3	0.2	setosa	15.04	15.04	15.04
3	4.6	3.1	1.5	0.2	setosa	14.26	14.26	14.26
4	5.0	3.6	1.4	0.2	setosa	18.00	18.00	18.00
...	...	...	...	...	...	...	...	...
145	6.7	3.0	5.2	2.3	virginica	20.10	20.10	20.10
146	6.3	2.5	5.0	1.9	virginica	15.75	15.75	15.75
147	6.5	3.0	5.2	2.0	virginica	19.50	19.50	19.50
148	6.2	3.4	5.4	2.3	virginica	21.08	21.08	21.08
149	5.9	3.0	5.1	1.8	virginica	17.70	17.70	17.70

150 rows × 8 columns

Notice the use of : here, which applies the new column values across all rows. However, this isn’t always necessary—you can specify conditions to control which rows receive a particular value when creating a column.

Here for example, let’s create a binarized version of the sepal_surface property:

iris_df.loc[iris_df['sepal_surface'] >= 15, 'sepal_surface_bin'] = "high"
iris_df

	sepal_length	sepal_width	petal_length	petal_width	species	sepal_surface	sepal_surface_2	sepal_surface_loc	sepal_surface_bin
0	5.1	3.5	1.4	0.2	setosa	17.85	17.85	17.85	high
1	4.9	3.0	1.4	0.2	setosa	14.70	14.70	14.70	NaN
2	4.7	3.2	1.3	0.2	setosa	15.04	15.04	15.04	high
3	4.6	3.1	1.5	0.2	setosa	14.26	14.26	14.26	NaN
4	5.0	3.6	1.4	0.2	setosa	18.00	18.00	18.00	high
...	...	...	...	...	...	...	...	...	...
145	6.7	3.0	5.2	2.3	virginica	20.10	20.10	20.10	high
146	6.3	2.5	5.0	1.9	virginica	15.75	15.75	15.75	high
147	6.5	3.0	5.2	2.0	virginica	19.50	19.50	19.50	high
148	6.2	3.4	5.4	2.3	virginica	21.08	21.08	21.08	high
149	5.9	3.0	5.1	1.8	virginica	17.70	17.70	17.70	high

150 rows × 9 columns

Unspecified observations will automatically be assigned a NaN value. However, you can always update these values later using boolean filtering to selectively assign values where needed:

iris_df.loc[(iris_df['sepal_length'] * iris_df['sepal_width']) < 15, 'sepal_surface_bin'] = "low"
iris_df

	sepal_length	sepal_width	petal_length	petal_width	species	sepal_surface	sepal_surface_2	sepal_surface_loc	sepal_surface_bin
0	5.1	3.5	1.4	0.2	setosa	17.85	17.85	17.85	high
1	4.9	3.0	1.4	0.2	setosa	14.70	14.70	14.70	low
2	4.7	3.2	1.3	0.2	setosa	15.04	15.04	15.04	high
3	4.6	3.1	1.5	0.2	setosa	14.26	14.26	14.26	low
4	5.0	3.6	1.4	0.2	setosa	18.00	18.00	18.00	high
...	...	...	...	...	...	...	...	...	...
145	6.7	3.0	5.2	2.3	virginica	20.10	20.10	20.10	high
146	6.3	2.5	5.0	1.9	virginica	15.75	15.75	15.75	high
147	6.5	3.0	5.2	2.0	virginica	19.50	19.50	19.50	high
148	6.2	3.4	5.4	2.3	virginica	21.08	21.08	21.08	high
149	5.9	3.0	5.1	1.8	virginica	17.70	17.70	17.70	high

150 rows × 9 columns

Try it yourself with Practice Exercise 1 in this notebook!

Using `assign`#

This method also allows you to create new columns in a given dataframe.

Info: This method is very useful when you want to create a new dataframe while maintaining the original as it was.

iris_df

	sepal_length	sepal_width	petal_length	petal_width	species	sepal_surface	sepal_surface_2	sepal_surface_loc	sepal_surface_bin
0	5.1	3.5	1.4	0.2	setosa	17.85	17.85	17.85	high
1	4.9	3.0	1.4	0.2	setosa	14.70	14.70	14.70	low
2	4.7	3.2	1.3	0.2	setosa	15.04	15.04	15.04	high
3	4.6	3.1	1.5	0.2	setosa	14.26	14.26	14.26	low
4	5.0	3.6	1.4	0.2	setosa	18.00	18.00	18.00	high
...	...	...	...	...	...	...	...	...	...
145	6.7	3.0	5.2	2.3	virginica	20.10	20.10	20.10	high
146	6.3	2.5	5.0	1.9	virginica	15.75	15.75	15.75	high
147	6.5	3.0	5.2	2.0	virginica	19.50	19.50	19.50	high
148	6.2	3.4	5.4	2.3	virginica	21.08	21.08	21.08	high
149	5.9	3.0	5.1	1.8	virginica	17.70	17.70	17.70	high

150 rows × 9 columns

iris_df.assign(sepal_surface_assign = iris_df["sepal_length"] * iris_df["sepal_width"])

	sepal_length	sepal_width	petal_length	petal_width	species	sepal_surface	sepal_surface_2	sepal_surface_loc	sepal_surface_bin	sepal_surface_assign
0	5.1	3.5	1.4	0.2	setosa	17.85	17.85	17.85	high	17.85
1	4.9	3.0	1.4	0.2	setosa	14.70	14.70	14.70	low	14.70
2	4.7	3.2	1.3	0.2	setosa	15.04	15.04	15.04	high	15.04
3	4.6	3.1	1.5	0.2	setosa	14.26	14.26	14.26	low	14.26
4	5.0	3.6	1.4	0.2	setosa	18.00	18.00	18.00	high	18.00
...	...	...	...	...	...	...	...	...	...	...
145	6.7	3.0	5.2	2.3	virginica	20.10	20.10	20.10	high	20.10
146	6.3	2.5	5.0	1.9	virginica	15.75	15.75	15.75	high	15.75
147	6.5	3.0	5.2	2.0	virginica	19.50	19.50	19.50	high	19.50
148	6.2	3.4	5.4	2.3	virginica	21.08	21.08	21.08	high	21.08
149	5.9	3.0	5.1	1.8	virginica	17.70	17.70	17.70	high	17.70

150 rows × 10 columns

Try it yourself with Practice Exercise 2 in this notebook!

Removing columns#

Using the reserverd keyword del to drop a DataFrame or single columns from the dataframe.

Beware: This deletion happens in-place, that is, the original datafame is modified.

iris_df_drop = iris_df.copy()

iris_df_drop.head(2)

	sepal_length	sepal_width	petal_length	petal_width	species	sepal_surface	sepal_surface_2	sepal_surface_loc	sepal_surface_bin
0	5.1	3.5	1.4	0.2	setosa	17.85	17.85	17.85	high
1	4.9	3.0	1.4	0.2	setosa	14.70	14.70	14.70	low

# delete the column 'x'
del iris_df_drop['sepal_surface_2']

iris_df_drop

	sepal_length	sepal_width	petal_length	petal_width	species	sepal_surface	sepal_surface_loc	sepal_surface_bin
0	5.1	3.5	1.4	0.2	setosa	17.85	17.85	high
1	4.9	3.0	1.4	0.2	setosa	14.70	14.70	low
2	4.7	3.2	1.3	0.2	setosa	15.04	15.04	high
3	4.6	3.1	1.5	0.2	setosa	14.26	14.26	low
4	5.0	3.6	1.4	0.2	setosa	18.00	18.00	high
...	...	...	...	...	...	...	...	...
145	6.7	3.0	5.2	2.3	virginica	20.10	20.10	high
146	6.3	2.5	5.0	1.9	virginica	15.75	15.75	high
147	6.5	3.0	5.2	2.0	virginica	19.50	19.50	high
148	6.2	3.4	5.4	2.3	virginica	21.08	21.08	high
149	5.9	3.0	5.1	1.8	virginica	17.70	17.70	high

150 rows × 8 columns

Using the method drop() to drop one or more columns by specifying axis argument equal 1.

Info: In contrast to using del, here a new dataframe is created.

iris_df_drop.drop(['sepal_surface', 'species'], axis=1).head()

	sepal_length	sepal_width	petal_length	petal_width	sepal_surface_loc	sepal_surface_bin
0	5.1	3.5	1.4	0.2	17.85	high
1	4.9	3.0	1.4	0.2	14.70	low
2	4.7	3.2	1.3	0.2	15.04	high
3	4.6	3.1	1.5	0.2	14.26	low
4	5.0	3.6	1.4	0.2	18.00	high

# The original dataframe is not modified
iris_df_drop.head()

	sepal_length	sepal_width	petal_length	petal_width	species	sepal_surface	sepal_surface_loc	sepal_surface_bin
0	5.1	3.5	1.4	0.2	setosa	17.85	17.85	high
1	4.9	3.0	1.4	0.2	setosa	14.70	14.70	low
2	4.7	3.2	1.3	0.2	setosa	15.04	15.04	high
3	4.6	3.1	1.5	0.2	setosa	14.26	14.26	low
4	5.0	3.6	1.4	0.2	setosa	18.00	18.00	high

# If we want to modify, we need to reassign the copy dataframe to it 
iris_df_drop = iris_df_drop.drop(['sepal_surface', 'species'], axis=1)
iris_df_drop

	sepal_length	sepal_width	petal_length	petal_width	sepal_surface_loc	sepal_surface_bin
0	5.1	3.5	1.4	0.2	17.85	high
1	4.9	3.0	1.4	0.2	14.70	low
2	4.7	3.2	1.3	0.2	15.04	high
3	4.6	3.1	1.5	0.2	14.26	low
4	5.0	3.6	1.4	0.2	18.00	high
...	...	...	...	...	...	...
145	6.7	3.0	5.2	2.3	20.10	high
146	6.3	2.5	5.0	1.9	15.75	high
147	6.5	3.0	5.2	2.0	19.50	high
148	6.2	3.4	5.4	2.3	21.08	high
149	5.9	3.0	5.1	1.8	17.70	high

150 rows × 6 columns

Note that with this drop() method you can also drop specific observations by setting axis=0

# Now a particular observation
iris_df_drop = iris_df_drop.drop([0], axis=0)
iris_df_drop

	sepal_length	sepal_width	petal_length	petal_width	sepal_surface_loc	sepal_surface_bin
1	4.9	3.0	1.4	0.2	14.70	low
2	4.7	3.2	1.3	0.2	15.04	high
3	4.6	3.1	1.5	0.2	14.26	low
4	5.0	3.6	1.4	0.2	18.00	high
5	5.4	3.9	1.7	0.4	21.06	high
...	...	...	...	...	...	...
145	6.7	3.0	5.2	2.3	20.10	high
146	6.3	2.5	5.0	1.9	15.75	high
147	6.5	3.0	5.2	2.0	19.50	high
148	6.2	3.4	5.4	2.3	21.08	high
149	5.9	3.0	5.1	1.8	17.70	high

149 rows × 6 columns

Transforming your data#

Sometimes, as part of preprocessing, you may need to apply a function to transform your dataframe.

The most straightforward way to do this is with the apply method. Typical uses of this are the following:

Applying transformations or computations that are more complex than what’s available through built-in methods.
Reducing the dimensions of the data, thereby allowing for aggregation operations when used with a function that returns a single value.
Applying user-defined or lambda functions across rows or columns efficiently.

Let’s explore some of these uses:

Creation of new columns, particularly when these need to arise from a complex operation.

def my_func(x):
    return x**3 - x**2 - x + 10

iris_df["sepal_length"].apply(my_func)

    111.541
     98.739
     87.033
     81.576
    105.000
        ...   
  259.173
  214.057
  235.875
  203.688
  174.669
Name: sepal_length, Length: 150, dtype: float64

You can also use a lambda function, which is basically a small anonymous function that can take any number of arguments, but can only have one expression.

lambda arguments : expression

See the documentation for more details.

iris_df["sepal_length"].apply(lambda x: x**3 - x**2 - x + 10)

    111.541
     98.739
     87.033
     81.576
    105.000
        ...   
  259.173
  214.057
  235.875
  203.688
  174.669
Name: sepal_length, Length: 150, dtype: float64

apply takes the argument axis, which specifies the axis along which the function is applied. By default, axis is set to 0, meaning the function is applied to each column.

import numpy as np
iris_df[["sepal_length", "petal_length"]].apply(np.mean)

sepal_length    5.843333
petal_length    3.758000
dtype: float64

But we can change this to apply a function to each row by setting axis to 1:

import numpy
iris_df[["sepal_length", "petal_length"]].apply(numpy.mean, axis=1)

    3.25
    3.15
    3.00
    3.05
    3.20
       ... 
  5.95
  5.65
  5.85
  5.80
  5.50
Length: 150, dtype: float64

iris_df["avg_length"] = iris_df[["sepal_length", "petal_length"]].apply(numpy.mean, axis=1)
iris_df["avg_width"] = iris_df[["sepal_width", "petal_width"]].apply(numpy.mean, axis=1)

iris_df.head()

	sepal_length	sepal_width	petal_length	petal_width	species	sepal_surface	sepal_surface_2	sepal_surface_loc	sepal_surface_bin	avg_length	avg_width
0	5.1	3.5	1.4	0.2	setosa	17.85	17.85	17.85	high	3.25	1.85
1	4.9	3.0	1.4	0.2	setosa	14.70	14.70	14.70	low	3.15	1.60
2	4.7	3.2	1.3	0.2	setosa	15.04	15.04	15.04	high	3.00	1.70
3	4.6	3.1	1.5	0.2	setosa	14.26	14.26	14.26	low	3.05	1.65
4	5.0	3.6	1.4	0.2	setosa	18.00	18.00	18.00	high	3.20	1.90

Try it yourself with Practice Exercise 3 in this notebook!

Sorting Data#

By values: `sort_values()`#

You can customize this sorting with the following parameters:

by parameter takes string or list of strings
ascending takes True or False
inplace will save sorted values into the df

Look at its documentation for further details.

iris_df.sort_values(by=['sepal_length','petal_width'])

	sepal_length	sepal_width	petal_length	petal_width	species	sepal_surface	sepal_surface_2	sepal_surface_loc	sepal_surface_bin	avg_length	avg_width
13	4.3	3.0	1.1	0.1	setosa	12.90	12.90	12.90	low	2.70	1.55
8	4.4	2.9	1.4	0.2	setosa	12.76	12.76	12.76	low	2.90	1.55
38	4.4	3.0	1.3	0.2	setosa	13.20	13.20	13.20	low	2.85	1.60
42	4.4	3.2	1.3	0.2	setosa	14.08	14.08	14.08	low	2.85	1.70
41	4.5	2.3	1.3	0.3	setosa	10.35	10.35	10.35	low	2.90	1.30
...	...	...	...	...	...	...	...	...	...	...	...
122	7.7	2.8	6.7	2.0	virginica	21.56	21.56	21.56	high	7.20	2.40
117	7.7	3.8	6.7	2.2	virginica	29.26	29.26	29.26	high	7.20	3.00
118	7.7	2.6	6.9	2.3	virginica	20.02	20.02	20.02	high	7.30	2.45
135	7.7	3.0	6.1	2.3	virginica	23.10	23.10	23.10	high	6.90	2.65
131	7.9	3.8	6.4	2.0	virginica	30.02	30.02	30.02	high	7.15	2.90

150 rows × 11 columns

iris_df.sort_values(by=['sepal_length','petal_width'], ascending=False)

	sepal_length	sepal_width	petal_length	petal_width	species	sepal_surface	sepal_surface_2	sepal_surface_loc	sepal_surface_bin	avg_length	avg_width
131	7.9	3.8	6.4	2.0	virginica	30.02	30.02	30.02	high	7.15	2.90
118	7.7	2.6	6.9	2.3	virginica	20.02	20.02	20.02	high	7.30	2.45
135	7.7	3.0	6.1	2.3	virginica	23.10	23.10	23.10	high	6.90	2.65
117	7.7	3.8	6.7	2.2	virginica	29.26	29.26	29.26	high	7.20	3.00
122	7.7	2.8	6.7	2.0	virginica	21.56	21.56	21.56	high	7.20	2.40
...	...	...	...	...	...	...	...	...	...	...	...
41	4.5	2.3	1.3	0.3	setosa	10.35	10.35	10.35	low	2.90	1.30
8	4.4	2.9	1.4	0.2	setosa	12.76	12.76	12.76	low	2.90	1.55
38	4.4	3.0	1.3	0.2	setosa	13.20	13.20	13.20	low	2.85	1.60
42	4.4	3.2	1.3	0.2	setosa	14.08	14.08	14.08	low	2.85	1.70
13	4.3	3.0	1.1	0.1	setosa	12.90	12.90	12.90	low	2.70	1.55

150 rows × 11 columns

By index: `sort_index()`#

You can customize this sorting with the following parameters:

axis along which to sort. The value 0 identifies the rows, and 1 identifies the columns.
ascending takes True or False
inplace will save sorted values into the df

Look at its documentation for further details.

iris_df.sort_index(axis=0)

	sepal_length	sepal_width	petal_length	petal_width	species	sepal_surface	sepal_surface_2	sepal_surface_loc	sepal_surface_bin	avg_length	avg_width
0	5.1	3.5	1.4	0.2	setosa	17.85	17.85	17.85	high	3.25	1.85
1	4.9	3.0	1.4	0.2	setosa	14.70	14.70	14.70	low	3.15	1.60
2	4.7	3.2	1.3	0.2	setosa	15.04	15.04	15.04	high	3.00	1.70
3	4.6	3.1	1.5	0.2	setosa	14.26	14.26	14.26	low	3.05	1.65
4	5.0	3.6	1.4	0.2	setosa	18.00	18.00	18.00	high	3.20	1.90
...	...	...	...	...	...	...	...	...	...	...	...
145	6.7	3.0	5.2	2.3	virginica	20.10	20.10	20.10	high	5.95	2.65
146	6.3	2.5	5.0	1.9	virginica	15.75	15.75	15.75	high	5.65	2.20
147	6.5	3.0	5.2	2.0	virginica	19.50	19.50	19.50	high	5.85	2.50
148	6.2	3.4	5.4	2.3	virginica	21.08	21.08	21.08	high	5.80	2.85
149	5.9	3.0	5.1	1.8	virginica	17.70	17.70	17.70	high	5.50	2.40

150 rows × 11 columns

iris_df.sort_index(axis=0, ascending=False)

	sepal_length	sepal_width	petal_length	petal_width	species	sepal_surface	sepal_surface_2	sepal_surface_loc	sepal_surface_bin	avg_length	avg_width
149	5.9	3.0	5.1	1.8	virginica	17.70	17.70	17.70	high	5.50	2.40
148	6.2	3.4	5.4	2.3	virginica	21.08	21.08	21.08	high	5.80	2.85
147	6.5	3.0	5.2	2.0	virginica	19.50	19.50	19.50	high	5.85	2.50
146	6.3	2.5	5.0	1.9	virginica	15.75	15.75	15.75	high	5.65	2.20
145	6.7	3.0	5.2	2.3	virginica	20.10	20.10	20.10	high	5.95	2.65
...	...	...	...	...	...	...	...	...	...	...	...
4	5.0	3.6	1.4	0.2	setosa	18.00	18.00	18.00	high	3.20	1.90
3	4.6	3.1	1.5	0.2	setosa	14.26	14.26	14.26	low	3.05	1.65
2	4.7	3.2	1.3	0.2	setosa	15.04	15.04	15.04	high	3.00	1.70
1	4.9	3.0	1.4	0.2	setosa	14.70	14.70	14.70	low	3.15	1.60
0	5.1	3.5	1.4	0.2	setosa	17.85	17.85	17.85	high	3.25	1.85

150 rows × 11 columns

Practice exercises#

Exercise 44

1- Create a new column named petal_length_cat, where petal_length values greater than the dataset’s average are labelled as “high” and those equal to or below the average are labeled as “low.” Then, print the count of occurrences for each category in this column.

# Your answers from here

Exercise 45

2- Create new columns for both sepals and petals representing their aspect ratio (length divided by width). Use assign for this.

Generate a scatterplot to show the relationship between these two new columns, and calculate the correlation coefficient between them.

Additionally, create a boxplot to illustrate differences in each of these new variables across species.

# Your answers from here

Exercise 46

3- Use apply() to append a new column that is the minimum of (petal_length, petal_width).

Print the head, tail of the new dataframe to check things look correct.

# Your answers from here

Pandas: Data Manipulation

Contents

Pandas: Data Manipulation#

Advanced Filtering and Subsetting#

Boolean Filtering#

Masking#

Creating columns#

Using brackets []#

Using loc[]#

Using assign#

Removing columns#

Transforming your data#

Sorting Data#

By values: sort_values()#

By index: sort_index()#

Practice exercises#

Using brackets `[]`#

Using `loc[]`#

Using `assign`#

By values: `sort_values()`#

By index: `sort_index()`#