Pandas: Introduction to Feature Engineering#
In this lesson, you will be introduced to feature enginering techniques using pandas. Specifically, we will cover:
Handling missing data
Creating new columns
Working with categorical data
Working with text data
Introduction#
Feature engineering is the process of selecting, manipulating, and transforming raw data into features that can be used in subsequent analyses. This process is particularly crucial in data science applications.
Pandas provides many functionalities for feature engineering, and we will cover some of them here with a caveat: data science applications often emphasize generalizability. To achieve this, it is standard practice to split the data into fitting (sometimes called “training”) and testing partitions. The fitting partition is used to estimate everything needed, including feature engineering steps, before model deployment. The model’s performance is then evaluated on the testing partition. This approach ensures there is no data leakage, which occurs when information that would not be available at prediction time is used in building the model, often resulting in overly optimistic results.
Unfortunately, feature engineering in Pandas may not always keep these two partitions strictly separated, which is why you may want to use scikit-learn for this process in the future.
Nevertheless, in this lesson, we will try to cover some Pandas’ functionalities for feature engineering in a safely way in an out-of-sample context.
# Load dependencies
import pandas as pd
# We will use this dataset, common for new learners in machine learning
full_df = pd.read_csv('https://raw.githubusercontent.com/pandas-dev/pandas/main/doc/data/titanic.csv')
full_df.head()
PassengerId | Survived | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 0 | 3 | Braund, Mr. Owen Harris | male | 22.0 | 1 | 0 | A/5 21171 | 7.2500 | NaN | S |
1 | 2 | 1 | 1 | Cumings, Mrs. John Bradley (Florence Briggs Th... | female | 38.0 | 1 | 0 | PC 17599 | 71.2833 | C85 | C |
2 | 3 | 1 | 3 | Heikkinen, Miss Laina | female | 26.0 | 0 | 0 | STON/O2. 3101282 | 7.9250 | NaN | S |
3 | 4 | 1 | 1 | Futrelle, Mrs. Jacques Heath (Lily May Peel) | female | 35.0 | 1 | 0 | 113803 | 53.1000 | C123 | S |
4 | 5 | 0 | 3 | Allen, Mr. William Henry | male | 35.0 | 0 | 0 | 373450 | 8.0500 | NaN | S |
# Use the info method to get information about this dataset
full_df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 PassengerId 891 non-null int64
1 Survived 891 non-null int64
2 Pclass 891 non-null int64
3 Name 891 non-null object
4 Sex 891 non-null object
5 Age 714 non-null float64
6 SibSp 891 non-null int64
7 Parch 891 non-null int64
8 Ticket 891 non-null object
9 Fare 891 non-null float64
10 Cabin 204 non-null object
11 Embarked 889 non-null object
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB
# Let's create our partitions, with the fitting set taking 80% of the observations and the testing set the remaining
fitting_df = full_df.iloc[:int(full_df.shape[0]*0.8),:].copy()
testing_df = full_df.iloc[int(full_df.shape[0]*0.8):,:].copy() # This copy operation is to avoid some annoying warnings later.
print("we have", fitting_df.shape[0], "observations in the fitting set;", "and", testing_df.shape[0], "in the test set")
we have 712 observations in the fitting set; and 179 in the test set
Handling nissing values#
dropna()
: In principle, this is a method that you can safely apply. Refer to a previous lesson to know how to use it.fillna()
: This is fine as long as you fill all the NaN’s with a specific value (e.g. a zero).
# This are OK
full_df["Age"].fillna(0).info()
<class 'pandas.core.series.Series'>
RangeIndex: 891 entries, 0 to 890
Series name: Age
Non-Null Count Dtype
-------------- -----
891 non-null float64
dtypes: float64(1)
memory usage: 7.1 KB
Replacing NaN’s with for example the mean first and then splitting the data would be problematic, because both partitions would no longer be independent.
full_df["Age"].fillna(full_df["Age"].mean())
0 22.000000
1 38.000000
2 26.000000
3 35.000000
4 35.000000
...
886 27.000000
887 19.000000
888 29.699118
889 26.000000
890 32.000000
Name: Age, Length: 891, dtype: float64
You have to estimate the mean on the fitting partition and use that estimation to populate both data partitions.
# The mean is estimating on the fitting partition
mean_fitting = fitting_df["Age"].mean()
# And use to fill NaN's in both
fitting_df["Age"] = fitting_df["Age"].fillna(mean_fitting)
testing_df["Age"] = testing_df["Age"].fillna(mean_fitting)
Creating new columns#
This is in principle a safe operation as long as it applied at the row-level. Refer to a previous lesson for further details on how to create new columns.
Working categorical data#
Convert continuous data into categories#
Normally here, one would do the following two steps:
Step 1: Bin fitting data and retrieve bin edges.
Step 2: Apply consistent binning to out-of-sample test data.
pd.cut()
: Bin values into discrete intervals.
The following parameter are important: bin
, which sets the criteria to bin by (Use help
to see the different values it can take), and retbins
to be able to transfer information to the testing partition.
# Bin the age from the fitting'dat in 5 equally separated ranges. Note that retbins is True to return the estimated edges
fitting_df['age_cat'], bin_edges = pd.cut(fitting_df['Age'], bins=5, retbins=True)
print("Bin edges:", bin_edges)
Bin edges: [ 0.67075 16.6 32.45 48.3 64.15 80. ]
Use the bins
argument and the computed ranges to bin your test data:
# Apply the same binning with the predefined edges
testing_df['age_cat'] = pd.cut(testing_df['Age'], bins=bin_edges)
Note that supplying pre-specified ranges is OK before data splitting
# This would be OK
full_df['age_cat'] = pd.cut(full_df['Age'], bins = [20, 30, 40, 50, 60])
pd.qcut()
: Quantile-based discretization function.
# Bin the training data using qcut and retrieve the edges
fitting_df['fare_cat'], bin_edges = pd.qcut(fitting_df.copy()['Fare'], q=3, labels=['Low', 'Medium', 'High'], retbins=True)
print("Bin edges:", bin_edges)
Bin edges: [ 0. 8.6625 26.25 512.3292]
# Use pd.cut() with the bin edges from the training data
testing_df['fare_cat'] = pd.cut(testing_df['Fare'], bins=bin_edges, labels=['Low', 'Medium', 'High'])
One-Hot Encoding#
Many algorithms require numerical inputs. For these, we need to convert categorical data into numbers.
We could be tempted to replace each category with a given number. For example, if we had three categories (A, B,C), we could replace them with (0, 1, 2), but by doing so we would be imposing an ordinal trend (0<1<2), which does not need to be the case (Why A should be lower than B?).
To prevent this, we can do one-hot encoding, where each category is represented by a separate binary column. For each observation, a ‘1’ is placed in the column corresponding to its original category, with ‘0’s in all other columns.
In pandas we can do this using pd.get_dummies()
function.
Important parameters:
prefix
: append prefix to column names (a good idea for later use)drop_first
: remove first level, as onlyk-1
variables needed to representk
levels. You will normally want to set this toTrue
.
Have a loot at the documentation for further details.
Let’s apply this to the Embarked
column.
Step 1: One-Hot encoding on the fitting partition
Use pd.get_dummies()
on the fitting data to create one-hot encoded columns.
Capture the resulting columns in the fitting data to use as a reference for the test data.
fitting_encoded_df = pd.get_dummies(fitting_df, columns=['Embarked'], drop_first=True)
# Save the columns after one-hot encoding the training set
fitting_columns = fitting_encoded_df.columns
fitting_encoded_df.head()
PassengerId | Survived | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | age_cat | fare_cat | Embarked_Q | Embarked_S | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 0 | 3 | Braund, Mr. Owen Harris | male | 22.0 | 1 | 0 | A/5 21171 | 7.2500 | NaN | (16.6, 32.45] | Low | 0 | 1 |
1 | 2 | 1 | 1 | Cumings, Mrs. John Bradley (Florence Briggs Th... | female | 38.0 | 1 | 0 | PC 17599 | 71.2833 | C85 | (32.45, 48.3] | High | 0 | 0 |
2 | 3 | 1 | 3 | Heikkinen, Miss Laina | female | 26.0 | 0 | 0 | STON/O2. 3101282 | 7.9250 | NaN | (16.6, 32.45] | Low | 0 | 1 |
3 | 4 | 1 | 1 | Futrelle, Mrs. Jacques Heath (Lily May Peel) | female | 35.0 | 1 | 0 | 113803 | 53.1000 | C123 | (32.45, 48.3] | High | 0 | 1 |
4 | 5 | 0 | 3 | Allen, Mr. William Henry | male | 35.0 | 0 | 0 | 373450 | 8.0500 | NaN | (32.45, 48.3] | Low | 0 | 1 |
Step 2: Apply consistent encoding to out-of-sample test data
Some categorical features in the test data may not include all the categories as in the fitting data. In this case, applying pd.get_dummies()
would yield fewer columns relative to the fitting data.
We can see this if we just use the first 5 observations of our test data:
pd.get_dummies(testing_df.iloc[:5,:], columns=['Embarked'], drop_first=True)
PassengerId | Survived | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | age_cat | fare_cat | Embarked_S | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
712 | 713 | 1 | 1 | Taylor, Mr. Elmer Zebley | male | 48.0 | 1 | 0 | 19996 | 52.0000 | C126 | (32.45, 48.3] | High | 1 |
713 | 714 | 0 | 3 | Larsson, Mr. August Viktor | male | 29.0 | 0 | 0 | 7545 | 9.4833 | NaN | (16.6, 32.45] | Medium | 1 |
714 | 715 | 0 | 2 | Greenberg, Mr. Samuel | male | 52.0 | 0 | 0 | 250647 | 13.0000 | NaN | (48.3, 64.15] | Medium | 1 |
715 | 716 | 0 | 3 | Soholt, Mr. Peter Andreas Lauritz Andersen | male | 19.0 | 0 | 0 | 348124 | 7.6500 | F G73 | (16.6, 32.45] | Low | 1 |
716 | 717 | 1 | 1 | Endres, Miss Caroline Louise | female | 38.0 | 0 | 0 | PC 17757 | 227.5250 | C45 | (32.45, 48.3] | High | 0 |
To handle this:
(1) Reindex the test data to match the columns from the fitting data, filling any missing columns with zeros (since those categories are absent in the test data).
(2) Add missing columns to ensure both datasets have the same structure.
# One-hot encode the test data
testing_encoded_df = pd.get_dummies(testing_df.iloc[:5,:], columns=['Embarked'], drop_first=True)
# Reindex test data to match training data columns, filling missing columns with 0
testing_encoded_df = testing_encoded_df.reindex(columns=fitting_columns, fill_value=0)
testing_encoded_df
PassengerId | Survived | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | age_cat | fare_cat | Embarked_Q | Embarked_S | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
712 | 713 | 1 | 1 | Taylor, Mr. Elmer Zebley | male | 48.0 | 1 | 0 | 19996 | 52.0000 | C126 | (32.45, 48.3] | High | 0 | 1 |
713 | 714 | 0 | 3 | Larsson, Mr. August Viktor | male | 29.0 | 0 | 0 | 7545 | 9.4833 | NaN | (16.6, 32.45] | Medium | 0 | 1 |
714 | 715 | 0 | 2 | Greenberg, Mr. Samuel | male | 52.0 | 0 | 0 | 250647 | 13.0000 | NaN | (48.3, 64.15] | Medium | 0 | 1 |
715 | 716 | 0 | 3 | Soholt, Mr. Peter Andreas Lauritz Andersen | male | 19.0 | 0 | 0 | 348124 | 7.6500 | F G73 | (16.6, 32.45] | Low | 0 | 1 |
716 | 717 | 1 | 1 | Endres, Miss Caroline Louise | female | 38.0 | 0 | 0 | PC 17757 | 227.5250 | C45 | (32.45, 48.3] | High | 0 | 0 |
Working with text data#
This a very common type of data.
Common text data problems involve:
data inconsistency
fixed length violations
typos
Pandas provides a set of string processing methods to easilty operate on each element of the string elements. These can accessed via the str
attribute and generally have names matching the equivalent string’s methods, such as lower()
, upper()
, split()
, contains()
, and replace()
This is a safe operation in terms of data leakage, since it acts on each observation individually:
full_df.head()
PassengerId | Survived | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked | age_cat | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 0 | 3 | Braund, Mr. Owen Harris | male | 22.0 | 1 | 0 | A/5 21171 | 7.2500 | NaN | S | (20, 30] |
1 | 2 | 1 | 1 | Cumings, Mrs. John Bradley (Florence Briggs Th... | female | 38.0 | 1 | 0 | PC 17599 | 71.2833 | C85 | C | (30, 40] |
2 | 3 | 1 | 3 | Heikkinen, Miss Laina | female | 26.0 | 0 | 0 | STON/O2. 3101282 | 7.9250 | NaN | S | (20, 30] |
3 | 4 | 1 | 1 | Futrelle, Mrs. Jacques Heath (Lily May Peel) | female | 35.0 | 1 | 0 | 113803 | 53.1000 | C123 | S | (30, 40] |
4 | 5 | 0 | 3 | Allen, Mr. William Henry | male | 35.0 | 0 | 0 | 373450 | 8.0500 | NaN | S | (30, 40] |
# convert names to lowercase
full_df["Name"].str.lower()
0 braund, mr. owen harris
1 cumings, mrs. john bradley (florence briggs th...
2 heikkinen, miss laina
3 futrelle, mrs. jacques heath (lily may peel)
4 allen, mr. william henry
...
886 montvila, rev. juozas
887 graham, miss margaret edith
888 johnston, miss catherine helen "carrie"
889 behr, mr. karl howell
890 dooley, mr. patrick
Name: Name, Length: 891, dtype: object
# Get last names
full_df["Name"].str.split().str[0].str.replace(",", "")
0 Braund
1 Cumings
2 Heikkinen
3 Futrelle
4 Allen
...
886 Montvila
887 Graham
888 Johnston
889 Behr
890 Dooley
Name: Name, Length: 891, dtype: object
Have a loot at this for further details.
Practice exercises#
The dataframe below contains two categoricals. Apply one-hot encoding to each of them, giving them a prefix and dropping the first level from each.
Print the new dataframe to insure correctness.
Hint: You might want to dummify each column into separate new dataframes, and then merge them together by using.
cats = pd.DataFrame({'breed':['persian','persian','siamese','himalayan','burmese'],
'color':['calico','white','seal point','cream','sable']})
# Your answers from here