Pandas: Introduction to Feature Engineering#

In this lesson, you will be introduced to feature enginering techniques using pandas. Specifically, we will cover:

  • Handling missing data

  • Creating new columns

  • Working with categorical data

  • Working with text data

Introduction#

Feature engineering is the process of selecting, manipulating, and transforming raw data into features that can be used in subsequent analyses. This process is particularly crucial in data science applications.

Pandas provides many functionalities for feature engineering, and we will cover some of them here with a caveat: data science applications often emphasize generalizability. To achieve this, it is standard practice to split the data into fitting (sometimes called “training”) and testing partitions. The fitting partition is used to estimate everything needed, including feature engineering steps, before model deployment. The model’s performance is then evaluated on the testing partition. This approach ensures there is no data leakage, which occurs when information that would not be available at prediction time is used in building the model, often resulting in overly optimistic results.

Unfortunately, feature engineering in Pandas may not always keep these two partitions strictly separated, which is why you may want to use scikit-learn for this process in the future.

Nevertheless, in this lesson, we will try to cover some Pandas’ functionalities for feature engineering in a safely way in an out-of-sample context.

# Load dependencies
import pandas as pd
# We will use this dataset, common for new learners in machine learning
full_df = pd.read_csv('https://raw.githubusercontent.com/pandas-dev/pandas/main/doc/data/titanic.csv')
full_df.head()
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
0 1 0 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.2500 NaN S
1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 0 PC 17599 71.2833 C85 C
2 3 1 3 Heikkinen, Miss Laina female 26.0 0 0 STON/O2. 3101282 7.9250 NaN S
3 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 0 113803 53.1000 C123 S
4 5 0 3 Allen, Mr. William Henry male 35.0 0 0 373450 8.0500 NaN S
# Use the info method to get information about this dataset
full_df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  891 non-null    int64  
 1   Survived     891 non-null    int64  
 2   Pclass       891 non-null    int64  
 3   Name         891 non-null    object 
 4   Sex          891 non-null    object 
 5   Age          714 non-null    float64
 6   SibSp        891 non-null    int64  
 7   Parch        891 non-null    int64  
 8   Ticket       891 non-null    object 
 9   Fare         891 non-null    float64
 10  Cabin        204 non-null    object 
 11  Embarked     889 non-null    object 
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB
# Let's create our partitions, with the fitting set taking 80% of the observations and the testing set the remaining
fitting_df = full_df.iloc[:int(full_df.shape[0]*0.8),:].copy() 
testing_df = full_df.iloc[int(full_df.shape[0]*0.8):,:].copy() # This copy operation is to avoid some annoying warnings later.

print("we have", fitting_df.shape[0], "observations in the fitting set;", "and", testing_df.shape[0], "in the test set")
we have 712 observations in the fitting set; and 179 in the test set

Handling nissing values#

  • dropna(): In principle, this is a method that you can safely apply. Refer to a previous lesson to know how to use it.

  • fillna(): This is fine as long as you fill all the NaN’s with a specific value (e.g. a zero).

# This are OK
full_df["Age"].fillna(0).info()
<class 'pandas.core.series.Series'>
RangeIndex: 891 entries, 0 to 890
Series name: Age
Non-Null Count  Dtype  
--------------  -----  
891 non-null    float64
dtypes: float64(1)
memory usage: 7.1 KB

Replacing NaN’s with for example the mean first and then splitting the data would be problematic, because both partitions would no longer be independent.

full_df["Age"].fillna(full_df["Age"].mean())
0      22.000000
1      38.000000
2      26.000000
3      35.000000
4      35.000000
         ...    
886    27.000000
887    19.000000
888    29.699118
889    26.000000
890    32.000000
Name: Age, Length: 891, dtype: float64

You have to estimate the mean on the fitting partition and use that estimation to populate both data partitions.

# The mean is estimating on the fitting partition
mean_fitting = fitting_df["Age"].mean()

# And use to fill NaN's in both
fitting_df["Age"] = fitting_df["Age"].fillna(mean_fitting)
testing_df["Age"] = testing_df["Age"].fillna(mean_fitting)

Creating new columns#

This is in principle a safe operation as long as it applied at the row-level. Refer to a previous lesson for further details on how to create new columns.

Working categorical data#

Convert continuous data into categories#

Normally here, one would do the following two steps:

  • Step 1: Bin fitting data and retrieve bin edges.

  • Step 2: Apply consistent binning to out-of-sample test data.

  • pd.cut(): Bin values into discrete intervals.

The following parameter are important: bin, which sets the criteria to bin by (Use help to see the different values it can take), and retbins to be able to transfer information to the testing partition.

# Bin the age from the fitting'dat in 5 equally separated ranges. Note that retbins is True to return the estimated edges
fitting_df['age_cat'], bin_edges = pd.cut(fitting_df['Age'], bins=5, retbins=True)
print("Bin edges:", bin_edges)
Bin edges: [ 0.67075 16.6     32.45    48.3     64.15    80.     ]

Use the bins argument and the computed ranges to bin your test data:

# Apply the same binning with the predefined edges
testing_df['age_cat'] = pd.cut(testing_df['Age'], bins=bin_edges)

Note that supplying pre-specified ranges is OK before data splitting

# This would be OK
full_df['age_cat'] = pd.cut(full_df['Age'], bins = [20, 30, 40, 50, 60])
  • pd.qcut(): Quantile-based discretization function.

# Bin the training data using qcut and retrieve the edges
fitting_df['fare_cat'], bin_edges = pd.qcut(fitting_df.copy()['Fare'], q=3, labels=['Low', 'Medium', 'High'], retbins=True)
print("Bin edges:", bin_edges)
Bin edges: [  0.       8.6625  26.25   512.3292]
# Use pd.cut() with the bin edges from the training data
testing_df['fare_cat'] = pd.cut(testing_df['Fare'], bins=bin_edges, labels=['Low', 'Medium', 'High'])

One-Hot Encoding#

Many algorithms require numerical inputs. For these, we need to convert categorical data into numbers.

We could be tempted to replace each category with a given number. For example, if we had three categories (A, B,C), we could replace them with (0, 1, 2), but by doing so we would be imposing an ordinal trend (0<1<2), which does not need to be the case (Why A should be lower than B?).

To prevent this, we can do one-hot encoding, where each category is represented by a separate binary column. For each observation, a ‘1’ is placed in the column corresponding to its original category, with ‘0’s in all other columns.

In pandas we can do this using pd.get_dummies() function.

Important parameters:

  • prefix : append prefix to column names (a good idea for later use)

  • drop_first: remove first level, as only k-1 variables needed to represent k levels. You will normally want to set this to True.

Have a loot at the documentation for further details.

Let’s apply this to the Embarked column.

  • Step 1: One-Hot encoding on the fitting partition

Use pd.get_dummies() on the fitting data to create one-hot encoded columns.
Capture the resulting columns in the fitting data to use as a reference for the test data.

fitting_encoded_df = pd.get_dummies(fitting_df, columns=['Embarked'], drop_first=True)

# Save the columns after one-hot encoding the training set
fitting_columns = fitting_encoded_df.columns

fitting_encoded_df.head()
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin age_cat fare_cat Embarked_Q Embarked_S
0 1 0 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.2500 NaN (16.6, 32.45] Low 0 1
1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 0 PC 17599 71.2833 C85 (32.45, 48.3] High 0 0
2 3 1 3 Heikkinen, Miss Laina female 26.0 0 0 STON/O2. 3101282 7.9250 NaN (16.6, 32.45] Low 0 1
3 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 0 113803 53.1000 C123 (32.45, 48.3] High 0 1
4 5 0 3 Allen, Mr. William Henry male 35.0 0 0 373450 8.0500 NaN (32.45, 48.3] Low 0 1
  • Step 2: Apply consistent encoding to out-of-sample test data

Some categorical features in the test data may not include all the categories as in the fitting data. In this case, applying pd.get_dummies() would yield fewer columns relative to the fitting data.

We can see this if we just use the first 5 observations of our test data:

pd.get_dummies(testing_df.iloc[:5,:], columns=['Embarked'], drop_first=True)
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin age_cat fare_cat Embarked_S
712 713 1 1 Taylor, Mr. Elmer Zebley male 48.0 1 0 19996 52.0000 C126 (32.45, 48.3] High 1
713 714 0 3 Larsson, Mr. August Viktor male 29.0 0 0 7545 9.4833 NaN (16.6, 32.45] Medium 1
714 715 0 2 Greenberg, Mr. Samuel male 52.0 0 0 250647 13.0000 NaN (48.3, 64.15] Medium 1
715 716 0 3 Soholt, Mr. Peter Andreas Lauritz Andersen male 19.0 0 0 348124 7.6500 F G73 (16.6, 32.45] Low 1
716 717 1 1 Endres, Miss Caroline Louise female 38.0 0 0 PC 17757 227.5250 C45 (32.45, 48.3] High 0

To handle this:

(1) Reindex the test data to match the columns from the fitting data, filling any missing columns with zeros (since those categories are absent in the test data).
(2) Add missing columns to ensure both datasets have the same structure.

# One-hot encode the test data
testing_encoded_df = pd.get_dummies(testing_df.iloc[:5,:], columns=['Embarked'], drop_first=True)

# Reindex test data to match training data columns, filling missing columns with 0
testing_encoded_df = testing_encoded_df.reindex(columns=fitting_columns, fill_value=0)

testing_encoded_df
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin age_cat fare_cat Embarked_Q Embarked_S
712 713 1 1 Taylor, Mr. Elmer Zebley male 48.0 1 0 19996 52.0000 C126 (32.45, 48.3] High 0 1
713 714 0 3 Larsson, Mr. August Viktor male 29.0 0 0 7545 9.4833 NaN (16.6, 32.45] Medium 0 1
714 715 0 2 Greenberg, Mr. Samuel male 52.0 0 0 250647 13.0000 NaN (48.3, 64.15] Medium 0 1
715 716 0 3 Soholt, Mr. Peter Andreas Lauritz Andersen male 19.0 0 0 348124 7.6500 F G73 (16.6, 32.45] Low 0 1
716 717 1 1 Endres, Miss Caroline Louise female 38.0 0 0 PC 17757 227.5250 C45 (32.45, 48.3] High 0 0

Working with text data#

This a very common type of data.

Common text data problems involve:

  1. data inconsistency

  2. fixed length violations

  3. typos

Pandas provides a set of string processing methods to easilty operate on each element of the string elements. These can accessed via the str attribute and generally have names matching the equivalent string’s methods, such as lower(), upper(), split(), contains(), and replace()

This is a safe operation in terms of data leakage, since it acts on each observation individually:

full_df.head()
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked age_cat
0 1 0 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.2500 NaN S (20, 30]
1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 0 PC 17599 71.2833 C85 C (30, 40]
2 3 1 3 Heikkinen, Miss Laina female 26.0 0 0 STON/O2. 3101282 7.9250 NaN S (20, 30]
3 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 0 113803 53.1000 C123 S (30, 40]
4 5 0 3 Allen, Mr. William Henry male 35.0 0 0 373450 8.0500 NaN S (30, 40]
# convert names to lowercase
full_df["Name"].str.lower()
0                                braund, mr. owen harris
1      cumings, mrs. john bradley (florence briggs th...
2                                  heikkinen, miss laina
3           futrelle, mrs. jacques heath (lily may peel)
4                               allen, mr. william henry
                             ...                        
886                                montvila, rev. juozas
887                          graham, miss margaret edith
888              johnston, miss catherine helen "carrie"
889                                behr, mr. karl howell
890                                  dooley, mr. patrick
Name: Name, Length: 891, dtype: object
# Get last names
full_df["Name"].str.split().str[0].str.replace(",", "")
0         Braund
1        Cumings
2      Heikkinen
3       Futrelle
4          Allen
         ...    
886     Montvila
887       Graham
888     Johnston
889         Behr
890       Dooley
Name: Name, Length: 891, dtype: object

Have a loot at this for further details.

Practice exercises#

Exercise 50

The dataframe below contains two categoricals. Apply one-hot encoding to each of them, giving them a prefix and dropping the first level from each.

Print the new dataframe to insure correctness.

Hint: You might want to dummify each column into separate new dataframes, and then merge them together by using.

cats = pd.DataFrame({'breed':['persian','persian','siamese','himalayan','burmese'], 
                     'color':['calico','white','seal point','cream','sable']})
# Your answers from here