Pandas: Introduction to Feature Engineering

Pandas: Introduction to Feature Engineering#

In this lesson, you will be introduced to feature enginering techniques using pandas. Specifically, we will cover:

Handling missing data
Creating new columns
Working with categorical data
Working with text data

Introduction#

Feature engineering is the process of selecting, manipulating, and transforming raw data into features that can be used in subsequent analyses. This process is particularly crucial in data science applications.

Pandas provides many functionalities for feature engineering, and we will cover some of them here with a caveat: data science applications often emphasize generalizability. To achieve this, it is standard practice to split the data into fitting (sometimes called “training”) and testing partitions. The fitting partition is used to estimate everything needed, including feature engineering steps, before model deployment. The model’s performance is then evaluated on the testing partition. This approach ensures there is no data leakage, which occurs when information that would not be available at prediction time is used in building the model, often resulting in overly optimistic results.

Unfortunately, feature engineering in Pandas may not always keep these two partitions strictly separated, which is why you may want to use scikit-learn for this process in the future.

Nevertheless, in this lesson, we will try to cover some Pandas’ functionalities for feature engineering in a safely way in an out-of-sample context.

# Load dependencies
import pandas as pd

# We will use this dataset, common for new learners in machine learning
full_df = pd.read_csv('https://raw.githubusercontent.com/pandas-dev/pandas/main/doc/data/titanic.csv')
full_df.head()

	PassengerId	Survived	Pclass	Name	Sex	Age	SibSp	Ticket	Fare	Cabin	Embarked
0	1	0	3	Braund, Mr. Owen Harris	male	22.0	1	A/5 21171	7.2500	NaN	S
1	2	1	1	Cumings, Mrs. John Bradley (Florence Briggs Th...	female	38.0	1	PC 17599	71.2833	C85	C
2	3	1	3	Heikkinen, Miss Laina	female	26.0	0	STON/O2. 3101282	7.9250	NaN	S
3	4	1	1	Futrelle, Mrs. Jacques Heath (Lily May Peel)	female	35.0	1	113803	53.1000	C123	S
4	5	0	3	Allen, Mr. William Henry	male	35.0	0	373450	8.0500	NaN	S

# Use the info method to get information about this dataset
full_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  891 non-null    int64  
 1   Survived     891 non-null    int64  
 2   Pclass       891 non-null    int64  
 3   Name         891 non-null    object 
 4   Sex          891 non-null    object 
 5   Age          714 non-null    float64
 6   SibSp        891 non-null    int64  
 7   Parch        891 non-null    int64  
 8   Ticket       891 non-null    object 
 9   Fare         891 non-null    float64
 10  Cabin        204 non-null    object 
 11  Embarked     889 non-null    object 
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB

# Let's create our partitions, with the fitting set taking 80% of the observations and the testing set the remaining
fitting_df = full_df.iloc[:int(full_df.shape[0]*0.8),:].copy() 
testing_df = full_df.iloc[int(full_df.shape[0]*0.8):,:].copy() # This copy operation is to avoid some annoying warnings later.

print("we have", fitting_df.shape[0], "observations in the fitting set;", "and", testing_df.shape[0], "in the test set")

we have 712 observations in the fitting set; and 179 in the test set

Handling nissing values#

dropna(): In principle, this is a method that you can safely apply. Refer to a previous lesson to know how to use it.
fillna(): This is fine as long as you fill all the NaN’s with a specific value (e.g. a zero).

# This are OK
full_df["Age"].fillna(0).info()

<class 'pandas.core.series.Series'>
RangeIndex: 891 entries, 0 to 890
Series name: Age
Non-Null Count  Dtype  
--------------  -----  
891 non-null    float64
dtypes: float64(1)
memory usage: 7.1 KB

Replacing NaN’s with for example the mean first and then splitting the data would be problematic, because both partitions would no longer be independent.

full_df["Age"].fillna(full_df["Age"].mean())

    22.000000
    38.000000
    26.000000
    35.000000
    35.000000
         ...    
  27.000000
  19.000000
  29.699118
  26.000000
  32.000000
Name: Age, Length: 891, dtype: float64

You have to estimate the mean on the fitting partition and use that estimation to populate both data partitions.

# The mean is estimating on the fitting partition
mean_fitting = fitting_df["Age"].mean()

# And use to fill NaN's in both
fitting_df["Age"] = fitting_df["Age"].fillna(mean_fitting)
testing_df["Age"] = testing_df["Age"].fillna(mean_fitting)

Creating new columns#

This is in principle a safe operation as long as it applied at the row-level. Refer to a previous lesson for further details on how to create new columns.

Working categorical data#

Convert continuous data into categories#

Normally here, one would do the following two steps:

Step 1: Bin fitting data and retrieve bin edges.
Step 2: Apply consistent binning to out-of-sample test data.

pd.cut(): Bin values into discrete intervals.

The following parameter are important: bin, which sets the criteria to bin by (Use help to see the different values it can take), and retbins to be able to transfer information to the testing partition.

# Bin the age from the fitting'dat in 5 equally separated ranges. Note that retbins is True to return the estimated edges
fitting_df['age_cat'], bin_edges = pd.cut(fitting_df['Age'], bins=5, retbins=True)
print("Bin edges:", bin_edges)

Bin edges: [ 0.67075 16.6     32.45    48.3     64.15    80.     ]

Use the bins argument and the computed ranges to bin your test data:

# Apply the same binning with the predefined edges
testing_df['age_cat'] = pd.cut(testing_df['Age'], bins=bin_edges)

Note that supplying pre-specified ranges is OK before data splitting

# This would be OK
full_df['age_cat'] = pd.cut(full_df['Age'], bins = [20, 30, 40, 50, 60])

pd.qcut(): Quantile-based discretization function.

# Bin the training data using qcut and retrieve the edges
fitting_df['fare_cat'], bin_edges = pd.qcut(fitting_df.copy()['Fare'], q=3, labels=['Low', 'Medium', 'High'], retbins=True)
print("Bin edges:", bin_edges)

Bin edges: [  0.       8.6625  26.25   512.3292]

# Use pd.cut() with the bin edges from the training data
testing_df['fare_cat'] = pd.cut(testing_df['Fare'], bins=bin_edges, labels=['Low', 'Medium', 'High'])

One-Hot Encoding#

Many algorithms require numerical inputs. For these, we need to convert categorical data into numbers.

We could be tempted to replace each category with a given number. For example, if we had three categories (A, B,C), we could replace them with (0, 1, 2), but by doing so we would be imposing an ordinal trend (0<1<2), which does not need to be the case (Why A should be lower than B?).

To prevent this, we can do one-hot encoding, where each category is represented by a separate binary column. For each observation, a ‘1’ is placed in the column corresponding to its original category, with ‘0’s in all other columns.

In pandas we can do this using pd.get_dummies() function.

Important parameters:

prefix : append prefix to column names (a good idea for later use)
drop_first: remove first level, as only k-1 variables needed to represent k levels. You will normally want to set this to True.

Have a loot at the documentation for further details.

Let’s apply this to the Embarked column.

Step 1: One-Hot encoding on the fitting partition

Use pd.get_dummies() on the fitting data to create one-hot encoded columns.
Capture the resulting columns in the fitting data to use as a reference for the test data.

fitting_encoded_df = pd.get_dummies(fitting_df, columns=['Embarked'], drop_first=True)

# Save the columns after one-hot encoding the training set
fitting_columns = fitting_encoded_df.columns

fitting_encoded_df.head()

	PassengerId	Survived	Pclass	Name	Sex	Age	SibSp	Ticket	Fare	Cabin	age_cat	fare_cat	Embarked_S
0	1	0	3	Braund, Mr. Owen Harris	male	22.0	1	A/5 21171	7.2500	NaN	(16.6, 32.45]	Low	1
1	2	1	1	Cumings, Mrs. John Bradley (Florence Briggs Th...	female	38.0	1	PC 17599	71.2833	C85	(32.45, 48.3]	High	0
2	3	1	3	Heikkinen, Miss Laina	female	26.0	0	STON/O2. 3101282	7.9250	NaN	(16.6, 32.45]	Low	1
3	4	1	1	Futrelle, Mrs. Jacques Heath (Lily May Peel)	female	35.0	1	113803	53.1000	C123	(32.45, 48.3]	High	1
4	5	0	3	Allen, Mr. William Henry	male	35.0	0	373450	8.0500	NaN	(32.45, 48.3]	Low	1

Step 2: Apply consistent encoding to out-of-sample test data

Some categorical features in the test data may not include all the categories as in the fitting data. In this case, applying pd.get_dummies() would yield fewer columns relative to the fitting data.

We can see this if we just use the first 5 observations of our test data:

pd.get_dummies(testing_df.iloc[:5,:], columns=['Embarked'], drop_first=True)

	PassengerId	Survived	Pclass	Name	Sex	Age	SibSp	Ticket	Fare	Cabin	age_cat	fare_cat	Embarked_S
712	713	1	1	Taylor, Mr. Elmer Zebley	male	48.0	1	19996	52.0000	C126	(32.45, 48.3]	High	1
713	714	0	3	Larsson, Mr. August Viktor	male	29.0	0	7545	9.4833	NaN	(16.6, 32.45]	Medium	1
714	715	0	2	Greenberg, Mr. Samuel	male	52.0	0	250647	13.0000	NaN	(48.3, 64.15]	Medium	1
715	716	0	3	Soholt, Mr. Peter Andreas Lauritz Andersen	male	19.0	0	348124	7.6500	F G73	(16.6, 32.45]	Low	1
716	717	1	1	Endres, Miss Caroline Louise	female	38.0	0	PC 17757	227.5250	C45	(32.45, 48.3]	High	0

To handle this:

(1) Reindex the test data to match the columns from the fitting data, filling any missing columns with zeros (since those categories are absent in the test data).
(2) Add missing columns to ensure both datasets have the same structure.

# One-hot encode the test data
testing_encoded_df = pd.get_dummies(testing_df.iloc[:5,:], columns=['Embarked'], drop_first=True)

# Reindex test data to match training data columns, filling missing columns with 0
testing_encoded_df = testing_encoded_df.reindex(columns=fitting_columns, fill_value=0)

testing_encoded_df

	PassengerId	Survived	Pclass	Name	Sex	Age	SibSp	Ticket	Fare	Cabin	age_cat	fare_cat	Embarked_S
712	713	1	1	Taylor, Mr. Elmer Zebley	male	48.0	1	19996	52.0000	C126	(32.45, 48.3]	High	1
713	714	0	3	Larsson, Mr. August Viktor	male	29.0	0	7545	9.4833	NaN	(16.6, 32.45]	Medium	1
714	715	0	2	Greenberg, Mr. Samuel	male	52.0	0	250647	13.0000	NaN	(48.3, 64.15]	Medium	1
715	716	0	3	Soholt, Mr. Peter Andreas Lauritz Andersen	male	19.0	0	348124	7.6500	F G73	(16.6, 32.45]	Low	1
716	717	1	1	Endres, Miss Caroline Louise	female	38.0	0	PC 17757	227.5250	C45	(32.45, 48.3]	High	0

Working with text data#

This a very common type of data.

Common text data problems involve:

data inconsistency
fixed length violations
typos

Pandas provides a set of string processing methods to easilty operate on each element of the string elements. These can accessed via the str attribute and generally have names matching the equivalent string’s methods, such as lower(), upper(), split(), contains(), and replace()

This is a safe operation in terms of data leakage, since it acts on each observation individually:

full_df.head()

	PassengerId	Survived	Pclass	Name	Sex	Age	SibSp	Ticket	Fare	Cabin	Embarked	age_cat
0	1	0	3	Braund, Mr. Owen Harris	male	22.0	1	A/5 21171	7.2500	NaN	S	(20, 30]
1	2	1	1	Cumings, Mrs. John Bradley (Florence Briggs Th...	female	38.0	1	PC 17599	71.2833	C85	C	(30, 40]
2	3	1	3	Heikkinen, Miss Laina	female	26.0	0	STON/O2. 3101282	7.9250	NaN	S	(20, 30]
3	4	1	1	Futrelle, Mrs. Jacques Heath (Lily May Peel)	female	35.0	1	113803	53.1000	C123	S	(30, 40]
4	5	0	3	Allen, Mr. William Henry	male	35.0	0	373450	8.0500	NaN	S	(30, 40]

# convert names to lowercase
full_df["Name"].str.lower()

                              braund, mr. owen harris
    cumings, mrs. john bradley (florence briggs th...
                                heikkinen, miss laina
         futrelle, mrs. jacques heath (lily may peel)
                             allen, mr. william henry
                             ...                        
                              montvila, rev. juozas
                        graham, miss margaret edith
            johnston, miss catherine helen "carrie"
                              behr, mr. karl howell
                                dooley, mr. patrick
Name: Name, Length: 891, dtype: object

# Get last names
full_df["Name"].str.split().str[0].str.replace(",", "")

       Braund
      Cumings
    Heikkinen
     Futrelle
        Allen
         ...    
   Montvila
     Graham
   Johnston
       Behr
     Dooley
Name: Name, Length: 891, dtype: object

Have a loot at this for further details.

Practice exercises#

Exercise 50

The dataframe below contains two categoricals. Apply one-hot encoding to each of them, giving them a prefix and dropping the first level from each.

Print the new dataframe to insure correctness.

Hint: You might want to dummify each column into separate new dataframes, and then merge them together by using.

cats = pd.DataFrame({'breed':['persian','persian','siamese','himalayan','burmese'], 
                     'color':['calico','white','seal point','cream','sable']})

# Your answers from here