{ "cells": [ { "cell_type": "markdown", "id": "a1c1a862-4c4e-4588-9a50-7f772506cae8", "metadata": {}, "source": [ "# Pandas: Introduction to Feature Engineering" ] }, { "cell_type": "markdown", "id": "7b28d51c-bab5-4ecd-be40-7e2956d38939", "metadata": {}, "source": [ "![](https://ds1002-resources.s3.amazonaws.com/images/workflow.png)\n", "\n", "In this lesson, you will be introduced to feature enginering techniques using pandas. Specifically, we will cover:\n", " \n", "- Handling missing data\n", "- Creating new columns\n", "- Working with categorical data\n", "- Working with text data" ] }, { "cell_type": "markdown", "id": "34615f2f-0779-4799-9d09-c3ace9a6dbc1", "metadata": {}, "source": [ "## Introduction\n", "\n", "Feature engineering is the process of selecting, manipulating, and transforming raw data into features that can be used in subsequent analyses. This process is particularly crucial in data science applications.\n", "\n", "Pandas provides many functionalities for feature engineering, and we will cover some of them here with a caveat: data science applications often emphasize **generalizability**. To achieve this, it is standard practice to split the data into **fitting** (sometimes called \"training\") and **testing** partitions. The fitting partition is used to estimate everything needed, including feature engineering steps, before model deployment. The model's performance is then evaluated on the testing partition. This approach ensures there is no **data leakage**, which occurs when information that would not be available at prediction time is used in building the model, often resulting in overly optimistic results.\n", "\n", "Unfortunately, feature engineering in Pandas may not always keep these two partitions strictly separated, which is why you may want to use [scikit-learn](https://scikit-learn.org/stable/modules/preprocessing.html) for this process in the future.\n", "\n", "Nevertheless, in this lesson, we will try to cover some Pandas' functionalities for feature engineering in a safely way in an out-of-sample context." ] }, { "cell_type": "code", "execution_count": 1, "id": "05675acc-7b87-405a-bef3-9bffc1996d97", "metadata": {}, "outputs": [], "source": [ "# Load dependencies\n", "import pandas as pd" ] }, { "cell_type": "code", "execution_count": 2, "id": "8c1f9c8b-0dd2-4566-8905-7d88f5aea763", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
PassengerIdSurvivedPclassNameSexAgeSibSpParchTicketFareCabinEmbarked
0103Braund, Mr. Owen Harrismale22.010A/5 211717.2500NaNS
1211Cumings, Mrs. John Bradley (Florence Briggs Th...female38.010PC 1759971.2833C85C
2313Heikkinen, Miss Lainafemale26.000STON/O2. 31012827.9250NaNS
3411Futrelle, Mrs. Jacques Heath (Lily May Peel)female35.01011380353.1000C123S
4503Allen, Mr. William Henrymale35.0003734508.0500NaNS
\n", "
" ], "text/plain": [ " PassengerId Survived Pclass \\\n", "0 1 0 3 \n", "1 2 1 1 \n", "2 3 1 3 \n", "3 4 1 1 \n", "4 5 0 3 \n", "\n", " Name Sex Age SibSp \\\n", "0 Braund, Mr. Owen Harris male 22.0 1 \n", "1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 \n", "2 Heikkinen, Miss Laina female 26.0 0 \n", "3 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 \n", "4 Allen, Mr. William Henry male 35.0 0 \n", "\n", " Parch Ticket Fare Cabin Embarked \n", "0 0 A/5 21171 7.2500 NaN S \n", "1 0 PC 17599 71.2833 C85 C \n", "2 0 STON/O2. 3101282 7.9250 NaN S \n", "3 0 113803 53.1000 C123 S \n", "4 0 373450 8.0500 NaN S " ] }, "execution_count": 2, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# We will use this dataset, common for new learners in machine learning\n", "full_df = pd.read_csv('https://raw.githubusercontent.com/pandas-dev/pandas/main/doc/data/titanic.csv')\n", "full_df.head()" ] }, { "cell_type": "code", "execution_count": 4, "id": "1094e195-79d5-4505-af21-c345b992e8a2", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", "RangeIndex: 891 entries, 0 to 890\n", "Data columns (total 12 columns):\n", " # Column Non-Null Count Dtype \n", "--- ------ -------------- ----- \n", " 0 PassengerId 891 non-null int64 \n", " 1 Survived 891 non-null int64 \n", " 2 Pclass 891 non-null int64 \n", " 3 Name 891 non-null object \n", " 4 Sex 891 non-null object \n", " 5 Age 714 non-null float64\n", " 6 SibSp 891 non-null int64 \n", " 7 Parch 891 non-null int64 \n", " 8 Ticket 891 non-null object \n", " 9 Fare 891 non-null float64\n", " 10 Cabin 204 non-null object \n", " 11 Embarked 889 non-null object \n", "dtypes: float64(2), int64(5), object(5)\n", "memory usage: 83.7+ KB\n" ] } ], "source": [ "# Use the info method to get information about this dataset\n", "full_df.info()" ] }, { "cell_type": "code", "execution_count": 5, "id": "abfccb9c-0b1e-4df0-8989-95797776ea16", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "we have 712 observations in the fitting set; and 179 in the test set\n" ] } ], "source": [ "# Let's create our partitions, with the fitting set taking 80% of the observations and the testing set the remaining\n", "fitting_df = full_df.iloc[:int(full_df.shape[0]*0.8),:].copy() \n", "testing_df = full_df.iloc[int(full_df.shape[0]*0.8):,:].copy() # This copy operation is to avoid some annoying warnings later.\n", "\n", "print(\"we have\", fitting_df.shape[0], \"observations in the fitting set;\", \"and\", testing_df.shape[0], \"in the test set\")" ] }, { "cell_type": "markdown", "id": "0020fb6f-5039-4934-a605-66d0dce851bf", "metadata": {}, "source": [ "## Handling nissing values\n", "\n", "- `dropna()`: In principle, this is a method that you can safely apply. Refer to a previous lesson to know how to use it.\n", "- `fillna()`: This is fine as long as you fill all the NaN's with a specific value (e.g. a zero)." ] }, { "cell_type": "code", "execution_count": 6, "id": "88d6d56e-fc08-41cb-bcc4-561d6b25ceeb", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", "RangeIndex: 891 entries, 0 to 890\n", "Series name: Age\n", "Non-Null Count Dtype \n", "-------------- ----- \n", "891 non-null float64\n", "dtypes: float64(1)\n", "memory usage: 7.1 KB\n" ] } ], "source": [ "# This are OK\n", "full_df[\"Age\"].fillna(0).info()" ] }, { "cell_type": "markdown", "id": "8b379713-a3c8-47ef-8d8c-a7f4b589e93d", "metadata": {}, "source": [ "Replacing NaN's with for example the mean first and then splitting the data would be problematic, because both partitions would no longer be independent." ] }, { "cell_type": "code", "execution_count": 13, "id": "352b7108-e74f-437b-9cf1-c2b411a814ea", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0 22.000000\n", "1 38.000000\n", "2 26.000000\n", "3 35.000000\n", "4 35.000000\n", " ... \n", "886 27.000000\n", "887 19.000000\n", "888 29.699118\n", "889 26.000000\n", "890 32.000000\n", "Name: Age, Length: 891, dtype: float64" ] }, "execution_count": 13, "metadata": {}, "output_type": "execute_result" } ], "source": [ "full_df[\"Age\"].fillna(full_df[\"Age\"].mean())" ] }, { "cell_type": "markdown", "id": "200682a0-05eb-4a4c-8bab-cd60793f973b", "metadata": {}, "source": [ "You have to estimate the mean on the fitting partition and use that estimation to populate both data partitions." ] }, { "cell_type": "code", "execution_count": 7, "id": "94463461-3418-4e05-89fd-5f72d0cebdb0", "metadata": {}, "outputs": [], "source": [ "# The mean is estimating on the fitting partition\n", "mean_fitting = fitting_df[\"Age\"].mean()\n", "\n", "# And use to fill NaN's in both\n", "fitting_df[\"Age\"] = fitting_df[\"Age\"].fillna(mean_fitting)\n", "testing_df[\"Age\"] = testing_df[\"Age\"].fillna(mean_fitting)" ] }, { "cell_type": "markdown", "id": "4803ef2e-0914-424f-b014-fde679d0e58d", "metadata": {}, "source": [ "## Creating new columns\n", "\n", "This is in principle a safe operation as long as it applied at the row-level. Refer to a previous lesson for further details on how to create new columns." ] }, { "cell_type": "markdown", "id": "1dd11b64-b5ac-41b3-99f7-3379a08f8523", "metadata": {}, "source": [ "## Working categorical data" ] }, { "cell_type": "markdown", "id": "67312d70-f16f-4f13-9913-0fd110f04e47", "metadata": {}, "source": [ "### Convert continuous data into categories" ] }, { "cell_type": "markdown", "id": "cf83dacb-6d18-40d4-9d2e-e8d1c3d4d3d5", "metadata": {}, "source": [ "Normally here, one would do the following two steps:\n", "\n", "- **Step 1**: Bin fitting data and retrieve bin edges.\n", "- **Step 2**: Apply consistent binning to out-of-sample test data." ] }, { "cell_type": "markdown", "id": "68550f01-6cb2-449c-a523-270889bce309", "metadata": {}, "source": [ "- `pd.cut()`: Bin values into discrete intervals.\n", "\n", "The following parameter are important: `bin`, which sets the criteria to bin by (Use `help` to see the different values it can take), and `retbins` to be able to transfer information to the testing partition." ] }, { "cell_type": "code", "execution_count": 8, "id": "6d97de53-2699-4e2e-997a-458df6d6355d", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Bin edges: [ 0.67075 16.6 32.45 48.3 64.15 80. ]\n" ] } ], "source": [ "# Bin the age from the fitting'dat in 5 equally separated ranges. Note that retbins is True to return the estimated edges\n", "fitting_df['age_cat'], bin_edges = pd.cut(fitting_df['Age'], bins=5, retbins=True)\n", "print(\"Bin edges:\", bin_edges)" ] }, { "cell_type": "markdown", "id": "12cdb9a4-94d1-413e-8261-5713b261d4f8", "metadata": {}, "source": [ "Use the `bins` argument and the computed ranges to bin your test data:" ] }, { "cell_type": "code", "execution_count": 9, "id": "a9bf0a44-288a-4361-98f5-072f4fe75bde", "metadata": {}, "outputs": [], "source": [ "# Apply the same binning with the predefined edges\n", "testing_df['age_cat'] = pd.cut(testing_df['Age'], bins=bin_edges)" ] }, { "cell_type": "markdown", "id": "3540da34-6a58-4682-8ad5-a4c2d23d3aee", "metadata": {}, "source": [ "**Note that supplying pre-specified ranges is OK before data splitting**" ] }, { "cell_type": "code", "execution_count": 10, "id": "7e544299-47b5-46c3-96c1-466223d8a2a7", "metadata": {}, "outputs": [], "source": [ "# This would be OK\n", "full_df['age_cat'] = pd.cut(full_df['Age'], bins = [20, 30, 40, 50, 60])" ] }, { "cell_type": "markdown", "id": "84a8580e-bc4f-4eaa-85b9-7600092efc25", "metadata": {}, "source": [ "- `pd.qcut()`: Quantile-based discretization function." ] }, { "cell_type": "code", "execution_count": 11, "id": "5984704b-9696-4841-ba43-6c10dfa3ed58", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Bin edges: [ 0. 8.6625 26.25 512.3292]\n" ] } ], "source": [ "# Bin the training data using qcut and retrieve the edges\n", "fitting_df['fare_cat'], bin_edges = pd.qcut(fitting_df.copy()['Fare'], q=3, labels=['Low', 'Medium', 'High'], retbins=True)\n", "print(\"Bin edges:\", bin_edges)" ] }, { "cell_type": "code", "execution_count": 12, "id": "0813355d-197b-4cfb-93bb-51a3fee8217a", "metadata": {}, "outputs": [], "source": [ "# Use pd.cut() with the bin edges from the training data\n", "testing_df['fare_cat'] = pd.cut(testing_df['Fare'], bins=bin_edges, labels=['Low', 'Medium', 'High'])" ] }, { "cell_type": "markdown", "id": "9034685c-8352-48fd-affb-7dcda7f70901", "metadata": {}, "source": [ "### One-Hot Encoding" ] }, { "cell_type": "markdown", "id": "054555c7-6c6d-4de0-bc8b-ce5841b1c04b", "metadata": {}, "source": [ "Many algorithms require numerical inputs. For these, we need to convert categorical data into numbers.\n", "\n", "We could be tempted to replace each category with a given number. For example, if we had three categories (A, B,C), we could replace them with (0, 1, 2), but by doing so we would be imposing an ordinal trend (0<1<2), which does not need to be the case (Why A should be lower than B?).\n", "\n", "To prevent this, we can do one-hot encoding, where each category is represented by a separate binary column. For each observation, a '1' is placed in the column corresponding to its original category, with '0's in all other columns.\n", "\n", "In pandas we can do this using `pd.get_dummies()` function. \n", "\n", "Important parameters: \n", "\n", "- `prefix` : append prefix to column names (a good idea for later use)\n", "- `drop_first`: remove first level, as only `k-1` variables needed to represent `k` levels. You will normally want to set this to `True`.\n", "\n", "\n", "Have a loot at the [documentation](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.get_dummies.html) for further details." ] }, { "cell_type": "markdown", "id": "87c9f161-3042-4dc6-a91b-bbdbf220c521", "metadata": {}, "source": [ "Let's apply this to the `Embarked` column." ] }, { "cell_type": "markdown", "id": "0a04750e-f680-4ef6-9b69-70ea134d5764", "metadata": {}, "source": [ "- **Step 1:** One-Hot encoding on the fitting partition\n", "\n", "Use `pd.get_dummies()` on the fitting data to create one-hot encoded columns. \n", "Capture the resulting columns in the fitting data to use as a reference for the test data." ] }, { "cell_type": "code", "execution_count": 20, "id": "d9970721-d0e2-40de-b2c8-d30ec32b0b7f", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
PassengerIdSurvivedPclassNameSexAgeSibSpParchTicketFareCabinage_catfare_catEmbarked_QEmbarked_S
0103Braund, Mr. Owen Harrismale22.010A/5 211717.2500NaN(16.6, 32.45]Low01
1211Cumings, Mrs. John Bradley (Florence Briggs Th...female38.010PC 1759971.2833C85(32.45, 48.3]High00
2313Heikkinen, Miss Lainafemale26.000STON/O2. 31012827.9250NaN(16.6, 32.45]Low01
3411Futrelle, Mrs. Jacques Heath (Lily May Peel)female35.01011380353.1000C123(32.45, 48.3]High01
4503Allen, Mr. William Henrymale35.0003734508.0500NaN(32.45, 48.3]Low01
\n", "
" ], "text/plain": [ " PassengerId Survived Pclass \\\n", "0 1 0 3 \n", "1 2 1 1 \n", "2 3 1 3 \n", "3 4 1 1 \n", "4 5 0 3 \n", "\n", " Name Sex Age SibSp \\\n", "0 Braund, Mr. Owen Harris male 22.0 1 \n", "1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 \n", "2 Heikkinen, Miss Laina female 26.0 0 \n", "3 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 \n", "4 Allen, Mr. William Henry male 35.0 0 \n", "\n", " Parch Ticket Fare Cabin age_cat fare_cat Embarked_Q \\\n", "0 0 A/5 21171 7.2500 NaN (16.6, 32.45] Low 0 \n", "1 0 PC 17599 71.2833 C85 (32.45, 48.3] High 0 \n", "2 0 STON/O2. 3101282 7.9250 NaN (16.6, 32.45] Low 0 \n", "3 0 113803 53.1000 C123 (32.45, 48.3] High 0 \n", "4 0 373450 8.0500 NaN (32.45, 48.3] Low 0 \n", "\n", " Embarked_S \n", "0 1 \n", "1 0 \n", "2 1 \n", "3 1 \n", "4 1 " ] }, "execution_count": 20, "metadata": {}, "output_type": "execute_result" } ], "source": [ "fitting_encoded_df = pd.get_dummies(fitting_df, columns=['Embarked'], drop_first=True)\n", "\n", "# Save the columns after one-hot encoding the training set\n", "fitting_columns = fitting_encoded_df.columns\n", "\n", "fitting_encoded_df.head()" ] }, { "cell_type": "markdown", "id": "cb930e22-0d9a-4e5e-afd5-3bace4aaf493", "metadata": {}, "source": [ "- **Step 2**: Apply consistent encoding to out-of-sample test data\n", "\n", "Some categorical features in the test data may not include all the categories as in the fitting data. In this case, applying `pd.get_dummies()` would yield fewer columns relative to the fitting data.\n", "\n", "We can see this if we just use the first 5 observations of our test data:" ] }, { "cell_type": "code", "execution_count": 21, "id": "32a9e4d7-6ba4-46dd-a94f-5f6ad5697a2f", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
PassengerIdSurvivedPclassNameSexAgeSibSpParchTicketFareCabinage_catfare_catEmbarked_S
71271311Taylor, Mr. Elmer Zebleymale48.0101999652.0000C126(32.45, 48.3]High1
71371403Larsson, Mr. August Viktormale29.00075459.4833NaN(16.6, 32.45]Medium1
71471502Greenberg, Mr. Samuelmale52.00025064713.0000NaN(48.3, 64.15]Medium1
71571603Soholt, Mr. Peter Andreas Lauritz Andersenmale19.0003481247.6500F G73(16.6, 32.45]Low1
71671711Endres, Miss Caroline Louisefemale38.000PC 17757227.5250C45(32.45, 48.3]High0
\n", "
" ], "text/plain": [ " PassengerId Survived Pclass \\\n", "712 713 1 1 \n", "713 714 0 3 \n", "714 715 0 2 \n", "715 716 0 3 \n", "716 717 1 1 \n", "\n", " Name Sex Age SibSp Parch \\\n", "712 Taylor, Mr. Elmer Zebley male 48.0 1 0 \n", "713 Larsson, Mr. August Viktor male 29.0 0 0 \n", "714 Greenberg, Mr. Samuel male 52.0 0 0 \n", "715 Soholt, Mr. Peter Andreas Lauritz Andersen male 19.0 0 0 \n", "716 Endres, Miss Caroline Louise female 38.0 0 0 \n", "\n", " Ticket Fare Cabin age_cat fare_cat Embarked_S \n", "712 19996 52.0000 C126 (32.45, 48.3] High 1 \n", "713 7545 9.4833 NaN (16.6, 32.45] Medium 1 \n", "714 250647 13.0000 NaN (48.3, 64.15] Medium 1 \n", "715 348124 7.6500 F G73 (16.6, 32.45] Low 1 \n", "716 PC 17757 227.5250 C45 (32.45, 48.3] High 0 " ] }, "execution_count": 21, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pd.get_dummies(testing_df.iloc[:5,:], columns=['Embarked'], drop_first=True)" ] }, { "cell_type": "markdown", "id": "50eb1de5-108f-407f-84ad-d5ee7d641185", "metadata": {}, "source": [ "To handle this:\n", "\n", "(1) Reindex the test data to match the columns from the fitting data, filling any missing columns with zeros (since those categories are absent in the test data). \n", "(2) Add missing columns to ensure both datasets have the same structure. \n" ] }, { "cell_type": "code", "execution_count": 22, "id": "5607d79e-771a-439d-aebb-0b13aed15a09", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
PassengerIdSurvivedPclassNameSexAgeSibSpParchTicketFareCabinage_catfare_catEmbarked_QEmbarked_S
71271311Taylor, Mr. Elmer Zebleymale48.0101999652.0000C126(32.45, 48.3]High01
71371403Larsson, Mr. August Viktormale29.00075459.4833NaN(16.6, 32.45]Medium01
71471502Greenberg, Mr. Samuelmale52.00025064713.0000NaN(48.3, 64.15]Medium01
71571603Soholt, Mr. Peter Andreas Lauritz Andersenmale19.0003481247.6500F G73(16.6, 32.45]Low01
71671711Endres, Miss Caroline Louisefemale38.000PC 17757227.5250C45(32.45, 48.3]High00
\n", "
" ], "text/plain": [ " PassengerId Survived Pclass \\\n", "712 713 1 1 \n", "713 714 0 3 \n", "714 715 0 2 \n", "715 716 0 3 \n", "716 717 1 1 \n", "\n", " Name Sex Age SibSp Parch \\\n", "712 Taylor, Mr. Elmer Zebley male 48.0 1 0 \n", "713 Larsson, Mr. August Viktor male 29.0 0 0 \n", "714 Greenberg, Mr. Samuel male 52.0 0 0 \n", "715 Soholt, Mr. Peter Andreas Lauritz Andersen male 19.0 0 0 \n", "716 Endres, Miss Caroline Louise female 38.0 0 0 \n", "\n", " Ticket Fare Cabin age_cat fare_cat Embarked_Q Embarked_S \n", "712 19996 52.0000 C126 (32.45, 48.3] High 0 1 \n", "713 7545 9.4833 NaN (16.6, 32.45] Medium 0 1 \n", "714 250647 13.0000 NaN (48.3, 64.15] Medium 0 1 \n", "715 348124 7.6500 F G73 (16.6, 32.45] Low 0 1 \n", "716 PC 17757 227.5250 C45 (32.45, 48.3] High 0 0 " ] }, "execution_count": 22, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# One-hot encode the test data\n", "testing_encoded_df = pd.get_dummies(testing_df.iloc[:5,:], columns=['Embarked'], drop_first=True)\n", "\n", "# Reindex test data to match training data columns, filling missing columns with 0\n", "testing_encoded_df = testing_encoded_df.reindex(columns=fitting_columns, fill_value=0)\n", "\n", "testing_encoded_df" ] }, { "cell_type": "markdown", "id": "861d6038-3deb-44b8-8337-059d6168d22f", "metadata": {}, "source": [ "## Working with text data" ] }, { "cell_type": "markdown", "id": "aee726db-9c3f-49b1-b290-2fff80c8198a", "metadata": {}, "source": [ "This a very common type of data. \n", "\n", "Common text data problems involve:\n", "\n", "1) data inconsistency\n", "2) fixed length violations\n", "3) typos" ] }, { "cell_type": "markdown", "id": "9c6726f6-cb32-4f62-866c-3b54af812bb1", "metadata": {}, "source": [ "Pandas provides a set of string processing methods to easilty operate on each element of the string elements. These can accessed via the `str` attribute and generally have names matching the equivalent string's methods, such as `lower()`, `upper()`, `split()`, `contains()`, and `replace()`\n", "\n", "This is a safe operation in terms of data leakage, since it acts on each observation individually:" ] }, { "cell_type": "code", "execution_count": 16, "id": "855f8092-6cc9-414d-8d33-893952a6227b", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
PassengerIdSurvivedPclassNameSexAgeSibSpParchTicketFareCabinEmbarkedage_cat
0103Braund, Mr. Owen Harrismale22.010A/5 211717.2500NaNS(20, 30]
1211Cumings, Mrs. John Bradley (Florence Briggs Th...female38.010PC 1759971.2833C85C(30, 40]
2313Heikkinen, Miss Lainafemale26.000STON/O2. 31012827.9250NaNS(20, 30]
3411Futrelle, Mrs. Jacques Heath (Lily May Peel)female35.01011380353.1000C123S(30, 40]
4503Allen, Mr. William Henrymale35.0003734508.0500NaNS(30, 40]
\n", "
" ], "text/plain": [ " PassengerId Survived Pclass \\\n", "0 1 0 3 \n", "1 2 1 1 \n", "2 3 1 3 \n", "3 4 1 1 \n", "4 5 0 3 \n", "\n", " Name Sex Age SibSp \\\n", "0 Braund, Mr. Owen Harris male 22.0 1 \n", "1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 \n", "2 Heikkinen, Miss Laina female 26.0 0 \n", "3 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 \n", "4 Allen, Mr. William Henry male 35.0 0 \n", "\n", " Parch Ticket Fare Cabin Embarked age_cat \n", "0 0 A/5 21171 7.2500 NaN S (20, 30] \n", "1 0 PC 17599 71.2833 C85 C (30, 40] \n", "2 0 STON/O2. 3101282 7.9250 NaN S (20, 30] \n", "3 0 113803 53.1000 C123 S (30, 40] \n", "4 0 373450 8.0500 NaN S (30, 40] " ] }, "execution_count": 16, "metadata": {}, "output_type": "execute_result" } ], "source": [ "full_df.head()" ] }, { "cell_type": "code", "execution_count": 17, "id": "8562b51b-ad9e-442e-bf8f-f41bd51ca27f", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0 braund, mr. owen harris\n", "1 cumings, mrs. john bradley (florence briggs th...\n", "2 heikkinen, miss laina\n", "3 futrelle, mrs. jacques heath (lily may peel)\n", "4 allen, mr. william henry\n", " ... \n", "886 montvila, rev. juozas\n", "887 graham, miss margaret edith\n", "888 johnston, miss catherine helen \"carrie\"\n", "889 behr, mr. karl howell\n", "890 dooley, mr. patrick\n", "Name: Name, Length: 891, dtype: object" ] }, "execution_count": 17, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# convert names to lowercase\n", "full_df[\"Name\"].str.lower()" ] }, { "cell_type": "code", "execution_count": 18, "id": "b6083e5f-ef6d-4e80-82a2-1d983825fe03", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0 Braund\n", "1 Cumings\n", "2 Heikkinen\n", "3 Futrelle\n", "4 Allen\n", " ... \n", "886 Montvila\n", "887 Graham\n", "888 Johnston\n", "889 Behr\n", "890 Dooley\n", "Name: Name, Length: 891, dtype: object" ] }, "execution_count": 18, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Get last names\n", "full_df[\"Name\"].str.split().str[0].str.replace(\",\", \"\")" ] }, { "cell_type": "markdown", "id": "bfbb9ece-ad60-443a-b675-a4d848e162ce", "metadata": {}, "source": [ "Have a loot at [this](https://pandas.pydata.org/pandas-docs/stable/user_guide/text.html) for further details." ] }, { "cell_type": "markdown", "id": "70bb1d15-417c-4f8a-9b1e-e4834c0179a4", "metadata": {}, "source": [ "## Practice exercises" ] }, { "cell_type": "markdown", "id": "4551c543-58ca-4fea-a065-9d4ff113572d", "metadata": {}, "source": [ "```{exercise}\n", ":label: pandas15\n", "\n", "The dataframe below contains two categoricals. Apply one-hot encoding to each of them, giving them a prefix and dropping the first level from each. \n", "\n", "Print the new dataframe to insure correctness.\n", "\n", "Hint: You might want to dummify each column into separate new dataframes, and then merge them together by using.\n", "```" ] }, { "cell_type": "code", "execution_count": 23, "id": "a4243404-8e8d-447b-9205-e4d8d1696de4", "metadata": {}, "outputs": [], "source": [ "cats = pd.DataFrame({'breed':['persian','persian','siamese','himalayan','burmese'], \n", " 'color':['calico','white','seal point','cream','sable']})" ] }, { "cell_type": "code", "execution_count": 24, "id": "2d5d412b-8281-4851-b866-5d820a4244af", "metadata": {}, "outputs": [], "source": [ "# Your answers from here" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.11.9" } }, "nbformat": 4, "nbformat_minor": 5 }