{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "a1c1a862-4c4e-4588-9a50-7f772506cae8",
   "metadata": {},
   "source": [
    "# Pandas: Introduction to Feature Engineering"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "7b28d51c-bab5-4ecd-be40-7e2956d38939",
   "metadata": {},
   "source": [
    "![](https://ds1002-resources.s3.amazonaws.com/images/workflow.png)\n",
    "\n",
    "In this lesson, you will be introduced to feature enginering techniques using pandas. Specifically, we will cover:\n",
    " \n",
    "- Handling missing data\n",
    "- Creating new columns\n",
    "- Working with categorical data\n",
    "- Working with text data"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "34615f2f-0779-4799-9d09-c3ace9a6dbc1",
   "metadata": {},
   "source": [
    "## Introduction\n",
    "\n",
    "Feature engineering is the process of selecting, manipulating, and transforming raw data into features that can be used in subsequent analyses. This process is particularly crucial in data science applications.\n",
    "\n",
    "Pandas provides many functionalities for feature engineering, and we will cover some of them here with a caveat: data science applications often emphasize **generalizability**. To achieve this, it is standard practice to split the data into **fitting** (sometimes called \"training\") and **testing** partitions. The fitting partition is used to estimate everything needed, including feature engineering steps, before model deployment. The model's performance is then evaluated on the testing partition. This approach ensures there is no **data leakage**, which occurs when information that would not be available at prediction time is used in building the model, often resulting in overly optimistic results.\n",
    "\n",
    "Unfortunately, feature engineering in Pandas may not always keep these two partitions strictly separated, which is why you may want to use [scikit-learn](https://scikit-learn.org/stable/modules/preprocessing.html) for this process in the future.\n",
    "\n",
    "Nevertheless, in this lesson, we will try to cover some Pandas' functionalities for feature engineering in a safely way in an out-of-sample context."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "id": "05675acc-7b87-405a-bef3-9bffc1996d97",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Load dependencies\n",
    "import pandas as pd"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "id": "8c1f9c8b-0dd2-4566-8905-7d88f5aea763",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>PassengerId</th>\n",
       "      <th>Survived</th>\n",
       "      <th>Pclass</th>\n",
       "      <th>Name</th>\n",
       "      <th>Sex</th>\n",
       "      <th>Age</th>\n",
       "      <th>SibSp</th>\n",
       "      <th>Parch</th>\n",
       "      <th>Ticket</th>\n",
       "      <th>Fare</th>\n",
       "      <th>Cabin</th>\n",
       "      <th>Embarked</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>3</td>\n",
       "      <td>Braund, Mr. Owen Harris</td>\n",
       "      <td>male</td>\n",
       "      <td>22.0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>A/5 21171</td>\n",
       "      <td>7.2500</td>\n",
       "      <td>NaN</td>\n",
       "      <td>S</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>2</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>Cumings, Mrs. John Bradley (Florence Briggs Th...</td>\n",
       "      <td>female</td>\n",
       "      <td>38.0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>PC 17599</td>\n",
       "      <td>71.2833</td>\n",
       "      <td>C85</td>\n",
       "      <td>C</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>3</td>\n",
       "      <td>1</td>\n",
       "      <td>3</td>\n",
       "      <td>Heikkinen, Miss Laina</td>\n",
       "      <td>female</td>\n",
       "      <td>26.0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>STON/O2. 3101282</td>\n",
       "      <td>7.9250</td>\n",
       "      <td>NaN</td>\n",
       "      <td>S</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>4</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>Futrelle, Mrs. Jacques Heath (Lily May Peel)</td>\n",
       "      <td>female</td>\n",
       "      <td>35.0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>113803</td>\n",
       "      <td>53.1000</td>\n",
       "      <td>C123</td>\n",
       "      <td>S</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>5</td>\n",
       "      <td>0</td>\n",
       "      <td>3</td>\n",
       "      <td>Allen, Mr. William Henry</td>\n",
       "      <td>male</td>\n",
       "      <td>35.0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>373450</td>\n",
       "      <td>8.0500</td>\n",
       "      <td>NaN</td>\n",
       "      <td>S</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   PassengerId  Survived  Pclass  \\\n",
       "0            1         0       3   \n",
       "1            2         1       1   \n",
       "2            3         1       3   \n",
       "3            4         1       1   \n",
       "4            5         0       3   \n",
       "\n",
       "                                                Name     Sex   Age  SibSp  \\\n",
       "0                            Braund, Mr. Owen Harris    male  22.0      1   \n",
       "1  Cumings, Mrs. John Bradley (Florence Briggs Th...  female  38.0      1   \n",
       "2                              Heikkinen, Miss Laina  female  26.0      0   \n",
       "3       Futrelle, Mrs. Jacques Heath (Lily May Peel)  female  35.0      1   \n",
       "4                           Allen, Mr. William Henry    male  35.0      0   \n",
       "\n",
       "   Parch            Ticket     Fare Cabin Embarked  \n",
       "0      0         A/5 21171   7.2500   NaN        S  \n",
       "1      0          PC 17599  71.2833   C85        C  \n",
       "2      0  STON/O2. 3101282   7.9250   NaN        S  \n",
       "3      0            113803  53.1000  C123        S  \n",
       "4      0            373450   8.0500   NaN        S  "
      ]
     },
     "execution_count": 2,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# We will use this dataset, common for new learners in machine learning\n",
    "full_df = pd.read_csv('https://raw.githubusercontent.com/pandas-dev/pandas/main/doc/data/titanic.csv')\n",
    "full_df.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "id": "1094e195-79d5-4505-af21-c345b992e8a2",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "<class 'pandas.core.frame.DataFrame'>\n",
      "RangeIndex: 891 entries, 0 to 890\n",
      "Data columns (total 12 columns):\n",
      " #   Column       Non-Null Count  Dtype  \n",
      "---  ------       --------------  -----  \n",
      " 0   PassengerId  891 non-null    int64  \n",
      " 1   Survived     891 non-null    int64  \n",
      " 2   Pclass       891 non-null    int64  \n",
      " 3   Name         891 non-null    object \n",
      " 4   Sex          891 non-null    object \n",
      " 5   Age          714 non-null    float64\n",
      " 6   SibSp        891 non-null    int64  \n",
      " 7   Parch        891 non-null    int64  \n",
      " 8   Ticket       891 non-null    object \n",
      " 9   Fare         891 non-null    float64\n",
      " 10  Cabin        204 non-null    object \n",
      " 11  Embarked     889 non-null    object \n",
      "dtypes: float64(2), int64(5), object(5)\n",
      "memory usage: 83.7+ KB\n"
     ]
    }
   ],
   "source": [
    "# Use the info method to get information about this dataset\n",
    "full_df.info()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "id": "abfccb9c-0b1e-4df0-8989-95797776ea16",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "we have 712 observations in the fitting set; and 179 in the test set\n"
     ]
    }
   ],
   "source": [
    "# Let's create our partitions, with the fitting set taking 80% of the observations and the testing set the remaining\n",
    "fitting_df = full_df.iloc[:int(full_df.shape[0]*0.8),:].copy() \n",
    "testing_df = full_df.iloc[int(full_df.shape[0]*0.8):,:].copy() # This copy operation is to avoid some annoying warnings later.\n",
    "\n",
    "print(\"we have\", fitting_df.shape[0], \"observations in the fitting set;\", \"and\", testing_df.shape[0], \"in the test set\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "0020fb6f-5039-4934-a605-66d0dce851bf",
   "metadata": {},
   "source": [
    "## Handling nissing values\n",
    "\n",
    "- `dropna()`: In principle, this is a method that you can safely apply. Refer to a previous lesson to know how to use it.\n",
    "- `fillna()`: This is fine as long as you fill all the NaN's with a specific value (e.g. a zero)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "id": "88d6d56e-fc08-41cb-bcc4-561d6b25ceeb",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "<class 'pandas.core.series.Series'>\n",
      "RangeIndex: 891 entries, 0 to 890\n",
      "Series name: Age\n",
      "Non-Null Count  Dtype  \n",
      "--------------  -----  \n",
      "891 non-null    float64\n",
      "dtypes: float64(1)\n",
      "memory usage: 7.1 KB\n"
     ]
    }
   ],
   "source": [
    "# This are OK\n",
    "full_df[\"Age\"].fillna(0).info()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "8b379713-a3c8-47ef-8d8c-a7f4b589e93d",
   "metadata": {},
   "source": [
    "Replacing NaN's with for example the mean first and then splitting the data would be problematic, because both partitions would no longer be independent."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "id": "352b7108-e74f-437b-9cf1-c2b411a814ea",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0      22.000000\n",
       "1      38.000000\n",
       "2      26.000000\n",
       "3      35.000000\n",
       "4      35.000000\n",
       "         ...    \n",
       "886    27.000000\n",
       "887    19.000000\n",
       "888    29.699118\n",
       "889    26.000000\n",
       "890    32.000000\n",
       "Name: Age, Length: 891, dtype: float64"
      ]
     },
     "execution_count": 13,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "full_df[\"Age\"].fillna(full_df[\"Age\"].mean())"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "200682a0-05eb-4a4c-8bab-cd60793f973b",
   "metadata": {},
   "source": [
    "You have to estimate the mean on the fitting partition and use that estimation to populate both data partitions."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "id": "94463461-3418-4e05-89fd-5f72d0cebdb0",
   "metadata": {},
   "outputs": [],
   "source": [
    "# The mean is estimating on the fitting partition\n",
    "mean_fitting = fitting_df[\"Age\"].mean()\n",
    "\n",
    "# And use to fill NaN's in both\n",
    "fitting_df[\"Age\"] = fitting_df[\"Age\"].fillna(mean_fitting)\n",
    "testing_df[\"Age\"] = testing_df[\"Age\"].fillna(mean_fitting)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "4803ef2e-0914-424f-b014-fde679d0e58d",
   "metadata": {},
   "source": [
    "## Creating new columns\n",
    "\n",
    "This is in principle a safe operation as long as it applied at the row-level. Refer to a previous lesson  for further details on how to create new columns."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "1dd11b64-b5ac-41b3-99f7-3379a08f8523",
   "metadata": {},
   "source": [
    "## Working categorical data"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "67312d70-f16f-4f13-9913-0fd110f04e47",
   "metadata": {},
   "source": [
    "### Convert continuous data into categories"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "cf83dacb-6d18-40d4-9d2e-e8d1c3d4d3d5",
   "metadata": {},
   "source": [
    "Normally here, one would do the following two steps:\n",
    "\n",
    "- **Step 1**: Bin fitting data and retrieve bin edges.\n",
    "- **Step 2**: Apply consistent binning to out-of-sample test data."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "68550f01-6cb2-449c-a523-270889bce309",
   "metadata": {},
   "source": [
    "- `pd.cut()`: Bin values into discrete intervals.\n",
    "\n",
    "The following parameter are important: `bin`, which sets the criteria to bin by (Use `help` to see the different values it can take), and `retbins` to  be able to transfer information to the testing partition."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "id": "6d97de53-2699-4e2e-997a-458df6d6355d",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Bin edges: [ 0.67075 16.6     32.45    48.3     64.15    80.     ]\n"
     ]
    }
   ],
   "source": [
    "# Bin the age from the fitting'dat in 5 equally separated ranges. Note that retbins is True to return the estimated edges\n",
    "fitting_df['age_cat'], bin_edges = pd.cut(fitting_df['Age'], bins=5, retbins=True)\n",
    "print(\"Bin edges:\", bin_edges)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "12cdb9a4-94d1-413e-8261-5713b261d4f8",
   "metadata": {},
   "source": [
    "Use the `bins` argument and the computed ranges to bin your test data:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "id": "a9bf0a44-288a-4361-98f5-072f4fe75bde",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Apply the same binning with the predefined edges\n",
    "testing_df['age_cat'] = pd.cut(testing_df['Age'], bins=bin_edges)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "3540da34-6a58-4682-8ad5-a4c2d23d3aee",
   "metadata": {},
   "source": [
    "**Note that supplying pre-specified ranges is OK before data splitting**"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "id": "7e544299-47b5-46c3-96c1-466223d8a2a7",
   "metadata": {},
   "outputs": [],
   "source": [
    "# This would be OK\n",
    "full_df['age_cat'] = pd.cut(full_df['Age'], bins = [20, 30, 40, 50, 60])"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "84a8580e-bc4f-4eaa-85b9-7600092efc25",
   "metadata": {},
   "source": [
    "- `pd.qcut()`: Quantile-based discretization function."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "id": "5984704b-9696-4841-ba43-6c10dfa3ed58",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Bin edges: [  0.       8.6625  26.25   512.3292]\n"
     ]
    }
   ],
   "source": [
    "# Bin the training data using qcut and retrieve the edges\n",
    "fitting_df['fare_cat'], bin_edges = pd.qcut(fitting_df.copy()['Fare'], q=3, labels=['Low', 'Medium', 'High'], retbins=True)\n",
    "print(\"Bin edges:\", bin_edges)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "id": "0813355d-197b-4cfb-93bb-51a3fee8217a",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Use pd.cut() with the bin edges from the training data\n",
    "testing_df['fare_cat'] = pd.cut(testing_df['Fare'], bins=bin_edges, labels=['Low', 'Medium', 'High'])"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "9034685c-8352-48fd-affb-7dcda7f70901",
   "metadata": {},
   "source": [
    "### One-Hot Encoding"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "054555c7-6c6d-4de0-bc8b-ce5841b1c04b",
   "metadata": {},
   "source": [
    "Many algorithms require numerical inputs. For these, we need to convert categorical data into numbers.\n",
    "\n",
    "We could be tempted to replace each category with a given number. For example, if we had three categories (A, B,C), we could replace them with (0, 1, 2), but by doing so we would be imposing an ordinal trend (0<1<2), which does not need to be the case (Why A should be lower than B?).\n",
    "\n",
    "To prevent this, we can do one-hot encoding, where each category is represented by a separate binary column. For each observation, a '1' is placed in the column corresponding to its original category, with '0's in all other columns.\n",
    "\n",
    "In pandas we can do this using `pd.get_dummies()` function. \n",
    "\n",
    "Important parameters: \n",
    "\n",
    "- `prefix`    : append prefix to column names (a good idea for later use)\n",
    "- `drop_first`: remove first level, as only `k-1` variables needed to represent `k` levels. You will normally want to set this to `True`.\n",
    "\n",
    "\n",
    "Have a loot at the [documentation](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.get_dummies.html) for further details."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "87c9f161-3042-4dc6-a91b-bbdbf220c521",
   "metadata": {},
   "source": [
    "Let's apply this to the `Embarked` column."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "0a04750e-f680-4ef6-9b69-70ea134d5764",
   "metadata": {},
   "source": [
    "- **Step 1:** One-Hot encoding on the fitting partition\n",
    "\n",
    "Use `pd.get_dummies()` on the fitting data to create one-hot encoded columns.  \n",
    "Capture the resulting columns in the fitting data to use as a reference for the test data."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 20,
   "id": "d9970721-d0e2-40de-b2c8-d30ec32b0b7f",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>PassengerId</th>\n",
       "      <th>Survived</th>\n",
       "      <th>Pclass</th>\n",
       "      <th>Name</th>\n",
       "      <th>Sex</th>\n",
       "      <th>Age</th>\n",
       "      <th>SibSp</th>\n",
       "      <th>Parch</th>\n",
       "      <th>Ticket</th>\n",
       "      <th>Fare</th>\n",
       "      <th>Cabin</th>\n",
       "      <th>age_cat</th>\n",
       "      <th>fare_cat</th>\n",
       "      <th>Embarked_Q</th>\n",
       "      <th>Embarked_S</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>3</td>\n",
       "      <td>Braund, Mr. Owen Harris</td>\n",
       "      <td>male</td>\n",
       "      <td>22.0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>A/5 21171</td>\n",
       "      <td>7.2500</td>\n",
       "      <td>NaN</td>\n",
       "      <td>(16.6, 32.45]</td>\n",
       "      <td>Low</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>2</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>Cumings, Mrs. John Bradley (Florence Briggs Th...</td>\n",
       "      <td>female</td>\n",
       "      <td>38.0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>PC 17599</td>\n",
       "      <td>71.2833</td>\n",
       "      <td>C85</td>\n",
       "      <td>(32.45, 48.3]</td>\n",
       "      <td>High</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>3</td>\n",
       "      <td>1</td>\n",
       "      <td>3</td>\n",
       "      <td>Heikkinen, Miss Laina</td>\n",
       "      <td>female</td>\n",
       "      <td>26.0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>STON/O2. 3101282</td>\n",
       "      <td>7.9250</td>\n",
       "      <td>NaN</td>\n",
       "      <td>(16.6, 32.45]</td>\n",
       "      <td>Low</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>4</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>Futrelle, Mrs. Jacques Heath (Lily May Peel)</td>\n",
       "      <td>female</td>\n",
       "      <td>35.0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>113803</td>\n",
       "      <td>53.1000</td>\n",
       "      <td>C123</td>\n",
       "      <td>(32.45, 48.3]</td>\n",
       "      <td>High</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>5</td>\n",
       "      <td>0</td>\n",
       "      <td>3</td>\n",
       "      <td>Allen, Mr. William Henry</td>\n",
       "      <td>male</td>\n",
       "      <td>35.0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>373450</td>\n",
       "      <td>8.0500</td>\n",
       "      <td>NaN</td>\n",
       "      <td>(32.45, 48.3]</td>\n",
       "      <td>Low</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   PassengerId  Survived  Pclass  \\\n",
       "0            1         0       3   \n",
       "1            2         1       1   \n",
       "2            3         1       3   \n",
       "3            4         1       1   \n",
       "4            5         0       3   \n",
       "\n",
       "                                                Name     Sex   Age  SibSp  \\\n",
       "0                            Braund, Mr. Owen Harris    male  22.0      1   \n",
       "1  Cumings, Mrs. John Bradley (Florence Briggs Th...  female  38.0      1   \n",
       "2                              Heikkinen, Miss Laina  female  26.0      0   \n",
       "3       Futrelle, Mrs. Jacques Heath (Lily May Peel)  female  35.0      1   \n",
       "4                           Allen, Mr. William Henry    male  35.0      0   \n",
       "\n",
       "   Parch            Ticket     Fare Cabin        age_cat fare_cat  Embarked_Q  \\\n",
       "0      0         A/5 21171   7.2500   NaN  (16.6, 32.45]      Low           0   \n",
       "1      0          PC 17599  71.2833   C85  (32.45, 48.3]     High           0   \n",
       "2      0  STON/O2. 3101282   7.9250   NaN  (16.6, 32.45]      Low           0   \n",
       "3      0            113803  53.1000  C123  (32.45, 48.3]     High           0   \n",
       "4      0            373450   8.0500   NaN  (32.45, 48.3]      Low           0   \n",
       "\n",
       "   Embarked_S  \n",
       "0           1  \n",
       "1           0  \n",
       "2           1  \n",
       "3           1  \n",
       "4           1  "
      ]
     },
     "execution_count": 20,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "fitting_encoded_df = pd.get_dummies(fitting_df, columns=['Embarked'], drop_first=True)\n",
    "\n",
    "# Save the columns after one-hot encoding the training set\n",
    "fitting_columns = fitting_encoded_df.columns\n",
    "\n",
    "fitting_encoded_df.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "cb930e22-0d9a-4e5e-afd5-3bace4aaf493",
   "metadata": {},
   "source": [
    "- **Step 2**: Apply consistent encoding to out-of-sample test data\n",
    "\n",
    "Some categorical features in the test data may not include all the categories as in the fitting data. In this case, applying `pd.get_dummies()` would yield fewer columns relative to the fitting data.\n",
    "\n",
    "We can see this if we just use the first 5 observations of our test data:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 21,
   "id": "32a9e4d7-6ba4-46dd-a94f-5f6ad5697a2f",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>PassengerId</th>\n",
       "      <th>Survived</th>\n",
       "      <th>Pclass</th>\n",
       "      <th>Name</th>\n",
       "      <th>Sex</th>\n",
       "      <th>Age</th>\n",
       "      <th>SibSp</th>\n",
       "      <th>Parch</th>\n",
       "      <th>Ticket</th>\n",
       "      <th>Fare</th>\n",
       "      <th>Cabin</th>\n",
       "      <th>age_cat</th>\n",
       "      <th>fare_cat</th>\n",
       "      <th>Embarked_S</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>712</th>\n",
       "      <td>713</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>Taylor, Mr. Elmer Zebley</td>\n",
       "      <td>male</td>\n",
       "      <td>48.0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>19996</td>\n",
       "      <td>52.0000</td>\n",
       "      <td>C126</td>\n",
       "      <td>(32.45, 48.3]</td>\n",
       "      <td>High</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>713</th>\n",
       "      <td>714</td>\n",
       "      <td>0</td>\n",
       "      <td>3</td>\n",
       "      <td>Larsson, Mr. August Viktor</td>\n",
       "      <td>male</td>\n",
       "      <td>29.0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>7545</td>\n",
       "      <td>9.4833</td>\n",
       "      <td>NaN</td>\n",
       "      <td>(16.6, 32.45]</td>\n",
       "      <td>Medium</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>714</th>\n",
       "      <td>715</td>\n",
       "      <td>0</td>\n",
       "      <td>2</td>\n",
       "      <td>Greenberg, Mr. Samuel</td>\n",
       "      <td>male</td>\n",
       "      <td>52.0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>250647</td>\n",
       "      <td>13.0000</td>\n",
       "      <td>NaN</td>\n",
       "      <td>(48.3, 64.15]</td>\n",
       "      <td>Medium</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>715</th>\n",
       "      <td>716</td>\n",
       "      <td>0</td>\n",
       "      <td>3</td>\n",
       "      <td>Soholt, Mr. Peter Andreas Lauritz Andersen</td>\n",
       "      <td>male</td>\n",
       "      <td>19.0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>348124</td>\n",
       "      <td>7.6500</td>\n",
       "      <td>F G73</td>\n",
       "      <td>(16.6, 32.45]</td>\n",
       "      <td>Low</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>716</th>\n",
       "      <td>717</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>Endres, Miss Caroline Louise</td>\n",
       "      <td>female</td>\n",
       "      <td>38.0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>PC 17757</td>\n",
       "      <td>227.5250</td>\n",
       "      <td>C45</td>\n",
       "      <td>(32.45, 48.3]</td>\n",
       "      <td>High</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "     PassengerId  Survived  Pclass  \\\n",
       "712          713         1       1   \n",
       "713          714         0       3   \n",
       "714          715         0       2   \n",
       "715          716         0       3   \n",
       "716          717         1       1   \n",
       "\n",
       "                                           Name     Sex   Age  SibSp  Parch  \\\n",
       "712                    Taylor, Mr. Elmer Zebley    male  48.0      1      0   \n",
       "713                  Larsson, Mr. August Viktor    male  29.0      0      0   \n",
       "714                       Greenberg, Mr. Samuel    male  52.0      0      0   \n",
       "715  Soholt, Mr. Peter Andreas Lauritz Andersen    male  19.0      0      0   \n",
       "716                Endres, Miss Caroline Louise  female  38.0      0      0   \n",
       "\n",
       "       Ticket      Fare  Cabin        age_cat fare_cat  Embarked_S  \n",
       "712     19996   52.0000   C126  (32.45, 48.3]     High           1  \n",
       "713      7545    9.4833    NaN  (16.6, 32.45]   Medium           1  \n",
       "714    250647   13.0000    NaN  (48.3, 64.15]   Medium           1  \n",
       "715    348124    7.6500  F G73  (16.6, 32.45]      Low           1  \n",
       "716  PC 17757  227.5250    C45  (32.45, 48.3]     High           0  "
      ]
     },
     "execution_count": 21,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "pd.get_dummies(testing_df.iloc[:5,:], columns=['Embarked'], drop_first=True)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "50eb1de5-108f-407f-84ad-d5ee7d641185",
   "metadata": {},
   "source": [
    "To handle this:\n",
    "\n",
    "(1) Reindex the test data to match the columns from the fitting data, filling any missing columns with zeros (since those categories are absent in the test data).  \n",
    "(2) Add missing columns to ensure both datasets have the same structure.  \n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 22,
   "id": "5607d79e-771a-439d-aebb-0b13aed15a09",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>PassengerId</th>\n",
       "      <th>Survived</th>\n",
       "      <th>Pclass</th>\n",
       "      <th>Name</th>\n",
       "      <th>Sex</th>\n",
       "      <th>Age</th>\n",
       "      <th>SibSp</th>\n",
       "      <th>Parch</th>\n",
       "      <th>Ticket</th>\n",
       "      <th>Fare</th>\n",
       "      <th>Cabin</th>\n",
       "      <th>age_cat</th>\n",
       "      <th>fare_cat</th>\n",
       "      <th>Embarked_Q</th>\n",
       "      <th>Embarked_S</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>712</th>\n",
       "      <td>713</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>Taylor, Mr. Elmer Zebley</td>\n",
       "      <td>male</td>\n",
       "      <td>48.0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>19996</td>\n",
       "      <td>52.0000</td>\n",
       "      <td>C126</td>\n",
       "      <td>(32.45, 48.3]</td>\n",
       "      <td>High</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>713</th>\n",
       "      <td>714</td>\n",
       "      <td>0</td>\n",
       "      <td>3</td>\n",
       "      <td>Larsson, Mr. August Viktor</td>\n",
       "      <td>male</td>\n",
       "      <td>29.0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>7545</td>\n",
       "      <td>9.4833</td>\n",
       "      <td>NaN</td>\n",
       "      <td>(16.6, 32.45]</td>\n",
       "      <td>Medium</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>714</th>\n",
       "      <td>715</td>\n",
       "      <td>0</td>\n",
       "      <td>2</td>\n",
       "      <td>Greenberg, Mr. Samuel</td>\n",
       "      <td>male</td>\n",
       "      <td>52.0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>250647</td>\n",
       "      <td>13.0000</td>\n",
       "      <td>NaN</td>\n",
       "      <td>(48.3, 64.15]</td>\n",
       "      <td>Medium</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>715</th>\n",
       "      <td>716</td>\n",
       "      <td>0</td>\n",
       "      <td>3</td>\n",
       "      <td>Soholt, Mr. Peter Andreas Lauritz Andersen</td>\n",
       "      <td>male</td>\n",
       "      <td>19.0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>348124</td>\n",
       "      <td>7.6500</td>\n",
       "      <td>F G73</td>\n",
       "      <td>(16.6, 32.45]</td>\n",
       "      <td>Low</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>716</th>\n",
       "      <td>717</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>Endres, Miss Caroline Louise</td>\n",
       "      <td>female</td>\n",
       "      <td>38.0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>PC 17757</td>\n",
       "      <td>227.5250</td>\n",
       "      <td>C45</td>\n",
       "      <td>(32.45, 48.3]</td>\n",
       "      <td>High</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "     PassengerId  Survived  Pclass  \\\n",
       "712          713         1       1   \n",
       "713          714         0       3   \n",
       "714          715         0       2   \n",
       "715          716         0       3   \n",
       "716          717         1       1   \n",
       "\n",
       "                                           Name     Sex   Age  SibSp  Parch  \\\n",
       "712                    Taylor, Mr. Elmer Zebley    male  48.0      1      0   \n",
       "713                  Larsson, Mr. August Viktor    male  29.0      0      0   \n",
       "714                       Greenberg, Mr. Samuel    male  52.0      0      0   \n",
       "715  Soholt, Mr. Peter Andreas Lauritz Andersen    male  19.0      0      0   \n",
       "716                Endres, Miss Caroline Louise  female  38.0      0      0   \n",
       "\n",
       "       Ticket      Fare  Cabin        age_cat fare_cat  Embarked_Q  Embarked_S  \n",
       "712     19996   52.0000   C126  (32.45, 48.3]     High           0           1  \n",
       "713      7545    9.4833    NaN  (16.6, 32.45]   Medium           0           1  \n",
       "714    250647   13.0000    NaN  (48.3, 64.15]   Medium           0           1  \n",
       "715    348124    7.6500  F G73  (16.6, 32.45]      Low           0           1  \n",
       "716  PC 17757  227.5250    C45  (32.45, 48.3]     High           0           0  "
      ]
     },
     "execution_count": 22,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# One-hot encode the test data\n",
    "testing_encoded_df = pd.get_dummies(testing_df.iloc[:5,:], columns=['Embarked'], drop_first=True)\n",
    "\n",
    "# Reindex test data to match training data columns, filling missing columns with 0\n",
    "testing_encoded_df = testing_encoded_df.reindex(columns=fitting_columns, fill_value=0)\n",
    "\n",
    "testing_encoded_df"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "861d6038-3deb-44b8-8337-059d6168d22f",
   "metadata": {},
   "source": [
    "## Working with text data"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "aee726db-9c3f-49b1-b290-2fff80c8198a",
   "metadata": {},
   "source": [
    "This a very common type of data. \n",
    "\n",
    "Common text data problems involve:\n",
    "\n",
    "1) data inconsistency\n",
    "2) fixed length violations\n",
    "3) typos"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "9c6726f6-cb32-4f62-866c-3b54af812bb1",
   "metadata": {},
   "source": [
    "Pandas provides a set of string processing methods to easilty operate on each element of the string elements. These can accessed via the `str` attribute and generally have names matching the equivalent string's methods, such as `lower()`, `upper()`, `split()`, `contains()`, and `replace()`\n",
    "\n",
    "This is a safe operation in terms of data leakage, since it acts on each observation individually:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "id": "855f8092-6cc9-414d-8d33-893952a6227b",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>PassengerId</th>\n",
       "      <th>Survived</th>\n",
       "      <th>Pclass</th>\n",
       "      <th>Name</th>\n",
       "      <th>Sex</th>\n",
       "      <th>Age</th>\n",
       "      <th>SibSp</th>\n",
       "      <th>Parch</th>\n",
       "      <th>Ticket</th>\n",
       "      <th>Fare</th>\n",
       "      <th>Cabin</th>\n",
       "      <th>Embarked</th>\n",
       "      <th>age_cat</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>3</td>\n",
       "      <td>Braund, Mr. Owen Harris</td>\n",
       "      <td>male</td>\n",
       "      <td>22.0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>A/5 21171</td>\n",
       "      <td>7.2500</td>\n",
       "      <td>NaN</td>\n",
       "      <td>S</td>\n",
       "      <td>(20, 30]</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>2</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>Cumings, Mrs. John Bradley (Florence Briggs Th...</td>\n",
       "      <td>female</td>\n",
       "      <td>38.0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>PC 17599</td>\n",
       "      <td>71.2833</td>\n",
       "      <td>C85</td>\n",
       "      <td>C</td>\n",
       "      <td>(30, 40]</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>3</td>\n",
       "      <td>1</td>\n",
       "      <td>3</td>\n",
       "      <td>Heikkinen, Miss Laina</td>\n",
       "      <td>female</td>\n",
       "      <td>26.0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>STON/O2. 3101282</td>\n",
       "      <td>7.9250</td>\n",
       "      <td>NaN</td>\n",
       "      <td>S</td>\n",
       "      <td>(20, 30]</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>4</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>Futrelle, Mrs. Jacques Heath (Lily May Peel)</td>\n",
       "      <td>female</td>\n",
       "      <td>35.0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>113803</td>\n",
       "      <td>53.1000</td>\n",
       "      <td>C123</td>\n",
       "      <td>S</td>\n",
       "      <td>(30, 40]</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>5</td>\n",
       "      <td>0</td>\n",
       "      <td>3</td>\n",
       "      <td>Allen, Mr. William Henry</td>\n",
       "      <td>male</td>\n",
       "      <td>35.0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>373450</td>\n",
       "      <td>8.0500</td>\n",
       "      <td>NaN</td>\n",
       "      <td>S</td>\n",
       "      <td>(30, 40]</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   PassengerId  Survived  Pclass  \\\n",
       "0            1         0       3   \n",
       "1            2         1       1   \n",
       "2            3         1       3   \n",
       "3            4         1       1   \n",
       "4            5         0       3   \n",
       "\n",
       "                                                Name     Sex   Age  SibSp  \\\n",
       "0                            Braund, Mr. Owen Harris    male  22.0      1   \n",
       "1  Cumings, Mrs. John Bradley (Florence Briggs Th...  female  38.0      1   \n",
       "2                              Heikkinen, Miss Laina  female  26.0      0   \n",
       "3       Futrelle, Mrs. Jacques Heath (Lily May Peel)  female  35.0      1   \n",
       "4                           Allen, Mr. William Henry    male  35.0      0   \n",
       "\n",
       "   Parch            Ticket     Fare Cabin Embarked   age_cat  \n",
       "0      0         A/5 21171   7.2500   NaN        S  (20, 30]  \n",
       "1      0          PC 17599  71.2833   C85        C  (30, 40]  \n",
       "2      0  STON/O2. 3101282   7.9250   NaN        S  (20, 30]  \n",
       "3      0            113803  53.1000  C123        S  (30, 40]  \n",
       "4      0            373450   8.0500   NaN        S  (30, 40]  "
      ]
     },
     "execution_count": 16,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "full_df.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "id": "8562b51b-ad9e-442e-bf8f-f41bd51ca27f",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0                                braund, mr. owen harris\n",
       "1      cumings, mrs. john bradley (florence briggs th...\n",
       "2                                  heikkinen, miss laina\n",
       "3           futrelle, mrs. jacques heath (lily may peel)\n",
       "4                               allen, mr. william henry\n",
       "                             ...                        \n",
       "886                                montvila, rev. juozas\n",
       "887                          graham, miss margaret edith\n",
       "888              johnston, miss catherine helen \"carrie\"\n",
       "889                                behr, mr. karl howell\n",
       "890                                  dooley, mr. patrick\n",
       "Name: Name, Length: 891, dtype: object"
      ]
     },
     "execution_count": 17,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# convert names to lowercase\n",
    "full_df[\"Name\"].str.lower()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "id": "b6083e5f-ef6d-4e80-82a2-1d983825fe03",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0         Braund\n",
       "1        Cumings\n",
       "2      Heikkinen\n",
       "3       Futrelle\n",
       "4          Allen\n",
       "         ...    \n",
       "886     Montvila\n",
       "887       Graham\n",
       "888     Johnston\n",
       "889         Behr\n",
       "890       Dooley\n",
       "Name: Name, Length: 891, dtype: object"
      ]
     },
     "execution_count": 18,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Get last names\n",
    "full_df[\"Name\"].str.split().str[0].str.replace(\",\", \"\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "bfbb9ece-ad60-443a-b675-a4d848e162ce",
   "metadata": {},
   "source": [
    "Have a loot at [this](https://pandas.pydata.org/pandas-docs/stable/user_guide/text.html) for further details."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "70bb1d15-417c-4f8a-9b1e-e4834c0179a4",
   "metadata": {},
   "source": [
    "## Practice exercises"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "4551c543-58ca-4fea-a065-9d4ff113572d",
   "metadata": {},
   "source": [
    "```{exercise}\n",
    ":label: pandas15\n",
    "\n",
    "The dataframe below contains two categoricals. Apply one-hot encoding to each of them, giving them a prefix and dropping the first level from each. \n",
    "\n",
    "Print the new dataframe to insure correctness.\n",
    "\n",
    "Hint: You might want to dummify each column into separate new dataframes, and then merge them together by using.\n",
    "```"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 23,
   "id": "a4243404-8e8d-447b-9205-e4d8d1696de4",
   "metadata": {},
   "outputs": [],
   "source": [
    "cats = pd.DataFrame({'breed':['persian','persian','siamese','himalayan','burmese'], \n",
    "                     'color':['calico','white','seal point','cream','sable']})"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 24,
   "id": "2d5d412b-8281-4851-b866-5d820a4244af",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Your answers from here"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.11.9"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}