Introduction to Pandas

Introduction to Pandas#

What you will learn in this lesson:

What is Pandas
How to import Pandas
Create series and dataframes
A first glimpse to dataframes’ attributes and methods.

What is Pandas?#

Pandas is a fundamental data manipulation library in Python, widely used in data science and analytics.

It provides two key data structures:

Series: A one-dimensional labeled array capable of holding any data type.
DataFrame: A two-dimensional labeled data structure with columns that can contain different types of data.

By far, the most important data structure in Pandas (and R) is the dataframe. In most data science applications, we work with tabular data where rows represent observations and columns represent features. Effective data manipulation is critical for preparing clean and useful datasets for analysis, and this is where Pandas (or R’s dplyr) and DataFrames play an essential role.

While Pandas dataframes are inspired by R’s Dataframe structure, there are key differences beyond the programming languages. Notably, Pandas dataframes have indexes, whereas R Dataframes do not, which introduces different approaches to handling and manipulating data.

Importing pandas#

Pandas is a package, so to use it, we need to first import it.

It is very common to give Pandas the name alias pd:

import pandas as pd

Axis Labels#

Before diving into series and dataframes, it is important to understand that both data structures store data along axes (like in NumPy), but, these data also have labels along each axis. These axis labels are collectively referred to as the index.

Therefore, series and dataframes have:

An array that holds the data.
The ondexes that hold the labels for observations (rows) and features (columns).

Therefore, in contrast to NumPy, Pandas integrates identifible data in a natural way, making it easier to work with structured data.

Why we use an index?

It allows you to access elements in an array by name.
It enables series objects with shared index labels to be easily combined.

In fact, a dataframe is a collection of series with a common index.

To this collection of series the dataframe adds a set of labels along the horizontal axis.

The index is axis 0 or the rows.
The columns are another kind of index, called axis 1.

It is crucial to understand the difference between the index of a dataframe and its data in order to understand how dataframes work. Many errors stem from not understanding this difference.

Indexes are powerful and controversial.

They enable complex operations when accessing or combining data.
However, they can be costly in terms of performance and challenging to work with (especially multi-indexes).
Users coming from R might find Pandas dataframes behave differently than expected, leading to some confusion.

Below are some visuals to help:

Series#

A series is essentially a one-dimensional array with labels along its axis. Its data must be of a single type, similar to NumPy arrays (which are used internally by Pandas).

The simplest way to create a series is by using the pd.Series() function.

How to create a series#

From a list

data = [10, 20, 30, 40, 50]
series = pd.Series(data)
print(series)

  10
  20
  30
  40
  50
dtype: int64

From a dictionary

# Series from a dictionary
data_dict = {'a': 1, 'b': 2, 'c': 3}
series_dict = pd.Series(data_dict)
print(series_dict)

a    1
b    2
c    3
dtype: int64

Properties overview#

Indexing and slicing work similarly as with lists:

# Accessing elements in a Series
print(series[0])  # First element
print(series[1:3])  # Slicing

10
1    20
2    30
dtype: int64

It has methods and attributes:

# This attribute provides the series as numpy array
series.values

array([10, 20, 30, 40, 50])

# This two methods return the sum and mean
print(series.sum())
print(series.mean())

150
30.0

Data Frames#

As mentioned earlier, a dataframe is a two-dimensional labeled data structure with columns that can contain different data types. You can think of it as similar to an Excel table, where each column can store different types of data (e.g., numbers, text, or dates).

How to create a dataframe#

The simplest way to create a dataframe is by using the pd.DataFrame() function.

As dict of arrays or lists. This is the easiest and probably most common way, along with reading from a file:

# Creating a DataFrame from a dictionary
data = {
    'Name': ['Alice', 'Bob', 'Charlie', 'David'],
    'Age': [24, 27, 22, 32],
    'City': ['New York', 'Los Angeles', 'Chicago', 'Houston'],
    'Coffe Lover': [True, False, True, True]
}
df = pd.DataFrame(data)
df

	Name	Age	City	Coffe Lover
0	Alice	24	New York	True
1	Bob	27	Los Angeles	False
2	Charlie	22	Chicago	True
3	David	32	Houston	True

As a list of lists, where each list corresponds to one observation:

# Creating a DataFrame from a dictionary
data = [
    ['Alice', 24, 'New York', True],
    ['Bob', 27, 'Los Angeles', False],
    ['Charlie',  22, 'Chicago', True],
    ['David', 32, 'Houston', True]
]
df = pd.DataFrame(data)
df

	0	1	2	3
0	Alice	24	New York	True
1	Bob	27	Los Angeles	False
2	Charlie	22	Chicago	True
3	David	32	Houston	True

As you can see, if we only pass the data, Pandas will automatically assign sequential integers as labels for both axes (rows and columns).

However, we can customize this behavior by specifying our own labels when creating the dataframe:

index = ["obs1","obs2","obs3","obs4"]
columns = ['Name', 'Age', 'City', 'Coffe Lover']

df = pd.DataFrame(data, columns=columns, index=index)
df

	Name	Age	City	Coffe Lover
obs1	Alice	24	New York	True
obs2	Bob	27	Los Angeles	False
obs3	Charlie	22	Chicago	True
obs4	David	32	Houston	True

Alternatively, dataframes are objects, meaning they come with attributes and methods. The index and columns attributes allow you to retrieve the labels for axis 0 (rows) and axis 1 (columns) and redefine them if needed:

data = [
    ['Alice', 24, 'New York', True],
    ['Bob', 27, 'Los Angeles', False],
    ['Charlie',  22, 'Chicago', True],
    ['David', 32, 'Houston', True]
]
df = pd.DataFrame(data)
print(df)

df.index = index
df.columns = columns

print(df)

         0   1            2      3
0    Alice  24     New York   True
1      Bob  27  Los Angeles  False
2  Charlie  22      Chicago   True
3    David  32      Houston   True
         Name  Age         City  Coffe Lover
obs1    Alice   24     New York         True
obs2      Bob   27  Los Angeles        False
obs3  Charlie   22      Chicago         True
obs4    David   32      Houston         True

From a file using, for example, pd.read_csv:

pd.read_csv.__doc__.split("\n")[:15]

['',
 'Read a comma-separated values (csv) file into DataFrame.',
 '',
 'Also supports optionally iterating or breaking of the file',
 'into chunks.',
 '',
 'Additional help can be found in the online docs for',
 '`IO Tools <https://pandas.pydata.org/pandas-docs/stable/user_guide/io.html>`_.',
 '',
 'Parameters',
 '----------',
 'filepath_or_buffer : str, path object or file-like object',
 '    Any valid string path is acceptable. The string could be a URL. Valid',
 '    URL schemes include http, ftp, s3, gs, and file. For file URLs, a host is',
 '    expected. A local file could be: file://localhost/path/to/table.csv.']

df = pd.read_csv("https://raw.githubusercontent.com/mwaskom/seaborn-data/refs/heads/master/iris.csv")
df

	sepal_length	sepal_width	petal_length	petal_width	species
0	5.1	3.5	1.4	0.2	setosa
1	4.9	3.0	1.4	0.2	setosa
2	4.7	3.2	1.3	0.2	setosa
3	4.6	3.1	1.5	0.2	setosa
4	5.0	3.6	1.4	0.2	setosa
...	...	...	...	...	...
145	6.7	3.0	5.2	2.3	virginica
146	6.3	2.5	5.0	1.9	virginica
147	6.5	3.0	5.2	2.0	virginica
148	6.2	3.4	5.4	2.3	virginica
149	5.9	3.0	5.1	1.8	virginica

150 rows × 5 columns

By default, this function expects a file with comma-separated values (CSV). However, you have the flexibility to read files with different delimiters by specifying the sep parameter:

# Here columns were separated using tabular spaces
df = pd.read_csv("https://www4.stat.ncsu.edu/~boos/var.select/diabetes.tab.txt", sep="\t")
df

	AGE	SEX	BMI	BP	S1	S2	S3	S4	S5	S6	Y
0	59	2	32.1	101.00	157	93.2	38.0	4.00	4.8598	87	151
1	48	1	21.6	87.00	183	103.2	70.0	3.00	3.8918	69	75
2	72	2	30.5	93.00	156	93.6	41.0	4.00	4.6728	85	141
3	24	1	25.3	84.00	198	131.4	40.0	5.00	4.8903	89	206
4	50	1	23.0	101.00	192	125.4	52.0	4.00	4.2905	80	135
...	...	...	...	...	...	...	...	...	...	...	...
437	60	2	28.2	112.00	185	113.8	42.0	4.00	4.9836	93	178
438	47	2	24.9	75.00	225	166.0	42.0	5.00	4.4427	102	104
439	60	2	24.9	99.67	162	106.6	43.0	3.77	4.1271	95	132
440	36	1	30.0	95.00	201	125.2	42.0	4.79	5.1299	85	220
441	36	1	19.6	71.00	250	133.2	97.0	3.00	4.5951	92	57

442 rows × 11 columns

An introduction to some attributes and methods#

index, columns: Retrieve the row and columns labels.

df.index, df.columns

(RangeIndex(start=0, stop=442, step=1),
 Index(['AGE', 'SEX', 'BMI', 'BP', 'S1', 'S2', 'S3', 'S4', 'S5', 'S6', 'Y'], dtype='object'))

We can also assign names to the axes, not just to the rows (observations) and columns (features):

df.index.name = 'obs_id'
df.columns.name = 'cols_id'

df

cols_id	AGE	SEX	BMI	BP	S1	S2	S3	S4	S5	S6	Y
obs_id
0	59	2	32.1	101.00	157	93.2	38.0	4.00	4.8598	87	151
1	48	1	21.6	87.00	183	103.2	70.0	3.00	3.8918	69	75
2	72	2	30.5	93.00	156	93.6	41.0	4.00	4.6728	85	141
3	24	1	25.3	84.00	198	131.4	40.0	5.00	4.8903	89	206
4	50	1	23.0	101.00	192	125.4	52.0	4.00	4.2905	80	135
...	...	...	...	...	...	...	...	...	...	...	...
437	60	2	28.2	112.00	185	113.8	42.0	4.00	4.9836	93	178
438	47	2	24.9	75.00	225	166.0	42.0	5.00	4.4427	102	104
439	60	2	24.9	99.67	162	106.6	43.0	3.77	4.1271	95	132
440	36	1	30.0	95.00	201	125.2	42.0	4.79	5.1299	85	220
441	36	1	19.6	71.00	250	133.2	97.0	3.00	4.5951	92	57

442 rows × 11 columns

values: Retrieves dataframe’s data as a numpy array:

df.values

array([[ 59.    ,   2.    ,  32.1   , ...,   4.8598,  87.    , 151.    ],
       [ 48.    ,   1.    ,  21.6   , ...,   3.8918,  69.    ,  75.    ],
       [ 72.    ,   2.    ,  30.5   , ...,   4.6728,  85.    , 141.    ],
       ...,
       [ 60.    ,   2.    ,  24.9   , ...,   4.1271,  95.    , 132.    ],
       [ 36.    ,   1.    ,  30.    , ...,   5.1299,  85.    , 220.    ],
       [ 36.    ,   1.    ,  19.6   , ...,   4.5951,  92.    ,  57.    ]])

type(df.values)

numpy.ndarray

copy(): gives the new df a clean break from the original. Otherwise, the copied df will point to the same object as the original.

df = pd.DataFrame({'x':[0,2,1,5], 'y':[1,1,0,0], 'z':[True,False,False,False]}) 

df_deep    = df.copy()  # deep copy; changes to df will not pass through
df_shallow = df         # shallow copy; changes to df will pass through

print(hex(id(df)), hex(id(df_deep)), hex(id(df_shallow)))

0x7f442135e590 0x7f4426df3ed0 0x7f442135e590

dtypes: provides the types of each column:

df.dtypes

x    int64
y    int64
z     bool
dtype: object

info(): prints information about the dataframe including the index dtype and columns, non-null values and memory usage.

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4 entries, 0 to 3
Data columns (total 3 columns):
 #   Column  Non-Null Count  Dtype
---  ------  --------------  -----
 0   x       4 non-null      int64
 1   y       4 non-null      int64
 2   z       4 non-null      bool 
dtypes: bool(1), int64(2)
memory usage: 200.0 bytes

rename(): Renames columns or index labels. It can rename one or more fields at once using a dict, which acts as a mapper:

df.rename(columns={'z': 'is_label'}, index={2: "obs3"})

	x	y	is_label
0	0	1	True
1	2	1	False
obs3	1	0	False
3	5	0	False

Note that to update the dataframe, you need to redefine the variable that stores it.

df

	x	y	z
0	0	1	True
1	2	1	False
2	1	0	False
3	5	0	False

df = df.rename(columns={'z': 'is_label'}, index={2: "obs3"})

df

	x	y	is_label
0	0	1	True
1	2	1	False
obs3	1	0	False
3	5	0	False

Practice exercises#

Exercise 36

Create a dataframe called dat by passing a dictionary of inputs. Here are the requirements:

has a column named features containing floats
has a column named labels containing integers 0, 1, 2

Print the df.

# Your answers from here

Exercise 37

Rename the labels column in dat to label.

# Your answers from here