R for data science: Tidyverse#

The tidyverse is one of the most popular ecosystems for data science in R. It includes many R packages commonly used in everyday data analysis. The core packages are as follows:

  • ggplot2 for visualization

  • dplyr for data manipulation

  • tidyr to tidy your data

  • readr to read data

  • purrr for functional programming

  • tibble for enhanced data frames

  • stringr for string manipulation

  • forcats for working with categorical data (factors)

Loading the tidyverse#

The tidyverse is a collection of packages. In R, to load a library or package, you simply use the library() function. This is similar to using the import keyword in Python.

# install.packages("tidyverse")
library("tidyverse")
── Attaching core tidyverse packages ────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────── tidyverse 2.0.0 ──
 dplyr     1.1.4      readr     2.1.5
 forcats   1.0.0      stringr   1.5.1
 ggplot2   3.5.1      tibble    3.2.1
 lubridate 1.9.3      tidyr     1.3.1
 purrr     1.0.2     
── Conflicts ──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────── tidyverse_conflicts() ──
 dplyr::filter() masks stats::filter()
 dplyr::lag()    masks stats::lag()
 Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

The pipe (%>%) operator#

One of the most powerful tools introduced in the tidyverse is the pipe (%>%) operator, which enables chaining functions in a clear, readable way—especially useful for data manipulation tasks.

Here’s an example where we first apply a logical filter, then select specific columns, and finally sort the data:

# Example with the iris dataset
iris.df<-read.csv("https://raw.githubusercontent.com/mwaskom/seaborn-data/refs/heads/master/iris.csv")

iris.df.pipe <- # Save the resulting pipeline into this variable
    iris.df %>% # start with the original dataframe
        filter(species == "virginica") %>% # take observations corresponding to virginica
            select(petal_length, petal_width) %>%  # select petal_length and petal_width colums
                arrange(desc(petal_length)) # sort data in descending order based on petal_length

head(iris.df.pipe)
A data.frame: 6 × 2
petal_lengthpetal_width
<dbl><dbl>
16.92.3
26.72.2
36.72.0
46.62.1
56.42.0
66.31.8

Data cleaning#

df.to.clean = data.frame(x = c(2, NA, 1, 1), 
                          y = c(NA, NA, 6, 6)
                         )
df.to.clean
A data.frame: 4 × 2
xy
<dbl><dbl>
2NA
NANA
1 6
1 6

drop_na: drop missing values#

df.to.clean %>% drop_na()
A data.frame: 2 × 2
xy
<dbl><dbl>
16
16

replace_na: replace missing values#

df.to.clean %>%replace_na(list(x=0, y=0))
A data.frame: 4 × 2
xy
<dbl><dbl>
20
00
16
16

distinc: drop duplicated data#

df.to.clean %>% distinct()
A data.frame: 3 × 2
xy
<dbl><dbl>
2NA
NANA
1 6

Remember: we can concatenate them together.

df.to.clean %>% replace_na(list(x=0, y=0)) %>% distinct()
A data.frame: 3 × 2
xy
<dbl><dbl>
20
00
16

Data manipulation#

# let's load again our Iris dataframe
iris.df<-read.csv("https://raw.githubusercontent.com/mwaskom/seaborn-data/refs/heads/master/iris.csv")
dim(iris.df)
  1. 150
  2. 5
str(iris.df)
'data.frame':	150 obs. of  5 variables:
 $ sepal_length: num  5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
 $ sepal_width : num  3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
 $ petal_length: num  1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
 $ petal_width : num  0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
 $ species     : chr  "setosa" "setosa" "setosa" "setosa" ...

filter: Pick observations by their values#

This function allows you to subset observations based on specific conditions.

iris.df.setosa<-iris.df %>% filter(species=="setosa")

head(iris.df.setosa)
A data.frame: 6 × 5
sepal_lengthsepal_widthpetal_lengthpetal_widthspecies
<dbl><dbl><dbl><dbl><chr>
15.13.51.40.2setosa
24.93.01.40.2setosa
34.73.21.30.2setosa
44.63.11.50.2setosa
55.03.61.40.2setosa
65.43.91.70.4setosa
# Here filtering based on two columns
iris.df.setosa.2<-iris.df %>% filter(species=="setosa", sepal_width>3)

head(iris.df.setosa.2)
A data.frame: 6 × 5
sepal_lengthsepal_widthpetal_lengthpetal_widthspecies
<dbl><dbl><dbl><dbl><chr>
15.13.51.40.2setosa
24.73.21.30.2setosa
34.63.11.50.2setosa
45.03.61.40.2setosa
55.43.91.70.4setosa
64.63.41.40.3setosa
# Here filtering with two logical conditions in the same column
iris.df.3<-iris.df %>% filter(sepal_width> 2 & sepal_width <= 3)

head(iris.df.3)
A data.frame: 6 × 5
sepal_lengthsepal_widthpetal_lengthpetal_widthspecies
<dbl><dbl><dbl><dbl><chr>
14.93.01.40.2setosa
24.42.91.40.2setosa
34.83.01.40.1setosa
44.33.01.10.1setosa
55.03.01.60.2setosa
64.43.01.30.2setosa

arrange: sort data by value#

This function takes a dataframe, and a set of column names (or more complicated expressions) to order by (in ascending order by default). If more than one column name is provided, each additional column is used to break ties in the values of preceding columns.

# This sorts the data based on sepal length first and then sepal_width
iris.df.sorted<-iris.df %>% arrange(sepal_length, sepal_width)
head(iris.df.sorted)
A data.frame: 6 × 5
sepal_lengthsepal_widthpetal_lengthpetal_widthspecies
<dbl><dbl><dbl><dbl><chr>
14.33.01.10.1setosa
24.42.91.40.2setosa
34.43.01.30.2setosa
44.43.21.30.2setosa
54.52.31.30.3setosa
64.63.11.50.2setosa

If we want to reorder in descending order, we can use the function desc. This, in contrast to Python’s Pandas, can be specified to single columns:

# This sorts the data based on sepal length first, in descending order, and then sepal_width, in ascending order
iris.df.sorted<-iris.df %>% arrange(desc(sepal_length), sepal_width)
head(iris.df.sorted)
A data.frame: 6 × 5
sepal_lengthsepal_widthpetal_lengthpetal_widthspecies
<dbl><dbl><dbl><dbl><chr>
17.93.86.42.0virginica
27.72.66.92.3virginica
37.72.86.72.0virginica
47.73.06.12.3virginica
57.73.86.72.2virginica
67.63.06.62.1virginica

select: select columns#

It allows you to pick a subset of variables.

# This selects columns by the given names of the columns. Here we just select sepal_length, sepal_width and species
iris.df.select<-iris.df %>% select(sepal_length, sepal_width, species)

head(iris.df.select)
A data.frame: 6 × 3
sepal_lengthsepal_widthspecies
<dbl><dbl><chr>
15.13.5setosa
24.93.0setosa
34.73.2setosa
44.63.1setosa
55.03.6setosa
65.43.9setosa

If you want to select consecutive columns, we can use the : operator, as in vectors.

# This selects all columns between sepal_length and petal_width (inclusive)
iris.df.select.2<-iris.df %>% select(sepal_length:petal_width)

head(iris.df.select.2)
A data.frame: 6 × 4
sepal_lengthsepal_widthpetal_lengthpetal_width
<dbl><dbl><dbl><dbl>
15.13.51.40.2
24.93.01.40.2
34.73.21.30.2
44.63.11.50.2
55.03.61.40.2
65.43.91.70.4

And you can use a minus (-) operator before the name of the columns to filter them out.

# This selects all columns but sepal_length and sepal_width
iris.df.select.3<-iris.df %>% select(-sepal_length, -sepal_width)

head(iris.df.select.3)
A data.frame: 6 × 3
petal_lengthpetal_widthspecies
<dbl><dbl><chr>
11.40.2setosa
21.40.2setosa
31.30.2setosa
41.50.2setosa
51.40.2setosa
61.70.4setosa

And you can also remove consecutive columns by combining both operators:

iris.df.select.4<-iris.df %>% select(-(sepal_length:petal_width))

head(iris.df.select.4)
A data.frame: 6 × 1
species
<chr>
1setosa
2setosa
3setosa
4setosa
5setosa
6setosa

mutate: creating columns#

This function allows you to add new columns to data frames. It would be mostly similar to assign in Python’s Pandas

# This creates a new column by multiplying summing sepal_length with sepal_width
iris.df.volume<-iris.df %>% mutate(sepal_volume=sepal_length*sepal_width)
head(iris.df.volume)
A data.frame: 6 × 6
sepal_lengthsepal_widthpetal_lengthpetal_widthspeciessepal_volume
<dbl><dbl><dbl><dbl><chr><dbl>
15.13.51.40.2setosa17.85
24.93.01.40.2setosa14.70
34.73.21.30.2setosa15.04
44.63.11.50.2setosa14.26
55.03.61.40.2setosa18.00
65.43.91.70.4setosa21.06

We can always use a preceding created column to define a new column.

# This creates a new column by multiplying multiplying sepal_length with sepal_width, and a second column with the logarithm
iris.df.volume.2<- iris.df %>% mutate(sepal_volume=sepal_length*sepal_width, sepal_volume_log=log(sepal_volume))
head(iris.df.volume.2)
A data.frame: 6 × 7
sepal_lengthsepal_widthpetal_lengthpetal_widthspeciessepal_volumesepal_volume_log
<dbl><dbl><dbl><dbl><chr><dbl><dbl>
15.13.51.40.2setosa17.852.882004
24.93.01.40.2setosa14.702.687847
34.73.21.30.2setosa15.042.710713
44.63.11.50.2setosa14.262.657458
55.03.61.40.2setosa18.002.890372
65.43.91.70.4setosa21.063.047376
# Obviously, this is going to give an error because we are trying to use a sepal_volume before it was created
iris.df.volume.3<-iris.df %>% mutate(sepal_volume_log=log(sepal_volume), sepal_volume=sepal_length*sepal_width)
head(iris.df.volume.3)
Error in `mutate()`:
 In argument: `sepal_volume_log = log(sepal_volume)`.
Caused by error:
! objeto 'sepal_volume' no encontrado
Traceback:

1. mutate(., sepal_volume_log = log(sepal_volume), sepal_volume = sepal_length * 
 .     sepal_width)
2. mutate.data.frame(., sepal_volume_log = log(sepal_volume), sepal_volume = sepal_length * 
 .     sepal_width)
3. mutate_cols(.data, dplyr_quosures(...), by)
4. withCallingHandlers(for (i in seq_along(dots)) {
 .     poke_error_context(dots, i, mask = mask)
 .     context_poke("column", old_current_column)
 .     new_columns <- mutate_col(dots[[i]], data, mask, new_columns)
 . }, error = dplyr_error_handler(dots = dots, mask = mask, bullets = mutate_bullets, 
 .     error_call = error_call, error_class = "dplyr:::mutate_error"), 
 .     warning = dplyr_warning_handler(state = warnings_state, mask = mask, 
 .         error_call = error_call))
5. mutate_col(dots[[i]], data, mask, new_columns)
6. mask$eval_all_mutate(quo)
7. eval()
8. .handleSimpleError(function (cnd) 
 . {
 .     local_error_context(dots, i = frame[[i_sym]], mask = mask)
 .     if (inherits(cnd, "dplyr:::internal_error")) {
 .         parent <- error_cnd(message = bullets(cnd))
 .     }
 .     else {
 .         parent <- cnd
 .     }
 .     message <- c(cnd_bullet_header(action), i = if (has_active_group_context(mask)) cnd_bullet_cur_group_label())
 .     abort(message, class = error_class, parent = parent, call = error_call)
 . }, "objeto 'sepal_volume' no encontrado", base::quote(NULL))
9. h(simpleError(msg, call))
10. abort(message, class = error_class, parent = parent, call = error_call)
11. signal_abort(cnd, .file)
12. signalCondition(cnd)

group_by and summarize: collapse down to a single summary#

This would be similar to using groupby method in Pandas’ dataframes

# This first groups by species and then calculates the average sepal length within each category
iris.df %>% group_by(species) %>% summarize(mean_sepal_length = mean(sepal_length))
A tibble: 3 × 2
speciesmean_sepal_length
<chr><dbl>
setosa 5.006
versicolor5.936
virginica 6.588
# You can can have several aggregation methods too
iris.df %>% group_by(species) %>% summarize(mean_sepal_length = mean(sepal_length),
                                            median_sepal_length = median(sepal_length))
A tibble: 3 × 3
speciesmean_sepal_lengthmedian_sepal_length
<chr><dbl><dbl>
setosa 5.0065.0
versicolor5.9365.9
virginica 6.5886.5

If you want to apply the same aggregation method to several columns, you need to use the across function:

# This first groups by species and then calculates the average across all columns between sepal_length and petal_width
iris.df %>% group_by(species) %>% summarize(across(sepal_length:petal_width, mean)) 
A tibble: 3 × 5
speciessepal_lengthsepal_widthpetal_lengthpetal_width
<chr><dbl><dbl><dbl><dbl>
setosa 5.0063.4281.4620.246
versicolor5.9362.7704.2601.326
virginica 6.5882.9745.5522.026

pivot_longer: reshape data to long format#

This would be similart to pd.melt in Python’s Pandas

iris.df.melt<-iris.df %>% pivot_longer(sepal_length:petal_width)

head(iris.df.melt)
A tibble: 6 × 3
speciesnamevalue
<chr><chr><dbl>
setosasepal_length5.1
setosasepal_width 3.5
setosapetal_length1.4
setosapetal_width 0.2
setosasepal_length4.9
setosasepal_width 3.0

Data visualization#

I don’t want to finish this introduction to doing data science in R without briefly showing ggplot2.

ggplot2 is for data visualization, and personally, I think it is the best thing in R. It allows you to create magnificient and professional plots in an elegant and versatile way.

It is based on grammar of graphics, a coherent system for describing and building graphs.

To build a ggplot, you need to use the following basic template that can be used for different types of plots:

ggplot(data = <DATA>, mapping = aes(<MAPPINGS>)) +  <GEOM_FUNCTION>()

or

ggplot(data = <DATA>) +  <GEOM_FUNCTION>(mapping = aes(<MAPPINGS>))

In a nutshell:

  • One begins a plot with the function ggplot, which creates a coordinate system where you can add layers to.

  • The first argument of ggplot is the dataset to use in the graph.

  • The graph can be then completed by adding one or more layers to ggplot using a geom function.

  • Each geom function in ggplot2 takes a mapping argument. This defines how variables in your dataset are mapped to visual properties. The mapping argument is always paired with the function aes, and the x and y arguments of aes() specify which variables to map to the x- and y-axes.

  • ggplot2 comes with many geom functions that each add a different type of layer to a plot

iris.df %>% ggplot(aes(x=sepal_length, y=petal_length)) + geom_point()

# or

iris.df %>% ggplot() + geom_point(aes(x=sepal_length, y=petal_length))
../../_images/f9ade33cb27f72b38cecff8ef0ec8953489c13e5bc0d11926c16ac9cb1c86075.png ../../_images/f9ade33cb27f72b38cecff8ef0ec8953489c13e5bc0d11926c16ac9cb1c86075.png
iris.df %>% ggplot(aes(x=petal_length)) + geom_histogram(binwidth = 0.1)
../../_images/77ce98207d59a8564c658cd2eaf90076c308fa485adacb1b2aa4bb9e492412ee.png

We can assign the plot to a variable and render it any time we like:

g <- iris.df %>% ggplot(aes(x=sepal_length, y=petal_length)) + geom_point()
g
../../_images/f9ade33cb27f72b38cecff8ef0ec8953489c13e5bc0d11926c16ac9cb1c86075.png

And we can keep adding layers to incorporate more graphics to the plot:

g + geom_smooth(method='lm')
`geom_smooth()` using formula = 'y ~ x'
../../_images/148f86ae02e0ae7c5c1038bc67ab1f9e7160c635cb55a1757ae7b9acc27a348d.png

The full list of available layers (geom_ functions) can be found at https://ggplot2.tidyverse.org/reference.

And as we mentioned, we can also control the aesthetics of the plot through the aes function. Examples of this include:

  • Position (i.e., on the x and y axes)

  • Color (“outside” color)

  • Fill (“inside” color)

  • Shape (of points)

  • Alpha (Transparency)

  • Line type

  • Size

iris.df %>% ggplot(aes(x=sepal_length, y=petal_length, color=species)) + geom_point()
../../_images/f511e7252283c1e3d17d0a93ce0a91798e02f0f671d34a2d3f81dfe670de5891.png
iris.df %>% ggplot(aes(x=sepal_length, y=petal_length, color=species, size=petal_width)) + geom_point()
../../_images/983e1339ed895d86ceafb5297022aa80096c720086c462fd42bdd97a92a3ebac.png
iris.df %>% ggplot(aes(x=petal_length, fill=species)) + geom_histogram(binwidth = 0.1)
../../_images/766c05ccc01d2da0f846b0c1523e5d8fc683bbcbdf2c1179e0d4c8bb24fc3754.png

I recommend visiting the R graph gallery for a complete list of recipes on creating graphs using ggplot. You’ll have fun!

Practice exercises#

Exercise 61

1- Import the data from “https://vincentarelbundock.github.io/Rdatasets/csv/psych/sat.act.csv”, which contains SAT and ACT scores for a sample of students. Save it to a variable named sat.dat.

2- Convert “education” and “gender” columns to factor type. (Hint: use as.factor function)

3- Using the pipe (%>%) operator, perform the following operations in sequence:

  • Filter the data to include only observations where age is between 18 and 45 years.

  • Create a new variable, SAT.avg, representing the average of SATQ and SATV.

  • Select the columns gender, education, and SAT.avg.

  • Group the dataframe by gender and education.

  • Summarize the dataframe by computing the mean and standard deviation of SAT.avg. Here, since there are some missing information, you will need to remove these observations. You can actually do this when using the mean and sd functions. Look into their documentation to figure out how to do this.

Save the resulting dataframe in a variable named sat.dat.preprocessed and display it.

4- With the resulting dataframe, create a barplot using geom_bar:

  • Set the x position to each level of education, and the height (y position) to the mean of SAT.avg.

  • To display a separate bar for each gender, set fill = gender in the aesthetics, and use position = "dodge" to place the bars side by side rather than stacked.

Adjust the plot aesthetics to make it more visually appealing. You may find the following page useful for this: https://r-graph-gallery.com/4-barplot-with-error-bar.html

# Your answers from here