Data Structures#
In this lesson, you will learn about the basic data structures in R, which include:
Vector
Matrix
Factors
List
Data Frame
Vectors#
A vector is a sequence of data elements, all of the same basic type.
Initialization#
Using the
c
function (short for “combine”) is the most basic way to initialize a vector.
# Here a sequence of numbers 1,2,3, 4
myvector<-c(1,2,3,4)
myvector
- 1
- 2
- 3
- 4
# Vector of booleans
c(TRUE, FALSE, TRUE, FALSE, FALSE)
- TRUE
- FALSE
- TRUE
- FALSE
- FALSE
# A vector of characters
c("aa", "bb", "cc", "dd", "ee")
- 'aa'
- 'bb'
- 'cc'
- 'dd'
- 'ee'
Using
:
to create a sequence of consecutive integers.
s1 <- 2:5
s1
- 2
- 3
- 4
- 5
Using
seq()
, which functions similarly to Python’srange()
.
s2 <- seq(from=1, to=5, by=2)
s2
- 1
- 3
- 5
Using the
rep()
function, which creates a series of repeated values.
s3 <- rep(1, 5)
s3
- 1
- 1
- 1
- 1
- 1
Accesing individual elements#
You can access any individual element of a vector by using []
and the index number corresponding to the element’s position.
myvector[1] # this gives the first element
myvector[2] # this gives the second element
myvector[3] # this gives the third element
myvector[4] # this gives the fourth element
# and so on...
Values for out-of-range indexes are reported as NA
.
myvector[5]
Note: Empty values in R are noted by NA
(compared to NaN
in Python).
Changing a vector#
You can change the values stored in a particular vector element by reassigning it:
myvector[2]<-100
myvector
- 1
- 100
- 3
- 4
Indexing#
By index number
Similar to accessing a single element, but this time we pass a vector of indices for the elements we want to extract. For example:
myvector[c(1,2)]
- 1
- 100
myvector[c(1,4)]
- 1
- 4
Note that elements will be returned according to the order of the vector supplied:
myvector[c(2,1)]
- 100
- 1
myvector[1:4]
- 1
- 100
- 3
- 4
By logical indexing
myvector
myvector[myvector>10]
myvector[myvector == 4]
- 1
- 100
- 3
- 4
Operations with vectors#
You can modify all elements of a vector simultaneously by applying operations such as addition, subtraction, multiplication, and division.
# Add 1
myvector + 1
# Substracte 1
myvector -2
# Multiply by 10
myvector*10
# Divide by 5
myvector/5
- 2
- 101
- 4
- 5
- -1
- 98
- 1
- 2
- 10
- 1000
- 30
- 40
- 0.2
- 20
- 0.6
- 0.8
You can also perform these operations between vectors:
myvector + (3*myvector)
- 4
- 400
- 12
- 16
Handy functions#
length
: it yields the number of elements in the vector.
length(myvector)
lapply
: It allows you apply a certain function to each element of a vector. It returns a list (see below).
# This applies a log function to each element of the vector
lapply(myvector, log)
- 0
- 4.60517018598809
- 1.09861228866811
- 1.38629436111989
sapply
: The same aslapply
, but it coerces the output to a vector.
# Same as above, but as a vector
sapply(myvector, log)
- 0
- 4.60517018598809
- 1.09861228866811
- 1.38629436111989
# This is the same as using the log function
log(myvector)
- 0
- 4.60517018598809
- 1.09861228866811
- 1.38629436111989
Obviously, the above result could also be achieved by directly passing the vector to the log
function, but lapply
and sapply
are especially useful for applying more complex functions. For example:
# Let's see the documentation of this function again
?sapply
sapply(c(1:length(myvector)), function(x) sqrt(x**2 + 1))
- 1.4142135623731
- 2.23606797749979
- 3.16227766016838
- 4.12310562561766
Matrix#
Matrices are essentially two-dimensional versions of vectors, with rows and columns. Like vectors, all elements in a matrix must be of the same data type.
Most of the rules that apply to vectors also apply to matrices.
Initialization#
A matrix can be initialized with function matrix
# Arrange the above vector's 4 elements into a 2x2 matrix
mymatrix<-matrix(myvector, nrow = 2, ncol=2)
mymatrix
1 | 3 |
100 | 4 |
Another way of creating a matrix is by converting data to a matrix using the the function as.matrix
as.matrix(myvector)
1 |
100 |
3 |
4 |
Some handy functions#
nrow
: it gives the number of rows.ncol
: it gives the number of columns.dim
: it provides the number both the number or rows and columns.
nrow(mymatrix)
ncol(mymatrix)
dim(mymatrix)
- 2
- 2
dim(as.matrix(myvector))
- 4
- 1
Factors#
Factors are similar to vectors, but each value has a distinct label, making them useful for working with categorical data.
Initialization#
Factors are created using the factor()
function.
# This is a group
group.factor <- factor(c(1,1,1,2,2,2,3,3,3))
group.factor
- 1
- 1
- 1
- 2
- 2
- 2
- 3
- 3
- 3
Levels:
- '1'
- '2'
- '3'
We could have also created this variable by coercing the given vector using the function as.factor
as.factor(c(1,1,1,2,2,2,3,3,3))
- 1
- 1
- 1
- 2
- 2
- 2
- 3
- 3
- 3
Levels:
- '1'
- '2'
- '3'
Many operations that apply to usual vectors also apply to factors, e.g. indexing:
group.factor[1]
Levels:
- '1'
- '2'
- '3'
group.factor[1:4]
- 1
- 1
- 1
- 2
Levels:
- '1'
- '2'
- '3'
However, there are certain operations that you can not perform on factors because these don’t make sense, e.g. addition, substraction, multiplication and divition:
group.factor + 2
Warning message in Ops.factor(group.factor, 2):
“‘+’ no es significativo para factores”
- <NA>
- <NA>
- <NA>
- <NA>
- <NA>
- <NA>
- <NA>
- <NA>
- <NA>
group.factor / 3
Warning message in Ops.factor(group.factor, 3):
“‘/’ no es significativo para factores”
- <NA>
- <NA>
- <NA>
- <NA>
- <NA>
- <NA>
- <NA>
- <NA>
- <NA>
Handy functions#
levels
: it yields the different levels.nlevels
: it yields the count of levels.
levels(group.factor)
nlevels(group.factor)
- '1'
- '2'
- '3'
With the levels
function we can also redefine the labels of the factors
levels(group.factor) <- c("Label1","Label2", "Label3")
group.factor
- Label1
- Label1
- Label1
- Label2
- Label2
- Label2
- Label3
- Label3
- Label3
Levels:
- 'Label1'
- 'Label2'
- 'Label3'
Lists#
A list in R is a data structure that can contain many different types of elements inside it like vectors, functions and even another list inside it. It is a very important data structure in R.
Initialization#
List can be initialized with the list
function
n <- c(2, 3, 5)
s <- c("aa", "bb", "cc", "dd", "ee")
b <- c(TRUE, FALSE, TRUE, FALSE, FALSE)
x <- list(n, s, b, 3) # x contains copies of n, s, b
x
-
- 2
- 3
- 5
-
- 'aa'
- 'bb'
- 'cc'
- 'dd'
- 'ee'
-
- TRUE
- FALSE
- TRUE
- FALSE
- FALSE
- 3
But we can also use them as kind of Python’s dictionaries:
Javi <- list( age = 39, married = TRUE, parents = c("Joseba","Edita"))
Javi
- $age
- 39
- $married
- TRUE
- $parents
-
- 'Joseba'
- 'Edita'
As you can see, here this list contains three variables of different kinds (and sizes).
Try it out yourself with Practice exercise 1
Accessing lists#
By index
x[1]
-
- 2
- 3
- 5
x[[1]]
- 2
- 3
- 5
By label:
Using the $
operator to access an entry by its name:
# This accesses my age
Javi$age
# this tells whether I am married or not
Javi$married
# this shows the names of my parents
Javi$parents
- 'Joseba'
- 'Edita'
Modifying lists#
We can again use the $
operator to edit an entry value, or add a new entry. For example:
# Let's modify my age
Javi$age<- 37.5
# Let's add my height as a new entry (in cms)
Javi$height<- 177
Javi
- $age
- 37.5
- $married
- TRUE
- $parents
-
- 'Joseba'
- 'Edita'
- $height
- 177
Data Frames#
Initialization#
Using the
data.frame
function.
age <- c(17, 19, 21, 37, 22, 35, 18)
gender<-factor(c("female", "female", "male", "female", "male", "female", "male"))
score <- c(12, 10, 11, 15, 12, 10, 11)
my.dataframe <- data.frame ( age=age, gender=gender, score=score )
my.dataframe
age | gender | score |
---|---|---|
<dbl> | <fct> | <dbl> |
17 | female | 12 |
19 | female | 10 |
21 | male | 11 |
37 | female | 15 |
22 | male | 12 |
35 | female | 10 |
18 | male | 11 |
From an external file using the
read.csv
function.
iris_df<-read.csv("https://raw.githubusercontent.com/mwaskom/seaborn-data/refs/heads/master/iris.csv")
head(iris_df) # See below this head function
sepal_length | sepal_width | petal_length | petal_width | species | |
---|---|---|---|---|---|
<dbl> | <dbl> | <dbl> | <dbl> | <chr> | |
1 | 5.1 | 3.5 | 1.4 | 0.2 | setosa |
2 | 4.9 | 3.0 | 1.4 | 0.2 | setosa |
3 | 4.7 | 3.2 | 1.3 | 0.2 | setosa |
4 | 4.6 | 3.1 | 1.5 | 0.2 | setosa |
5 | 5.0 | 3.6 | 1.4 | 0.2 | setosa |
6 | 5.4 | 3.9 | 1.7 | 0.4 | setosa |
Access and modify#
As with lists, the $
operator is used to access and modify elements in dataframes.
# This shows the age variable
my.dataframe$age
- 17
- 19
- 21
- 37
- 22
- 35
- 18
# This adds an entry called "var1", with the same value in all its elements
my.dataframe$var1<-1
my.dataframe
age | gender | score | var1 |
---|---|---|---|
<dbl> | <fct> | <dbl> | <dbl> |
17 | female | 12 | 1 |
19 | female | 10 | 1 |
21 | male | 11 | 1 |
37 | female | 15 | 1 |
22 | male | 12 | 1 |
35 | female | 10 | 1 |
18 | male | 11 | 1 |
# This adds an entry called "var2", with different values
my.dataframe$var2<-c(1:7)
my.dataframe
age | gender | score | var1 | var2 |
---|---|---|---|---|
<dbl> | <fct> | <dbl> | <dbl> | <int> |
17 | female | 12 | 1 | 1 |
19 | female | 10 | 1 | 2 |
21 | male | 11 | 1 | 3 |
37 | female | 15 | 1 | 4 |
22 | male | 12 | 1 | 5 |
35 | female | 10 | 1 | 6 |
18 | male | 11 | 1 | 7 |
Handy functions#
head
: it displays the first six lines (this number can be modified with the argument `n”) of the dataframe.
head(my.dataframe)
age | gender | score | var1 | var2 | |
---|---|---|---|---|---|
<dbl> | <fct> | <dbl> | <dbl> | <int> | |
1 | 17 | female | 12 | 1 | 1 |
2 | 19 | female | 10 | 1 | 2 |
3 | 21 | male | 11 | 1 | 3 |
4 | 37 | female | 15 | 1 | 4 |
5 | 22 | male | 12 | 1 | 5 |
6 | 35 | female | 10 | 1 | 6 |
head(my.dataframe, n = 2)
age | gender | score | var1 | var2 | |
---|---|---|---|---|---|
<dbl> | <fct> | <dbl> | <dbl> | <int> | |
1 | 17 | female | 12 | 1 | 1 |
2 | 19 | female | 10 | 1 | 2 |
tail
: the same ashead
, but displaying the last six lines.
tail(my.dataframe)
age | gender | score | var1 | var2 | |
---|---|---|---|---|---|
<dbl> | <fct> | <dbl> | <dbl> | <int> | |
2 | 19 | female | 10 | 1 | 2 |
3 | 21 | male | 11 | 1 | 3 |
4 | 37 | female | 15 | 1 | 4 |
5 | 22 | male | 12 | 1 | 5 |
6 | 35 | female | 10 | 1 | 6 |
7 | 18 | male | 11 | 1 | 7 |
tail(my.dataframe, n=2)
age | gender | score | var1 | var2 | |
---|---|---|---|---|---|
<dbl> | <fct> | <dbl> | <dbl> | <int> | |
6 | 35 | female | 10 | 1 | 6 |
7 | 18 | male | 11 | 1 | 7 |
names
: it shows the names all columns in the dataframe.
names(my.dataframe)
- 'age'
- 'gender'
- 'score'
- 'var1'
- 'var2'
str
: it returns the structure of data frame - name, type and preview of data in each column.
str(my.dataframe)
'data.frame': 7 obs. of 5 variables:
$ age : num 17 19 21 37 22 35 18
$ gender: Factor w/ 2 levels "female","male": 1 1 2 1 2 1 2
$ score : num 12 10 11 15 12 10 11
$ var1 : num 1 1 1 1 1 1 1
$ var2 : int 1 2 3 4 5 6 7
And these are the same that we saw before (You’re free to try them out):
dim
: it returns the dimensions of data frame (i.e. number of rows and number of columns)nrow
: it returns the number of rowsncol
: it returns the number of columns
Try it out yourself with practice exercise 2!
Practice exercises#
1- Create a list that includes the following elements:
A variable named “name” containing your name.
A variable named “weight” with your weight.
A variable named “films”, which is a vector containing the names of your favorite films.
A variable named “coffee_drinker” set to TRUE or FALSE, depending on whether you like coffee.
# Your answer here
2- Create a dataframe containing information about at least four people you know (e.g., family, friends). This dataframe should include the following columns:
A column for their names.
A column for their age.
A column for their gender/sex (use a factor variable for this).
A column indicating whether you think they like coffee or not.
# Your answer here