Data Structures

Data Structures#

In this lesson, you will learn about the basic data structures in R, which include:

Vector
Matrix
Factors
List
Data Frame

Vectors#

A vector is a sequence of data elements, all of the same basic type.

Initialization#

Using the c function (short for “combine”) is the most basic way to initialize a vector.

# Here a sequence of numbers 1,2,3, 4
myvector<-c(1,2,3,4)
myvector

1
2
3
4

# Vector of booleans
c(TRUE, FALSE, TRUE, FALSE, FALSE)

TRUE
FALSE
TRUE
FALSE
FALSE

# A vector of characters
c("aa", "bb", "cc", "dd", "ee")

'aa'
'bb'
'cc'
'dd'
'ee'

Using : to create a sequence of consecutive integers.

s1 <- 2:5
s1

2
3
4
5

Using seq(), which functions similarly to Python’s range().

s2 <- seq(from=1, to=5, by=2)
s2

1
3
5

Using the rep()function, which creates a series of repeated values.

s3 <- rep(1, 5)
s3

1
1
1
1
1

Accesing individual elements#

You can access any individual element of a vector by using [] and the index number corresponding to the element’s position.

Remember: In R, indexing starts at 1!

myvector[1] # this gives the first element
myvector[2] # this gives the second element
myvector[3] # this gives the third element
myvector[4] # this gives the fourth element
# and so on...

1

2

3

4

Values for out-of-range indexes are reported as NA.

myvector[5]

<NA>

Note: Empty values in R are noted by NA (compared to NaN in Python).

Changing a vector#

You can change the values stored in a particular vector element by reassigning it:

myvector[2]<-100
myvector

1
100
3
4

Indexing#

By index number

Similar to accessing a single element, but this time we pass a vector of indices for the elements we want to extract. For example:

myvector[c(1,2)]

1
100

myvector[c(1,4)]

1
4

Note that elements will be returned according to the order of the vector supplied:

myvector[c(2,1)]

100
1

myvector[1:4]

1
100
3
4

By logical indexing

myvector
myvector[myvector>10]
myvector[myvector == 4]

1
100
3
4

100

4

Operations with vectors#

You can modify all elements of a vector simultaneously by applying operations such as addition, subtraction, multiplication, and division.

# Add 1
myvector + 1
# Substracte 1
myvector -2
# Multiply by 10
myvector*10
# Divide by 5
myvector/5

2
101
4
5

-1
98
1
2

10
1000
30
40

0.2
20
0.6
0.8

You can also perform these operations between vectors:

myvector + (3*myvector)

4
400
12
16

Handy functions#

length: it yields the number of elements in the vector.

length(myvector)

4

lapply: It allows you apply a certain function to each element of a vector. It returns a list (see below).

# This applies a log function to each element of the vector
lapply(myvector, log)

0
4.60517018598809
1.09861228866811
1.38629436111989

sapply: The same as lapply, but it coerces the output to a vector.

# Same as above, but as a vector 
sapply(myvector, log)

0
4.60517018598809
1.09861228866811
1.38629436111989

# This is the same as using the log function
log(myvector)

0
4.60517018598809
1.09861228866811
1.38629436111989

Obviously, the above result could also be achieved by directly passing the vector to the log function, but lapply and sapply are especially useful for applying more complex functions. For example:

# Let's see the documentation of this function again
?sapply

sapply(c(1:length(myvector)), function(x) sqrt(x**2 + 1))

1.4142135623731
2.23606797749979
3.16227766016838
4.12310562561766

Matrix#

Matrices are essentially two-dimensional versions of vectors, with rows and columns. Like vectors, all elements in a matrix must be of the same data type.

Most of the rules that apply to vectors also apply to matrices.

Initialization#

A matrix can be initialized with function matrix

# Arrange the above vector's 4 elements into a 2x2 matrix 
mymatrix<-matrix(myvector, nrow = 2, ncol=2)
mymatrix

A matrix: 2 × 2 of type dbl
1	3
100	4

Another way of creating a matrix is by converting data to a matrix using the the function as.matrix

as.matrix(myvector)

A matrix: 4 × 1 of type dbl
1
100
3
4

Some handy functions#

nrow: it gives the number of rows.
ncol: it gives the number of columns.
dim: it provides the number both the number or rows and columns.

nrow(mymatrix)
ncol(mymatrix)
dim(mymatrix)

2

2
2

dim(as.matrix(myvector))

4
1

Factors#

Factors are similar to vectors, but each value has a distinct label, making them useful for working with categorical data.

Initialization#

Factors are created using the factor() function.

# This is a group 
group.factor <- factor(c(1,1,1,2,2,2,3,3,3))
group.factor

1
1
1
2
2
2
3
3
3

Levels:

'1'
'2'
'3'

We could have also created this variable by coercing the given vector using the function as.factor

as.factor(c(1,1,1,2,2,2,3,3,3))

1
1
1
2
2
2
3
3
3

Levels:

'1'
'2'
'3'

Note: The labels are always characters, regardless of whether the underlying values are numeric, character, boolean, or another type.

Many operations that apply to usual vectors also apply to factors, e.g. indexing:

group.factor[1]

1

Levels:

'1'
'2'
'3'

group.factor[1:4]

1
1
1
2

Levels:

'1'
'2'
'3'

However, there are certain operations that you can not perform on factors because these don’t make sense, e.g. addition, substraction, multiplication and divition:

group.factor + 2

Warning message in Ops.factor(group.factor, 2):
“‘+’ no es significativo para factores”

<NA>
<NA>
<NA>
<NA>
<NA>
<NA>
<NA>
<NA>
<NA>

group.factor / 3

Warning message in Ops.factor(group.factor, 3):
“‘/’ no es significativo para factores”

<NA>
<NA>
<NA>
<NA>
<NA>
<NA>
<NA>
<NA>
<NA>

Handy functions#

levels: it yields the different levels.
nlevels: it yields the count of levels.

levels(group.factor)
nlevels(group.factor)

'1'
'2'
'3'

3

With the levels function we can also redefine the labels of the factors

levels(group.factor) <- c("Label1","Label2", "Label3")

group.factor

Label1
Label1
Label1
Label2
Label2
Label2
Label3
Label3
Label3

Levels:

'Label1'
'Label2'
'Label3'

Lists#

A list in R is a data structure that can contain many different types of elements inside it like vectors, functions and even another list inside it. It is a very important data structure in R.

Important: Compared to Python, R lists could be work as list or dictionaries.

Initialization#

List can be initialized with the list function

n <- c(2, 3, 5)
s <- c("aa", "bb", "cc", "dd", "ee")
b <- c(TRUE, FALSE, TRUE, FALSE, FALSE)

x <- list(n, s, b, 3)   # x contains copies of n, s, b
x

1. 2
2. 3
3. 5
1. 'aa'
2. 'bb'
3. 'cc'
4. 'dd'
5. 'ee'
1. TRUE
2. FALSE
3. TRUE
4. FALSE
5. FALSE
3

But we can also use them as kind of Python’s dictionaries:

Javi <- list( age = 39, married = TRUE, parents = c("Joseba","Edita"))
Javi

$age

39

$married

TRUE

$parents

'Joseba'
'Edita'

As you can see, here this list contains three variables of different kinds (and sizes).

Try it out yourself with Practice exercise 1

Accessing lists#

By index

x[1]

1. 2
2. 3
3. 5

x[[1]]

2
3
5

By label:

Using the $ operator to access an entry by its name:

# This accesses my age
Javi$age
# this tells whether I am married or not
Javi$married
# this shows the names of my parents
Javi$parents

39

TRUE

'Joseba'
'Edita'

Modifying lists#

We can again use the $ operator to edit an entry value, or add a new entry. For example:

# Let's modify my age
Javi$age<- 37.5
# Let's add my height as a new entry (in cms)
Javi$height<- 177

Javi

$age

37.5

$married

TRUE

$parents

'Joseba'
'Edita'

$height

177

Data Frames#

Initialization#

Using the data.frame function.

age <- c(17, 19, 21, 37, 22, 35, 18)
gender<-factor(c("female", "female", "male", "female", "male", "female", "male"))
score <- c(12, 10, 11, 15, 12, 10, 11)

my.dataframe <- data.frame ( age=age, gender=gender, score=score )

my.dataframe

A data.frame: 7 × 3
age	gender	score
<dbl>	<fct>	<dbl>
17	female	12
19	female	10
21	male	11
37	female	15
22	male	12
35	female	10
18	male	11

From an external file using the read.csv function.

iris_df<-read.csv("https://raw.githubusercontent.com/mwaskom/seaborn-data/refs/heads/master/iris.csv")
head(iris_df) # See below this head function

A data.frame: 6 × 5
	sepal_length	sepal_width	petal_length	petal_width	species
	<dbl>	<dbl>	<dbl>	<dbl>	<chr>
1	5.1	3.5	1.4	0.2	setosa
2	4.9	3.0	1.4	0.2	setosa
3	4.7	3.2	1.3	0.2	setosa
4	4.6	3.1	1.5	0.2	setosa
5	5.0	3.6	1.4	0.2	setosa
6	5.4	3.9	1.7	0.4	setosa

Access and modify#

As with lists, the $ operator is used to access and modify elements in dataframes.

# This shows the age variable
my.dataframe$age

17
19
21
37
22
35
18

# This adds an entry called "var1", with the same value in all its elements 
my.dataframe$var1<-1
my.dataframe

A data.frame: 7 × 4
age	gender	score	var1
<dbl>	<fct>	<dbl>	<dbl>
17	female	12	1
19	female	10	1
21	male	11	1
37	female	15	1
22	male	12	1
35	female	10	1
18	male	11	1

# This adds an entry called "var2", with different values
my.dataframe$var2<-c(1:7)
my.dataframe

A data.frame: 7 × 5
age	gender	score	var1	var2
<dbl>	<fct>	<dbl>	<dbl>	<int>
17	female	12	1	1
19	female	10	1	2
21	male	11	1	3
37	female	15	1	4
22	male	12	1	5
35	female	10	1	6
18	male	11	1	7

Handy functions#

head: it displays the first six lines (this number can be modified with the argument `n”) of the dataframe.

head(my.dataframe)

A data.frame: 6 × 5
	age	gender	score	var1	var2
	<dbl>	<fct>	<dbl>	<dbl>	<int>
1	17	female	12	1	1
2	19	female	10	1	2
3	21	male	11	1	3
4	37	female	15	1	4
5	22	male	12	1	5
6	35	female	10	1	6

head(my.dataframe, n = 2)

A data.frame: 2 × 5
	age	gender	score	var1	var2
	<dbl>	<fct>	<dbl>	<dbl>	<int>
1	17	female	12	1	1
2	19	female	10	1	2

tail: the same as head, but displaying the last six lines.

tail(my.dataframe)

A data.frame: 6 × 5
	age	gender	score	var1	var2
	<dbl>	<fct>	<dbl>	<dbl>	<int>
2	19	female	10	1	2
3	21	male	11	1	3
4	37	female	15	1	4
5	22	male	12	1	5
6	35	female	10	1	6
7	18	male	11	1	7

tail(my.dataframe, n=2)

A data.frame: 2 × 5
	age	gender	score	var1	var2
	<dbl>	<fct>	<dbl>	<dbl>	<int>
6	35	female	10	1	6
7	18	male	11	1	7

names: it shows the names all columns in the dataframe.

names(my.dataframe)

'age'
'gender'
'score'
'var1'
'var2'

str: it returns the structure of data frame - name, type and preview of data in each column.

str(my.dataframe)

'data.frame':	7 obs. of  5 variables:
 $ age   : num  17 19 21 37 22 35 18
 $ gender: Factor w/ 2 levels "female","male": 1 1 2 1 2 1 2
 $ score : num  12 10 11 15 12 10 11
 $ var1  : num  1 1 1 1 1 1 1
 $ var2  : int  1 2 3 4 5 6 7

And these are the same that we saw before (You’re free to try them out):

dim: it returns the dimensions of data frame (i.e. number of rows and number of columns)
nrow: it returns the number of rows
ncol: it returns the number of columns

Try it out yourself with practice exercise 2!

Practice exercises#

Exercise 53

1- Create a list that includes the following elements:

A variable named “name” containing your name.
A variable named “weight” with your weight.
A variable named “films”, which is a vector containing the names of your favorite films.
A variable named “coffee_drinker” set to TRUE or FALSE, depending on whether you like coffee.

# Your answer here

Exercise 54

2- Create a dataframe containing information about at least four people you know (e.g., family, friends). This dataframe should include the following columns:

A column for their names.
A column for their age.
A column for their gender/sex (use a factor variable for this).
A column indicating whether you think they like coffee or not.

# Your answer here

Data Structures

Contents

Data Structures#

Vectors#

Initialization#

Accesing individual elements#

Changing a vector#

Indexing#

Operations with vectors#

Handy functions#

Matrix#

Initialization#

Some handy functions#

Factors#

Initialization#

Handy functions#

Lists#

Initialization#

Accessing lists#

Modifying lists#

Data Frames#

Initialization#

Access and modify#

Handy functions#

Practice exercises#