Data Structures#

In this lesson, you will learn about the basic data structures in R, which include:

  • Vector

  • Matrix

  • Factors

  • List

  • Data Frame

Vectors#

A vector is a sequence of data elements, all of the same basic type.

Initialization#

  • Using the c function (short for “combine”) is the most basic way to initialize a vector.

# Here a sequence of numbers 1,2,3, 4
myvector<-c(1,2,3,4)
myvector
  1. 1
  2. 2
  3. 3
  4. 4
# Vector of booleans
c(TRUE, FALSE, TRUE, FALSE, FALSE)
  1. TRUE
  2. FALSE
  3. TRUE
  4. FALSE
  5. FALSE
# A vector of characters
c("aa", "bb", "cc", "dd", "ee")
  1. 'aa'
  2. 'bb'
  3. 'cc'
  4. 'dd'
  5. 'ee'
  • Using : to create a sequence of consecutive integers.

s1 <- 2:5
s1
  1. 2
  2. 3
  3. 4
  4. 5
  • Using seq(), which functions similarly to Python’s range().

s2 <- seq(from=1, to=5, by=2)
s2
  1. 1
  2. 3
  3. 5
  • Using the rep()function, which creates a series of repeated values.

s3 <- rep(1, 5)
s3
  1. 1
  2. 1
  3. 1
  4. 1
  5. 1

Accesing individual elements#

You can access any individual element of a vector by using [] and the index number corresponding to the element’s position.

Remember: In R, indexing starts at 1!
myvector[1] # this gives the first element
myvector[2] # this gives the second element
myvector[3] # this gives the third element
myvector[4] # this gives the fourth element
# and so on...
1
2
3
4

Values for out-of-range indexes are reported as NA.

myvector[5]
<NA>

Note: Empty values in R are noted by NA (compared to NaN in Python).

Changing a vector#

You can change the values stored in a particular vector element by reassigning it:

myvector[2]<-100
myvector
  1. 1
  2. 100
  3. 3
  4. 4

Indexing#

  • By index number

Similar to accessing a single element, but this time we pass a vector of indices for the elements we want to extract. For example:

myvector[c(1,2)]
  1. 1
  2. 100
myvector[c(1,4)]
  1. 1
  2. 4

Note that elements will be returned according to the order of the vector supplied:

myvector[c(2,1)]
  1. 100
  2. 1
myvector[1:4]
  1. 1
  2. 100
  3. 3
  4. 4
  • By logical indexing

myvector
myvector[myvector>10]
myvector[myvector == 4]
  1. 1
  2. 100
  3. 3
  4. 4
100
4

Operations with vectors#

You can modify all elements of a vector simultaneously by applying operations such as addition, subtraction, multiplication, and division.

# Add 1
myvector + 1
# Substracte 1
myvector -2
# Multiply by 10
myvector*10
# Divide by 5
myvector/5
  1. 2
  2. 101
  3. 4
  4. 5
  1. -1
  2. 98
  3. 1
  4. 2
  1. 10
  2. 1000
  3. 30
  4. 40
  1. 0.2
  2. 20
  3. 0.6
  4. 0.8

You can also perform these operations between vectors:

myvector + (3*myvector)
  1. 4
  2. 400
  3. 12
  4. 16

Handy functions#

  • length: it yields the number of elements in the vector.

length(myvector)
4
  • lapply: It allows you apply a certain function to each element of a vector. It returns a list (see below).

# This applies a log function to each element of the vector
lapply(myvector, log)
  1. 0
  2. 4.60517018598809
  3. 1.09861228866811
  4. 1.38629436111989
  • sapply: The same as lapply, but it coerces the output to a vector.

# Same as above, but as a vector 
sapply(myvector, log)
  1. 0
  2. 4.60517018598809
  3. 1.09861228866811
  4. 1.38629436111989
# This is the same as using the log function
log(myvector)
  1. 0
  2. 4.60517018598809
  3. 1.09861228866811
  4. 1.38629436111989

Obviously, the above result could also be achieved by directly passing the vector to the log function, but lapply and sapply are especially useful for applying more complex functions. For example:

# Let's see the documentation of this function again
?sapply
sapply(c(1:length(myvector)), function(x) sqrt(x**2 + 1))
  1. 1.4142135623731
  2. 2.23606797749979
  3. 3.16227766016838
  4. 4.12310562561766

Matrix#

Matrices are essentially two-dimensional versions of vectors, with rows and columns. Like vectors, all elements in a matrix must be of the same data type.

Most of the rules that apply to vectors also apply to matrices.

Initialization#

A matrix can be initialized with function matrix

# Arrange the above vector's 4 elements into a 2x2 matrix 
mymatrix<-matrix(myvector, nrow = 2, ncol=2)
mymatrix
A matrix: 2 × 2 of type dbl
13
1004

Another way of creating a matrix is by converting data to a matrix using the the function as.matrix

as.matrix(myvector)
A matrix: 4 × 1 of type dbl
1
100
3
4

Some handy functions#

  • nrow: it gives the number of rows.

  • ncol: it gives the number of columns.

  • dim: it provides the number both the number or rows and columns.

nrow(mymatrix)
ncol(mymatrix)
dim(mymatrix)
2
2
  1. 2
  2. 2
dim(as.matrix(myvector))
  1. 4
  2. 1

Factors#

Factors are similar to vectors, but each value has a distinct label, making them useful for working with categorical data.

Initialization#

Factors are created using the factor() function.

# This is a group 
group.factor <- factor(c(1,1,1,2,2,2,3,3,3))
group.factor
  1. 1
  2. 1
  3. 1
  4. 2
  5. 2
  6. 2
  7. 3
  8. 3
  9. 3
Levels:
  1. '1'
  2. '2'
  3. '3'

We could have also created this variable by coercing the given vector using the function as.factor

as.factor(c(1,1,1,2,2,2,3,3,3))
  1. 1
  2. 1
  3. 1
  4. 2
  5. 2
  6. 2
  7. 3
  8. 3
  9. 3
Levels:
  1. '1'
  2. '2'
  3. '3'
Note: The labels are always characters, regardless of whether the underlying values are numeric, character, boolean, or another type.

Many operations that apply to usual vectors also apply to factors, e.g. indexing:

group.factor[1]
1
Levels:
  1. '1'
  2. '2'
  3. '3'
group.factor[1:4]
  1. 1
  2. 1
  3. 1
  4. 2
Levels:
  1. '1'
  2. '2'
  3. '3'

However, there are certain operations that you can not perform on factors because these don’t make sense, e.g. addition, substraction, multiplication and divition:

group.factor + 2
Warning message in Ops.factor(group.factor, 2):
“‘+’ no es significativo para factores”
  1. <NA>
  2. <NA>
  3. <NA>
  4. <NA>
  5. <NA>
  6. <NA>
  7. <NA>
  8. <NA>
  9. <NA>
group.factor / 3
Warning message in Ops.factor(group.factor, 3):
“‘/’ no es significativo para factores”
  1. <NA>
  2. <NA>
  3. <NA>
  4. <NA>
  5. <NA>
  6. <NA>
  7. <NA>
  8. <NA>
  9. <NA>

Handy functions#

  • levels: it yields the different levels.

  • nlevels: it yields the count of levels.

levels(group.factor)
nlevels(group.factor)
  1. '1'
  2. '2'
  3. '3'
3

With the levels function we can also redefine the labels of the factors

levels(group.factor) <- c("Label1","Label2", "Label3")

group.factor
  1. Label1
  2. Label1
  3. Label1
  4. Label2
  5. Label2
  6. Label2
  7. Label3
  8. Label3
  9. Label3
Levels:
  1. 'Label1'
  2. 'Label2'
  3. 'Label3'

Lists#

A list in R is a data structure that can contain many different types of elements inside it like vectors, functions and even another list inside it. It is a very important data structure in R.

Important: Compared to Python, R lists could be work as list or dictionaries.

Initialization#

List can be initialized with the list function

n <- c(2, 3, 5)
s <- c("aa", "bb", "cc", "dd", "ee")
b <- c(TRUE, FALSE, TRUE, FALSE, FALSE)

x <- list(n, s, b, 3)   # x contains copies of n, s, b
x
    1. 2
    2. 3
    3. 5
    1. 'aa'
    2. 'bb'
    3. 'cc'
    4. 'dd'
    5. 'ee'
    1. TRUE
    2. FALSE
    3. TRUE
    4. FALSE
    5. FALSE
  1. 3

But we can also use them as kind of Python’s dictionaries:

Javi <- list( age = 39, married = TRUE, parents = c("Joseba","Edita"))
Javi
$age
39
$married
TRUE
$parents
  1. 'Joseba'
  2. 'Edita'

As you can see, here this list contains three variables of different kinds (and sizes).

Try it out yourself with Practice exercise 1

Accessing lists#

  • By index

x[1]
    1. 2
    2. 3
    3. 5
x[[1]]
  1. 2
  2. 3
  3. 5
  • By label:

Using the $ operator to access an entry by its name:

# This accesses my age
Javi$age
# this tells whether I am married or not
Javi$married
# this shows the names of my parents
Javi$parents
39
TRUE
  1. 'Joseba'
  2. 'Edita'

Modifying lists#

We can again use the $ operator to edit an entry value, or add a new entry. For example:

# Let's modify my age
Javi$age<- 37.5
# Let's add my height as a new entry (in cms)
Javi$height<- 177

Javi
$age
37.5
$married
TRUE
$parents
  1. 'Joseba'
  2. 'Edita'
$height
177

Data Frames#

Initialization#

  • Using the data.frame function.

age <- c(17, 19, 21, 37, 22, 35, 18)
gender<-factor(c("female", "female", "male", "female", "male", "female", "male"))
score <- c(12, 10, 11, 15, 12, 10, 11)

my.dataframe <- data.frame ( age=age, gender=gender, score=score )

my.dataframe
A data.frame: 7 × 3
agegenderscore
<dbl><fct><dbl>
17female12
19female10
21male 11
37female15
22male 12
35female10
18male 11
  • From an external file using the read.csv function.

iris_df<-read.csv("https://raw.githubusercontent.com/mwaskom/seaborn-data/refs/heads/master/iris.csv")
head(iris_df) # See below this head function
A data.frame: 6 × 5
sepal_lengthsepal_widthpetal_lengthpetal_widthspecies
<dbl><dbl><dbl><dbl><chr>
15.13.51.40.2setosa
24.93.01.40.2setosa
34.73.21.30.2setosa
44.63.11.50.2setosa
55.03.61.40.2setosa
65.43.91.70.4setosa

Access and modify#

As with lists, the $ operator is used to access and modify elements in dataframes.

# This shows the age variable
my.dataframe$age
  1. 17
  2. 19
  3. 21
  4. 37
  5. 22
  6. 35
  7. 18
# This adds an entry called "var1", with the same value in all its elements 
my.dataframe$var1<-1
my.dataframe
A data.frame: 7 × 4
agegenderscorevar1
<dbl><fct><dbl><dbl>
17female121
19female101
21male 111
37female151
22male 121
35female101
18male 111
# This adds an entry called "var2", with different values
my.dataframe$var2<-c(1:7)
my.dataframe
A data.frame: 7 × 5
agegenderscorevar1var2
<dbl><fct><dbl><dbl><int>
17female1211
19female1012
21male 1113
37female1514
22male 1215
35female1016
18male 1117

Handy functions#

  • head: it displays the first six lines (this number can be modified with the argument `n”) of the dataframe.

head(my.dataframe)
A data.frame: 6 × 5
agegenderscorevar1var2
<dbl><fct><dbl><dbl><int>
117female1211
219female1012
321male 1113
437female1514
522male 1215
635female1016
head(my.dataframe, n = 2)
A data.frame: 2 × 5
agegenderscorevar1var2
<dbl><fct><dbl><dbl><int>
117female1211
219female1012
  • tail: the same as head, but displaying the last six lines.

tail(my.dataframe)
A data.frame: 6 × 5
agegenderscorevar1var2
<dbl><fct><dbl><dbl><int>
219female1012
321male 1113
437female1514
522male 1215
635female1016
718male 1117
tail(my.dataframe, n=2)
A data.frame: 2 × 5
agegenderscorevar1var2
<dbl><fct><dbl><dbl><int>
635female1016
718male 1117
  • names: it shows the names all columns in the dataframe.

names(my.dataframe)
  1. 'age'
  2. 'gender'
  3. 'score'
  4. 'var1'
  5. 'var2'
  • str: it returns the structure of data frame - name, type and preview of data in each column.

str(my.dataframe)
'data.frame':	7 obs. of  5 variables:
 $ age   : num  17 19 21 37 22 35 18
 $ gender: Factor w/ 2 levels "female","male": 1 1 2 1 2 1 2
 $ score : num  12 10 11 15 12 10 11
 $ var1  : num  1 1 1 1 1 1 1
 $ var2  : int  1 2 3 4 5 6 7

And these are the same that we saw before (You’re free to try them out):

  • dim: it returns the dimensions of data frame (i.e. number of rows and number of columns)

  • nrow: it returns the number of rows

  • ncol: it returns the number of columns

Try it out yourself with practice exercise 2!

Practice exercises#

Exercise 53

1- Create a list that includes the following elements:

  • A variable named “name” containing your name.

  • A variable named “weight” with your weight.

  • A variable named “films”, which is a vector containing the names of your favorite films.

  • A variable named “coffee_drinker” set to TRUE or FALSE, depending on whether you like coffee.

# Your answer here

Exercise 54

2- Create a dataframe containing information about at least four people you know (e.g., family, friends). This dataframe should include the following columns:

  • A column for their names.

  • A column for their age.

  • A column for their gender/sex (use a factor variable for this).

  • A column indicating whether you think they like coffee or not.

# Your answer here