Overview
Teaching: 30 min
Exercises: 15 minQuestions
How can I work with subsets of data in R?
Objectives
To be able to subset vectors, factors, and data frames
To be able to extract individual and multiple elements: by index, by name, using comparison operations
To be able to skip and remove elements from various data structures.
Much of R’s power comes from it’s vectorization. R has many powerful subset operators and mastering them will allow you to easily perform complex operations on any kind of dataset without the resource depletion of loops.
There are a few different ways we can subset any kind of object, and different subsetting operators for the different data structures.
Let’s start with a data structure we’ve seen before, the workhorse of R: atomic vectors.
x <- c(5.4, 6.2, 7.1, 4.8, 7.5)
We can name elements within our vectors using the names
function. Here, we have named each element
with a different letter of the alphabet.
names(x) <- c('a', 'b', 'c', 'd', 'e')
x
a b c d e
5.4 6.2 7.1 4.8 7.5
So now that we’ve created a dummy vector to play with, how do we get at its contents?
To extract elements of a vector we can give their corresponding index, starting from one:
x[1]
a
5.4
x[4]
d
4.8
It may look different, but the square brackets operator is a function. For atomic vectors (and matrices), it means “get me the nth element”.
We can ask for multiple elements at once:
x[c(1, 3)]
a c
5.4 7.1
Or slices of the vector:
x[1:4]
a b c d
5.4 6.2 7.1 4.8
Remember that the :
operator creates a sequence of numbers from the left element to the right. Using it inside the square
brackets lets us select a range of elements.
We can ask for the same element multiple times:
x[c(1,1,3)]
a a c
5.4 5.4 7.1
If we ask for a number outside of the vector, R will return missing values:
x[6]
<NA>
NA
This is a vector of length one containing an NA
, whose name is also NA
.
If we ask for the 0th element, we get an empty vector:
x[0]
named numeric(0)
Vector numbering in R starts at 1
In many programming languages (C and python, for example), the first element of a vector has an index of 0. In R, the first element is 1.
If we use a negative number as the index of a vector, R will return every element except for the one specified:
x[-2]
a c d e
5.4 7.1 4.8 7.5
We can skip multiple elements:
x[c(-1, -5)] # or x[-c(1,5)]
b c d
6.2 7.1 4.8
Tip: Order of operations
A common trip up for novices occurs when trying to skip slices of a vector. Most people first try to negate a sequence like so:
x[-1:3]
This gives a somewhat cryptic error:
Error in x[-1:3]: only 0's may be mixed with negative subscripts
But remember the order of operations.
:
is really a function, so what happens is it takes its first argument as -1, and second as 3, so generates the sequence of numbers:c(-1, 0, 1, 2, 3)
.The correct solution is to wrap that function call in brackets, so that the
-
operator applies to the results:x[-(1:3)]
d e 4.8 7.5
To remove elements from a vector, we need to assign the results back into the variable:
x <- x[-4]
x
a b c e
5.4 6.2 7.1 7.5
We can extract elements by using their name, instead of index:
x[c("a", "c")]
a c
5.4 7.1
This is usually a much more reliable way to subset objects: the position of various elements can often change when chaining together subsetting operations, but the names will always remain the same!
Unfortunately we can’t skip or remove elements so easily when we extract them by their name.
To skip (or remove) a single named element:
x[-which(names(x) == "b")]
a c d e
5.4 7.1 4.8 7.5
The which
function returns the indices of all TRUE
elements of its argument.
Remember that expressions evaluate before being passed to functions. Let’s break
this down so that its clearer what’s happening.
First this happens:
names(x) == "b"
[1] FALSE TRUE FALSE FALSE FALSE
The condition operator is applied to every name of the vector x
. Only the
first name is “a” so that element is TRUE.
which
then converts this to an index:
which(names(x) == "b")
[1] 2
Only the first element is TRUE
, so which
returns 1. Now that we have indices
the skipping works because we have a negative index!
Skipping multiple named indices is similar, but uses a different comparison operator:
x[-which(names(x) %in% c("a", "c"))]
b d e
6.2 4.8 7.5
The %in%
goes through each element of its left argument, in this case the
names of x
, and asks, “Does this element occur in the second argument?”.
Tip: Getting help for operators
Remember you can search for help on operators by wrapping them in quotes:
help("%in%")
or?"%in%"
.
So why can’t we use ==
like before? That’s an excellent question.
Let’s take a look at the comparison component of this code:
names(x) == c('a', 'c')
Warning in names(x) == c("a", "c"): longer object length is not a multiple
of shorter object length
[1] TRUE FALSE TRUE
Obviously “c” is in the names of x
, so why didn’t this work? ==
works
slightly differently than %in%
. It will compare each element of its left argument
to the corresponding element of its right argument.
Here’s a mock illustration:
c("a", "b", "c", "d", "e") # names of x
| | | | | # The elements == is comparing
c("a", "c")
Remember from our last lesson, when one vector is shorter than the other, it gets recycled:
c("a", "b", "c", "d", "e") # names of x
| | | | | # The elements == is comparing
c("a", "c", "a", "c", "a")
In this case R simply repeats c("a", "c")
two and a half times. If the longer
vector length isn’t a multiple of the shorter vector length, then
R will also print out a warning message.
This difference between ==
and %in%
is important to remember,
because it can introduce hard to find and subtle bugs!
Challenge 1
Given the following code:
x <- c(5.4, 6.2, 7.1, 4.8, 7.5) names(x) <- c('a', 'b', 'c', 'd', 'e') print(x)
a b c d e 5.4 6.2 7.1 4.8 7.5
Come up with at least 2 different commands that will produce the following output:
b c d 6.2 7.1 4.8
After you find 2 different commands, compare notes with your neighbour. Did you have different strategies?
Solution to Challenge 1
Use the
c
function:x[c(2,3,4)]
b c d 6.2 7.1 4.8
Use the colon operator:
x[2:4]
b c d 6.2 7.1 4.8
Select elements by name:
x[c("b", "c", "d")]
b c d 6.2 7.1 4.8
Use the
-
(NOT) along with thec
function to remove elements you don’t want:x[-c(1,5)]
b c d 6.2 7.1 4.8
Challenge 2
Run the following code to define vector
x
as above:x <- c(5.4, 6.2, 7.1, 4.8, 7.5) names(x) <- c('a', 'b', 'c', 'd', 'e') print(x)
a b c d e 5.4 6.2 7.1 4.8 7.5
Given this vector
x
, what would you expect the following to do?x[-which(names(x) == "g")]
Test out your guess by trying out this command. Did this match your expectation? Why did we get this result? (Tip: test out each part of the command on it’s own - this is a useful debugging strategy)
Solution to Challenge 2
The
which
command returns the index of everyTRUE
value in its input. Thenames(x) == "g"
command didn’t return anyTRUE
values. Because there were noTRUE
values passed to thewhich
command, it returned an empty vector. Negating this vector with the minus sign didn’t change its meaning. Because we used this empty vector to retrieve values fromx
, it produced an empty numeric vector. It was anamed numeric
empty vector because the vector type of x is “named numeric” since we assigned names to the values (trystr(x)
).
Challenge 3
While it is not recommended, it is possible for multiple elements in a vector to have the same name. Consider this examples:
y <- 1:3 y
[1] 1 2 3
names(y) <- c('a', 'a', 'a') y
a a a 1 2 3
Can you come up with a command that will only return one of the ‘a’ values and a different command that will return all of the ‘a’ values? Does your answer differ from your neighbors?
Solution to challenge 3
y['a'] # only returns first value
a 1
y[which(names(y) == 'a')] # returns all three values
a a a 1 2 3
We can subset data by using boolean vectors:
x[c(TRUE, TRUE, FALSE, FALSE, FALSE)]
a a
1 2
R will return any values that are indicated by TRUE
in your vector, and filter out any that
are FALSE
.
x[c(TRUE, FALSE)]
a a
1 3
Notice how R also recycled our logical vector to the correct length?
Since comparison operators evaluate to logical vectors, we can also use them to succinctly subset vectors. When we do a logical comparison on a vector, R returns a logical vector as the result:
x > 7
a b c d e
FALSE FALSE TRUE FALSE TRUE
We can nest our comparison inside of our subsetting operators to tell R to return a subset of our data which matches whatever criteria we specify.
x[x > 7]
c e
7.1 7.5
Tip: Combining logical conditions
There are many situations in which you will wish to combine multiple logical criteria. For example, we might want to find all the countries that are located in Asia or Europe and have life expectancies within a certain range. Several operations for combining logical vectors exist in R:
&
, the “logical AND” operator: returnsTRUE
if both the left and right areTRUE
.|
, the “logical OR” operator: returnsTRUE
, if either the left or right (or both) areTRUE
.The recycling rule applies with both of these, so
TRUE & c(TRUE, FALSE, TRUE)
will compare the firstTRUE
on the left of the&
sign with each of the three conditions on the right.You may sometimes see
&&
and||
instead of&
and|
. These operators do not use the recycling rule: they only look at the first element of each vector and ignore the remaining elements. The longer operators are mainly used in programming, rather than data analysis.
!
, the “logical NOT” operator: convertsTRUE
toFALSE
andFALSE
toTRUE
. It can negate a single logical condition (eg!TRUE
becomesFALSE
), or a whole vector of conditions(eg!c(TRUE, FALSE)
becomesc(FALSE, TRUE)
).Additionally, you can compare the elements within a single vector using the
all
function (which returnsTRUE
if every element of the vector isTRUE
) and theany
function (which returnsTRUE
if one or more elements of the vector areTRUE
).
Challenge 4
Given the following code:
x <- c(5.4, 6.2, 7.1, 4.8, 7.5) names(x) <- c('a', 'b', 'c', 'd', 'e') print(x)
a b c d e 5.4 6.2 7.1 4.8 7.5
Write a subsetting command to return the values in x that are greater than 4 and less than 7.
Solution to Challenge 4
x_subset <- x[x<7 & x>4] print(x_subset)
a b d 5.4 6.2 4.8
Key Points
Indexing in R starts at 1, not 0.
Access individual values by location using
[]
.Access slices of data using
[low:high]
.Access arbitrary sets of data using
[c(...)]
.Use
which
to select subsets of data based on value.