Please ensure you have the latest version of R and RStudio installed on your machine.
Why R?
Why RStudio?
Basic layout
three panels:
Editor panel when you open scripts
There are two main ways one can work within RStudio.
Run button and key shortcuts
Project managment
Mostly use the console
”>” cursor - similar to the shell
The simplest thing you could do with R is do arithmetic:
1 + 100
Ignore the [1] for now.
Incomplete commands:
> 1 +
Can cancel with the “Esc” key, or “Ctrl+C” in command line - can use for running code too
order of operations
3 + 5 * 2
Use parentheses to force order:
(3 + 5) * 2
Use these to make code easier to read
(3 + (5 * (2 ^ 2))) ## hard to read
3 + 5 * 2 ^ 2 ## clear, if you remember the rules
3 + 5 * (2 ^ 2) ## if you forget some rules, this might help
”##” indicate comments
Scientific notation
2/10000
5e3 ## Note the lack of minus here
functions use name then parenthesis
sin(1) ## trigonometry functions
log(1) ## natural logarithm
log10(10) ## base-10 logarithm
exp(0.5) ## e^(1/2)
Can look up functions in google or use Autocomplete (tab)
1 == 1 ## equality (note two equals signs, read as "is equal to")
1 != 2 ## inequality (read as "is not equal to")
1 < 2 ## less than
1 <= 1 ## less than or equal to
1 > 0 ## greater than
1 >= -9 ## greater than or equal to
float point error, use all.equal instead of ‘==’ for non-integers
use assignment arrow to save values to variables
x <- 1/40
No output
x
stored as a decimal approximation called floating point number.
point out x in environment variables
log(x)
Can reassign values
x <- 100
Can reference the variable in the assignment
x <- x + 1 ##notice how RStudio updates its description of x on the top right tab
The right hand side of the assignment can be any valid R expression. The right hand side is fully evaluated before the assignment occurs.
character values:
y <- "green"
Variable names rules: can contain letters, numbers, underscores and periods cannot start with a number nor contain spaces at all.
Naming conventions:
consistency is important
can use =
operator for assignment:
z = 1/40
Less common, and sometimes it is confusing to use ‘=’ instead of ‘<-‘ remember consistency!
built in functions coding your own supplemental lesson
install.packages("packagename")
library(packagename)
install.packages("gapminder")
text scrolling for install
library(gapminder)
Point out how to install via the Packages tab.
Other useful commands:
installed.packages()
update.packages()
remove.packages("packagename")
CHALLENGES (allot 15 min)
help(function_name)
?function_name
Each help page is broken down into sections:
Some may have different sections, but these are the main ones.
Help files make it easier to use R because you don;’t have to remember the usage of every function.
To seek help on special operators, use quotes:
?"+"
Many packages come with “vignettes”: tutorials and extended example documentation. Without any arguments
vignette()
will list all vignettes for all installed packages;
vignette(package="package-name")
will list all available vignettes for package-name
and vignette("vignette-name")
will open the specified vignette.
If a package doesn’t have any vignettes, you can usually find help by typing
help("package-name")
.
fuzzy search:
??function_name
CRAN Task View - show website: http://cran.at.r-project.org/web/views
Stack Overflow - search using the [r]
tag.
CHALLENGES - allot 10 min
Review operators:
x + z
Try adding two different types of data:
x + y
Gives error because 101 + “green” is nonsense.
5 main data types: double
, integer
, complex
, logical
, and character
.
typeof(3.14)
A double
, also referred to as a floating point number, is how R stores numeric values
by default.
typeof(1L)
To use an integer
value in R, we use the L to tell R that this value is an integer value.
Without the L, R would store this value as a double
.
typeof(1+1i)
R can also support complex
values as well. Unless you are doing mathematical analyses or
complicated transformations, chances are you will not encounter this data type very often.
typeof(TRUE)
[1] "logical"
Logical
data types are particularly helpful in subsetting data frames and other types of data manipulation. We will explore this concept more later.
typeof('banana')
[1] "character"
Lastly, R stores strings as the character
type.
No matter how complicated our analyses become, all data in R is interpreted as one of these basic data types.
Remember the [1]? R never uses just a single value, but instead uses vectors. Output before was a vector of length 1.
As seen in the challenges, we can build vectors using the c
function.
x <- c(2, 4, 6, 8, 10, 12, 14, 16)
x
colon operator quickly creates sequential vectors:
y <- 1:8
y
can specify whatever start and stop point we want
-4:7
[1] -4 -3 -2 -1 0 1 2 3 4 5 6 7
R is vectorized operations on the entire vector - returns a vector
y + 10
[1] 11 12 13 14 15 16 17 18
x * 2
[1] 4 8 12 16 20 24 28 32
x + y
[1] 3 6 9 12 15 18 21 24
when operating on two or more vectors, R performs the operation element by element:
x: 2 4 6 8 10 12 14 16
+ + + + + + + +
y: 1 2 3 4 5 6 7 8
--------------------------
3 6 9 12 15 18 21 24
Vectors can be made up of any of the basic data types.
Character Vectors:
a <- c("one", "two", "three", "four")
a
similar to typeof
, str
will tell us the data type.
also will give us a compact view of some basic information about our object.
str(a)
chr that this is a character vector numbers in the brackets indicates the dimensions of our vector. then a list the first few elements
Logical Vectors:
b <- c(TRUE, TRUE, FALSE, TRUE)
b
we can use c
to add elements to an existing vector
c(x, 20, 25)
Changes won’t be saved until we use the assignment arrow
modify y
to contain all numbers to 20 by adding on to the existing vector
y <- c(y, 9:20)
y
This is a nested operation, from Order of Opers. earlier the sequence inside the parenthesis is created first, then added to y
length
can quickly return the length of a vector.
length(y)
Other data structures, lists and matrices
use ?list()
, ?matrix()
or supplemental lesson to learn about them.
primarily data frames today, continued after break:
CHALLENGES - allot 10 minutes
** HERE **
R’s power comes from it’s vectorization. many powerful subset operators that will allow you to easily perform complex operations on any kind of dataset without the resource depletion of loops.
Let’s start with a data structure we’ve seen before, the workhorse of R: atomic vectors.
x <- c(5.4, 6.2, 7.1, 4.8, 7.5)
can name elements within our vectors using the names
function.
names(x) <- c('a', 'b', 'c', 'd', 'e')
x
how do we get to individual contents?
we can give their corresponding index, starting from one:
x[1]
x[4]
the square brackets operator is a function For atomic vectors (and matrices), it means “get me the nth element”.
We can ask for multiple elements at once:
x[c(1, 3)]
Or slices of the vector:
x[1:4]
the :
operator lets us select a range of elements
We can ask for the same element multiple times:
x[c(1,1,3)]
a number outside of the vector, R will return missing values
x[6]
This is a vector of length one containing an NA
, whose name is also NA
.
If we ask for the 0th element, we get an empty vector:
x[0]
named numeric(0)
R starts indices with 1 instead of 0 like other programming languages such as C and python
use a negative number to return every element except for the one specified:
x[-2]
We can skip multiple elements:
x[c(-1, -5)] ## or x[-c(1,5)]
Combining positive and negative indices will return an error.
To make the subset permanent we need to assign the value.
x <- x[-4]
x
CHALLENGE - allot 5 minutes
can extract elements by using their name
x[c("a", "c")]
more reliable since positions can change. But we cannot as easily skip or remove by name.
To skip (or remove) a single named element:
x[-which(names(x) == "a")]
The which
function returns the indices of all TRUE
elements of its argument.
Step by step analysis of command:
names(x) == "a"
The condition operator is applied to every name of the vector x
. Only the
first name is “a” so that element is TRUE.
which
then converts this to an index:
which(names(x) == "a")
Only the first element is TRUE
, so which
returns 1.
the ‘-‘ makes this index negative and removes the element
Skipping multiple named indices is similar, but uses a different comparison operator:
x[-which(names(x) %in% c("a", "c"))]
%in%
goes through each element of its left argument, in this case the
names of x
, and asks, “Does this element occur in the second argument?”.
why can’t we use ==
like before? Good question.
names(x) == c('a', 'c')
gives a warning
==
works by comparing each element of its left argument
to the corresponding element of its right argument.
Here’s a mock illustration:
c("a", "b", "c", "d", "e") ## names of x
| | | | | ## The elements == is comparing
c("a", "c")
when one vector is shorter than the other, it gets recycled:
c("a", "b", "c", "d", "e") ## names of x
| | | | | ## The elements == is comparing
c("a", "c", "a", "c", "a")
R repeats c("a", "c")
two and a half times. If the longer
vector length isn’t a multiple of the shorter vector length, then
R will also print out a warning message.
This difference between ==
and %in%
is important to remember,
because it can introduce hard to find and subtle bugs!
CHALLENGES - allot 10 min
We can subset data by using boolean vectors:
x[c(TRUE, TRUE, FALSE, FALSE, FALSE)]
R will return any values that are indicated by TRUE
in your vector, and filter out any that
are FALSE
.
x[c(TRUE, FALSE)]
R also recycled our logical vector
comparison operators evaluate to logical vectors we can use them to succinctly subset vectors
x > 7
nest our comparison inside of our subsetting operators to tell R to return a subset
x[x > 7]
&
, the “logical AND” operator: returns TRUE
if both the left and right
are TRUE
.|
, the “logical OR” operator: returns TRUE
, if either the left or right
(or both) are TRUE
.The recycling rule applies with both of these
&&
and ||
do not use the recycling rule: they only look at the first element of each
vector and ignore the remaining elements. Usually used in programming not data analysis
!
, the “logical NOT” operator: converts TRUE
to FALSE
and FALSE
to
TRUE
. can negate a single logical condition, or a whole vector of conditionsthe all
function returns TRUE
if every element of the vector is TRUE
the any
function returns TRUE
if one or more elements of the vector are TRUE
.
CHALLENGES allot 5 min
so far data structures contained all of the same data type one of R’s most powerful features is its ability to deal with tabular data (like spreadsheet or CSV) Data Frames are built from vectors but can contain vectors of different data types.
build a data frame from existing vectors - data.frame()
command
coat <- c("calico", "black", "tabby")
weight <- c(2.1, 5.0, 3.2)
likes_string <- c(1,0,1)
cats <- data.frame(coat, weight, likes_string)
cats
can pull out columns by specifying them using the $
operator
cats$weight
cats$coat
can perform operations on columns within our data frame, just like with vectors
#### Say we discovered that the scale weighs two Kg light:
cats$weight + 2
paste("My cat is", cats$coat)
CHALLENGES allot 5 min
add an additional column for age using c
age <- c(2,3,5,12)
cats
We can then add this as a column in our data frame by using the cbind()
function:
cats <- cbind(cats, age)
Error - there are more elements in age but only 3 rows in cats.
age <- c(4,5,8)
cats <- cbind(cats, age)
cats
add a row, rows are lists since they contain different types of elements:
newRow <- list("tortoiseshell", 3.3, TRUE, 9)
cats <- rbind(cats, newRow)
Our list had the correct number of elements, so why did R give us a warning?
class(cats$coat)
Factors are data classes that R uses to handle categorical data. categories are called levels
Anything new that doesn’t fit into one of its categories is rejected as nonsense and
is replaced by an NA
until we explicitly add that as a level in the factor:
levels(cats$coat)
levels(cats$coat) <- c(levels(cats$coat), 'tortoiseshell')
cats <- rbind(cats, list("tortoiseshell", 3.3, TRUE, 9))
can also change the column to character to prevent this
str(cats)
cats$coat <- as.character(cats$coat)
str(cats)
We can now add rows and columns, but we’ve accidentally added a garbage row:
cats
coat weight likes_string age
1 calico 2.1 1 4
2 black 5.0 0 5
3 tabby 3.2 1 8
4 <NA> 3.3 1 9
5 tortoiseshell 3.3 1 9
We can ask for a data.frame minus this offending row:
cats[-4,]
Notice the comma with nothing after it to indicate we want to drop the entire fourth row.
using na.omit
allows us to drop all rows with NA
values:
na.omit(cats)
Let’s reassign the output to cats
, so that our changes will be permanent:
cats <- na.omit(cats)
remember that columns are vectors or factors, and rows are lists.
We can also glue two dataframes together with rbind
:
cats <- rbind(cats, cats)
cats
But now the row names are unnecessarily complicated. We can remove the rownames, and R will automatically re-name them sequentially:
rownames(cats) <- NULL
cats
CHALLENGES - allot 5 min
let’s use a more realistic dataset
the gapminder
data set built into the gapminder
package.
install the gapminder
package if you havent already
install.packages("gapminder")
load it now using the library
command (explain how to use the packages
tab
library('gapminder')
to make our analysis reproducible, we should put the code into a script file
mention loading libraries in script files
investigate the data - check
out what the data looks like with str
:
str(gapminder)
We can also examine individual columns of the data.frame with our typeof
function:
typeof(gapminder$year)
typeof(gapminder$lifeExp)
typeof(gapminder$country)
str(gapminder$country)
information about its dimensions;
remembering that str(gapminder)
said there were 1704 observations of 6
variables in gapminder
what do you think the following will produce, and why?
length(gapminder)
try it, point out that data frames are lists of vectors - each column is a vector/factor and each row is a list (hence the different data types)
typeof(gapminder)
length
gave us 6 because gapminder is built out of a list of 6
columns
To get the number of rows and columns in our dataset, try:
nrow(gapminder)
ncol(gapminder)
Or, both at once:
dim(gapminder)
titles of all the columns:
colnames(gapminder)
ask ourselves if the structure R is reporting matches our intuition or expectations do the basic data types reported for each column make sense? No? We need to fix it now
Once we’re happy that the data types and structures seem reasonable start digging into our data proper
head(gapminder)
CHALLENGES - 5 min
data frames are lists of vectors, so selecting a single element returns a single vector, or column of the data frame.
head(gapminder[5])
the c
command to returns multiple columns:
head(gapminder[c(1,5)])
$
provides a convenient shorthand to extract columns by name:
head(gapminder$year)
With two arguments, [
subsets on typical matrix format
if one of the arguments is blank, R will
default to include all of the rows or columns:
gapminder[1:3,]
If we subset a single row, the result will be a data frame
gapminder[3,]
But for a single column the result will be a vector.
CHALLENGES - 15 min
Often when we’re coding we want to control the flow of our actions. This can be done by setting actions to occur only if a condition or a set of conditions are met. Alternatively, we can also set an action to occur a particular number of times.
There are several ways you can control flow in R. For conditional statements, the most commonly used approaches:
## if
if (condition is true) {
perform action
}
## if ... else
if (condition is true) {
perform action
} else { ## that is, if the condition is false,
perform alternative action
}
if we want R to print a message if a variable x
has a particular value:
## sample a random number from a Poisson distribution
## with a mean (lambda) of 8
x <- rpois(1, lambda=8)
if (x >= 10) {
print("x is greater than or equal to 10")
}
x
you may not get the same output as your neighbour
Let’s set a seed so that we all generate the same ‘pseudo-random’ number
set.seed(10)
x <- rpois(1, lambda=8)
if (x >= 10) {
print("x is greater than or equal to 10")
} else if (x > 5) {
print("x is greater than 5")
} else {
print("x is less than 5")
}
In the above case, the function rpois()
generates a random number following a
Poisson distribution with a mean (i.e. lambda) of 8. The function set.seed()
guarantees that all machines will generate the exact same ‘pseudo-random’
number
when R evaluates the condition inside if()
statements, it is
looking for a logical element
This can cause some headaches for beginners.
x <- 4 == 3
if (x) {
"4 equals 3"
}
As we can see, the message was not printed because the vector x is FALSE
x <- 4 == 3
x
CHALLENGES - 5 min
Did anyone get a warning message like this?
Warning in if (gapminder$year == 2012) {: the condition has length > 1 and
only the first element will be used
If your condition evaluates to a vector with more than one logical element,
the function if()
will still run, but will only evaluate the condition in the first
element. Here you need to make sure your condition is of length 1.
The any()
function will return TRUE if at least one
TRUE value is found within a vector, otherwise it will return FALSE
.
This can be used in a similar way to the %in%
operator.
The function all()
, as the name suggests, will only return TRUE
if all values in
the vector are TRUE
.
to repeat an operation until a certain condition is met use a while()
loop
while(this condition is true){
do a thing
}
here’s a while loop
that generates random numbers from a uniform distribution (the runif
function)
between 0 and 1 until it gets one that’s less than 0.1.
z <- 1
while(z > 0.1){
z <- runif(1)
print(z)
}
You have to be careful that you don’t end up in an infinite loop because your condition is never met.
If you want to iterate over
a set of values a for()
loop will do the job.
This is the most
flexible of looping operations, but therefore also the hardest to use
correctly.
Avoid using for()
loops unless the the calculation at each iteration depends on the results of previous iterations.
The basic structure of a for()
loop is:
for(iterator in set of values){
do a thing
}
For example:
for(i in 1:10){
print(i)
}
The 1:10
bit creates a vector on the fly; you can iterate
over any other vector as well.
We can use a for()
loop nested within another for()
loop to iterate over two things at
once.
for(i in 1:5){
for(j in c('a', 'b', 'c', 'd', 'e')){
print(paste(i,j))
}
}
Rather than printing the results, we could write the loop output to a new object.
output_vector <- c()
for(i in 1:5){
for(j in c('a', 'b', 'c', 'd', 'e')){
temp_output <- paste(i, j)
output_vector <- c(output_vector, temp_output)
}
}
output_vector
This approach can be useful, but ‘growing your results’ is computationally inefficient, so avoid it when you are iterating through a lot of values.
A better way is to define your (empty) output object before filling in the values.
output_matrix <- matrix(nrow=5, ncol=5)
j_vector <- c('a', 'b', 'c', 'd', 'e')
for(i in 1:5){
for(j in 1:5){
temp_j_value <- j_vector[j]
temp_output <- paste(i, temp_j_value)
output_matrix[i, j] <- temp_output
}
}
output_vector2 <- as.vector(output_matrix)
output_vector2
CHALLENGES - 15 min
Manipulation of dataframes means many things to many researchers, we often select certain observations (rows) or variables (columns), we often group the data by a certain variable(s), or we even calculate summary statistics. We can do these operations using the normal base R operations:
mean(gapminder[gapminder$continent == "Africa", "gdpPercap"])
mean(gapminder[gapminder$continent == "Americas", "gdpPercap"])
mean(gapminder[gapminder$continent == "Asia", "gdpPercap"])
This can be repetitive, and repetition will cost you time, make your code bulky and hard to read and potentially introduce some nasty bugs
dplyr
packagethe dplyr
package provides a number of useful functions for manipulating dataframes
in a way that will reduce the above repetition, reduce the probability of making
errors, and probably even save you some typing. the dplyr
grammar can also make your code easier to read.
we’re going to cover 5 of the most commonly used functions as well as using
pipes (%>%
) to combine them
select()
filter()
group_by()
summarize()
mutate()
If you have have not installed this package earlier, please do so:
install.packages('dplyr')
Now let’s load the package:
library(dplyr)
If we wanted to use only a few of the variables in
our dataframe we could use the select()
function.
year_country_gdp <- select(gapminder,year,country,gdpPercap)
look at year_country_gdp
see that it only contains the year, country and gdpPercap.
we used ‘normal’ grammar
but the strengths of dplyr
lie in combining several functions using pipes. Since the pipes grammar
is unlike anything we’ve seen in R before, let’s repeat what we’ve done above
using pipes.
year_country_gdp <- gapminder %>% select(year,country,gdpPercap)
To help you understand why we wrote that in that way, let’s walk through it step
by step
First we summon the gapminder dataframe and pass it on, using the pipe
symbol %>%
, to the next step, which is the select()
function.
In R, a pipe symbol is %>%
while in the
shell it is |
using the above but only with European
countries, we can combine select
and filter
year_country_gdp_euro <- gapminder %>%
filter(continent=="Europe") %>%
select(year,country,gdpPercap)
line breaks between specific portions of our command make the code easier to read
first we pass the data frame to filter
then pass the filtered dataframe to select
.
If we reversed this, it would not work since we removed the continent data with select
CHALLENGES - 5 min
If we want to do the same above for each country, we can reduce repetitiveness with group_by
group_by
will essentially use every unique criteria that you
could have used in filter
.
str(gapminder)
str(gapminder %>% group_by(continent))
notice that the structure of the dataframe where we used group_by()
(grouped_df
) is not the same as the original gapminder
(data.frame
).
A
grouped_df
can be thought of as a list
where each item in the list
is a
data.frame
which contains only the rows that correspond to the a particular
value.
group_by()
is much more
exciting in conjunction with summarize()
summarize
allows us to create new
variable(s) by using functions that repeat for each of the continent-specific
data frames
gdp_bycontinents <- gapminder %>%
group_by(continent) %>%
summarize(mean_gdpPercap=mean(gdpPercap))
That allowed us to calculate the mean gdpPercap for each continent, but it gets even better.
CHALLENGES - 5 min
group_by()
allows us to group by multiple variables. Let’s group by year
and continent
.
gdp_bycontinents_byyear <- gapminder %>%
group_by(continent,year) %>%
summarize(mean_gdpPercap=mean(gdpPercap))
we can define more than 1 variable with summarize
gdp_pop_bycontinents_byyear <- gapminder %>%
group_by(continent,year) %>%
summarize(mean_gdpPercap=mean(gdpPercap),
sd_gdpPercap=sd(gdpPercap),
mean_pop=mean(pop),
sd_pop=sd(pop))
We can create new variables prior to (or even after) summarizing information using mutate()
.
gdp_pop_bycontinents_byyear <- gapminder %>%
mutate(gdp_billion=gdpPercap*pop/10^9) %>%
group_by(continent,year) %>%
summarize(mean_gdpPercap=mean(gdpPercap),
sd_gdpPercap=sd(gdpPercap),
mean_pop=mean(pop),
sd_pop=sd(pop),
mean_gdp_billion=mean(gdp_billion),
sd_gdp_billion=sd(gdp_billion))
** CHALLENGES ** - allot 10 min
Plotting our data is one of the best ways to quickly explore it and the various relationships between variables.
Today we’ll be learning about the ggplot2 package, because it is a bit harder to learn but produces better looking plots than the base plotting ability.
ggplot2 is built on the grammar of graphics (where the gg comes from) the idea that any plot can be expressed from the same set of components:
The key to understanding ggplot2 is thinking about a figure in layers.
Let’s start off with an example. The first thing we need to do is load the ggplot2
package
library("ggplot2")
If you haven’t previously installed the package, install it now using the command install.packages("ggplot2")
. Then load it using the command above.
to begin graphing, we use the ggplot
function
this lets R
know that we’re creating a new plot, and any of the arguments we give the
ggplot
function are the global options for the plot: they apply to all
layers on the plot.
library("ggplot2")
ggplot(data = gapminder, aes(x = gdpPercap, y = lifeExp)) +
geom_point()
mention that line breaks can be anywhere, but you need to have the + at the end of the line.
We’ve passed in two arguments to ggplot
.
First, we tell ggplot
what data we
want to show on our figure
For the second argument we passed in the aes
function, which
tells ggplot
how variables in the data map to aesthetic properties of
the figure, in this case the x and y locations.
Here we told ggplot
we
want to plot the “gdpPercap” column of the gapminder data frame on the x-axis, and
the “lifeExp” column on the y-axis.
we didn’t need to explicitly
pass aes
these columns (e.g. x = gapminder[, "gdpPercap"]
)
Other options that can be set with the aes
function include color, size, transparency and shape. We will talk more about that later.
By itself, the call to ggplot
isn’t enough to draw a figure:
ggplot(data = gapminder, aes(x = gdpPercap, y = lifeExp))
We need to tell ggplot
how we want to visually represent the data, which we
do by adding a new geom layer. In our example, we used geom_point
, which
tells ggplot
we want to visually represent the relationship between x and
y as a scatterplot of points:
ggplot(data = gapminder, aes(x = gdpPercap, y = lifeExp)) +
geom_point()
CHALLENGES - 10 min
Using a scatterplot probably isn’t the best for visualizing change over time.
Instead, let’s tell ggplot
to visualize the data as a line plot:
ggplot(data = gapminder, aes(x=year, y=lifeExp, by=country, color=continent)) +
geom_line()
Instead of adding a geom_point
layer, we’ve added a geom_line
layer. We’ve
added the by aesthetic, which tells ggplot
to draw a line for each
country.
if we want to visualize both lines and points on the plot? We can simply add another layer to the plot:
ggplot(data = gapminder, aes(x=year, y=lifeExp, by=country, color=continent)) +
geom_line() + geom_point()
note that each layer is drawn on top of the previous layer.
ggplot(data = gapminder, aes(x=year, y=lifeExp, by=country)) +
geom_line(aes(color=continent)) + geom_point()
the aesthetic mapping of color has been moved from the
global plot options in ggplot
to the geom_line
layer so it no longer applies
to the points.
Now we can clearly see that the points are drawn on top of the lines.
Demonstrate changing an aesthetic to a solid color:
ggplot(data = gapminder, aes(x=year, y=lifeExp, by=country)) +
geom_line(aes(color="red")) + geom_point()
ggplot
also makes it easy to overlay statistical models over the data. To
demonstrate we’ll go back to our first example:
ggplot(data = gapminder, aes(x = gdpPercap, y = lifeExp)) +
geom_point()
it’s hard to see the relationship between the points due outliers
change the scale of units on the x axis using the scale functions.
can also modify the transparency of the points, using the alpha function - helpful when large amount of data which is very clustered.
ggplot(data = gapminder, aes(x = gdpPercap, y = lifeExp)) +
geom_point(alpha = 0.5) + scale_x_log10()
The log10
function applied a transformation to the values of the gdpPercap
column before rendering them on the plot, so that each multiple of 10 now only
corresponds to an increase in 1 on the transformed scale, e.g. a GDP per capita
of 1,000 is now 3 on the y axis, a value of 10,000 corresponds to 4 on the y
axis and so on. This makes it easier to visualize the spread of data on the
x-axis.
point out that the alpha
aesthetic is applied only to the point geom. Can also make transparency based on variables such as:
ggplot(data = gapminder, aes(x = gdpPercap, y = lifeExp)) +
geom_point(aes(alpha = continent)) + scale_x_log10()
We can fit a simple relationship to the data by adding another layer,
geom_smooth
:
ggplot(data = gapminder, aes(x = gdpPercap, y = lifeExp)) +
geom_point() + scale_x_log10() + geom_smooth(method="lm")
We can make the line thicker by setting the size aesthetic in the
geom_smooth
layer:
ggplot(data = gapminder, aes(x = gdpPercap, y = lifeExp)) +
geom_point() + scale_x_log10() + geom_smooth(method="lm", size=1.5)
There are two ways an aesthetic can be specified. Here we set the size
aesthetic by passing it as an argument to geom_smooth
. Previously in the
lesson we’ve used the aes
function to define a mapping between data
variables and their visual representation.
CHALLENGES - 10 min
we can split this out over multiple panels by adding a layer of facet panels. Focusing only on those countries with names that start with the letter “A” or “Z”.
start by subsetting the data
use the substr
function to
pull out a part of a character string
the %in%
operator allows us to make multiple comparisons rather
than write out long subsetting conditions
starts.with <- substr(gapminder$country, start = 1, stop = 1)
az.countries <- gapminder[starts.with %in% c("A", "Z"), ]
ggplot(data = az.countries, aes(x = year, y = lifeExp, color=continent)) +
geom_line() + facet_wrap( ~ country)
The facet_wrap
layer takes a formula as its argument, denoted by the tilde
(~). This tells R to draw a panel for each unique value in the country column
of the gapminder dataset.
change some of the text elements.
rename our x
and y
axes using the xlab()
and ylab()
functions:
ggplot(data = az.countries, aes(x = year, y = lifeExp, color=continent)) +
geom_line() + facet_wrap( ~ country) +
xlab("Year") + ylab("Life Expectancy")
give our figure a title with the ggtitle()
function. And capitalize the label of our
legend. This can be done using the scales layer.
ggplot(data = az.countries, aes(x = year, y = lifeExp, color=continent)) +
geom_line() + facet_wrap( ~ country) +
xlab("Year") + ylab("Life Expectancy") +
ggtitle("Figure 1") + scale_colour_discrete(name="Continent")
let’s remove the x-axis labels so the plot is less cluttered. To do this, we use the theme layer which controls the axis text and overall text size.
ggplot(data = az.countries, aes(x = year, y = lifeExp, color=continent)) +
geom_line() + facet_wrap( ~ country) +
xlab("Year") + ylab("Life Expectancy") +
ggtitle("Figure 1") + scale_colour_discrete(name="Continent") +
theme(axis.text.x=element_blank(), axis.ticks.x=element_blank())
CHALLENGES - allot 5 min
So making publication quality plots is great but does us little good if we cannot get them out of R and into our documents.
You can save a plot from within RStudio using the ‘Export’ button in the ‘Plot’ window
if you are working in a command line environment, or want to create multiple plots without user interaction?
you can use the ggsave
function to save your plots quickly. This function can be used to save the last displayed plot in a format specified by the file name.
You can save as several different formats, such as a PDF:
ggsave("My_most_recent_plot.pdf")
Or as a JPG:
ggsave("My_most_recent_plot.jpg")
ggsave
also allows you to specify size and quality of the image. You can check out all of the
options using the ?ggsave
command to view the help file.
if you will want to save plots without creating them in the ‘Plot’ window first. Perhaps you want to make a pdf document with multiple pages: each one a different plot, for example. Or perhaps you’re looping through multiple subsets of a file, plotting data from each subset, and you want to save each plot, but obviously can’t stop the loop to click ‘Export’ for each one.
The function
pdf
creates a new pdf device. You can control the size and resolution
using the arguments to this function.
pdf("Life_Exp_vs_time.pdf", width=12, height=4)
ggplot(data=gapminder, aes(x=year, y=lifeExp, color=continent)) +
geom_point()
dev.off()
The pdf
command opens the pdf file, and any output between this command and the dev.off()
command
will be added to that file.
dont forget the dev.off
command
Open up this document and have a look.
CHALLENGES - 5 min
The commands jpeg
, png
etc. are used similarly to produce
documents in different formats.
At some point, you’ll also want to write out data from R.
We can use the write.table
function for this, which is
very similar to the read.table
function that we mentioned previously in Lesson 2.
Let’s create a data-cleaning script, for this analysis, we only want to focus on the gapminder data for Australia:
make a script file
aust_subset <- gapminder[gapminder$country == "Australia",]
write.table(aust_subset,
file="gapminder-aus.csv",
sep=","
)
in line comments and linebreaks within function calls
aust_subset <- gapminder[gapminder$country == "Australia",]
write.table(aust_subset, ## Gapminder data for countries located in Australia
file="gapminder-aus.csv", ## Name of the output file
sep="," ## Comma separated
)
open a shell window and navigate to the location of the file using your cd
command.
to find out where R is saving
your files, you can check the file
pane or use the getwd()
command.
head gapminder-aus.csv
you can view
the file in R by clicking on the filename in the file
pane and selecting View File
.
Where did all these quotation marks come from? Also the row numbers are meaningless. let’s look at the help file
?write.table
By default R will wrap character vectors with quotation marks when writing out to file. It will also write out the row and column names.
write.table(aust_subset, ## Gapminder data for countries located in Australia
file="gapminder-aus.csv", ## Name of the output file
sep=",", ## Comma separated
quote=FALSE, ## Turn off quotation marks
row.names=FALSE ## No row names
)
look at the data again using our shell skills:
head gapminder-aus.csv
That looks better!
CHALLENGES - 5 min
Don’t forget your R helpfiles and package vignettes which can be accessed
by using the ?
and vignette
commands.
Additional R topics that we could not cover today.
The R Club meets on East Campus twice a month. It is headed by Leonardo Bastos, a PhD student in Agronomy and Horticulture. You can email Leonardo for more information, or check out the club’s GitHub page which contains previous meeting topics.
R quick reference guides including today’s handouts and more!
Hadley Wickham is RStudio’s Chief Data Scientist and developer of the dplyr
and ggplot2
packages.
R for Data Science is his newest book, and is available here for free.
Following One R Tip a Day is a great way to learn new tips and tricks in R.
Twotorials is a compilation of 2 minute youtube videos which highlight a specific topic in R.
For more advanced topics, check out Hadley Wickham’s website based on his book “Advanced R”.
Remind them to fill out post-it survey