Home

Intro to R

Jump to ...
- Home
- Part 0
- Part 1
- Part 2
- Part 3
- Part 4
- Part 5
- Part 6
- Part 7
- Part 8
- Answers to tasks

1 Recap

In part 3, we learned:

How to organise our project folders and data files.
How to set the working directory (with setwd())
How to read in data from a csv spreadsheet file (with read.csv()) to create a ‘data frame’.
Data frames are two dimensional lists with rows and columns.
How to get a glimpse of the data in the console by calling head(d).
How to index a data frame to select specific rows (e.g. d[1:3, ]).
How to index a specific column using the $ operator.
How to create new variables in a data frame.
How to combine data from different sources using ID names, the match function or the merge function.

There’s a “cheatsheet” of all the commands we use in this course here.

2 Basic stats

In the last section, we used a database of roof shape and rainfall data. We’ll use that again here. Make a new R script in your analysis folder and copy this code in:

# Remember to set your working directory!
# setwd( ... )

# Load data
d <- read.csv("../data/RoofShapeData.csv", stringsAsFactors = F)
rainfallData <- read.csv("../data/RainfallData_onlyComplete.csv", stringsAsFactors = F)

# Add the rainfall data into the main data source
d$rainfall <- rainfallData[match(d$society_id, rainfallData$society_id) , ]$code

At the moment, the variable for roof shape is called code_label, which isn’t a very helpful label, and referring to it will make our code less clear. Let’s make a new variable called roofShape that is just a copy of code_label. Add this to your script:

d$roofShape <- d$code_label

Run your script so far to check that they work.

The data frame d now includes the variables rainfall and roofShape.

2.1 Mean, standard deviation, minimum and maximum

When we have a lot of data, it’s hard to view it all at once. Instead, we can use bits of programming to help us understand the range and distribution. For example, what is the mean rainfall in the data?

The function mean takes a numeric vector and returns the mean (average). Other simple functions include sd, max and min:

mean(d$rainfall)

## [1] 111704.8

sd(d$rainfall)

## [1] 69864.21

max(d$rainfall)

## [1] 383459.4

min(d$rainfall)

## [1] 2410.033

The values are in mean ml/m2/month (the average millileters of water falling per meter squared per month). You can see D-PLACE’s visualisation here

2.2 Basic statistics

We want to test the hypothesis that roof shape differs by amount of rainfall. Let’s remind ourselves of the categories of roof shape:

table(d$roofShape)

## 
##              Beehive shaped                     Conical 
##                          55                         350 
##                        Flat                 Four slopes 
##                          93                          85 
##               Hemispherical                   One slope 
##                          96                          14 
## Rounded or semi-cylindrical          Semi-hemispherical 
##                          36                           8 
##                  Two slopes 
##                         365

We might predict that flat rooves would be associated with lower rainfall compared to other roof types, because they are less good at draining water.

To test this we first need a variable in the data frame d that separates flat rooves from other types.

Task: Add a line to your script to make a variable in d named flatRoof that is TRUE if the roof type is “Flat” and FALSE otherwise.

Task: Now sub-set the data using this variable so that you can get the mean precipitation of societies with flat rooves and societies with rooves of other shapes.

Ok, so the mean rainfall on flat rooves is about half that on other roof types. Let’s run a very basic statistical test. t.test is a function which runs a two-sample t-test. It takes two arguments. The first is a formula, and the second is the source of data. A formula a dependent variable first (the measure we’re interested in comparing), followed by a tilde ~ (on a keyboard usually found to the left of the ‘1’ or to the right of the left-hand shift key), and the right is the variable that defines the two groups we want to test. Like this:

t.test(rainfall ~ flatRoof, data = d)

## 
##  Welch Two Sample t-test
## 
## data:  rainfall by flatRoof
## t = 12.622, df = 152.73, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  49268.66 67553.20
## sample estimates:
## mean in group FALSE  mean in group TRUE 
##           116634.24            58223.31

(note that ‘data’ is an optional argument, so we have to name it in the code)

The output to the conslode has a lot of statistics, including the pvalue and t score, and mean of each group. The p-value is very small (“< 2.2e-16” means less that 2 multiplied by 10 to the power of minus 16), suggesting that the difference is significant.

Hooray, we will be famous! Of course, we need to do all the proper statistical checks to be able to interpret this output meaningfully (are the samples normally distributed, large enough, equal variance etc.). We can do that with similar code, too. My aim here is just to show you how the functions work.

More importantly for this course, we might not be controlling for other explanations. For example, these data points are not statistically independent: they belong to societies that are related historically. So a t-test might not be a fair statistic. To go further, we can start using regression models which can (eventually) let us take into account some of this historical relationship with multi-level modelling or phylogenetics.

Below I apply a very simple linear model using lm. This function takes two arguments: a formula of the model we want to run, and a data frame.

The formula is like the one for t.test, but we can add extra independent variables seperated by a +. Let’s say that we want A function has a dependent variable first, followed by a tilde ~, then a list of dependent variables seperated by +. All the variables in the formula should exist in the data frame we give as the second argument.

Let’s say, for some reason, we think that we should control for the focal year (maybe the planet is getting wetter over time). We’d use this model:

model1 = lm(rainfall ~ flatRoof + focal_year, data=d)

The summary function gives us a table of parameter estimates, t-values and p-values.

summary(model1)

## 
## Call:
## lm(formula = rainfall ~ flatRoof + focal_year, data = d)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -114252  -50366   -9048   38378  266738 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   58539.07   21466.62   2.727  0.00649 ** 
## flatRoofTRUE -55341.76    7431.79  -7.447 1.93e-13 ***
## focal_year       30.54      11.23   2.720  0.00664 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 67780 on 1099 degrees of freedom
## Multiple R-squared:  0.06039,    Adjusted R-squared:  0.05868 
## F-statistic: 35.31 on 2 and 1099 DF,  p-value: 1.367e-15

Add these lines to your script and run the model to see the results.

No need to worry about the statistical theory now, just the basic idea: You can apply regression or other statistical tests to a set of data, and assign the results to a variable. That variable (model1 above) now stores the output of the statistical test.

2.3 tapply

tapply is another powerful command in R. We can use it to get quick answers such as: Which group of societies (defined by language family) experiences the greatest average rainfall?

tapply, takes a vector of objects, groups the objects according to a second vector, then applies a function to each group.

For example, we might want to work out the standard deviation of rainfall within each family (the average distance of each society from the mean for their family: larger standard deviations indicate more variation within the family).

The function sd() works out the standard deviation of a vector of numbers, but we want to do this within each fmaily. We can do this using tapply. In the code below, the first argument is what we want to process, the second argument defines how we want to group the data, the thrid argument is the funciton we want to apply (without brackets). sd is the function that gives us the standard deviation:

rainfallByLangFam.SD <- tapply(d$rainfall, d$language_family, sd)

(The ‘.’ in the variable name is not special code, it’s just part of the label. It just helps us understand what the variable represents and contrasts it with some other variables below)

We then sort the data using sort (in decreasing order) and show the first few results using head.

rainfallByLangFam.SD.sorted <- sort(rainfallByLangFam.SD, decreasing = TRUE)
head(rainfallByLangFam.SD.sorted)

##         Arawakan         Chibchan            Mande     Austronesian 
##         85195.68         82175.48         69591.25         66556.29 
## Nuclear-Macro-Je  Central Sudanic 
##         60057.10         55027.98

Arawakan has the highest standard deviation of rainfall!

Task: Which basic word order type has the higest mean rainfall? Add this to your script

2.4 Custom functions

What do we do when we can’t find a function that does what we want? This is an optional section on custom functions.

Say we want to find out language family has the largest range in rainfall (difference between maximum and minimum). To answer this, we need to know the difference between the maximum and minimum rainfall for each family. There’s no built-in function that can calculate this, but we can make our own.

First, how would be calculate the range? Given the list of numbers below, we can work out the maximum and minimum value, then subtract one from the other:

x <- c(1,4,6,2,99,2)
max(x) - min(x)

## [1] 98

Functions in R are defined similarly to other objects. Below we create a function called rangeDiff which takes one argument (a list of numbers) and returns one number (the range of the numbers).

We declare it to be a function with one argument named x.
We use curly brackets to indicate where the code of the function begins and ends.
Then we create a variable r which calculates the range.
We pass the value of r back to the user with the function return.

rangeDiff <- function(x){
  r <- max(x) - min(x)
  return(r)
}

After running these lines, nothing will happen, but we can now use this function in other parts of the code.

x <- c(1,4,6,2,99,2)
rangeDiff(x)

## [1] 98

lotteryNums <- c(64,92,47,11)
rangeDiff(lotteryNums)

## [1] 81

Now we can use rangeDiff to work out the range for each language family:

rainByLangFam.Range <- tapply(d$rainfall, d$language_family, rangeDiff)
rainByLangFam.Range.sorted <- sort(rainByLangFam.Range, decreasing = TRUE)
head(rainByLangFam.Range.sorted)

##       Arawakan   Austronesian Atlantic-Congo          Mande   Sino-Tibetan 
##       290561.0       287889.5       236624.6       230405.1       222649.6 
##       Chibchan 
##       217780.5

Arawakan has the largest range in rainfall!

3 Making plots

R is great at visualising data to help us figure out what’s going on.

The most basic function is plot, which just makes an x-by-y plot.

Let’s plot the rainfall by focal year (no particular reason to do this except these are the two numeric values in the data)

plot(d$focal_year, d$rainfall)

Or the same outcome, but using a formula:

plot(rainfall ~ focal_year, data = d)

Hmm. There are a few earlier societies coded with very low rainfall. Let’s investigate that further.

hist is a function which produces a histogram:

hist(d$rainfall)

We can see that the distribution of rainfall is centered around 100,000 ml/m2/month, and is slightly right-skewed.

boxplot makes a box plot. This shows the mean and range of a particular variable for several categories of obervations. For example, here’s a boxplot of rainfall for the three largest language families:

# Make a data frame with just the data we want:
bigFamilies = c("Atlantic-Congo","Afro-Asiatic","Austronesian")
dx = d[d$language_family %in% bigFamilies,]
# Make boxplot
boxplot(rainfall ~ language_family, data = dx)

Task: Make a boxplot showing the mean rainfall for each type of roof shape. There are lots of types, which makes the legend difficult to read. Can you make a boxplot for just the top 4 most common roof shape types?

There are many other types of plot. The easiest way to find a recipie for making a plot is to use a gallery (like this one).

3.1 Writing plots to files

In the “Plots” window, there’s an “Export” button that lets you export the plot to a png or pdf file on your computer. You can then use this file with other programs like Word or Latex. The “Export” button is very useful, but we also want to be able to create files using code. Why? A few reasons:

The “Export” button needs you to specify the size of the plot. This might differ between runs.
You might need to export a plot at a larger size/resolution than the “Plots” window, which is tricky.
You might need to export to a particular file type that the Plots window doesn’t support (e.g. eps).
We want the script to reproduce our analysis with as little input from us as possible. That is, we want to externalise all the steps we take to make our plots. Remembering to click extra buttons (and which size, where to save etc.) increases the number of things we have to store in our heads.
We might want to make sure that the file is saved to a specific location. If you’re using a program like Latex, you can link images so that any update to the image file is automatically included in your manuscript. This makes it easier to make small changes further down the road.

So. How do we save a plot using code?

You can write plots to a pdf using the function pdf. Add the pdf funciton before the first plotting command, then dev.off() to stop writing to the file.

We’re writing out data to a file called “myMap.pdf” inside out ‘results’ folder:

# start writing
pdf(file="../results/myMap.pdf", width= 7, height = 7)
  # draw a map of the world
  map() 
  # add locations of languages
  points(d2$longitude, d2$latitude)
  # add a horizontal line at the equator
  abline(h = 0, col = 2)
# stop writing.
dev.off()

Other functions for different file types are available. Many of them are covered here.

Go to the next tutorial

Back to the index