In part 1, we learned how to do a number of things in R:
Assignment (saving values to variables like ‘x’ so we can call them later).
x <- 1
Different mathematical operators like +
, -
, /
(division) and *
(multiplication).
How to make vectors (lists of numbers or other items):
numbersOneToOneHundered <- 1:100
myNumbers <- c(1,2,5,7)
How to apply functions:
sumOfMyNumbers <- sum(myNumbers)
Indexing (extracting specific items from a list):
secondItemByNumber <- myNumbers[2]
secondItemByBooleanVector <- myNumbers[c(FALSE,TRUE,FALSE,FALSE)]
In part 2 we learned how to install packages for specific functions (install.packages("nameOfThePackage")
), load packages (library(nameOfThePackage)
), and how to get help with errors.
There’s a “cheatsheet” of all the commands we use in this course here.
Before we can process data with R, we need to load it into the memory (also called “reading” the data). In R studio, you can click File>Import Dataset, and you’ll see a menu where you can choose a file to load, just like in programs like Word or Excel. This is handy, but it’s more productive to write some code to load our data. There are several reasons for this:
In this tutorial, we’ll look at some data from the D-PLACE database on the shape of rooves across lots of societies. The data is originally from the Ethnographic Atlas. The goal is to see whether the amount of rainfall is related to the shape of your roof (more rainfall should mean steeper rooves).
To do this, we’ll load some data from a file. This is as easy as running one line of code which refers to the location of the file on your computer. For example, if I have a file called “RoofShapeData.csv” on my desktop, I could load it with this code:
d <- read.csv("~/Desktop/RoofShapeData.csv")
However, it’s important to keep your project files organised - we don’t want files from multiple projects all on the desktop! So we’ll have to do some preperation, which involves setting up a dedicated folder for your project and then telling R that we’ve done this.
First, create a folder on your computer to store data and scripts for this tutorial. I’ve put mine in the folder “~/Documents/Teaching/CARDIFF/RCourse/IntroToR”. Inside this folder create:
Next, download the data files (you may need to right-click these links and choose “save target as”):
(these files come from this GitHub repository): https://github.com/seannyD/IntroToRForAnthropology/tree/main/data.
Put the data files in the folder called “data”.
Note: If you’re using Windows, your browser may want to save these files as “.txt” files, rather than “.csv” files. That’s fine - the data won’t change - but you might get some errors when it comes to loading the files in R (because the filenames are different). To fix this, I reccommend that you: - Make Windows show file extensions in the File Explorer (see here) - Download the files as “.txt” files. - In File Explorer, find the files and rename them to end in “.csv”
These files are the same kinds as you can download from D-PLACE. However, I’ve edited them slightly to make things easier for this tutorial.
Make a new R script file in RStudio (File > New File > R script).
When we read data, we need to tell R where the files are stored. The first step is usually to set the working directory. This sets the directory (the folder) that R will look in to find data, and to write files to.
We want to set the working directory to our analysis folder, because that’s where we’ll save our scripts.
The first line of your script will set the working directory. You can do this using the function setwd
. This is a function that takes one argument: the location of the folder you want to use as the working directory. In the code below, I’ve set this to my particular location, you should adjust this to point to your analysis folder. The easiest way to find the address of your analysis folder is to:
For me, because I’m using a mac, it’ll look something like this (note the first character is a tilde, which indicates the current user’s home directory):
setwd("~/Documents/Teaching/CARDIFF/RCourse/IntroToRForAnthropologists/analysis")
On a windows machine, it might look something like this (note the forward slashes to separate directories - copying the path from Windows Explorer might give you backslashes, which need to be changed to forward slashes):
setwd("C:/Users/sean/Documents/R/R Course/Analysis")
Now copy this line of code to your script. This will mean that the first thing that your analysis script does is set the right working directory. This is important, because the working directory needs to be set each time you restart RStudio. Of course, it will only work for your specific computer, but it’s convenient for now.
If it worked, you should see no complaints.
If it didn’t work, you might get an error. Click on the tabs to get hints.
If the error is “cannot change working directory”, then you might have spelled the location of the folder incorrectly.
If you see a “+” sign next to the cursor, did you include both the opening and the closing quote marks? Both the opening and the closing parentheses? If not, click inside the console window, and press the Escape key to cancel the current command. Adjust your code and try again.
If you get an error “could not find function”, then did you spell ‘setwd’ correctly?
Yeah, but the most common mistake students make is not setting the working directory at the start of a session, then getting confused when lines don’t run. If you want, you can surround the hard-coding with the try()
function, so that it won’t break on someone else’s computer.
If you can’t get this to work, use the following line of code to read the file directly from the web, and skip to section 1.5:
d <- read.csv("https://github.com/seannyD/IntroToRForAnthropology/raw/main/data/RoofShapeData.csv", stringsAsFactors = F)
Now save your script to your analysis folder, named something like “part3.R”. You can do this by clicking File > Save As, and choosing the location like saving many other type of files.
Next, we want to load the cultural data into R so we can analyse it. We can “read in” a csv file to an object called a data frame. Copy this line into your script, then run it in the console.
d <- read.csv("../data/RoofShapeData.csv", stringsAsFactors = F)
To run a line from the script in the console, place the cursor on the line and press the “Run” button in the top right, or press Control + Enter (or Command + Enter on a Mac).
NOTE: The path is not the full location of the file, it is a relative path. The current working directory is inside the analysis folder, and the data we want is in the data folder, which is one step above this (relative to the working directory). The “../” part at the beginning of the file means “look in the folder above this one, find the folder called”data“, and inside that find the folder called”WALS_WordOrder.csv"". This might seem complicated, but now this line of code will work on anyone’s computer (if they have the data). We’ve just made our code more reproducible!
NOTE: by default, read.csv converts some strings to factors, which can complicate things later on. For now, we’ll turn this off by using the stringsAsFactors
argument. As long as you copy the code above exactly, it should work.
It’s a good idea to check how big the data frame is, using dim
(for ‘dimensions’). Since this is not a required when reproducing our analysis, you can just type and run this directly into the console:
dim(d)
## [1] 1134 12
This data frame has 1134 rows and 12 columns.
Data frames are 2-dimensional objects - they have rows and columns. Typically, rows are data points and columns are variables. We can look at the data in many ways.
Task: What happens if you type
d
into the console to see what’s inside the objectd
?
The function str
can be used to view the structure of an object:
str(d)
## 'data.frame': 1134 obs. of 12 variables:
## $ society_id : chr "Ad30" "Ae17" "Aj2" "Ca36" ...
## $ society_name : chr "Digo" "Rega" "Maasi" "Beni-Amer" ...
## $ society_xd_id : chr "xd99" "xd135" "xd395" "xd444" ...
## $ language_glottocode: chr "digo1243" "lega1249" "masa1300" "mans1267" ...
## $ language_name : chr "Digo" "Lega-Shabunda" "Masai" "Mansa'" ...
## $ language_family : chr "Atlantic-Congo" "Atlantic-Congo" "Nilotic" "Afro-Asiatic" ...
## $ variable_id : chr "EA082" "EA082" "EA082" "EA082" ...
## $ code : int 1 1 1 1 1 1 1 1 1 1 ...
## $ code_label : chr "Rounded or semi-cylindrical" "Rounded or semi-cylindrical" "Rounded or semi-cylindrical" "Rounded or semi-cylindrical" ...
## $ focal_year : int 1890 1900 1900 1860 1930 1940 1900 1930 1950 1950 ...
## $ sub_case : chr "" "" "Kisonko or Southern Masai of Tanzania" "" ...
## $ comment : chr "" "" "" "" ...
In the output, each row summarises one of the columns in the data.
The function head
can be used to look at just part of the data. Here we see the data for the first 6 rows of the data frame.
head(d)
## society_id society_name society_xd_id language_glottocode
## 1 Ad30 Digo xd99 digo1243
## 2 Ae17 Rega xd135 lega1249
## 3 Aj2 Maasi xd395 masa1300
## 4 Ca36 Beni-Amer xd444 mans1267
## 5 Cb13 Habbaniya xd462 nort3133
## 6 Cb3 Songhai xd480 koyr1242
## language_name language_family variable_id code
## 1 Digo Atlantic-Congo EA082 1
## 2 Lega-Shabunda Atlantic-Congo EA082 1
## 3 Masai Nilotic EA082 1
## 4 Mansa' Afro-Asiatic EA082 1
## 5 North Kordofan Arabic Afro-Asiatic EA082 1
## 6 Koyraboro Senni Songhai Songhay EA082 1
## code_label focal_year sub_case
## 1 Rounded or semi-cylindrical 1890
## 2 Rounded or semi-cylindrical 1900
## 3 Rounded or semi-cylindrical 1900 Kisonko or Southern Masai of Tanzania
## 4 Rounded or semi-cylindrical 1860
## 5 Rounded or semi-cylindrical 1930
## 6 Rounded or semi-cylindrical 1940 Bamba division
## comment
## 1
## 2
## 3
## 4
## 5
## 6
Rows and columns are represented. If the number of columns is too great to represent in the width of the window, then they will be added seperately below. This can sometimes make it a bit confusing. You can also look at the data using the View
fuction (note capital V, and also note that this won’t update when you change the data).
View(d)
Don’t be afraid of taking a peek at the data in a spreadsheet program if you want first.
Can you tell what each column represents? Here’s a summary:
We can index the data frame in different ways. When indexing 1-dimensional objects like vectors, we used square brackets with a single number. With 2-dimensional objects like data frames, we need to specify 2 indices: what rows we want, and what columns we want.
For example, show row 1, column 2 (society name):
d[ 1 , 2 ]
## [1] "Digo"
For example, show rows 1, 4 and 6, and all columns (for all columns, we leave the second argument blank):
d[ c(1,4,6) , ]
Or all rows for the 6th column:
d[ , 6]
We can also refer to columns by their names. First, we use names
to see the names of columns:
names(d)
## [1] "society_id" "society_name" "society_xd_id"
## [4] "language_glottocode" "language_name" "language_family"
## [7] "variable_id" "code" "code_label"
## [10] "focal_year" "sub_case" "comment"
Then we get all rows of the column code_label (the text label of the roof shape), using the special dollar sign character $
.
d$code_label
Hmm, that’s a lot of text that just came out of the console! Let’s say we want to look at the first three items in this column. There are many ways of indexing this:
d[1:3, ]$code_label
d[1:3, c("code_label")]
d$code_label[1:3]
The first method above is very common: First we select the rows we are interested in, including all columns. Then we ask for the code_label column using the dollar sign.
The funciton table
summarises the counts of each unique value in a variable. So we can summarise a factor such as code_label (the roof shape).
table(d$code_label)
##
## Beehive shaped Conical
## 57 353
## Flat Four slopes
## 95 88
## Hemispherical One slope
## 103 15
## Rounded or semi-cylindrical Semi-hemispherical
## 38 8
## Two slopes
## 377
The table here is printed with the name and the number of occurances underneath. For example, we can see that e.g. 95 societies have “flat” rooves and 8 have “Semi-hemispherical”.
We can provide multiple arguments to table
, which will result in contingency tables. For example, a table of the number of roof shapes for each language family:
table(d$code_label,d$language_family)
##
## Abkhaz-Adyge Afro-Asiatic Ainu Algic Anim
## Beehive shaped 2 0 5 0 0 0
## Conical 3 1 33 0 17 0
## Flat 2 0 24 0 0 0
## Four slopes 3 0 4 0 0 0
## Hemispherical 7 0 13 0 4 0
## One slope 1 0 0 0 0 0
## Rounded or semi-cylindrical 2 0 11 0 1 0
## Semi-hemispherical 0 0 0 0 0 0
## Two slopes 12 0 9 1 4 1
##
## Araucanian Arawakan Athabaskan-Eyak-Tlingit
## Beehive shaped 0 0 0
## Conical 0 1 11
## Flat 0 0 0
## Four slopes 1 2 1
## Hemispherical 0 0 8
## One slope 0 0 0
## Rounded or semi-cylindrical 0 2 0
## Semi-hemispherical 0 0 0
## Two slopes 0 4 11
##
## Atlantic-Congo Austroasiatic Austronesian Aymaran
## Beehive shaped 27 1 3 0
## Conical 137 0 2 0
## Flat 22 0 1 0
## Four slopes 18 0 12 0
## Hemispherical 4 0 0 0
## One slope 1 2 0 0
## Rounded or semi-cylindrical 2 0 2 0
## Semi-hemispherical 0 0 0 0
## Two slopes 82 12 73 1
##
## Barbacoan Basque Blue Nile Mao Bororoan Caddoan
## Beehive shaped 0 0 1 0 2
## Conical 0 0 0 0 0
## Flat 0 0 0 0 0
## Four slopes 1 0 0 0 0
## Hemispherical 0 0 0 0 2
## One slope 0 0 0 0 0
## Rounded or semi-cylindrical 0 0 0 0 0
## Semi-hemispherical 0 0 0 0 0
## Two slopes 0 1 0 2 0
##
## Cariban Central Sudanic Chibchan Chicham
## Beehive shaped 1 1 0 0
## Conical 4 11 3 0
## Flat 0 0 0 0
## Four slopes 2 0 0 1
## Hemispherical 0 0 0 0
## One slope 0 0 0 0
## Rounded or semi-cylindrical 1 0 0 0
## Semi-hemispherical 0 0 0 0
## Two slopes 3 2 2 0
##
## Chimakuan Chinookan Chocoan Chonan
## Beehive shaped 0 0 0 0
## Conical 0 0 1 0
## Flat 0 0 0 0
## Four slopes 0 0 0 0
## Hemispherical 0 1 0 0
## One slope 0 0 0 0
## Rounded or semi-cylindrical 0 0 0 0
## Semi-hemispherical 0 0 0 2
## Two slopes 1 1 0 0
##
## Chukotko-Kamchatkan Chumashan Cochimi-Yuman
## Beehive shaped 0 0 0
## Conical 1 0 0
## Flat 1 0 0
## Four slopes 0 0 1
## Hemispherical 0 1 8
## One slope 0 0 0
## Rounded or semi-cylindrical 0 0 2
## Semi-hemispherical 0 0 0
## Two slopes 1 0 1
##
## Coosan Dizoid Dogon Dravidian Eleman Eskimo-Aleut
## Beehive shaped 0 0 0 0 0 0
## Conical 0 1 0 1 0 1
## Flat 0 0 1 0 0 1
## Four slopes 0 0 0 2 0 1
## Hemispherical 0 0 0 0 0 10
## One slope 0 0 0 0 0 0
## Rounded or semi-cylindrical 0 0 0 1 0 0
## Semi-hemispherical 0 0 0 0 0 2
## Two slopes 1 0 0 6 1 2
##
## Furan Goilalan Great Andamanese Greater Kwerba
## Beehive shaped 0 0 0 0
## Conical 1 0 0 0
## Flat 0 0 0 0
## Four slopes 0 0 0 0
## Hemispherical 0 0 0 0
## One slope 0 0 1 0
## Rounded or semi-cylindrical 0 0 0 0
## Semi-hemispherical 0 0 0 0
## Two slopes 0 1 0 1
##
## Guahiboan Guaicuruan Haida Heibanic Hmong-Mien
## Beehive shaped 0 0 0 0 0
## Conical 0 0 0 5 0
## Flat 0 0 0 0 0
## Four slopes 0 0 0 0 0
## Hemispherical 0 0 0 0 0
## One slope 0 1 0 0 0
## Rounded or semi-cylindrical 0 0 0 0 0
## Semi-hemispherical 0 0 0 0 0
## Two slopes 1 2 1 0 1
##
## Huavean Huitotoan Ijoid Indo-European Iroquoian
## Beehive shaped 0 0 0 0 0
## Conical 0 0 0 0 0
## Flat 0 0 0 13 0
## Four slopes 0 0 0 6 0
## Hemispherical 0 0 0 0 0
## One slope 0 0 0 0 0
## Rounded or semi-cylindrical 0 0 0 0 2
## Semi-hemispherical 0 0 0 0 0
## Two slopes 1 1 1 26 1
##
## Japonic Jodi-Saliban Kadugli-Krongo Kartvelian
## Beehive shaped 0 1 0 0
## Conical 0 0 2 0
## Flat 0 0 0 3
## Four slopes 2 0 0 0
## Hemispherical 0 0 0 0
## One slope 0 0 0 0
## Rounded or semi-cylindrical 0 0 0 0
## Semi-hemispherical 0 0 0 0
## Two slopes 0 0 0 0
##
## Kawesqar Keresan Khoe-Kwadi Kiowa-Tanoan Koiarian
## Beehive shaped 0 0 1 0 0
## Conical 0 0 0 1 0
## Flat 0 5 0 6 0
## Four slopes 0 0 0 0 0
## Hemispherical 1 0 1 0 0
## One slope 0 0 0 0 0
## Rounded or semi-cylindrical 0 0 0 0 0
## Semi-hemispherical 0 0 1 0 0
## Two slopes 0 0 0 0 2
##
## Kolopom Koman Koreanic Kxa Lencan
## Beehive shaped 1 0 0 0 0
## Conical 0 1 0 0 0
## Flat 0 0 0 0 0
## Four slopes 0 0 1 0 1
## Hemispherical 0 0 0 1 0
## One slope 0 0 0 0 0
## Rounded or semi-cylindrical 0 0 0 0 0
## Semi-hemispherical 0 0 0 0 0
## Two slopes 0 0 0 0 0
##
## Lower Sepik-Ramu Maiduan Mailuan Mande Maningrida
## Beehive shaped 0 0 0 0 0
## Conical 0 2 0 14 0
## Flat 0 0 0 7 0
## Four slopes 0 0 0 1 0
## Hemispherical 0 0 0 0 0
## One slope 0 0 0 0 0
## Rounded or semi-cylindrical 0 0 0 0 1
## Semi-hemispherical 0 0 0 1 0
## Two slopes 1 0 1 0 0
##
## Matacoan Mayan Misumalpan Miwok-Costanoan
## Beehive shaped 1 0 0 0
## Conical 1 0 0 0
## Flat 0 0 0 0
## Four slopes 0 0 0 0
## Hemispherical 0 0 0 2
## One slope 0 0 0 0
## Rounded or semi-cylindrical 0 0 0 0
## Semi-hemispherical 0 0 0 0
## Two slopes 0 5 1 0
##
## Mixe-Zoque Mongolic Morehead-Wasur Muskogean
## Beehive shaped 0 0 0 0
## Conical 0 2 0 1
## Flat 0 0 0 0
## Four slopes 2 0 0 0
## Hemispherical 0 1 0 0
## One slope 0 0 0 0
## Rounded or semi-cylindrical 0 0 0 0
## Semi-hemispherical 0 0 0 0
## Two slopes 0 1 1 2
##
## Nakh-Daghestanian Nambiquaran Narrow Talodi Ndu
## Beehive shaped 0 1 0 0
## Conical 0 0 1 0
## Flat 1 0 0 0
## Four slopes 0 0 0 0
## Hemispherical 0 0 0 0
## One slope 0 0 0 0
## Rounded or semi-cylindrical 0 0 0 0
## Semi-hemispherical 0 0 0 0
## Two slopes 0 0 0 2
##
## Nilotic Nubian Nuclear Torricelli
## Beehive shaped 3 0 0
## Conical 18 1 0
## Flat 1 1 0
## Four slopes 0 0 0
## Hemispherical 6 1 0
## One slope 0 0 0
## Rounded or semi-cylindrical 1 0 0
## Semi-hemispherical 0 0 0
## Two slopes 0 0 1
##
## Nuclear Trans New Guinea Nuclear-Macro-Je Nyimang
## Beehive shaped 0 1 0
## Conical 2 0 1
## Flat 0 0 0
## Four slopes 1 2 0
## Hemispherical 1 1 0
## One slope 0 0 0
## Rounded or semi-cylindrical 1 2 0
## Semi-hemispherical 0 0 0
## Two slopes 5 2 0
##
## Otomanguean Palaihnihan Pama-Nyungan Pano-Tacanan
## Beehive shaped 0 0 0 0
## Conical 0 1 0 0
## Flat 0 0 0 0
## Four slopes 0 1 0 1
## Hemispherical 0 0 0 0
## One slope 0 0 2 0
## Rounded or semi-cylindrical 0 0 1 0
## Semi-hemispherical 0 0 0 0
## Two slopes 3 0 0 2
##
## Peba-Yagua Pomoan Quechuan Sahaptian Saharan
## Beehive shaped 0 0 0 0 1
## Conical 0 1 0 1 2
## Flat 0 0 0 0 0
## Four slopes 1 0 0 1 0
## Hemispherical 0 2 0 0 0
## One slope 0 0 0 0 0
## Rounded or semi-cylindrical 0 0 0 0 1
## Semi-hemispherical 0 0 0 0 0
## Two slopes 0 0 1 2 0
##
## Salishan Sepik Shastan Sino-Tibetan Siouan
## Beehive shaped 0 0 0 0 0
## Conical 3 0 0 0 4
## Flat 0 0 0 1 0
## Four slopes 4 0 0 2 0
## Hemispherical 0 0 0 0 6
## One slope 3 0 0 0 0
## Rounded or semi-cylindrical 0 0 0 0 0
## Semi-hemispherical 0 0 0 0 0
## Two slopes 11 1 1 19 1
##
## Songhay South Bougainville South Omotic Surmic
## Beehive shaped 0 0 0 2
## Conical 1 0 6 1
## Flat 0 0 0 0
## Four slopes 0 0 0 0
## Hemispherical 0 0 0 0
## One slope 0 0 0 0
## Rounded or semi-cylindrical 1 0 0 0
## Semi-hemispherical 0 0 0 0
## Two slopes 0 1 0 0
##
## Ta-Ne-Omotic Tai-Kadai Tamaic Tarascan
## Beehive shaped 2 0 0 0
## Conical 5 0 0 0
## Flat 0 0 0 0
## Four slopes 0 0 0 1
## Hemispherical 0 0 1 0
## One slope 0 0 0 0
## Rounded or semi-cylindrical 0 1 0 0
## Semi-hemispherical 0 0 0 0
## Two slopes 0 1 0 0
##
## Ticuna-Yuri Totonacan Tsimshian Tucanoan Tungusic
## Beehive shaped 0 0 0 0 0
## Conical 0 0 0 0 3
## Flat 0 0 0 0 0
## Four slopes 1 1 0 0 0
## Hemispherical 0 0 0 0 0
## One slope 0 0 0 0 0
## Rounded or semi-cylindrical 0 0 0 0 0
## Semi-hemispherical 0 0 0 0 0
## Two slopes 0 0 1 1 4
##
## Tupian Turkic Tuu Uralic Uto-Aztecan Wakashan
## Beehive shaped 0 0 0 0 0 0
## Conical 0 0 0 4 32 0
## Flat 0 2 0 0 2 0
## Four slopes 3 2 0 1 3 0
## Hemispherical 0 1 0 0 17 0
## One slope 0 0 0 0 0 1
## Rounded or semi-cylindrical 2 0 0 0 0 0
## Semi-hemispherical 0 0 1 0 0 0
## Two slopes 5 6 0 11 7 4
##
## Western Tasmanian Wintuan Yanomamic Yeniseian
## Beehive shaped 0 0 0 0
## Conical 0 2 1 0
## Flat 0 0 0 1
## Four slopes 0 0 1 0
## Hemispherical 0 1 0 0
## One slope 0 0 2 0
## Rounded or semi-cylindrical 0 0 0 0
## Semi-hemispherical 1 0 0 0
## Two slopes 0 0 0 0
##
## Yokutsan Yukaghir Yuki-Wappo Zamucoan
## Beehive shaped 0 0 0 0
## Conical 1 1 3 0
## Flat 0 0 0 0
## Four slopes 0 0 0 0
## Hemispherical 1 0 1 0
## One slope 0 0 0 1
## Rounded or semi-cylindrical 1 0 0 0
## Semi-hemispherical 0 0 0 0
## Two slopes 0 0 0 0
For example, Uto-Aztecan has mainly Conical and Hemispherical rooves.
Our roof shape data observations come from focal years as early as 2000 BC. However, our rainfall data only relates to more recent times. Let’s say we want to look at all data where the focal year is greater than 1850.
Task: Make a boolean vector which is
TRUE
if thefocal_year
variable is greater than 1850, andFALSE
otherwise. Assign this to a variable namedrecentFocalYears
(create a variable calledrecentFocalYears
and store the boolean vector inside it).
Task: Make a table of counts of roof shape types for languages in the recent data. Hint: you should index the rows with the variable recentFocalYears and the code_label column.
In the code above, we made variable called northernHemisphere
. Instead of doing this, we can add a new column directly to the data frame as follows:
d$recentFocalYears <- d$focal_year > 1850
head(d)
## society_id society_name society_xd_id language_glottocode
## 1 Ad30 Digo xd99 digo1243
## 2 Ae17 Rega xd135 lega1249
## 3 Aj2 Maasi xd395 masa1300
## 4 Ca36 Beni-Amer xd444 mans1267
## 5 Cb13 Habbaniya xd462 nort3133
## 6 Cb3 Songhai xd480 koyr1242
## language_name language_family variable_id code
## 1 Digo Atlantic-Congo EA082 1
## 2 Lega-Shabunda Atlantic-Congo EA082 1
## 3 Masai Nilotic EA082 1
## 4 Mansa' Afro-Asiatic EA082 1
## 5 North Kordofan Arabic Afro-Asiatic EA082 1
## 6 Koyraboro Senni Songhai Songhay EA082 1
## code_label focal_year sub_case
## 1 Rounded or semi-cylindrical 1890
## 2 Rounded or semi-cylindrical 1900
## 3 Rounded or semi-cylindrical 1900 Kisonko or Southern Masai of Tanzania
## 4 Rounded or semi-cylindrical 1860
## 5 Rounded or semi-cylindrical 1930
## 6 Rounded or semi-cylindrical 1940 Bamba division
## comment recentFocalYears
## 1 TRUE
## 2 TRUE
## 3 TRUE
## 4 TRUE
## 5 TRUE
## 6 TRUE
How many societies are coded between 1900 and 1920? This requires two boolean tests.
We can combine boolean values with boolean operators like &
(and) , |
(or):
# AND
c(TRUE, TRUE, FALSE) & c(TRUE, FALSE, FALSE)
## [1] TRUE FALSE FALSE
# OR
c(TRUE, TRUE, FALSE) | c(TRUE, FALSE, FALSE)
## [1] TRUE TRUE FALSE
The code below makes a boolean vector which is TRUE
only when the facal year is greater than 1899 and the focal year is less than 1921. We can then take the sum
of this variable to count the number of TRUE
values (TRUE
is treated as 1 and FALSE
as zero).
northEastQ <- d$focal_year > 1899 & d$focal_year < 1921
sum(northEastQ)
## [1] 376
Merging two sources of data is one of the most powerful skills you can gain from coding. This will allow you to compare data from different sources. To achieve this, the data must contain information on how to cross-reference. This is typically an ID (identification) which will be a unique combinations of letters and numbers for each society. For example, eHRAF uses two letters and two numbers to refer to cultures (e.g. Tudor Britain is “ES14”), and Glottolog uses four letters and four numbers to refer to languages (e.g. Welsh is “wels1247”). We can use R to match data across two files and join them together.
We’d like to compare the roof shapes of societies to the rainfall for that area. But the data for rainfall exists in another file. Both files can be cross-referenced through the society_id
variable, which refers to the society that has been coded.
We want to match the society_id in the roof shape data to the society_id in the rainfall data, then add the numeric rainfall data into the roof shape data frame.
Task: Load the data in
../data/RainfallData_onlyComplete.csv
into a data frame calledrainfallData
. This includes the Monthly Mean Precipitation.
Task What are the names and formats of the variables in rainfallData? What is the name of the column that includes the numeric amount of rainfall?
It is possible to match data frames using rownames. However, this assumes that row ids are unique and that there are no missing values. So this is not reccommended.
match
match
is a function which takes two arguments: a vector of things to be matched, and a vector of values to be matched against. For each item of the first vector, it returns the index of that item in the second vector:
match(c(3,5,2), c(2,3,5))
## [1] 2 3 1
That is, 3 is the second item in the second vector, 5 is the 3rd item in the second vector and 2 is the 1st item in the second vector. This applies to any type of data:
match(c("c","b","b",'a'), c("a","b","c"))
## [1] 3 2 2 1
If we have three society ids:
sx <- c("Ad30", "Ae17", "Aj2")
We can use match to get the rainfall of these societies The code below asks to find where the codes in gx
are in the rainfall data variable rainfallData. These are the row numbers where the society ids correspond. It then uses these row numbers to index rainfallData and return the code variable:
# example of what match returns:
match(sx,rainfallData$society_id)
## [1] 955 1699 563
# Get rainfall for each member of
rainfallData[match(sx,rainfallData$society_id), ]$code
## [1] 96037.26 193073.32 62002.16
We can match every item in d
and make a new column called rainfall
this way:
d$rainfall <- rainfallData[match(d$society_id, rainfallData$society_id) , ]$code
merge
Two data frames can be merged using merge
and specifying which columns should be used to align the data. This is simpler to write, but results in a data frame with all information from both sources. This can be a bit tricky to use if that’s not what you want to do.
To use merge, it’s best to have a primary data frame with all the current columns, and a secondary data frame with only the cross-referencing variable and the new target variable (or all the variables you want to transfer. Also note that the rainfall data is stored in the variable “code”. Our roof shape data already has a variable like this, so we want to avoid confusion. Ideally, the new target variable should already be renamed in the secondary data frame:
# Copy the 'code' column to a new column called 'rainfall'
rainfallData$rainfall <- rainfallData$code
# Make a new secondary data frame with only the columns we want
rf2 <- rainfallData[,c("society_id","rainfall")]
Now we can merge the two data sources. The code below makes a new variable d2
which merges d
and rainfallData
by matching on the glottolog code. Note that the name of the variable in each database can be different, so we have to give both seperately.
d2 <- merge(d, rf2, by.x='society_id', by.y='society_id')
The last column of d2 should now have the rainfall data. Have a look for yourself, e.g.:
head(d2)
Go to the next tutorial
Back to the index