1 Recap

In part 1, we learned how to do a number of things in R:

Assignment (saving values to variables like ‘x’ so we can call them later).

x <- 1

Different mathematical operators like +, -, / (division) and * (multiplication).

How to make vectors (lists of numbers or other items):

numbersOneToOneHundered <- 1:100
myNumbers <- c(1,2,5,7)

How to apply functions:

sumOfMyNumbers <- sum(myNumbers)

Indexing (extracting specific items from a list):

secondItemByNumber <- myNumbers[2]
secondItemByBooleanVector <- myNumbers[c(FALSE,TRUE,FALSE,FALSE)]

In part 2 we learned how to install packages for specific functions (install.packages("nameOfThePackage")), load packages (library(nameOfThePackage)), and how to get help with errors.

There’s a “cheatsheet” of all the commands we use in this course here.

2 Reading data

Before we can process data with R, we need to load it into the memory (also called “reading” the data). In R studio, you can click File>Import Dataset, and you’ll see a menu where you can choose a file to load, just like in programs like Word or Excel. This is handy, but it’s more productive to write some code to load our data. There are several reasons for this:

  • The only way to load data is to run code. The “Import Dataset” button just creates some code automatically and then runs the code (the code will appear in the Console if you’re interested).
  • The “Import Dataset” button is a special feature of R studio. Not all platforms for running R have this.
  • We want our script to reproduce all the steps in our analysis. This includes details about which file and exactly how to import it. We don’t want to keep this information in our heads - that’s stressful!
  • We want to reduce the number of buttons we have to click to run our script. While writing the code takes a bit more time to write up front, you’ll be working on scripts over days or weeks. Running one line of code is much easier than clicking around to find the file each time.
  • You might need to import file types that aren’t supported by the “Import Dataset” button.

3 Reading data from a csv file

In this tutorial, we’ll look at some data from the D-PLACE database on the shape of rooves across lots of societies. The data is originally from the Ethnographic Atlas. The goal is to see whether the amount of rainfall is related to the shape of your roof (more rainfall should mean steeper rooves).

To do this, we’ll load some data from a file. This is as easy as running one line of code which refers to the location of the file on your computer. For example, if I have a file called “RoofShapeData.csv” on my desktop, I could load it with this code:

d <- read.csv("~/Desktop/RoofShapeData.csv")

However, it’s important to keep your project files organised - we don’t want files from multiple projects all on the desktop! So we’ll have to do some preperation, which involves setting up a dedicated folder for your project and then telling R that we’ve done this.

3.1 Getting the data

First, create a folder on your computer to store data and scripts for this tutorial. I’ve put mine in the folder “~/Documents/Teaching/CARDIFF/RCourse/IntroToR”. Inside this folder create:

  • A folder called “data” to store data files.
  • A folder called “analysis” to store scripts.
  • A folder called “results” to store results.

Next, download the data files (you may need to right-click these links and choose “save target as”):

(these files come from this GitHub repository): https://github.com/seannyD/IntroToRForAnthropology/tree/main/data.

Put the data files in the folder called “data”.

Note: If you’re using Windows, your browser may want to save these files as “.txt” files, rather than “.csv” files. That’s fine - the data won’t change - but you might get some errors when it comes to loading the files in R (because the filenames are different). To fix this, I reccommend that you: - Make Windows show file extensions in the File Explorer (see here) - Download the files as “.txt” files. - In File Explorer, find the files and rename them to end in “.csv”

These files are the same kinds as you can download from D-PLACE. However, I’ve edited them slightly to make things easier for this tutorial.

3.2 Setting the working directory

Make a new R script file in RStudio (File > New File > R script).

When we read data, we need to tell R where the files are stored. The first step is usually to set the working directory. This sets the directory (the folder) that R will look in to find data, and to write files to.

We want to set the working directory to our analysis folder, because that’s where we’ll save our scripts.

The first line of your script will set the working directory. You can do this using the function setwd. This is a function that takes one argument: the location of the folder you want to use as the working directory. In the code below, I’ve set this to my particular location, you should adjust this to point to your analysis folder. The easiest way to find the address of your analysis folder is to:

  • Open the “Session” menu in RStudio (top of the screen, to the right of File, Edit, Code etc.).
  • Click “Set Working Directory”, then “Choose Directory”
  • Find your analysis directory and click “open”
  • RStudio will add a line of code to the console and run it.

For me, because I’m using a mac, it’ll look something like this (note the first character is a tilde, which indicates the current user’s home directory):

setwd("~/Documents/Teaching/CARDIFF/RCourse/IntroToRForAnthropologists/analysis")

On a windows machine, it might look something like this (note the forward slashes to separate directories - copying the path from Windows Explorer might give you backslashes, which need to be changed to forward slashes):

setwd("C:/Users/sean/Documents/R/R Course/Analysis")

Now copy this line of code to your script. This will mean that the first thing that your analysis script does is set the right working directory. This is important, because the working directory needs to be set each time you restart RStudio. Of course, it will only work for your specific computer, but it’s convenient for now.

If it worked, you should see no complaints.

3.3 Errors

3.3.1 Errors

If it didn’t work, you might get an error. Click on the tabs to get hints.

3.3.2 cannot change working directory

If the error is “cannot change working directory”, then you might have spelled the location of the folder incorrectly.

3.3.3 Nothing happens

If you see a “+” sign next to the cursor, did you include both the opening and the closing quote marks? Both the opening and the closing parentheses? If not, click inside the console window, and press the Escape key to cancel the current command. Adjust your code and try again.

3.3.4 cound not find function

If you get an error “could not find function”, then did you spell ‘setwd’ correctly?

3.3.5 I hate you for hard coding a path

Yeah, but the most common mistake students make is not setting the working directory at the start of a session, then getting confused when lines don’t run. If you want, you can surround the hard-coding with the try() function, so that it won’t break on someone else’s computer.

3.3.6 Nothing works!

If you can’t get this to work, use the following line of code to read the file directly from the web, and skip to section 1.5:

d <- read.csv("https://github.com/seannyD/IntroToRForAnthropology/raw/main/data/RoofShapeData.csv", stringsAsFactors = F)

3.4 Reading the data

Now save your script to your analysis folder, named something like “part3.R”. You can do this by clicking File > Save As, and choosing the location like saving many other type of files.

Next, we want to load the cultural data into R so we can analyse it. We can “read in” a csv file to an object called a data frame. Copy this line into your script, then run it in the console.

d <- read.csv("../data/RoofShapeData.csv", stringsAsFactors = F)

To run a line from the script in the console, place the cursor on the line and press the “Run” button in the top right, or press Control + Enter (or Command + Enter on a Mac).

NOTE: The path is not the full location of the file, it is a relative path. The current working directory is inside the analysis folder, and the data we want is in the data folder, which is one step above this (relative to the working directory). The “../” part at the beginning of the file means “look in the folder above this one, find the folder called”data“, and inside that find the folder called”WALS_WordOrder.csv"". This might seem complicated, but now this line of code will work on anyone’s computer (if they have the data). We’ve just made our code more reproducible!

NOTE: by default, read.csv converts some strings to factors, which can complicate things later on. For now, we’ll turn this off by using the stringsAsFactors argument. As long as you copy the code above exactly, it should work.

It’s a good idea to check how big the data frame is, using dim (for ‘dimensions’). Since this is not a required when reproducing our analysis, you can just type and run this directly into the console:

dim(d)
## [1] 1134   12

This data frame has 1134 rows and 12 columns.

3.5 Summarising data

Data frames are 2-dimensional objects - they have rows and columns. Typically, rows are data points and columns are variables. We can look at the data in many ways.

Task: What happens if you type d into the console to see what’s inside the object d?

The function str can be used to view the structure of an object:

str(d)
## 'data.frame':    1134 obs. of  12 variables:
##  $ society_id         : chr  "Ad30" "Ae17" "Aj2" "Ca36" ...
##  $ society_name       : chr  "Digo" "Rega" "Maasi" "Beni-Amer" ...
##  $ society_xd_id      : chr  "xd99" "xd135" "xd395" "xd444" ...
##  $ language_glottocode: chr  "digo1243" "lega1249" "masa1300" "mans1267" ...
##  $ language_name      : chr  "Digo" "Lega-Shabunda" "Masai" "Mansa'" ...
##  $ language_family    : chr  "Atlantic-Congo" "Atlantic-Congo" "Nilotic" "Afro-Asiatic" ...
##  $ variable_id        : chr  "EA082" "EA082" "EA082" "EA082" ...
##  $ code               : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ code_label         : chr  "Rounded or semi-cylindrical" "Rounded or semi-cylindrical" "Rounded or semi-cylindrical" "Rounded or semi-cylindrical" ...
##  $ focal_year         : int  1890 1900 1900 1860 1930 1940 1900 1930 1950 1950 ...
##  $ sub_case           : chr  "" "" "Kisonko or Southern Masai of Tanzania" "" ...
##  $ comment            : chr  "" "" "" "" ...

In the output, each row summarises one of the columns in the data.

The function head can be used to look at just part of the data. Here we see the data for the first 6 rows of the data frame.

head(d)
##   society_id society_name society_xd_id language_glottocode
## 1       Ad30         Digo          xd99            digo1243
## 2       Ae17         Rega         xd135            lega1249
## 3        Aj2        Maasi         xd395            masa1300
## 4       Ca36    Beni-Amer         xd444            mans1267
## 5       Cb13    Habbaniya         xd462            nort3133
## 6        Cb3      Songhai         xd480            koyr1242
##             language_name language_family variable_id code
## 1                    Digo  Atlantic-Congo       EA082    1
## 2           Lega-Shabunda  Atlantic-Congo       EA082    1
## 3                   Masai         Nilotic       EA082    1
## 4                  Mansa'    Afro-Asiatic       EA082    1
## 5   North Kordofan Arabic    Afro-Asiatic       EA082    1
## 6 Koyraboro Senni Songhai         Songhay       EA082    1
##                    code_label focal_year                              sub_case
## 1 Rounded or semi-cylindrical       1890                                      
## 2 Rounded or semi-cylindrical       1900                                      
## 3 Rounded or semi-cylindrical       1900 Kisonko or Southern Masai of Tanzania
## 4 Rounded or semi-cylindrical       1860                                      
## 5 Rounded or semi-cylindrical       1930                                      
## 6 Rounded or semi-cylindrical       1940                        Bamba division
##   comment
## 1        
## 2        
## 3        
## 4        
## 5        
## 6

Rows and columns are represented. If the number of columns is too great to represent in the width of the window, then they will be added seperately below. This can sometimes make it a bit confusing. You can also look at the data using the View fuction (note capital V, and also note that this won’t update when you change the data).

View(d)

Don’t be afraid of taking a peek at the data in a spreadsheet program if you want first.

Can you tell what each column represents? Here’s a summary:

  1. “society_id”: ID for the society (D-PLACE)
  2. “society_name” Society name
  3. “society_xd_id” ID for the society (D-PLACE)
  4. “language_glottocode”: ID for the society’s language (Glottolog)
  5. “language_name” Society’s language name
  6. “language_family” Language family of the society’s language
  7. “variable_id”: ID of the roof shape variable (Ethnographic Atlas)
  8. “code”: The numeric code for the roof shape of the society (EA).
  9. “code_label”: The text label for that code (EA)
  10. “focal_year”: Year the observation was made.
  11. “sub_case”: Note on particular society
  12. “comment”: Comments from D-PLACE or EA

4 Indexing data frames

We can index the data frame in different ways. When indexing 1-dimensional objects like vectors, we used square brackets with a single number. With 2-dimensional objects like data frames, we need to specify 2 indices: what rows we want, and what columns we want.

For example, show row 1, column 2 (society name):

d[ 1 , 2 ]
## [1] "Digo"

For example, show rows 1, 4 and 6, and all columns (for all columns, we leave the second argument blank):

d[ c(1,4,6) , ]

Or all rows for the 6th column:

d[ , 6]

We can also refer to columns by their names. First, we use names to see the names of columns:

names(d)
##  [1] "society_id"          "society_name"        "society_xd_id"      
##  [4] "language_glottocode" "language_name"       "language_family"    
##  [7] "variable_id"         "code"                "code_label"         
## [10] "focal_year"          "sub_case"            "comment"

Then we get all rows of the column code_label (the text label of the roof shape), using the special dollar sign character $.

d$code_label

Hmm, that’s a lot of text that just came out of the console! Let’s say we want to look at the first three items in this column. There are many ways of indexing this:

d[1:3, ]$code_label
d[1:3, c("code_label")]
d$code_label[1:3]

The first method above is very common: First we select the rows we are interested in, including all columns. Then we ask for the code_label column using the dollar sign.

The funciton table summarises the counts of each unique value in a variable. So we can summarise a factor such as code_label (the roof shape).

table(d$code_label)
## 
##              Beehive shaped                     Conical 
##                          57                         353 
##                        Flat                 Four slopes 
##                          95                          88 
##               Hemispherical                   One slope 
##                         103                          15 
## Rounded or semi-cylindrical          Semi-hemispherical 
##                          38                           8 
##                  Two slopes 
##                         377

The table here is printed with the name and the number of occurances underneath. For example, we can see that e.g. 95 societies have “flat” rooves and 8 have “Semi-hemispherical”.

We can provide multiple arguments to table, which will result in contingency tables. For example, a table of the number of roof shapes for each language family:

table(d$code_label,d$language_family)
##                              
##                                   Abkhaz-Adyge Afro-Asiatic Ainu Algic Anim
##   Beehive shaped                2            0            5    0     0    0
##   Conical                       3            1           33    0    17    0
##   Flat                          2            0           24    0     0    0
##   Four slopes                   3            0            4    0     0    0
##   Hemispherical                 7            0           13    0     4    0
##   One slope                     1            0            0    0     0    0
##   Rounded or semi-cylindrical   2            0           11    0     1    0
##   Semi-hemispherical            0            0            0    0     0    0
##   Two slopes                   12            0            9    1     4    1
##                              
##                               Araucanian Arawakan Athabaskan-Eyak-Tlingit
##   Beehive shaped                       0        0                       0
##   Conical                              0        1                      11
##   Flat                                 0        0                       0
##   Four slopes                          1        2                       1
##   Hemispherical                        0        0                       8
##   One slope                            0        0                       0
##   Rounded or semi-cylindrical          0        2                       0
##   Semi-hemispherical                   0        0                       0
##   Two slopes                           0        4                      11
##                              
##                               Atlantic-Congo Austroasiatic Austronesian Aymaran
##   Beehive shaped                          27             1            3       0
##   Conical                                137             0            2       0
##   Flat                                    22             0            1       0
##   Four slopes                             18             0           12       0
##   Hemispherical                            4             0            0       0
##   One slope                                1             2            0       0
##   Rounded or semi-cylindrical              2             0            2       0
##   Semi-hemispherical                       0             0            0       0
##   Two slopes                              82            12           73       1
##                              
##                               Barbacoan Basque Blue Nile Mao Bororoan Caddoan
##   Beehive shaped                      0      0             1        0       2
##   Conical                             0      0             0        0       0
##   Flat                                0      0             0        0       0
##   Four slopes                         1      0             0        0       0
##   Hemispherical                       0      0             0        0       2
##   One slope                           0      0             0        0       0
##   Rounded or semi-cylindrical         0      0             0        0       0
##   Semi-hemispherical                  0      0             0        0       0
##   Two slopes                          0      1             0        2       0
##                              
##                               Cariban Central Sudanic Chibchan Chicham
##   Beehive shaped                    1               1        0       0
##   Conical                           4              11        3       0
##   Flat                              0               0        0       0
##   Four slopes                       2               0        0       1
##   Hemispherical                     0               0        0       0
##   One slope                         0               0        0       0
##   Rounded or semi-cylindrical       1               0        0       0
##   Semi-hemispherical                0               0        0       0
##   Two slopes                        3               2        2       0
##                              
##                               Chimakuan Chinookan Chocoan Chonan
##   Beehive shaped                      0         0       0      0
##   Conical                             0         0       1      0
##   Flat                                0         0       0      0
##   Four slopes                         0         0       0      0
##   Hemispherical                       0         1       0      0
##   One slope                           0         0       0      0
##   Rounded or semi-cylindrical         0         0       0      0
##   Semi-hemispherical                  0         0       0      2
##   Two slopes                          1         1       0      0
##                              
##                               Chukotko-Kamchatkan Chumashan Cochimi-Yuman
##   Beehive shaped                                0         0             0
##   Conical                                       1         0             0
##   Flat                                          1         0             0
##   Four slopes                                   0         0             1
##   Hemispherical                                 0         1             8
##   One slope                                     0         0             0
##   Rounded or semi-cylindrical                   0         0             2
##   Semi-hemispherical                            0         0             0
##   Two slopes                                    1         0             1
##                              
##                               Coosan Dizoid Dogon Dravidian Eleman Eskimo-Aleut
##   Beehive shaped                   0      0     0         0      0            0
##   Conical                          0      1     0         1      0            1
##   Flat                             0      0     1         0      0            1
##   Four slopes                      0      0     0         2      0            1
##   Hemispherical                    0      0     0         0      0           10
##   One slope                        0      0     0         0      0            0
##   Rounded or semi-cylindrical      0      0     0         1      0            0
##   Semi-hemispherical               0      0     0         0      0            2
##   Two slopes                       1      0     0         6      1            2
##                              
##                               Furan Goilalan Great Andamanese Greater Kwerba
##   Beehive shaped                  0        0                0              0
##   Conical                         1        0                0              0
##   Flat                            0        0                0              0
##   Four slopes                     0        0                0              0
##   Hemispherical                   0        0                0              0
##   One slope                       0        0                1              0
##   Rounded or semi-cylindrical     0        0                0              0
##   Semi-hemispherical              0        0                0              0
##   Two slopes                      0        1                0              1
##                              
##                               Guahiboan Guaicuruan Haida Heibanic Hmong-Mien
##   Beehive shaped                      0          0     0        0          0
##   Conical                             0          0     0        5          0
##   Flat                                0          0     0        0          0
##   Four slopes                         0          0     0        0          0
##   Hemispherical                       0          0     0        0          0
##   One slope                           0          1     0        0          0
##   Rounded or semi-cylindrical         0          0     0        0          0
##   Semi-hemispherical                  0          0     0        0          0
##   Two slopes                          1          2     1        0          1
##                              
##                               Huavean Huitotoan Ijoid Indo-European Iroquoian
##   Beehive shaped                    0         0     0             0         0
##   Conical                           0         0     0             0         0
##   Flat                              0         0     0            13         0
##   Four slopes                       0         0     0             6         0
##   Hemispherical                     0         0     0             0         0
##   One slope                         0         0     0             0         0
##   Rounded or semi-cylindrical       0         0     0             0         2
##   Semi-hemispherical                0         0     0             0         0
##   Two slopes                        1         1     1            26         1
##                              
##                               Japonic Jodi-Saliban Kadugli-Krongo Kartvelian
##   Beehive shaped                    0            1              0          0
##   Conical                           0            0              2          0
##   Flat                              0            0              0          3
##   Four slopes                       2            0              0          0
##   Hemispherical                     0            0              0          0
##   One slope                         0            0              0          0
##   Rounded or semi-cylindrical       0            0              0          0
##   Semi-hemispherical                0            0              0          0
##   Two slopes                        0            0              0          0
##                              
##                               Kawesqar Keresan Khoe-Kwadi Kiowa-Tanoan Koiarian
##   Beehive shaped                     0       0          1            0        0
##   Conical                            0       0          0            1        0
##   Flat                               0       5          0            6        0
##   Four slopes                        0       0          0            0        0
##   Hemispherical                      1       0          1            0        0
##   One slope                          0       0          0            0        0
##   Rounded or semi-cylindrical        0       0          0            0        0
##   Semi-hemispherical                 0       0          1            0        0
##   Two slopes                         0       0          0            0        2
##                              
##                               Kolopom Koman Koreanic Kxa Lencan
##   Beehive shaped                    1     0        0   0      0
##   Conical                           0     1        0   0      0
##   Flat                              0     0        0   0      0
##   Four slopes                       0     0        1   0      1
##   Hemispherical                     0     0        0   1      0
##   One slope                         0     0        0   0      0
##   Rounded or semi-cylindrical       0     0        0   0      0
##   Semi-hemispherical                0     0        0   0      0
##   Two slopes                        0     0        0   0      0
##                              
##                               Lower Sepik-Ramu Maiduan Mailuan Mande Maningrida
##   Beehive shaped                             0       0       0     0          0
##   Conical                                    0       2       0    14          0
##   Flat                                       0       0       0     7          0
##   Four slopes                                0       0       0     1          0
##   Hemispherical                              0       0       0     0          0
##   One slope                                  0       0       0     0          0
##   Rounded or semi-cylindrical                0       0       0     0          1
##   Semi-hemispherical                         0       0       0     1          0
##   Two slopes                                 1       0       1     0          0
##                              
##                               Matacoan Mayan Misumalpan Miwok-Costanoan
##   Beehive shaped                     1     0          0               0
##   Conical                            1     0          0               0
##   Flat                               0     0          0               0
##   Four slopes                        0     0          0               0
##   Hemispherical                      0     0          0               2
##   One slope                          0     0          0               0
##   Rounded or semi-cylindrical        0     0          0               0
##   Semi-hemispherical                 0     0          0               0
##   Two slopes                         0     5          1               0
##                              
##                               Mixe-Zoque Mongolic Morehead-Wasur Muskogean
##   Beehive shaped                       0        0              0         0
##   Conical                              0        2              0         1
##   Flat                                 0        0              0         0
##   Four slopes                          2        0              0         0
##   Hemispherical                        0        1              0         0
##   One slope                            0        0              0         0
##   Rounded or semi-cylindrical          0        0              0         0
##   Semi-hemispherical                   0        0              0         0
##   Two slopes                           0        1              1         2
##                              
##                               Nakh-Daghestanian Nambiquaran Narrow Talodi Ndu
##   Beehive shaped                              0           1             0   0
##   Conical                                     0           0             1   0
##   Flat                                        1           0             0   0
##   Four slopes                                 0           0             0   0
##   Hemispherical                               0           0             0   0
##   One slope                                   0           0             0   0
##   Rounded or semi-cylindrical                 0           0             0   0
##   Semi-hemispherical                          0           0             0   0
##   Two slopes                                  0           0             0   2
##                              
##                               Nilotic Nubian Nuclear Torricelli
##   Beehive shaped                    3      0                  0
##   Conical                          18      1                  0
##   Flat                              1      1                  0
##   Four slopes                       0      0                  0
##   Hemispherical                     6      1                  0
##   One slope                         0      0                  0
##   Rounded or semi-cylindrical       1      0                  0
##   Semi-hemispherical                0      0                  0
##   Two slopes                        0      0                  1
##                              
##                               Nuclear Trans New Guinea Nuclear-Macro-Je Nyimang
##   Beehive shaped                                     0                1       0
##   Conical                                            2                0       1
##   Flat                                               0                0       0
##   Four slopes                                        1                2       0
##   Hemispherical                                      1                1       0
##   One slope                                          0                0       0
##   Rounded or semi-cylindrical                        1                2       0
##   Semi-hemispherical                                 0                0       0
##   Two slopes                                         5                2       0
##                              
##                               Otomanguean Palaihnihan Pama-Nyungan Pano-Tacanan
##   Beehive shaped                        0           0            0            0
##   Conical                               0           1            0            0
##   Flat                                  0           0            0            0
##   Four slopes                           0           1            0            1
##   Hemispherical                         0           0            0            0
##   One slope                             0           0            2            0
##   Rounded or semi-cylindrical           0           0            1            0
##   Semi-hemispherical                    0           0            0            0
##   Two slopes                            3           0            0            2
##                              
##                               Peba-Yagua Pomoan Quechuan Sahaptian Saharan
##   Beehive shaped                       0      0        0         0       1
##   Conical                              0      1        0         1       2
##   Flat                                 0      0        0         0       0
##   Four slopes                          1      0        0         1       0
##   Hemispherical                        0      2        0         0       0
##   One slope                            0      0        0         0       0
##   Rounded or semi-cylindrical          0      0        0         0       1
##   Semi-hemispherical                   0      0        0         0       0
##   Two slopes                           0      0        1         2       0
##                              
##                               Salishan Sepik Shastan Sino-Tibetan Siouan
##   Beehive shaped                     0     0       0            0      0
##   Conical                            3     0       0            0      4
##   Flat                               0     0       0            1      0
##   Four slopes                        4     0       0            2      0
##   Hemispherical                      0     0       0            0      6
##   One slope                          3     0       0            0      0
##   Rounded or semi-cylindrical        0     0       0            0      0
##   Semi-hemispherical                 0     0       0            0      0
##   Two slopes                        11     1       1           19      1
##                              
##                               Songhay South Bougainville South Omotic Surmic
##   Beehive shaped                    0                  0            0      2
##   Conical                           1                  0            6      1
##   Flat                              0                  0            0      0
##   Four slopes                       0                  0            0      0
##   Hemispherical                     0                  0            0      0
##   One slope                         0                  0            0      0
##   Rounded or semi-cylindrical       1                  0            0      0
##   Semi-hemispherical                0                  0            0      0
##   Two slopes                        0                  1            0      0
##                              
##                               Ta-Ne-Omotic Tai-Kadai Tamaic Tarascan
##   Beehive shaped                         2         0      0        0
##   Conical                                5         0      0        0
##   Flat                                   0         0      0        0
##   Four slopes                            0         0      0        1
##   Hemispherical                          0         0      1        0
##   One slope                              0         0      0        0
##   Rounded or semi-cylindrical            0         1      0        0
##   Semi-hemispherical                     0         0      0        0
##   Two slopes                             0         1      0        0
##                              
##                               Ticuna-Yuri Totonacan Tsimshian Tucanoan Tungusic
##   Beehive shaped                        0         0         0        0        0
##   Conical                               0         0         0        0        3
##   Flat                                  0         0         0        0        0
##   Four slopes                           1         1         0        0        0
##   Hemispherical                         0         0         0        0        0
##   One slope                             0         0         0        0        0
##   Rounded or semi-cylindrical           0         0         0        0        0
##   Semi-hemispherical                    0         0         0        0        0
##   Two slopes                            0         0         1        1        4
##                              
##                               Tupian Turkic Tuu Uralic Uto-Aztecan Wakashan
##   Beehive shaped                   0      0   0      0           0        0
##   Conical                          0      0   0      4          32        0
##   Flat                             0      2   0      0           2        0
##   Four slopes                      3      2   0      1           3        0
##   Hemispherical                    0      1   0      0          17        0
##   One slope                        0      0   0      0           0        1
##   Rounded or semi-cylindrical      2      0   0      0           0        0
##   Semi-hemispherical               0      0   1      0           0        0
##   Two slopes                       5      6   0     11           7        4
##                              
##                               Western Tasmanian Wintuan Yanomamic Yeniseian
##   Beehive shaped                              0       0         0         0
##   Conical                                     0       2         1         0
##   Flat                                        0       0         0         1
##   Four slopes                                 0       0         1         0
##   Hemispherical                               0       1         0         0
##   One slope                                   0       0         2         0
##   Rounded or semi-cylindrical                 0       0         0         0
##   Semi-hemispherical                          1       0         0         0
##   Two slopes                                  0       0         0         0
##                              
##                               Yokutsan Yukaghir Yuki-Wappo Zamucoan
##   Beehive shaped                     0        0          0        0
##   Conical                            1        1          3        0
##   Flat                               0        0          0        0
##   Four slopes                        0        0          0        0
##   Hemispherical                      1        0          1        0
##   One slope                          0        0          0        1
##   Rounded or semi-cylindrical        1        0          0        0
##   Semi-hemispherical                 0        0          0        0
##   Two slopes                         0        0          0        0

For example, Uto-Aztecan has mainly Conical and Hemispherical rooves.

4.1 Sub-sets of the data

Our roof shape data observations come from focal years as early as 2000 BC. However, our rainfall data only relates to more recent times. Let’s say we want to look at all data where the focal year is greater than 1850.

Task: Make a boolean vector which is TRUE if the focal_year variable is greater than 1850, and FALSE otherwise. Assign this to a variable named recentFocalYears (create a variable called recentFocalYears and store the boolean vector inside it).

Task: Make a table of counts of roof shape types for languages in the recent data. Hint: you should index the rows with the variable recentFocalYears and the code_label column.

4.2 Making a new variable inside a data frame

In the code above, we made variable called northernHemisphere. Instead of doing this, we can add a new column directly to the data frame as follows:

d$recentFocalYears <- d$focal_year > 1850
head(d)
##   society_id society_name society_xd_id language_glottocode
## 1       Ad30         Digo          xd99            digo1243
## 2       Ae17         Rega         xd135            lega1249
## 3        Aj2        Maasi         xd395            masa1300
## 4       Ca36    Beni-Amer         xd444            mans1267
## 5       Cb13    Habbaniya         xd462            nort3133
## 6        Cb3      Songhai         xd480            koyr1242
##             language_name language_family variable_id code
## 1                    Digo  Atlantic-Congo       EA082    1
## 2           Lega-Shabunda  Atlantic-Congo       EA082    1
## 3                   Masai         Nilotic       EA082    1
## 4                  Mansa'    Afro-Asiatic       EA082    1
## 5   North Kordofan Arabic    Afro-Asiatic       EA082    1
## 6 Koyraboro Senni Songhai         Songhay       EA082    1
##                    code_label focal_year                              sub_case
## 1 Rounded or semi-cylindrical       1890                                      
## 2 Rounded or semi-cylindrical       1900                                      
## 3 Rounded or semi-cylindrical       1900 Kisonko or Southern Masai of Tanzania
## 4 Rounded or semi-cylindrical       1860                                      
## 5 Rounded or semi-cylindrical       1930                                      
## 6 Rounded or semi-cylindrical       1940                        Bamba division
##   comment recentFocalYears
## 1                     TRUE
## 2                     TRUE
## 3                     TRUE
## 4                     TRUE
## 5                     TRUE
## 6                     TRUE

How many societies are coded between 1900 and 1920? This requires two boolean tests.

We can combine boolean values with boolean operators like & (and) , | (or):

# AND
c(TRUE, TRUE, FALSE) & c(TRUE, FALSE, FALSE)
## [1]  TRUE FALSE FALSE
# OR
c(TRUE, TRUE, FALSE) | c(TRUE, FALSE, FALSE)
## [1]  TRUE  TRUE FALSE

The code below makes a boolean vector which is TRUE only when the facal year is greater than 1899 and the focal year is less than 1921. We can then take the sum of this variable to count the number of TRUE values (TRUE is treated as 1 and FALSE as zero).

northEastQ <- d$focal_year > 1899 & d$focal_year < 1921
sum(northEastQ)
## [1] 376

4.3 Combining data from different sources

Merging two sources of data is one of the most powerful skills you can gain from coding. This will allow you to compare data from different sources. To achieve this, the data must contain information on how to cross-reference. This is typically an ID (identification) which will be a unique combinations of letters and numbers for each society. For example, eHRAF uses two letters and two numbers to refer to cultures (e.g. Tudor Britain is “ES14”), and Glottolog uses four letters and four numbers to refer to languages (e.g. Welsh is “wels1247”). We can use R to match data across two files and join them together.

We’d like to compare the roof shapes of societies to the rainfall for that area. But the data for rainfall exists in another file. Both files can be cross-referenced through the society_id variable, which refers to the society that has been coded.

We want to match the society_id in the roof shape data to the society_id in the rainfall data, then add the numeric rainfall data into the roof shape data frame.

Task: Load the data in ../data/RainfallData_onlyComplete.csv into a data frame called rainfallData. This includes the Monthly Mean Precipitation.

Task What are the names and formats of the variables in rainfallData? What is the name of the column that includes the numeric amount of rainfall?

4.3.1 Using names

It is possible to match data frames using rownames. However, this assumes that row ids are unique and that there are no missing values. So this is not reccommended.

4.3.2 Using match

match is a function which takes two arguments: a vector of things to be matched, and a vector of values to be matched against. For each item of the first vector, it returns the index of that item in the second vector:

match(c(3,5,2), c(2,3,5))
## [1] 2 3 1

That is, 3 is the second item in the second vector, 5 is the 3rd item in the second vector and 2 is the 1st item in the second vector. This applies to any type of data:

match(c("c","b","b",'a'), c("a","b","c"))
## [1] 3 2 2 1

If we have three society ids:

sx <- c("Ad30", "Ae17", "Aj2")

We can use match to get the rainfall of these societies The code below asks to find where the codes in gx are in the rainfall data variable rainfallData. These are the row numbers where the society ids correspond. It then uses these row numbers to index rainfallData and return the code variable:

# example of what match returns:
match(sx,rainfallData$society_id)
## [1]  955 1699  563
# Get rainfall for each member of 
rainfallData[match(sx,rainfallData$society_id), ]$code
## [1]  96037.26 193073.32  62002.16

We can match every item in d and make a new column called rainfall this way:

d$rainfall <- rainfallData[match(d$society_id, rainfallData$society_id) , ]$code

4.3.3 Using merge

Two data frames can be merged using merge and specifying which columns should be used to align the data. This is simpler to write, but results in a data frame with all information from both sources. This can be a bit tricky to use if that’s not what you want to do.

To use merge, it’s best to have a primary data frame with all the current columns, and a secondary data frame with only the cross-referencing variable and the new target variable (or all the variables you want to transfer. Also note that the rainfall data is stored in the variable “code”. Our roof shape data already has a variable like this, so we want to avoid confusion. Ideally, the new target variable should already be renamed in the secondary data frame:

# Copy the 'code' column to a new column called 'rainfall'
rainfallData$rainfall <- rainfallData$code
# Make a new secondary data frame with only the columns we want
rf2 <- rainfallData[,c("society_id","rainfall")]

Now we can merge the two data sources. The code below makes a new variable d2 which merges d and rainfallData by matching on the glottolog code. Note that the name of the variable in each database can be different, so we have to give both seperately.

d2 <- merge(d, rf2, by.x='society_id', by.y='society_id')

The last column of d2 should now have the rainfall data. Have a look for yourself, e.g.:

head(d2)

Go to the next tutorial

Back to the index