Part 1

Task: can you apply the functions sum and length to boolean vectors?

Yes - sum counts the number of TRUE values.

Task: make a list of booleans where the items are TRUE if each corresponding item of nums is less or equal to 3

nums <= 3

Task: make a list of booleans where the items are TRUE if each corresponding item of nums is either 5 or 1.

nums %in% c(5,1)

Task : Using y to index nums worked because the length of the two vectors are the same. What happens if you make another variable y2 which only has 3 boolean items, and try to index nums using that?

y2 = c(TRUE, FALSE, TRUE)
nums[y2]

Task: Find the favourite numbers of all my friends who have more than 4 characters in their name.

Hint: build this up step by step. First of all, get the number of characters in each name, then test whether this number is greater than 4. This should result in a vector of booleans. Then index nums using this vector.

nums[nchar(friends)>4]

Part 3

Task: What happens if you type d into the console to see what’s inside the object d?

You get a lot of data printed to the screen! Very unhelpful.

Task: Make a boolean vector which is TRUE if the focal_year variable is greater than 1850, and FALSE otherwise. Assign this to a variable named recentFocalYears (create a variable called recentFocalYears and store the boolean vector inside it).

recentFocalYears <- d$focal_year > 1850

Task: Make a table of counts of roof shape types for languages in the recent data. Hint: you should index the rows with the variable recentFocalYears and the code_label column.

table(d[recentFocalYears,]$code_label)

Task: Load the data in data/Glottolog_Data.csv into a data frame called glottoData.

rainfallData <- read.csv("../data/RainfallData.csv", stringsAsFactors = F)

Task What are the names and formats of the variables in rainfallData? What is the name of the column that includes the numeric amount of rainfall?

str(rainfallData)

The code variable holds the data on rainfall)

Part 4

Task: Add a line to your script to make a variable in d named flatRoof that is TRUE if the roof type is “Flat” and FALSE otherwise.

d$flatRoof <- d$roofShape == "Flat"

Task: Which basic word order type has the higest mean rainfall? Add this to your script

rainfallByLangFam.MEAN <- tapply(d$rainfall, d$language_family, mean)
rainfallByLangFam.MEAN.sorted <- sort(rainfallByLangFam.MEAN, decreasing = TRUE)
head(rainfallByLangFam.MEAN.sorted)

Task: Make a boxplot showing the mean rainfall for each type of roof shape. There are lots of types, which makes the legend difficult to read. Can you make a boxplot for just the top 4 most common roof shape types?

# Plot of all the roof shape types:
boxplot(rainfall ~ roofShape, data = d)

# Plot the 5 most common roof shapes:
# Get a table of counts for roof shape:
roofShapeCounts = table(d$roofShape)
# Sort the list and take the top 4
top4RoofShapes = sort(roofShapeCounts,decreasing = T)[1:4]
# This is a list of numbers, but we want the NAMES of this list:
top4RoofShapes = names(top4RoofShapes)
# Plot the data:
boxplot(rainfall ~ roofShape, data = d[d$roofShape %in% top4RoofShapes,])

Task: Change the colour of the line in the abline function to green.

abline(h = 0, col = 'green')

Part 5

Task: What happens if you try to calculate the sum of this variable?

You get NA returned!

Task What happens when you try to work out the mean rainfall in the whole dataset?

mean(rainfallData$code)

You get an NA value!

Task Which languages have the rainfall data missing?

Here’s how to get rows where code is NA:

rainfallData[is.na(rainfallData$code),]

Task: How many datapoints in rainfallData$language_family are missing? Remember that is.na creates a vector of booleans and sum can count the number of TRUE values in a vector.

sum(is.na(rainfallData$language_family))
sum(is.na(rainfallData$code))

Task: Work out the mean rainfall for each language family in the data. There are a few ways to do this, and none is “better” than any other, as long as the code is clear.

If there were no NA values, then we could use a tapply function to solve this:

tapply(rainfallData$code, rainfallData$language_family, mean)

But this will produce some NA values.

There are a few ways to solve this. The first is to make a data frame with only rows where the code variable is NOT NA:

rainfallData2 = rainfallData[!is.na(rainfallData$code),]
tapply(rainfallData2$code, rainfallData2$language_family, mean)

You could also just pass the argument na.rm=TRUE as an extra argument to tapply. tapply passes this argument on to the function it’s given. Now, mean() will ignore any NA values.

tapply(rainfallData$code, rainfallData$language_family, mean, an.rm=TRUE)