Task: can you apply the functions
sumandlengthto boolean vectors?
Yes - sum counts the number of TRUE values.
Task: make a list of booleans where the items are
TRUEif each corresponding item ofnumsis less or equal to 3
nums <= 3
Task: make a list of booleans where the items are
TRUEif each corresponding item ofnumsis either 5 or 1.
nums %in% c(5,1)
Task : Using y to index
numsworked because the length of the two vectors are the same. What happens if you make another variabley2which only has 3 boolean items, and try to indexnumsusing that?
y2 = c(TRUE, FALSE, TRUE)
nums[y2]
Task: Find the favourite numbers of all my friends who have more than 4 characters in their name.
Hint: build this up step by step. First of all, get the number of characters in each name, then test whether this number is greater than 4. This should result in a vector of booleans. Then index nums using this vector.
nums[nchar(friends)>4]
Task: What happens if you type
dinto the console to see what’s inside the objectd?
You get a lot of data printed to the screen! Very unhelpful.
Task: Make a boolean vector which is
TRUEif thefocal_yearvariable is greater than 1850, andFALSEotherwise. Assign this to a variable namedrecentFocalYears(create a variable calledrecentFocalYearsand store the boolean vector inside it).
recentFocalYears <- d$focal_year > 1850
Task: Make a table of counts of roof shape types for languages in the recent data. Hint: you should index the rows with the variable recentFocalYears and the code_label column.
table(d[recentFocalYears,]$code_label)
Task: Load the data in
data/Glottolog_Data.csvinto a data frame called glottoData.
rainfallData <- read.csv("../data/RainfallData.csv", stringsAsFactors = F)
Task What are the names and formats of the variables in rainfallData? What is the name of the column that includes the numeric amount of rainfall?
str(rainfallData)
The code variable holds the data on rainfall)
Task: Add a line to your script to make a variable in
dnamedflatRoofthat is TRUE if the roof type is “Flat” and FALSE otherwise.
d$flatRoof <- d$roofShape == "Flat"
Task: Which basic word order type has the higest mean rainfall? Add this to your script
rainfallByLangFam.MEAN <- tapply(d$rainfall, d$language_family, mean)
rainfallByLangFam.MEAN.sorted <- sort(rainfallByLangFam.MEAN, decreasing = TRUE)
head(rainfallByLangFam.MEAN.sorted)
Task: Make a boxplot showing the mean rainfall for each type of roof shape. There are lots of types, which makes the legend difficult to read. Can you make a boxplot for just the top 4 most common roof shape types?
# Plot of all the roof shape types:
boxplot(rainfall ~ roofShape, data = d)
# Plot the 5 most common roof shapes:
# Get a table of counts for roof shape:
roofShapeCounts = table(d$roofShape)
# Sort the list and take the top 4
top4RoofShapes = sort(roofShapeCounts,decreasing = T)[1:4]
# This is a list of numbers, but we want the NAMES of this list:
top4RoofShapes = names(top4RoofShapes)
# Plot the data:
boxplot(rainfall ~ roofShape, data = d[d$roofShape %in% top4RoofShapes,])
Task: Change the colour of the line in the
ablinefunction to green.
abline(h = 0, col = 'green')
Task: What happens if you try to calculate the sum of this variable?
You get NA returned!
Task What happens when you try to work out the mean rainfall in the whole dataset?
mean(rainfallData$code)
You get an NA value!
Task Which languages have the rainfall data missing?
Here’s how to get rows where code is NA:
rainfallData[is.na(rainfallData$code),]
Task: How many datapoints in
rainfallData$language_familyare missing? Remember thatis.nacreates a vector of booleans andsumcan count the number ofTRUEvalues in a vector.
sum(is.na(rainfallData$language_family))
sum(is.na(rainfallData$code))
Task: Work out the mean rainfall for each language family in the data. There are a few ways to do this, and none is “better” than any other, as long as the code is clear.
If there were no NA values, then we could use a tapply function to solve this:
tapply(rainfallData$code, rainfallData$language_family, mean)
But this will produce some NA values.
There are a few ways to solve this. The first is to make a data frame with only rows where the code variable is NOT NA:
rainfallData2 = rainfallData[!is.na(rainfallData$code),]
tapply(rainfallData2$code, rainfallData2$language_family, mean)
You could also just pass the argument na.rm=TRUE as an extra argument to tapply. tapply passes this argument on to the function it’s given. Now, mean() will ignore any NA values.
tapply(rainfallData$code, rainfallData$language_family, mean, an.rm=TRUE)