Task: can you apply the functions
sum
andlength
to boolean vectors?
Yes - sum counts the number of TRUE
values.
Task: make a list of booleans where the items are
TRUE
if each corresponding item ofnums
is less or equal to 3
nums <= 3
Task: make a list of booleans where the items are
TRUE
if each corresponding item ofnums
is either 5 or 1.
nums %in% c(5,1)
Task : Using y to index
nums
worked because the length of the two vectors are the same. What happens if you make another variabley2
which only has 3 boolean items, and try to indexnums
using that?
y2 = c(TRUE, FALSE, TRUE)
nums[y2]
Task: Find the favourite numbers of all my friends who have more than 4 characters in their name.
Hint: build this up step by step. First of all, get the number of characters in each name, then test whether this number is greater than 4. This should result in a vector of booleans. Then index nums using this vector.
nums[nchar(friends)>4]
Task: What happens if you type
d
into the console to see what’s inside the objectd
?
You get a lot of data printed to the screen! Very unhelpful.
Task: Make a boolean vector which is
TRUE
if thefocal_year
variable is greater than 1850, andFALSE
otherwise. Assign this to a variable namedrecentFocalYears
(create a variable calledrecentFocalYears
and store the boolean vector inside it).
recentFocalYears <- d$focal_year > 1850
Task: Make a table of counts of roof shape types for languages in the recent data. Hint: you should index the rows with the variable recentFocalYears and the code_label column.
table(d[recentFocalYears,]$code_label)
Task: Load the data in
data/Glottolog_Data.csv
into a data frame called glottoData.
rainfallData <- read.csv("../data/RainfallData.csv", stringsAsFactors = F)
Task What are the names and formats of the variables in rainfallData? What is the name of the column that includes the numeric amount of rainfall?
str(rainfallData)
The code
variable holds the data on rainfall)
Task: Add a line to your script to make a variable in
d
namedflatRoof
that is TRUE if the roof type is “Flat” and FALSE otherwise.
d$flatRoof <- d$roofShape == "Flat"
Task: Which basic word order type has the higest mean rainfall? Add this to your script
rainfallByLangFam.MEAN <- tapply(d$rainfall, d$language_family, mean)
rainfallByLangFam.MEAN.sorted <- sort(rainfallByLangFam.MEAN, decreasing = TRUE)
head(rainfallByLangFam.MEAN.sorted)
Task: Make a boxplot showing the mean rainfall for each type of roof shape. There are lots of types, which makes the legend difficult to read. Can you make a boxplot for just the top 4 most common roof shape types?
# Plot of all the roof shape types:
boxplot(rainfall ~ roofShape, data = d)
# Plot the 5 most common roof shapes:
# Get a table of counts for roof shape:
roofShapeCounts = table(d$roofShape)
# Sort the list and take the top 4
top4RoofShapes = sort(roofShapeCounts,decreasing = T)[1:4]
# This is a list of numbers, but we want the NAMES of this list:
top4RoofShapes = names(top4RoofShapes)
# Plot the data:
boxplot(rainfall ~ roofShape, data = d[d$roofShape %in% top4RoofShapes,])
Task: Change the colour of the line in the
abline
function to green.
abline(h = 0, col = 'green')
Task: What happens if you try to calculate the sum of this variable?
You get NA returned!
Task What happens when you try to work out the mean rainfall in the whole dataset?
mean(rainfallData$code)
You get an NA value!
Task Which languages have the rainfall data missing?
Here’s how to get rows where code
is NA:
rainfallData[is.na(rainfallData$code),]
Task: How many datapoints in
rainfallData$language_family
are missing? Remember thatis.na
creates a vector of booleans andsum
can count the number ofTRUE
values in a vector.
sum(is.na(rainfallData$language_family))
sum(is.na(rainfallData$code))
Task: Work out the mean rainfall for each language family in the data. There are a few ways to do this, and none is “better” than any other, as long as the code is clear.
If there were no NA values, then we could use a tapply function to solve this:
tapply(rainfallData$code, rainfallData$language_family, mean)
But this will produce some NA values.
There are a few ways to solve this. The first is to make a data frame with only rows where the code
variable is NOT NA:
rainfallData2 = rainfallData[!is.na(rainfallData$code),]
tapply(rainfallData2$code, rainfallData2$language_family, mean)
You could also just pass the argument na.rm=TRUE
as an extra argument to tapply. tapply passes this argument on to the function it’s given. Now, mean()
will ignore any NA values.
tapply(rainfallData$code, rainfallData$language_family, mean, an.rm=TRUE)