In part 4 we learned:
mean()
(average), sd()
(standard deviation), max()
, min()
.tapply
to group data and apply a function to each group. For example, to get the mean latitude for each language family:tapply(d2$latitude, d2$glotto.family, mean)
We also learned a few basics about plots:
col
to a string like “green”).pdf()
and dev.off()
).In this tutorial, we’ll cover missing data. A common problem when analysing data is dealing with missing data. Not all rows in your data will have known values, for example the word order of a particular language may not be available.
The variable below is a vector with two missing values (“NA”). A missing value is a logical constant which represents “Not available”.
x = c(2, 5, 3, NA, 1, NA)
Task: What happens if you try to calculate the sum of this variable?
You get NA returned! This isn’t helpful, but is technically correct - there’s no defined sum for missing data.
There are two ways of dealing with this. First, take out NA values, and we can do this by indexing x. However, you can’t use the boolean operators with NAs:
x == NA
## [1] NA NA NA NA NA NA
Instead, you can use the function is.na
:
is.na(x)
## [1] FALSE FALSE FALSE TRUE FALSE TRUE
We want to keep only the values that are not NA, so we can negate the booleans with the symbol !
:
# Vector of booleans that indexes non-NA data:
!is.na(x)
## [1] TRUE TRUE TRUE FALSE TRUE FALSE
# Create a new vector with only non-NA data:
x2 = x[!is.na(x)]
sum(x2)
## [1] 11
The second option is to use the optional argument na.rm
(short for ‘NA removed’) in sum
. This argument controls whether NA values should be ignored when calcualting the sum.
sum(x, na.rm=TRUE)
## [1] 11
When using this, be sure you know what you are excluding.
NA values can also be problematic when indexing:
# a vector of numbers with 1 NA
x = c(1,2,NA,4)
# a vector of characters
y = c("a",'b','c','d')
# index y using x
y2 = y[x]
y2
## [1] "a" "b" NA "d"
Suddenly, y2 includes NA values!
In the previous data sets we looked at, there were no missing data. Let’s load a file with some missing data. Download RainfallData_all.csv this file (RainfallData_all.csv) into your data folder. This is mainly the same as the previous rainfall data we used. However, this data includes the full data from D-PLACE, which includes some missing data for some columns. Specifically, some languages in the new data have values for rainfall, but do not have a defined language family. For these languages, the language family is defined as “NA” (missing). Some care is needed when using data like this.
Create a new script called “part5.R”. Now load the data, like before, but load the file “RainfallData_all.csv” instead of “RainfallData_complete.csv”:
rainfallData = read.csv("../data/RainfallData_all.csv", stringsAsFactors = F)
This dataset has some missing values for language_family
and code
(code
is the name of the column with the measurement of the amount of rainfall).
Task What happens when you try to work out the mean rainfall in the whole dataset?
Task Which languages have the rainfall data missing?
Task: How many datapoints in
rainfallData$language_family
andcode
are missing? Remember thatis.na
creates a vector of booleans andsum
can count the number ofTRUE
values in a vector. And remember you can look at the answers to help you.
Asking the questions above is an important step in any analysis, to check that you’re working with the right data.
Missing values in fields which are used for indexing can lead to problems. For example, suppose we want to get the rainfall for all languages in the “Indo-European” language family. The code below looks like it should work. Let’s run it and see what happens.
rainfallData[rainfallData$language_family=="Dravidian", ]$code
## [1] NA NA NA NA NA NA NA NA
## [9] NA NA NA NA NA NA NA NA
## [17] NA NA NA NA NA NA NA NA
## [25] NA NA NA NA NA NA NA NA
## [33] NA NA NA NA NA 100922.0 NA NA
## [41] NA 103422.3 NA NA NA NA NA NA
## [49] NA NA NA NA NA NA NA 108118.3
## [57] 109047.8 NA NA NA 112861.6 NA NA NA
## [65] NA NA NA NA NA NA 132169.3 NA
## [73] 133062.3 NA NA 135592.9 138058.4 NA 141620.7 141620.7
## [81] NA NA NA NA NA 164923.8 NA NA
## [89] 178297.3 178391.2 179394.2 179394.2 NA 189398.7 NA 193003.6
## [97] 196914.4 NA NA 206521.6 NA NA
Hmm. There are only 20 Dravidian languages, but we get over 100 values back and most of them are NA. In fact, we get an NA returned for every NA value in language_family
. This is because indexing expects true/false (or a number), and NAs produce something else, so we get this strange behaviour.
This might seem like a failing or R, but the computer is just doing what we ask. To get the result we want, we just need to be more specific.
There are two ways to handle this. Either remove the NAs in situ using is.na
, or remove NAs from the data frame. The former takes up less memory, but looks messier.
# in situ:
# (find rows where language_family is not NA, AND where
# langauge_family is equal to "Dravidian")
rainfallData[!is.na(rainfallData$language_family) &
rainfallData$language_family=="Dravidian", ]$code
## [1] 100922.0 103422.3 108118.3 109047.8 112861.6 132169.3 133062.3 135592.9
## [9] 138058.4 141620.7 141620.7 164923.8 178297.3 178391.2 179394.2 179394.2
## [17] 189398.7 193003.6 196914.4 206521.6
# remove NAs:
# Create new dataframe rainfallData2 which has no
# missing values for language_family.
rainfallData2 <- rainfallData[!is.na(rainfallData$language_family),]
# Now get the 'code' column from this new dataframe
rainfallData2[rainfallData2$language_family=="Dravidian",]$code
## [1] 100922.0 103422.3 108118.3 109047.8 112861.6 132169.3 133062.3 135592.9
## [9] 138058.4 141620.7 141620.7 164923.8 178297.3 178391.2 179394.2 179394.2
## [17] 189398.7 193003.6 196914.4 206521.6
Another strategy is to change the NAs to an explicit value. The code below sets any language_family
values that are NA to be “Isolate”.
missing <- is.na(rainfallData$language_family)
rainfallData[missing,]$language_family <- "Isolate"
However, this will make it look like all isolates are part of the same family. In this case, I usually set the language famiy just to be the language name (the language is part of its own family):
missing <- is.na(rainfallData$language_family)
rainfallData[missing,]$language_family <-
rainfallData[missing,]$language_name
Note we have to index both the list we want to change and the source we want to take from, so that the rows line up.
Task: Work out the mean rainfall for each language family in the data. There are a few ways to do this, and none is “better” than any other, as long as the code is clear.
Sometimes, calculatings can lead to unexpected results. For example, if we divide a number by zero:
100 / 2
## [1] 50
100 / 0
## [1] Inf
We get Inf
- which is a special R object which represents infinity. This is technically correct, but can be confusing if we weren’t expecting zeros in our data. For example, calculating the mean of anything with an Inf
value is still Inf
. Check that each line of code works before moving on, and if you find Inf
in your data, check whether any calculations are dividing by zero.
Other possible results include ‘NaN’ which stands for “not a number”. This can happen when a calculation has an undefined result:
0/0
## [1] NaN
Or your vector has no items in it:
# This works as expected:
x = c(1,4,3)
mean(x)
## [1] 2.666667
# Now get mean of numbers in X if the numbers are above 10:
mean(x[x>10])
## [1] NaN
There are no numbers above 10, so in the second case mean()
is given an empty vector. There is no defined ‘mean’ for this case.
You now know enough to get this joke:
x = c(1,2)
c(x[3:18], "Batman")
If you don’t get the joke, try importing 60s American TV shows into your brain.
Go to the next tutorial
Back to the index