1 Recap

In part 4 we learned:

  • How to apply a function to a column of a data frame to get some basic statistics like mean() (average), sd() (standard deviation), max(), min().
  • How to apply a t-test to two different samples of numbers.
  • How to make a linear regression model.
  • How to create our own functions.
  • How to use tapply to group data and apply a function to each group. For example, to get the mean latitude for each language family:
tapply(d2$latitude, d2$glotto.family, mean)

We also learned a few basics about plots:

  • How to make basic plots (figures with visual summaries of the data).
  • How colours work in R (usually by setting the argument col to a string like “green”).
  • How to write a plot to a file on your computer (putting the plot code between a call to pdf() and dev.off()).

2 Missing values: NA

In this tutorial, we’ll cover missing data. A common problem when analysing data is dealing with missing data. Not all rows in your data will have known values, for example the word order of a particular language may not be available.

The variable below is a vector with two missing values (“NA”). A missing value is a logical constant which represents “Not available”.

x = c(2, 5, 3, NA, 1, NA)

Task: What happens if you try to calculate the sum of this variable?

You get NA returned! This isn’t helpful, but is technically correct - there’s no defined sum for missing data.

There are two ways of dealing with this. First, take out NA values, and we can do this by indexing x. However, you can’t use the boolean operators with NAs:

x == NA
## [1] NA NA NA NA NA NA

Instead, you can use the function is.na:

is.na(x)
## [1] FALSE FALSE FALSE  TRUE FALSE  TRUE

We want to keep only the values that are not NA, so we can negate the booleans with the symbol !:

# Vector of booleans that indexes non-NA data:
!is.na(x)
## [1]  TRUE  TRUE  TRUE FALSE  TRUE FALSE
# Create a new vector with only non-NA data:
x2 = x[!is.na(x)]
sum(x2)
## [1] 11

The second option is to use the optional argument na.rm (short for ‘NA removed’) in sum. This argument controls whether NA values should be ignored when calcualting the sum.

sum(x, na.rm=TRUE)
## [1] 11

When using this, be sure you know what you are excluding.

NA values can also be problematic when indexing:

# a vector of numbers with 1 NA
x = c(1,2,NA,4)
# a vector of characters
y = c("a",'b','c','d')
# index y using x
y2 = y[x]
y2
## [1] "a" "b" NA  "d"

Suddenly, y2 includes NA values!

3 Data with NA values

In the previous data sets we looked at, there were no missing data. Let’s load a file with some missing data. Download RainfallData_all.csv this file (RainfallData_all.csv) into your data folder. This is mainly the same as the previous rainfall data we used. However, this data includes the full data from D-PLACE, which includes some missing data for some columns. Specifically, some languages in the new data have values for rainfall, but do not have a defined language family. For these languages, the language family is defined as “NA” (missing). Some care is needed when using data like this.

Create a new script called “part5.R”. Now load the data, like before, but load the file “RainfallData_all.csv” instead of “RainfallData_complete.csv”:

rainfallData = read.csv("../data/RainfallData_all.csv", stringsAsFactors = F)

This dataset has some missing values for language_family and code (code is the name of the column with the measurement of the amount of rainfall).

Task What happens when you try to work out the mean rainfall in the whole dataset?

Task Which languages have the rainfall data missing?

Task: How many datapoints in rainfallData$language_family and code are missing? Remember that is.na creates a vector of booleans and sum can count the number of TRUE values in a vector. And remember you can look at the answers to help you.

Asking the questions above is an important step in any analysis, to check that you’re working with the right data.

Missing values in fields which are used for indexing can lead to problems. For example, suppose we want to get the rainfall for all languages in the “Indo-European” language family. The code below looks like it should work. Let’s run it and see what happens.

rainfallData[rainfallData$language_family=="Dravidian", ]$code
##   [1]       NA       NA       NA       NA       NA       NA       NA       NA
##   [9]       NA       NA       NA       NA       NA       NA       NA       NA
##  [17]       NA       NA       NA       NA       NA       NA       NA       NA
##  [25]       NA       NA       NA       NA       NA       NA       NA       NA
##  [33]       NA       NA       NA       NA       NA 100922.0       NA       NA
##  [41]       NA 103422.3       NA       NA       NA       NA       NA       NA
##  [49]       NA       NA       NA       NA       NA       NA       NA 108118.3
##  [57] 109047.8       NA       NA       NA 112861.6       NA       NA       NA
##  [65]       NA       NA       NA       NA       NA       NA 132169.3       NA
##  [73] 133062.3       NA       NA 135592.9 138058.4       NA 141620.7 141620.7
##  [81]       NA       NA       NA       NA       NA 164923.8       NA       NA
##  [89] 178297.3 178391.2 179394.2 179394.2       NA 189398.7       NA 193003.6
##  [97] 196914.4       NA       NA 206521.6       NA       NA

Hmm. There are only 20 Dravidian languages, but we get over 100 values back and most of them are NA. In fact, we get an NA returned for every NA value in language_family. This is because indexing expects true/false (or a number), and NAs produce something else, so we get this strange behaviour.

This might seem like a failing or R, but the computer is just doing what we ask. To get the result we want, we just need to be more specific.

There are two ways to handle this. Either remove the NAs in situ using is.na, or remove NAs from the data frame. The former takes up less memory, but looks messier.

# in situ:
# (find rows where language_family is not NA, AND where
#    langauge_family is equal to "Dravidian")
rainfallData[!is.na(rainfallData$language_family) &
               rainfallData$language_family=="Dravidian", ]$code
##  [1] 100922.0 103422.3 108118.3 109047.8 112861.6 132169.3 133062.3 135592.9
##  [9] 138058.4 141620.7 141620.7 164923.8 178297.3 178391.2 179394.2 179394.2
## [17] 189398.7 193003.6 196914.4 206521.6
# remove NAs:
# Create new dataframe rainfallData2 which has no 
#   missing values for language_family.
rainfallData2 <- rainfallData[!is.na(rainfallData$language_family),]
# Now get the 'code' column from this new dataframe
rainfallData2[rainfallData2$language_family=="Dravidian",]$code
##  [1] 100922.0 103422.3 108118.3 109047.8 112861.6 132169.3 133062.3 135592.9
##  [9] 138058.4 141620.7 141620.7 164923.8 178297.3 178391.2 179394.2 179394.2
## [17] 189398.7 193003.6 196914.4 206521.6

Another strategy is to change the NAs to an explicit value. The code below sets any language_family values that are NA to be “Isolate”.

missing <- is.na(rainfallData$language_family)
rainfallData[missing,]$language_family <- "Isolate"

However, this will make it look like all isolates are part of the same family. In this case, I usually set the language famiy just to be the language name (the language is part of its own family):

missing <- is.na(rainfallData$language_family)
rainfallData[missing,]$language_family <-
    rainfallData[missing,]$language_name

Note we have to index both the list we want to change and the source we want to take from, so that the rows line up.

Task: Work out the mean rainfall for each language family in the data. There are a few ways to do this, and none is “better” than any other, as long as the code is clear.

4 Inf

Sometimes, calculatings can lead to unexpected results. For example, if we divide a number by zero:

100 / 2
## [1] 50
100 / 0
## [1] Inf

We get Inf - which is a special R object which represents infinity. This is technically correct, but can be confusing if we weren’t expecting zeros in our data. For example, calculating the mean of anything with an Inf value is still Inf. Check that each line of code works before moving on, and if you find Inf in your data, check whether any calculations are dividing by zero.

5 NaN

Other possible results include ‘NaN’ which stands for “not a number”. This can happen when a calculation has an undefined result:

0/0
## [1] NaN

Or your vector has no items in it:

# This works as expected:
x = c(1,4,3)
mean(x)
## [1] 2.666667
# Now get mean of numbers in X if the numbers are above 10:
mean(x[x>10])
## [1] NaN

There are no numbers above 10, so in the second case mean() is given an empty vector. There is no defined ‘mean’ for this case.

You now know enough to get this joke:

x = c(1,2)
c(x[3:18], "Batman")

If you don’t get the joke, try importing 60s American TV shows into your brain.


Go to the next tutorial

Back to the index