1 Permutation tests: principles

The idea behind random permutation is the following: if two variables are really correlated, then mixing up the data in one of the variables (permuting the variable) should break this relationship. If they’re not really correlated, then permuting a variable should not have a big effect on the statistic. Put another way, measuring the statistic when one variable is permuted gives us an idea of the baseline level of chance.

R has a function called sample which is useful for permutation. It takes one manditory argument: the vector of items to choose from, and an optional argument of how many items to choose (default is as many as in the original vector) and whether to allow choosing the same item twice (default is no). It then returns a random selection of the items in the vector.

With the default settings, it permutes the vector

x <- c(1,2,3,4,5,6)
sample(x)
## [1] 4 2 6 1 3 5

Note that, because this is a random process, if we do it again, we get a different answer:

sample(x)
## [1] 3 6 1 4 5 2

Let’s try a simple test. If we pick two sets of random numbers, they should not be correlated:

x <- rnorm(100)
y <- rnorm(100)

cor.test(x,y)
## 
##  Pearson's product-moment correlation
## 
## data:  x and y
## t = 0.63809, df = 98, p-value = 0.5249
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.1337854  0.2574879
## sample estimates:
##        cor 
## 0.06432298

They’re not significantly correlated - good! But we can also test this by permuting one of the variables:

y.permuted <- sample(y)

cor.test(x,y.permuted)
## 
##  Pearson's product-moment correlation
## 
## data:  x and y.permuted
## t = -0.853, df = 98, p-value = 0.3957
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.2775853  0.1124667
## sample estimates:
##         cor 
## -0.08584779

We get a different result, but it’s still not significant. What happens if we do this many times? To do this, let’s put the code above into a new custom function. We’ll have it return just the correlation coefficient rather than the whole test.

permutationTest <- function(x,y){
  y.permuted <- sample(y)
  r = cor(x,y.permuted)
  return(r)
}

Then we’ll call this custom function 100 times using the function replicate:

permutedCorrelation <- replicate(1000, permutationTest(x,y))

Now we can look at the results using a histogram. We’ll also add a red line to mark the position of the real correlation coefficient value, so we can compare them. (Note that we’re adjusting the limits of the x axis to show more of the plot).

realCorrelation <- cor(x,y)

hist(permutedCorrelation, xlim=c(-1, 1))
abline(v=realCorrelation, col=2)

This looks like a normal distribution centered around zero. In a few cases, the correlation is quite high, but it’s quite rate. That’s what we expected! This is a representation of the likelihood of the two variables being correlated by chance.

The red line is the real correlation coefficiet - and it sits in the middle of this distribution. This means that we’re quite likely to see the real correlation value, even by chance.

We can work out a statistic to tell us how strong the relationship is. The z score measures how many standard deviations away from the mean of permuted values the true value is. We can also work out a p-value by counting the proportion of permutations that resulted in a stronger correlation coefficient than the real one.

Note that below, I’m not concerned with the direction of the correlation, so I’m just looking at the absolute size of the value (how far it is from zero), rather than positive/negative. In other words, it’s a 2-tailed test. The function abs turns a vector of numbers into a vector of absolute values.

# number of standard deviations above the mean
z.score <- (realCorrelation-mean(permutedCorrelation)) / sd(permutedCorrelation)
z.score
## [1] 0.6867762
# proportion of stronger results in permutation
p.value <- sum(abs(permutedCorrelation) >= abs(realCorrelation)) / length(permutedCorrelation)
p.value
## [1] 0.526

That is, the strength of the correlation is very weak (z = 0.6867762, p = 0.526).

1.1 Correlated variables

Let’s do another test - this time with two variables that are correlated. We’ll do this by picking random numbers for x2, then copying them, slightly altered for y2:

x2 <- rnorm(100)
y2 <- jitter(x2, amount=2)
plot(x2,y2)

cor.test(x2,y2)
## 
##  Pearson's product-moment correlation
## 
## data:  x2 and y2
## t = 9.2149, df = 98, p-value = 6.123e-15
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.5598495 0.7741578
## sample estimates:
##       cor 
## 0.6813439

Next we’ll work out the real correlation coefficient, and then the permutation test.

realCorrelation2 <- cor(x2,y2)
permutedCorrelation2 <- replicate(1000, permutationTest(x2,y2))

Now we can look at the results using a histogram, but we’ll also add a red line to mark the position of the real correlation coefficient value, so we can compare them. (Note that we’re adjusting the limits of the x axis to show more of the plot).

hist(permutedCorrelation2, xlim=c(-1, 1))
abline(v=realCorrelation2, col=2)

The real value is clearly outside the distribution of permuted values, which suggests that the two variables are correlated to a greater extent than we’d expect by chance. Here are the statistics:

z.score2 <- (realCorrelation2-mean(permutedCorrelation2)) / sd(permutedCorrelation2)
z.score2
## [1] 6.833018
# proportion of stronger results in permutation
p.value2 <- sum(abs(permutedCorrelation2) >= abs(realCorrelation2)) / length(permutedCorrelation2)
p.value2
## [1] 0

The p-value is zero, but this is not really a good description of the test. What we found was that, out of 1000 permutations, none were as strong as the real value. So, we could say that the p-value is lower than 1/1000, i.e. we would report “p < 0.001”. The greater the number of permutations, the lower this p value can become.

1.2 Summary

We ran some permutation tests above. The basic steps were:

  • Measure the real strength of the relationship between our variables
  • Permute one of the variables to try to break this relationship
  • Measure the strength of the relationship between the first variable and the permuted variable
  • Do the permutation many times, and get a distribution of permuted strengths
  • Compare the real strength of the relationship with the distribution of permuted strengths

Look again at the results for the uncorrelated variables:

  • Pearson test t = 0.6380864, p = 0.5249057
  • Permutation test z = 0.6867762, p = 0.526

Note that the measures are very similar. In fact, this is because the correlation test works on essentially the same principle - calculating the likelihood of observing a correlation as strong as the real one by chance. We used 1000 permutations for the test above. As the number of permutations increases, the similarity between the two tests should converge. In this case the similarity is also due to the fact that the variables were normally distributed, which is one of the assumptions of the Pearson correlation test. The advantage of the permutation test is that it does not require that the variables are normally distributed.

Another strength of the permutation test is that it can be modified to look at different baselines, for example permuting only within a linguistic area or family. We’ll try this in the next sections …

2 An example with real data: testing differences between groups

In this tutorial, we’ll investigate a correlation in our data. The hypothesis is the following:

Societies from places with more rainfall will be less likely to have flat rooves than societies from places with less rainfall.

This is not very well based in theory, but the aim is to demonstrate some principles of how permutation works.

First we load and merge our data

# set working directory
setwd("~/Documents/Teaching/CARDIFF/RCourse/IntroToRForAnthropologists/analysis/")

#Load the roof data
d = read.csv("../data/RoofShapeData.csv", stringsAsFactors = F)
# Creae a variable called 'roofShape'
# (no space so we can tell the difference in a list)
d$roofShape = "NotFlat"
d[d$code_label=="Flat",]$roofShape = "Flat"

# Load rainfall data
rainfallData = read.csv("../data/RainfallData_all.csv",stringsAsFactors = F)

# Add the rainfall data into the main data source
d$rainfall <- rainfallData[match(d$society_id, rainfallData$society_id) , ]$code
# get rid of NA values
d = d[!is.na(d$rainfall),]
# Set isolates to have language family of their own
d[d$language_family=="",]$language_family =
  d[d$language_family=="",]$language_name

# Load society data
societyData = read.csv("../data/societies.csv",stringsAsFactors = F)

# Add the geographic coordinates
d$longitude <- societyData[match(d$society_id, societyData$id) , ]$Long
d$latitude <- societyData[match(d$society_id, societyData$id) , ]$Lat

Next we have to convert the roofShape variable to a factor. This is a data type which recognises that the items represent discrete categories, rather than continuous numbers or a string of characters.

# convert to factors
d$roofShape = factor(d$roofShape)
head(d$roofShape)
## [1] NotFlat NotFlat NotFlat NotFlat NotFlat NotFlat
## Levels: Flat NotFlat

So, do flat rooves get more rainfall?

t.test(d$rainfall~d$roofShape)
## 
##  Welch Two Sample t-test
## 
## data:  d$rainfall by d$roofShape
## t = -12.869, df = 156.49, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -67327.91 -49409.79
## sample estimates:
##    mean in group Flat mean in group NotFlat 
##              58042.54             116411.39

The pattern certainly seems strong - but we can’t see whether the languages are related. Indeed, when we draw a map of the basic word order data, we notice that there are large geogrpahic areas of similar types:

We can use several methods to try and control for the geographic relatedness of languages.

2.1 Permutation tests

Recall that a basic step in the permutation test is measuring the strength of the relationship between two variables. In the section above with two continuous variables, we used the correlation coefficient. In this section, our prediction is that there is a difference in the means of the two groups.

So let’s measure the real difference in means between flat and non-flat rooved socieitis:

# function that takes a vector of numbers and a vector of what group each number belongs to, and works out the difference in means between them.
getDifferenceInMeans <- function(x,group){
  diff(tapply(x, group, mean))
}

realDifferenceInMeans <- getDifferenceInMeans(d$rainfall, d$roofShape)

Now, in a similar manner as above, we can make a function that permutes the basic word order between languages:

permTest <- function(x,group){
  # permute group
  group <- sample(group)
  # work out difference in means
  permutedDifferenceInMeans <-getDifferenceInMeans(x,group)
  # return the value
  return(permutedDifferenceInMeans)
}

And now run this many times, and look at the results:

res <- replicate(1000, permTest(d$rainfall,d$roofShape))

hist(res, xlim=c(-30000,60000))
abline(v= realDifferenceInMeans, col=2)

# Z value
(realDifferenceInMeans-mean(res)) / sd(res)
##  NotFlat 
## 7.791489
# P value
sum(res >= realDifferenceInMeans) / length(res)
## [1] 0

So, according to this test, the relationship is still strong. But now let’s try controlling for geogrpahic area. To do this, we’ll allow data to be permuted only within areas.

2.2 Using tapply to do structured permutation

This can be done using the function tapply, which groups data before applying a function to each group. Let’s say we have a vector of letters, and another vector saying whether they’re a consonant or a vowel:

myLetters   <-      c('x','z','a','d','e','o')
vowelOrConsonant <- c('c','c','v','c','v','v') 

We can permute myLetters 4 times - note that vowels and consonants are completely mixed up:

sample(myLetters)
## [1] "o" "z" "e" "a" "d" "x"
sample(myLetters)
## [1] "d" "a" "x" "z" "o" "e"
sample(myLetters)
## [1] "d" "a" "o" "z" "e" "x"
sample(myLetters)
## [1] "d" "o" "a" "z" "e" "x"

Next, we can use tapply to first group letters by type, then sample within each group:

tapply(myLetters,vowelOrConsonant, sample)
## $c
## [1] "z" "d" "x"
## 
## $v
## [1] "a" "o" "e"

This command returns a list, but we’d like it to return a vector, so we can use the function unlist. Let’s do it several times and look at the results:

unlist(tapply(myLetters,vowelOrConsonant, sample))
##  c1  c2  c3  v1  v2  v3 
## "x" "d" "z" "o" "e" "a"
unlist(tapply(myLetters,vowelOrConsonant, sample))
##  c1  c2  c3  v1  v2  v3 
## "x" "d" "z" "o" "e" "a"
unlist(tapply(myLetters,vowelOrConsonant, sample))
##  c1  c2  c3  v1  v2  v3 
## "d" "z" "x" "a" "e" "o"
unlist(tapply(myLetters,vowelOrConsonant, sample))
##  c1  c2  c3  v1  v2  v3 
## "x" "z" "d" "e" "a" "o"

Now note that the vowels can only trade places with anoter vowel. Note also that the resulting vector sorts the data so that the consonants all come first.

2.3 Controlling for language family

We can now write a function which takes a vector of numbers, a vector indicating which group they belong to and a vector saying which language family they belong to, and run a structured permutation test. (Maybe what we really want to do is group by climate region, but you get the idea).

Before we proceed, there’s a slight hiccup to get over: In the data we’re using, there are some language families with just one society. Unfortuantely, the function sample() behaves weirdly if you give it a vector with just one number. In this case, it returns a sample of numbers from 1 to the number given. So we have to make a new function that does what we want. We’ll call this sampleSimple:

sampleSimple = function(X,...){
  if(length(X)==1){
    return(X)
  } else{
    return(sample(X,...))
  }
}

The function below is similar to the permutation function above, except it uses the tapply trick to permute only within families Note also that the numeric vector has to be sorted by family.

permuteWithinFamilies <- function(x,group,family){
  # permute number
  x2 <- unlist(tapply(x,family,sampleSimple))
  # put group into right order
  group2 <- unlist(tapply(group,family,sampleSimple))
  permutedDifferenceInMeans <-getDifferenceInMeans(x2,group2)
  return(permutedDifferenceInMeans)
}

resFamily <- replicate(1000, 
          permuteWithinFamilies(d$rainfall,
                             d$roofShape,
                             d$language_family))

hist(resFamily, xlim=c(-30000,60000))
abline(v= realDifferenceInMeans, col=2)

Note how the distribution has shifted to the right - with the test above, we’re much more likely to see higher correlation values. The reason for the difference is that we’re using a different baseline to represent random chance.

# Z value
(realDifferenceInMeans-mean(resFamily)) / sd(resFamily)
##  NotFlat 
## 6.691929
# P value
sum(resFamily >= realDifferenceInMeans) / length(resFamily)
## [1] 0

The result is still significant, but the effect size is smaller. This suggests that there may be some geographic or historical signal in the data.

3 Independent samples

An extreme way of ensuring that your data is not affected by historical or geographical relations is only to run the test with data that you know are not related. That is, to use a sample of independent data points. For example, we could choose a random language from each geographic area for each basic word order type.

The function sample has an optional parameter “size”, which defines the number of random selections to return. We can pass this to sample inside the tapply function to pick just one item from each group. (the function sampleSimple passes these arguments on to sample)

# Set up some variables
x <- d$rainfall
group <- d$roofShape
family <- d$language_family

# split the numbers by group
g1 = x[group=="Flat"]
g2 = x[group=="NotFlat"]

# split the families by group
a1 = family[group=="Flat"]
a2 = family[group=="NotFlat"]

# pick one number from each family for group 1
indepSample1= tapply(g1,a1, sampleSimple, size=1)
# pick one number from each family for group 2
indepSample2= tapply(g2,a2, sampleSimple, size=1)

# work out the difference in means
indepDiff = mean(indepSample2) - mean(indepSample1)

indepDiff
## [1] 52676.25

So, this tells us that the difference between Flat and Non-flat rooves from a random independent sample is 5.267625310^{4}. That’s more than zero, but is it significantly more than zero? What would happen if we had chosen a different random selection?

We can answer these questions by running the code above many times. First, we put the code into a custom function, then repliate it many times:

indepSample <- function(x,group,family){
  # work out what the two groups are
  groupNames = sort(unique(group))
  
  # split the x variable into the two groups
  g1 = x[group==groupNames[1]]
  g2 = x[group==groupNames[2]]
  
  # split the family vairable into the two groups
  a1 = family[group==groupNames[1]]
  a2 = family[group==groupNames[2]]
  
  # choose independent samples from both groups
  indepSample1= tapply(g1,a1, sampleSimple, size=1)
  indepSample2= tapply(g2,a2, sampleSimple, size=1)
  
  # work out difference in means
  indepDiff = mean(indepSample2) - mean(indepSample1)
  return(indepDiff)
}

res = replicate(1000, 
          indepSample(d$rainfall, d$roofShape, d$language_family))

Let’s look at the results, and determine what proportion of independent samples resulted in a difference less than zero. What we’re seeing here is the difference in means, where a positive value means that the Flat rooves have a higher mean rainfall than Non-flat rooves.

hist(res)
abline(v= 0, col=2)

# P value
p.value <- sum(res < 0) / length(res)
p.value
## [1] 0

The p value is 0 (or < 1/1000) - so quite significant. This suggests that the correlation between rainfall and roof shape may not be explained away by Galton’s problem.

However, it’s worth noting that this test is very conservative - it throws away a lot of information and runs tests on a much smaller sample of the data. It’s possible that this would lead to reduced statistical power, and an inability to detect relationships that really were there.

3.1 Combining permutation and independent samples

In the tests above, we took one society from each family where we could. But this results in an imbalance in the sample sizes - there are more areas with non-flat rooves than with flat rooves. Larger samples can include more extreme measures by chance, which could affect our result. We could control for this by only selecting the same number of observations for each group.

In addition, we were assuming that the difference between indpendent samples by chance was zero.

We could address these shortcomings by combining independent samples and random permutation. The idea is to take indpendent samples, then run a mini permutation test on them.

In the code below, I add a function which takes indpendent samples from each family for each group, but then makes sure both samples are the same size by throwing out data from the larger group. The data I throw out is picked randomly.

In addition, I don’t just return the difference between the means of the two samples. When I’ve picked my two samples, I permute one of them and then return the difference between the real difference and the permuted difference.

Phew! That’s hard to explain in words. Here’s the code:

indepSample.equalSizes <- function(x,group,family){
  # work out what the two groups are
  groupNames = unique(group)
  
  # split the x variable into the two groups
  g1 = x[group==groupNames[1]]
  g2 = x[group==groupNames[2]]
  
  # split the family vairable into the two groups
  a1 = family[group==groupNames[1]]
  a2 = family[group==groupNames[2]]
  
  # choose independent samples from both groups
  indepSample1= tapply(g1,a1, sampleSimple, size=1)
  indepSample2= tapply(g2,a2, sampleSimple, size=1)
  
  # work out how many samples are in the smallest group
  chooseNum <- min(length(indepSample1),length(indepSample2))
  
  # randomly pick items from each group, so they have the same size
  indepSample1 <- sampleSimple(indepSample1, size=chooseNum)
  indepSample2 <- sampleSimple(indepSample2, size=chooseNum)
  
  # work out the difference in means
  indepDiff = mean(indepSample2) - mean(indepSample1)
  
  ### Permutation:
  
  # permute membership between samples by
  # 1) joining them together and permuting 
  permutedSamples1and2 <- sampleSimple(c(indepSample1,indepSample2))
  # 2) splitting them into 2 new groups
  permSample1 <-permutedSamples1and2[1:chooseNum]
  permSample2 <-permutedSamples1and2[(chooseNum+1):length(permutedSamples1and2)]
  
  # work out the difference in means for the permuted group
  permDiff = mean(permSample2) - mean(permSample1)
  
  # return difference between real independent samples and permuted independent samples
  return(indepDiff - permDiff)
}

If the function above returns a value greater than zero, that indicates that the relationship is stronger for the real data than for the permuted data.

Now we can run this many times and look at the results.

res = replicate(1000, 
        indepSample.equalSizes(d$rainfall,
                               d$roofShape,
                               d$language_family))
hist(res)
abline(v= 0, col=2)

# P value
sum(res >= 0) / length(res)
## [1] 0.009

The distribution is much closer to zero. The p-value higher (approaching p = 0.01), but still significant. However, the effect is now actually leaning the other way!

Comparing this result to the results above, we might conclude that part of the correlation between roof shape and rainfall is driven by cultural history and partly by an imbalance in sample size (more societies with non-flat rooves giving a wider range of rainfall).

This technique is discussed in this supporting information of this paper.

Everett, Blasí & Roberts (2016) Language evolution and climate: the case of desiccation and tone. Journal of Language Evolution Jan 2016, 1 (1) 33-46;

4 Mixed effects modelling

Another method for controlling for language relatedness is mixed effects modelling. The theory of how mixed effects modelling works is beyond the scope of this tutorial, but below I show how it might be done, to compare with the results above. Note that, in comparison to the indpendent sample method, this uses all the data, and so has greater statistical power. It’s also easier to test a range of constraints than the permutation method.

First, we load the lme4 library:

library(lme4)
## Loading required package: Matrix

Next, we make two models predicting the probability of a language having SOV order. One just tries to predict the overall probability, with a seperate probability for each area. The second adds the adposition order as a fixed effect.

If the second model provides a better fit to the data, then that suggests that basic word order does help predict name length.

library(lme4)
# Center the rainfall variable around 0:
d$rainfall.scaled = d$rainfall-mean(d$rainfall)

# Model 0 - null model with random effects for family
m0 = lmer(rainfall.scaled  ~ 1 + 
            (1 +roofShape | language_family), data=d)

# Model 1 - model with fixed effect for roof shape
m1 = lmer(rainfall.scaled ~ 1 + roofShape + 
            (1 + roofShape | language_family), data=d)

# Peek at Model 1
summary(m1)
## Linear mixed model fit by REML ['lmerMod']
## Formula: rainfall.scaled ~ 1 + roofShape + (1 + roofShape | language_family)
##    Data: d
## 
## REML criterion at convergence: 27612.8
## 
## Scaled residuals: 
##     Min      1Q  Median      3Q     Max 
## -3.0587 -0.5319 -0.1002  0.5032  4.8941 
## 
## Random effects:
##  Groups          Name             Variance  Std.Dev. Corr
##  language_family (Intercept)      7.128e+08 26699        
##                  roofShapeNotFlat 2.350e+09 48472    0.53
##  Residual                         1.839e+09 42885        
## Number of obs: 1132, groups:  language_family, 155
## 
## Fixed effects:
##                  Estimate Std. Error t value
## (Intercept)        -42924       7328  -5.858
## roofShapeNotFlat    54802       8530   6.425
## 
## Correlation of Fixed Effects:
##             (Intr)
## rofShpNtFlt -0.713
# Has the model fit improved when adding adposition order?
anova(m0,m1)
## refitting model(s) with ML (instead of REML)
## Data: d
## Models:
## m0: rainfall.scaled ~ 1 + (1 + roofShape | language_family)
## m1: rainfall.scaled ~ 1 + roofShape + (1 + roofShape | language_family)
##    Df   AIC   BIC logLik deviance  Chisq Chi Df Pr(>Chisq)    
## m0  5 27682 27707 -13836    27672                             
## m1  6 27664 27694 -13826    27652 20.261      1  6.757e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The second model does improve the fit of the data. This suggests that the relationship between roof shape and rainfall is robust.

Here’s a plot of the random effects, which show that for some families, the relationship goes in the opposite direction to the hypothesised prediction.

library(lattice)
dotplot(ranef(m1))
## $language_family

If we look at the variance explained by the fixed effects, however, we see that the effect size (R2m) is reasonably small compared to the whole model (R2c):

library(MuMIn)
r.squaredGLMM(m1)
## Warning: 'r.squaredGLMM' now calculates a revised statistic. See the help page.
##             R2m       R2c
## [1,] 0.03729768 0.7031751

There are many more tests and checks that we should carry out in a real mixed effects analysis. The main point here is that we can apply a range of tests to the relationships in the data. Each of these has its own assumptions and limitations. But together, they might help us understand the data better.