We’d like to make a nexus file from some anthropological data, so we can use it in programs like Mesquite and Splitstree. The basic question is whether there is co-evolution of roof shape and roofing materials.
The easiest way to see the structure of a nexus file is to look at an example. You can open a nexus file in a good text editor. Here’s a very simple nexus file for 5 languages, storing two variables of binary data.
#NEXUS
BEGIN TAXA;
TITLE Taxa;
DIMENSIONS NTAX=5;
TAXLABELS
Welsh
Irish
Breton
Cornish
Manx
;
END;
BEGIN CHARACTERS;
TITLE Character_Matrix;
LINK TAXA = Taxa;
DIMENSIONS NCHAR=2;
FORMAT DATATYPE = STANDARD GAP = - MISSING = ? SYMBOLS = " 0 1";
MATRIX
Welsh 1 1
Irish 1 0
Breton 0 1
Cornish 1 1
Manx 0 0
;
This data has:
We’d like to load some data in R, then save it to a file format like this. For the csv format, we could just use the ‘write.csv’ function, but there’s no no built-in function to make a nexus file. We can make our own!
First, let’s load the roof shape data. If you don’t have this downloaded already, the roof shape data comes from D-PLACE, but you can download it from here.
We’ll also be using a similar dataset on roofing materials. Download the file here
Add this to a new script file (you may need to set the working directory):
# Load data
d <- read.csv("../data/RoofShapeData.csv", stringsAsFactors = F)
roofMat = read.csv("../data/RoofingMaterials.csv",stringsAsFactors = F)
# roof shape variable for clarity
d$roofShape = d$code_label
# Add roof material data to d
d$roofMaterial = roofMat[match(d$society_id,roofMat$society_id),]$code_label
# Make binary variables
d$flat_roof
## NULL
Let’s quickly visualise the raw data. The ‘trees’ below are just clusters in the raw data - these will be heavily skewed by the phylogentic signal, but it’s at least an easy way to visualise the raw data:
heatmap(table(d$roofMaterial,d$roofShape),cexCol = 0.7)
For this analysis, we’ll be concentrating on data from societis that speak languages from the Afro-Asiatic family. To make things simple, we’ll look at whether societies have a flat roof, and whether they use earth turf, grass or thatch compared to some other material (let’s say we have a theoretical reason to make this category). It’s possible to do phylogenetic analyses with more than just binary categories, and there are probably more sensible categories, but for illustration this will serve.
# Select only Atlantic Congo data
data = d[d$language_family=="Afro-Asiatic",]
# Make boolean variables
data$flatRoof = data$roofShape=="Flat"
# Plant roof is true if made from earth, turf, grass, leaves or thatch
data$plantRoof = data$roofMaterial=="Earth or turf" |
data$roofMaterial=="Grass, leaves or thatch"
In this case, some langauges are represented by multiple societies. We’ll ignore that for now and just select one society per language
data = data[!duplicated(data$language_glottocode),]
In the nexus file format, some of the file format is standard - we need the first ‘#NEXUS’ line in any file. And some of it is specific to the data - like the value of NTAX, and the list of languages. We can just copy the standard parts, but we need to calculate the specific parts. What’s standard and what’s specific?
The specific parts are:
We’ll use the glotto code to identify each language, and we’ll the character data is just the two linguistic variables.
taxlabels = data$language_glottocode
characterData = data[,c("flatRoof","plantRoof")]
We can now work out the values of NTAX and NCHAR:
numTax = nrow(characterData)
numChar = ncol(characterData)
So we have our first bits! Eventually, we’ll paste these into the right place inside a template. But we need the other bits first.
Let’s take a look at the character data:
head(characterData)
## flatRoof plantRoof
## 4 FALSE FALSE
## 5 FALSE FALSE
## 7 FALSE FALSE
## 8 FALSE FALSE
## 10 FALSE FALSE
## 11 FALSE FALSE
This isn’t quite right - the data should be binary. We need a way of converting this into 1s and 0s. Since the data is boolean, we can just use as.numeric
. Otherwise, you’ll need to convert to a factor first, then to a numeric (and then maybe take 1 away).
characterData$flatRoof = as.numeric(characterData$flatRoof)
characterData$plantRoof = as.numeric(characterData$plantRoof)
The method above converts each column to a binary number. This works fine for just two variables. But if we had lots of characters, then we’d need a more efficient system. We could do this with a for
loop:
for(i in 1:ncol(characterData)){
characterData[,i] <- as.numeric(characterData[,i])
}
This can be read as “For each number in the list from 1 to the number of columns (in this case: 1, 2), do the following: Replace column i in characterData with the binary conversion of column i.”
However, using for loops for lots of data can take a long time. A more ‘R’ approach is to use apply
.
The function apply
takes a matrix or data frame and applies a function to either each row or each column. Here’s a toy example of how it works. Let’s make a matrix with some numbers in:
# Make a matrix with some numbers in
x = table(characterData$flatRoof, characterData$plantRoof)
x
##
## FALSE TRUE
## FALSE 26 41
## TRUE 3 18
apply
takes three arguments: The data, a function (like sum
) and a number representing whether to apply the function to the rows or the columns:
# take the sum of each row:
apply(x, 1, sum)
## FALSE TRUE
## 67 21
# take the sum of each column:
apply(x, 2, sum)
## FALSE TRUE
## 29 59
So, we could convert each column in characterData to binary using this code:
characterData.binary = apply(characterData, 2, as.numeric)
Let’s check it worked:
head(characterData.binary)
## flatRoof plantRoof
## [1,] 0 0
## [2,] 0 0
## [3,] 0 0
## [4,] 0 0
## [5,] 0 0
## [6,] 0 0
This code will now work for 2 columns, or 200! It will also come in handy for the next step.
paste
Let’s look at the TAXLABELS section of the nexus file again:
TAXLABELS
Welsh
Irish
Breton
Cornish
Manx
;
We need a string with each language name on a different line. Recall that the language names in our WALS data are glotto codes:
head(taxlabels)
## [1] "mans1267" "nort3133" "taya1257" "algi1247" "hass1238" "hogg1238"
The function paste
combines a vector of strings into a single string. It has optional arguments “sep” and “collapse” which have different effects. For example:
# Combine a list of three names with a surname:
paste(c("Johnny","Joey","Tommy"), "Ramone", sep=" ")
## [1] "Johnny Ramone" "Joey Ramone" "Tommy Ramone"
# Join items in two lists with a hyphen:
paste(c("a","b","c"),c("A","B","C"), sep="-")
## [1] "a-A" "b-B" "c-C"
# Combine four strings into a single string:
paste(c("super","cali","fragilistic","expialidocious"), collapse = "")
## [1] "supercalifragilisticexpialidocious"
We can use paste
to create a single string that combines all the language names by new lines. New lines can be represented using the ‘escape’ character backslash (“\n”):
taxlabels.string <- paste(taxlabels,collapse="\n")
apply
and paste
:Let’s look at the format of the character block:
MATRIX
Welsh 1 1
Irish 1 0
Breton 0 1
Cornish 1 1
Manx 0 0
;
Each line starts with the taxon name, then the list of characters comes next. Note that for some programs, there should be no spaces between the character states.
We can make a data frame that looks like this by combining the taxlabels variable with the characterData data frame, using cbind
(which combines things by column):
characterData.binary.withNames = cbind(taxlabels,characterData.binary)
head(characterData.binary.withNames)
## taxlabels flatRoof plantRoof
## [1,] "mans1267" "0" "0"
## [2,] "nort3133" "0" "0"
## [3,] "taya1257" "0" "0"
## [4,] "algi1247" "0" "0"
## [5,] "hass1238" "0" "0"
## [6,] "hogg1238" "0" "0"
Now we need to do two things:
TASK: There are many ways to do this, but see if you can figure out how to do this using
apply
and paste.
Here’s my solution:
combineRows = apply(characterData.binary.withNames, 1,
paste, collapse=' ')
characterData.binary.string = paste(combineRows,collapse='\n')
We now have all the bits we need to make the file:
There are many ways we could put all the bits together. For example, we could make each line in the file using paste
, then paste all these lines together.
But here’s a more elegant solution: We’ll use a template string with all the standard parts, with markers where we want to add the specific information.
We can define a template string by copying the text from the example file, putting it between single quotes and replacing the specific parts with unique strings we can refer to later:
(note that definitions of variables can be spread over multiple lines)
nexus_template =
'#NEXUS
BEGIN TAXA;
TITLE Taxa;
DIMENSIONS NTAX=ntax_here;
TAXLABELS
taxlabels_here
;
END;
BEGIN CHARACTERS;
TITLE Character_Matrix;
LINK TAXA = Taxa;
DIMENSIONS NCHAR=nchar_here;
FORMAT DATATYPE = STANDARD GAP = - MISSING = ? SYMBOLS = " 0 1";
MATRIX
characterData_here
;
'
Now all we need to do is replace “ntax_here” with the variable numTax etc. To do this we can use the function gsub
which substitutes (or “replaces”") one string with another:
# change all 's' to 'k' in the string 'snow':
gsub("s", "k", "snow")
## [1] "know"
So, we just need to copy the template text, then replace the placeholders with the bits we calculated.
# copy template to new variable
nex = nexus_template
# add data to template
nex = gsub("ntax_here", numTax, nex)
nex = gsub("nchar_here", numChar, nex)
nex = gsub("taxlabels_here", taxlabels.string, nex)
nex = gsub("characterData_here", characterData.binary.string, nex)
The final step is to write this to a file. We can use the function cat
:
filename = "data/Afro-Asiatic.nex"
cat(nex, file=filename)
And we’re done! This creates a file that can be opened in programs like Mesquite (see below for the full script as it would appear in a script file).
We wrote a script to convert data from Atlantic-Congo languages to a nexus file. If we want to do this for another set of data, we could just copy the whole script and replace bits and pieces. However, a much more efficient approach would be to make a custom function that takes a set of data and produces a nexus file.
TASK: Create a custom function named
makeNexusFile
which takes three arguments: a list of language names (taxlabels), a data frame of character data (characterData) and a filename. It should create a nexus file format and write it to a file.
We already have all the lines of code we need, we just need to put the right ones inside a function. Have a think: what bits would need to be specified each time, and which bits can be calcualted automatically? For example, we can calculate the number of taxa directly from the character data.
Here’s how the script file should be organised:
convertToBinary
function)makeNexusFile
)makeNexusFile
functionHere’s my code. My function takes three variables: the list of languages, the character data and the name of the file to write to.
############
# PARAMETERS
# Nexus file template
nexus_template =
'#NEXUS
BEGIN TAXA;
TITLE Taxa;
DIMENSIONS NTAX=ntax_here;
TAXLABELS
taxlabels_here
;
END;
BEGIN CHARACTERS;
TITLE Character_Matrix;
LINK TAXA = Taxa;
DIMENSIONS NCHAR=nchar_here;
FORMAT DATATYPE = STANDARD GAP = - MISSING = ? SYMBOLS = " 0 1";
MATRIX
characterData_here
;
'
############
# FUNCTIONS
# Function to convert a vector to a binary vector
# This will work for character, factor and boolean variables
convertToBinary = function(x){
return(as.numeric(as.factor(x))-1)
}
# Function to convert data to a nexus format
makeNexusFile = function(taxlabels, characterData, filename){
# make tax labels string
taxlabels.string = paste(taxlabels,collapse='\n')
# Get number of taxa
numTax = nrow(characterData)
# Get number of characters (variables)
numChar = ncol(characterData)
# convert characters to binary
characterData.binary = apply(characterData, 2, convertToBinary)
# add taxon labels to character section
characterData.binary.withNames = cbind(taxlabels,characterData.binary)
# paste everything together
combineRows = apply(characterData.binary.withNames,
1, paste, collapse=' ')
characterData.binary.string =
paste(combineRows,collapse='\n')
# Copy the template
nex = nexus_template
# add data to template by replacing markers
nex = gsub("ntax_here", numTax, nex)
nex = gsub("nchar_here", numChar, nex)
nex = gsub("taxlabels_here", taxlabels.string, nex)
nex = gsub("characterData_here", characterData.binary.string, nex)
cat(nex, file=filename)
}
#############
# DATA
# Load the data
d <- read.csv("../data/RoofShapeData.csv", stringsAsFactors = F)
roofMat = read.csv("../data/RoofingMaterials.csv",stringsAsFactors = F)
# roof shape variable for clarity
d$roofShape = d$code_label
# Add roof material data to d
d$roofMaterial = roofMat[match(d$society_id,roofMat$society_id),]$code_label
# Select only Afro-Asiatic languages
data = d[d$language_family=="Afro-Asiatic",]
# Make boolean variables
data$flatRoof = data$roofShape=="Flat"
# Plant roof is true if made from earth, turf, grass, leaves or thatch
data$plantRoof = data$roofMaterial=="Earth or turf" |
data$roofMaterial=="Grass, leaves or thatch"
# Ignore duplicated data
data = data[!duplicated(data$language_glottocode),]
# Make variables for the function:
# Taxon labels
taxlabels = data$language_glottocode
# Character data (data frame)
characterData = data[,c("flatRoof","plantRoof")]
# Call our function to make the nexus file
makeNexusFile(taxlabels, characterData, "../data/Afro-Asiatic.nex")
The code above means that we could call the makeNexusFile
function many times, without having to repeat the code within the function. But we can go even further. This function could be quite useful to many other scripts, so we could put the function inside its own script file, then load it inside any script where we need it.
For example, we could copy everything in the code above the DATA
section into another script file (called “makeNexusFileFunctions.R”), then use the source
function to load it into memory.
Our script to convert data is now much shorter (note that we’re also making a file for Uto-Aztecan):
# Load funcitons from another file
source("makeNexusFileFunctions.R")
#############
# DATA
# Load the WALS data
# Load the data
d <- read.csv("../data/RoofShapeData.csv", stringsAsFactors = F)
roofMat = read.csv("../data/RoofingMaterials.csv",stringsAsFactors = F)
# roof shape variable for clarity
d$roofShape = d$code_label
# Add roof material data to d
d$roofMaterial = roofMat[match(d$society_id,roofMat$society_id),]$code_label
# Make boolean variables
data$flatRoof = data$roofShape=="Flat"
# Plant roof is true if made from earth, turf, grass, leaves or thatch
data$plantRoof = data$roofMaterial=="Earth or turf" |
data$roofMaterial=="Grass, leaves or thatch"
# Ignore duplicated data
data = data[!duplicated(data$language_glottocode),]
###############
# Make a nexus file for Atlantic Congo
data = d[d$language_family=="Atlantic-Congo",]
# Make variables for the function:
# Taxon labels
taxlabels = data$iso_code
# Character data (data frame)
characterData = data[,c("flatRoof","plantRoof")]
# Call our function to make the nexus file
makeNexusFile(taxlabels, characterData,"../data/Atlantic-Congo.nex")
###############
# Do the same for Uto-Aztecan
data = d[d$language_family=="Uto-Aztecano",]
taxlabels = data$iso_code
characterData = data[,c("flatRoof","plantRoof")]
makeNexusFile(taxlabels, characterData,"../data/Uto-Aztecano.nex")
####
# Do the same for Afro-Asiatic
data = d[d$language_family=="Afro-Asiatic",]
taxlabels = data$iso_code
characterData = data[,c("flatRoof","plantRoof")]
makeNexusFile(taxlabels, characterData,"../data/Afro-Asiatic.nex")
TASK: Can you see any redundancy in the script above? Let’s say you wanted to make a nexus file for every language family. Is there a way to put some of this code into a function, then call the function for each language family?