Lorentz Center@Snellius workshop

Utilizing Genealogical Phylogenetic Networks in Evolutionary Biology:
Touching the Data

7 to 11 July, 2014

Here is a compilation of empirical datasets that might prove useful for investigating mathematical algorithms associated with those phylogenetic networks intended to represent evolutionary history.

Note that the datasets in the http://phylonetworks.blogspot.se/p/datasets.html are also available for discussion.

This page last updated: Tuesday 24 June 2014

Dataset 1

Responsible person: Axel Janke
Datafile: Bear.zip
Size: 2.3 MB
Kutschera VE, Bidon T, Hailer F, Rodi JL, Fain SR, Janke A (2014) Bears in a forest of gene trees: phylogenetic inference is complicated by incomplete lineage sorting and gene flow. Molecular Biology and Evolution (in press). DOI:10.1093/molbev/msu186
Short description:
This is a multi-locus dataset from bears, with conflicting data. There are seven well-recognized bear species, but the gene trees from 14 intron loci and the mtDNA are incongruent. Pairwise IMa2 analyses show that gene flow (hybridization) is a very likely explanation.

The zip-file is available here:
It contains the following files:
1. 14 gene trees of autosomal introns, from *BEAST inferences of a species tree, with branch lengths
2. 14 gene trees of autosomal introns, from *BEAST inferences of a species tree, with branch lengths AND posterior probabilities
3. A list of the 14 autosomal introns in the same ordering the corresponding gene trees appear in files 1 and 2
4. A tab-delimited table of the species names, abbreviations used, and the individual IDs
5.-19. Fasta files of the 14 autosomal introns
20.-21. In-press paper (plus supplementary material).

Dataset 2

Responsible person: Scot Kelchner
Datafile: Bamboo.zip
Size: 1.9 MB
Kelchner SA, Bamboo Phylogeny Group (2013) Higher level phylogenetic relationships within the bamboos (Poaceae: Bambusoideae) based on five plastid markers. Molecular Phylogenetics and Evolution 67: 404-413. DOI:10.1016/j.ympev.2013.02.005
Short description:
Multi-locus dataset for 40 species of bamboo. There are five chloroplast markers, including one coding sequence, two introns and two spacers. The biological problem is a very tough (and contentious) one. Tree analyses show only one rather unlikely answer, but when networks are used a more biologically plausible answer is revealed.

The zip-file is available here:
It contains the following files:
1. Readme file
2. PDF of published paper
3. aligned sequence data in Nexus format, including annotations of repeats, inversions and ambiguous regions
4. aligned sequence data in Fasta format.

Dataset 3

Responsible person: Mattis List
Datafile: gist7c56ce7c9e983aecad25-63e4b3d823165ac8225d9ccf27043fb35aa37c3f.tar.gz
Size: 233 KB
List J-M, Nelson-Sathi S, Geisler H, Martin W. (2013) Networks of lexical borrowing and lateral gene transfer in language and genome evolution. Bioessays 36: 141-150. DOI: 10.1002/bies.201300096
Short description:
This is a test dataset for phylogenetic network approaches in historical linguistics. It consists of cognate word sets (ie. homologous words) for 40 Indo-European languages, along with a reference phylogenetic tree. Known borrowings have been deliberately reintroduced into the data, which represent phylogenetic reticulations.

The link to the GitHub repository is here:
The repository (click the "Download Gist" button) contains the following files:
1. Readme file
2. tree-representation of the underlying taxa using the Newick format (nwk-file)
3. csv-representation of the presence-absence patterns of the data (csv-file)
4. nexus-representation of the presence-absence matrix of the data (nex-file)
5. wordlist representation of the data which is important for additional linguistic analyses (qlc-format).

Dataset 4

Responsible person: Rob Beiko
Datafile: Beiko_trees.tar.gz
Size: 68.1 MB
Beiko RG (2011) Telling the whole story in a 10,000-genome world. Biology Direct 6: 34. DOI:10.1186/1745-6150-6-34
Short description:
This dataset was first used in the Big Analysis challenge at iEvoBio 2012. There, the data were presented as a 244-taxa bacterial SPR Supertree constructed from a subset of the data from 159,905 prokaryotic phylogenetic trees.

The link to the Dryad data repository and the published paper are here:
The Dryad repository contains:
1. Readme file
2. csv file with the taxonomic affiliations of all organisms
3. the full set of trees in a couple of different formats (basically, linking each sequence back to NCBI).

Dataset 5

Responsible person: Jim Whitfield
Datafiles: Genomes.zip, Genomes-dataset4.zip
Size: 29.2 MB, 29.6 MB
Various, as explained in accompanying text files.
Short description:
There are four genomic datasets, each of which is an example of a different type of data that is commonly collected, and which has associated data-analysis issues.

The zip-files are available here:
This contains the following folders:
1. Dataset 1 - Corbiculate bees
2. Dataset 2 - Aculeate bees and wasps
3. Dataset 3 - Metazoan phyla
4. Dataset 4 - Placental mammals

Dataset 6

Responsible person: Mike Steel
Datafiles: mike_data2.pdf
Size: 78 KB
None .
Short description:
Three constructed datasets, designed to investigate various properties of parsimony analysis.

The PDF-file is available here: