Korak: An efficient method for exploring the space of gene tree/species tree reconciliations in a probabilistic framework
Jean-Philippe Doyon, Sylvie Hamel, and Cedric Chauve
IEEE/ACM Transactions on Computational Biology and Bioinformatics, 2012
Given a gene tree G, a species tree S (both in newick format) and duplication/loss rates and branch length for S,
Korak computes the following:
- The size and diameter of the whole space of reconciliations between G and S.
- The likelihood of the LCA based reconciliation.
- The sum of likelihood overall reconciliations located in a
subspace of the whole space of reconciliations (given a maximal depth). That is the probability mass of the considered subspace.
- The (exact/approximate) posterior probability (given the tree G) of each visited reconciliation.
- Compare the probability analysis with the real reconciliation (for simulated gene tree only).
A user manual of the program:
manualExploration.pdf.
The archive Exploration.tgz contains the following:
- The C++ code which consists of four libraries (folder LIBRARY) and the main project (folder EXPLORATION/exploration/).
- Several input files (folder EXPLORATION/INPUT_FILES/)
Follow these steps to build the binary called Korak
- Download the archive Exploration.tgz.
- Create a new directory and move the archive in it.
- Extract the archive : 'tar -zxvf Exploration.tgz'
- 'cd EXPLORATION/exploration/'
- Type 'cmake .' to build the makefile corresponding to your system.
- Type 'make' to build the binary file called Exploration (several
warnings are written to the shell, don't worry, it is normal: job to do
latter)
- Execute the binary: './Exploration D ../INPUT_FILES/DATA2 E L Q D'
The optionn of the program and the format of the output are described in
manualExploration.pdf. The same example as Step 7 below is used.
Output files of the Exploration program:
- exploration.log2 : summarize of the running time
- exploration.results : results of the computation (see manualExploration.pdf)
Probabilistic Analysis on Real and Simulated Gene Trees
This section contains the following:
- Input data:
- Real data for 12 fungal genomes [1]:
- 1278 real gene family trees.
- Species tree with branch length (in millions of years) and duplications and loss rates (computed by Cafe [2])
- Synthetic data based on the real ones above:
- Simulated gene trees using a "recursive" birth-and-death process (see the paper);
- Based on the rates R computed by cafe, three duplication/loss rates categories are considered
- 1051 trees with R x 1;
- 1025 trees with R x 1.4;
- 924 trees with R x 1.8.
- Output data:
- Real gene trees:
- Complete exploration
- Incomplete exploration
- Simulated gene trees
Input gene trees
Probabilistic analysis
References
[1]
I.
Wapinski, A. Pfeffer, N. Friedman, and A. Regev. Natural history and
evolutionary principles of gene duplication in fungi. Nature,
449:54–61, 2007.
[2]
T.
De Bie, N. Cristianini, J.P. Demuth, and M.W. Hahn. CAFE: a
computational tool for the study of gene family evolution.
Bioinformatics, 22(10):1269–1271, 2006.