Korak program

Korak: An efficient method for exploring the space of gene tree/species tree reconciliations in a probabilistic framework

Jean-Philippe Doyon, Sylvie Hamel, and Cedric Chauve
IEEE/ACM Transactions on Computational Biology and Bioinformatics, 2012

Given a gene tree G, a species tree S (both in newick format) and duplication/loss rates and branch length for S, Korak computes the following:

The size and diameter of the whole space of reconciliations between G and S.
The likelihood of the LCA based reconciliation.
The sum of likelihood overall reconciliations located in a subspace of the whole space of reconciliations (given a maximal depth). That is the probability mass of the considered subspace.
The (exact/approximate) posterior probability (given the tree G) of each visited reconciliation.
Compare the probability analysis with the real reconciliation (for simulated gene tree only).

A user manual of the program: manualExploration.pdf.

The archive Exploration.tgz contains the following:

The C++ code which consists of four libraries (folder LIBRARY) and the main project (folder EXPLORATION/exploration/).
Several input files (folder EXPLORATION/INPUT_FILES/)

Follow these steps to build the binary called Korak

Download the archive Exploration.tgz.
Create a new directory and move the archive in it.
Extract the archive : 'tar -zxvf Exploration.tgz'
'cd EXPLORATION/exploration/'
Type 'cmake .' to build the makefile corresponding to your system.
Type 'make' to build the binary file called Exploration (several warnings are written to the shell, don't worry, it is normal: job to do latter)
Execute the binary: './Exploration D ../INPUT_FILES/DATA2 E L Q D'

The optionn of the program and the format of the output are described in manualExploration.pdf. The same example as Step 7 below is used.

Output files of the Exploration program:

exploration.log2 : summarize of the running time
exploration.results : results of the computation (see manualExploration.pdf)

Probabilistic Analysis on Real and Simulated Gene Trees

This section contains the following:

Input data:

Real data for 12 fungal genomes [1]:

1278 real gene family trees.

Species tree with branch length (in millions of years) and duplications and loss rates (computed by Cafe [2])

Synthetic data based on the real ones above:

Simulated gene trees using a "recursive" birth-and-death process (see the paper);
Based on the rates R computed by cafe, three duplication/loss rates categories are considered

1051 trees with R x 1;
1025 trees with R x 1.4;
924 trees with R x 1.8.

Output data:

Real gene trees:

Complete exploration
Incomplete exploration

Simulated gene trees

Complete exploration

Input gene trees

	Increasing Factor (I.F.)	Gene Trees	Branch Lengths (in time) and Rates	Species Tree (12 fungal genomes)
Real gene trees	Not applicable	1278 trees realGeneTree.tgz	edgeValues-1	speciesTree
Simulated gene trees	1	1051 trees simulatedGeneTree_1.tgz	edgeValues-1
	1.4	1025 trees simulatedGeneTree_1.4.tgz	edgeValues-1.4
	1.8	924 trees simulatedGeneTree_1.8.tgz	edgeValues-1.8

Probabilistic analysis

Reconciliation Tree Explored	Real Gene Trees	Simulated Gene Tree with I.F.
Reconciliation Tree Explored	Real Gene Trees	1	1.4	1.8
Whole tree	realGeneTree_CompleteExploration.tgz	simulatedGeneTree_CompleteExploration_1.tgz	simulatedGeneTree_CompleteExploration_1.4.tgz	simulatedGeneTree_CompleteExploration_1.8.tgz
Subtree with Depth
0	realGeneTree_Depth_0.tgz
1	realGeneTree_Depth_1.tgz
2	realGeneTree_Depth_2.tgz
3	realGeneTree_Depth_3.tgz
4	realGeneTree_Depth_4.tgz
5	realGeneTree_Depth_5.tgz
6	realGeneTree_Depth_6.tgz
7	realGeneTree_Depth_7.tgz
8	realGeneTree_Depth_8.tgz
9	realGeneTree_Depth_9.tgz
10	realGeneTree_Depth_10.tgz

References
[1] I. Wapinski, A. Pfeffer, N. Friedman, and A. Regev. Natural history and evolutionary principles of gene duplication in fungi. Nature, 449:54–61, 2007.
[2] T. De Bie, N. Cristianini, J.P. Demuth, and M.W. Hahn. CAFE: a computational tool for the study of gene family evolution. Bioinformatics, 22(10):1269–1271, 2006.