Computational Methods for Paleogenomics and Comparative Genomics

The IRMACS Center at Simon Fraser University

Fast Phylogenetic Scaffolding of Ancient Contigs (FPSAC) and application to the medieval Black Death agent

The code and data provided on this page are described in the following papers:

Description File(s) Format Comments
Code
Archive fpsac_1.0.tar.gz Tarball (tar.gz) Python and shell scripts
Input Data
Contigs black_death_contigs_individual_8291.fa.gz FASTA The length of contigs is 20bp longer than the number given in the contig id.
Extant genomes black_death_extant_genomes.fa.gz FASTA Names have been modified to avoid characters '.', ':' and '-'. Only chromosome sequences (no plasmid) are considered.
Species tree black_death_species_tree.nhx NHX augmented:
ancestral node of interest marked by @
Outgroups are not resolved (non-binary root)
Intermediate results
Megablast output black_death_megablast_hits.txt BLAST hit table Obtained using NCBI Megablast with default parameters.
Homologous markers families black_death_homologous_markers_families.txt family_header = family_id family_multiplicity
extant_occurrence_1 = extant_genome.Chr:start-end orientation(+/-) contig_id_and_length:start-end(,contig_id_and_length:start-end...)
...
extant_occurrence_k = extant_genome.Chr:start-end orientation(+/-) contig_id_and_length:start-end(,contig_id_and_length:start-end...)
Family 13 was excluded from further analysis.
Adjacencies and repeat_spanning_intervals black_death_adjacencies.txt black_death_repeat_spanning_intervals.txt ANGES format augmented to include gaps coordinates
each adjacency or common interval (character) is represented by a single row in the file:
character_id|phylogenetic_weight;list_of_species_containing_character:list_of_markers_in_character list_of-gaps_coordinates

Markers were doubled to account for orientation: family X induced two families, with respective ids 2X (for the head of the markers) and 2X-1 (tails of the markers) Adjacencies (2X,2X-1) are weighted by 10000 to ensure markers are properly reconstructed. See there for further details.

Selected subset of adjacencies of maximum weight and compatible with a circular structure Kept adjacencies: black_death_kept_adjacencies.txt
Discarded adjacencies: black_death_discarded_adjacencies.txt
ANGES format augmented to include gaps coordinates as above. Algorithm: Linearization of ancestral multichromosomal genomes.

Markers are still doubled, as above.

Selected subset of repeat spanning intervals of maximum weight and compatible with the selected adjacencies Kept intervals: black_death_kept_repeat_spanning_intervals.txt
Discarded intervals: black_death_discarded_repeat_spanning_intervals.txt
ANGES format augmented to include gaps coordinates as above.

Markers are still doubled, as above.

Markers circular order (without outgroup adjacencies) black_death_markers_order.txt Undoubled markers.
Outgroup supported adjacencies black_death_outgroup_adjacencies.txt
Ancestral gaps black_death_gaps.txt Undoubled markers.
Extant gaps alignments black_death_gaps_alignments.tar.gz Default Muscle output: FASTA (see Muscle manual) Computed using Muscle 3.8.31
Final results
Ancestral genome: DNA sequence black_death_DNA_sequence.fa.gz Gaps for outgroup adjacencies are replaced by a sequence of 50 Ns.
Ancestral genome: sequence map with extant annotations black_death_ancestral_sequence_map
Annotation black_death_ancestral_sequence_annotation_Basys.gbk Obtained with Basys.