Chromosome painting using TraceAncestor

TraceAncestor is a suite of script that allows to estimate the allelic dosage of ancestral alleles in hybrid individuals and then to perform chromosome painting.

Installation

git clone https://github.com/gdroc/GeMo_tutorials.git
cd GeMo_tutorials

Download dataset, you only need to launch the script download_dataset.pl without any parameter

perl download_dataset.pl

This script create a new directory data

data/
├── Ahmed_et_al_2019_color.txt
├── Ahmed_et_al_2019_individuals.txt
├── Ahmed_et_al_2019_origin.txt
├── Ahmed_et_al_2019.vcf

Workflow

vcf2gst.pl

Usage

This script is used to define GST values from individuals that are identified as pure breed for an ancestor.

Must be used on pure breed. If there is introgressed part on the genome of the individual, the part must be removed before analysis.

bin/vcf2gst.pl --help

Parameters :
   --vcf       vcf containing the ancestors and other individuals to scan [Required]
   --ancestor  A two column file with individuals in the first column and group tag (i.e. origin) in the second column [Required]
   --depth     minimal depth for a snp to be used in the analysis (Default 5)
   --output    output file name (Default GSTmatrix.txt)
   --help

Input

–ancestor Ancestor file (Required)

A two column file with individuals in the first column and group tag (i.e. origin) in the second column

individuals

origin

De_Chios

Mandarin

Shekwasha

Mandarin

Sunki

Mandarin

Cleopatra

Mandarin

Pink

Pummello

Timor

Pummello

Tahitian

Pummello

Deep_red

Pummello

Corsican

Citron

Buddha_Hand

Citron

–vcf VCF file (Required)

Now, you can run the following command

perl bin/vcf2gst.pl --ancestor data/Ahmed_et_al_2019_origin.txt --vcf data/Ahmed_et_al_2019.vcf --output GSTMatrix.txt

Output

The output is a CSV file containing GST (inter-population differentiation parameter) information:

with :

  • #CHROM = chromosome name

  • POS = position of DSNP

  • REF = Base of the reference allele of this DSNP

  • ALT = Base of the alternative allele of this DSNP

  • %Nref = Percentage of maximal missing data for this DSNP

  • GST = value of GST (inter-population differentiation parameter) (With 1,2,3 the ancestors names)

  • F = Alternative allele frequency for each ancestor (With 1,2,3 the ancestors names)

prefilter.pl

Usage

This script is used to define a matrix of ancestry informative markers from the matrix gotten at the step 1.

bin/prefilter.pl --help
Parameters :
    --matrix    GST matrix [Required]
    --gst       threshold for gst (Default : 0.9)
    --missing   threshold for missing data (Default 0.3)
    --output    output file name (Default Diagnosis_matrix)
    --help      display this help

Now, you can run the following command

perl bin/prefilter.pl --matrix GSTMatrix.txt --output Diagnosis_matrix.txt

Output

A matrix containing all the ancestry informative markers for every ancestors.

with:

  • ancestor = Ancestor names

  • chromosome = Chromosome numbers

  • position = Position of the SNP marker

  • allele = Base of the ancestral allele

TraceAncestor.pl

Usage

bin/TraceAncestor.pl --help

Parameters :
    --matrix     Diagnosis matrix [Required]
    --vcf       vcf of the hybrid population
    --individuals    A two column file with individuals to scan for origin (same as defined in the VCF headerline) in the first column and the ploidy in the second column [Required]
    --window    number of markers by window (Default 10)
    --lod       LOD value to conclude for one hypothesis (Default 3)
    --freq      theoretical frequency used to calcul the LOD (Default 0.99)
    --cut       number of K bases in one window (Default 100)
    --dirout    Directory output (Default result)
    --help      display this help

Input

–individuals A two column file with individuals to scan for origin (same as defined in the VCF headerline) in the first column and the ploidy in the second column.

Now, you can run the following command

perl bin/TraceAncestor.pl --matrix Diagnosis_matrix.txt --vcf data/Ahmed_et_al_2019.vcf --individuals data/Ahmed_et_al_2019_individuals.txt

Output

For each individual present in the file data/Ahmed_et_al_2019_individuals.txt, 4 outputs are produced, prefixed with the name of indivual :

  • Bergamot_ideo.txt : A text file of the position of genomic blocks the ancestry mosaic with a succession of genomic blocks along the chromosome

chr

haplotype

start

end

ancestral_group

1

0

1

28700000

Citron

1

1

1

28700000

Pummello

2

0

1

600000

Citron

2

0

3000001

4200000

Mandarin

2

0

4200001

10400000

Citron

2

0

10800001

35200000

Citron

  • Bergamot_chrom.txt : A tab file with name, length and karyotype based on ploidy.

  • Bergamot_ancestor.txt : Frequency of ancestors alleles along chromosome for the particular hybrid focused.

  • Bergamot_curve.txt : Frequency of ancestors alleles along chromosome for the GeMo visualization tool.

Visualization and block refinement with GeMo

Go to GeMo WebApp

  • Load data has follow

Gemo_Vizualise

References