Chromosome painting using non admixed ancestral accessions (VCFHunter)

The aims of this tutorial are to showing how data should be processed to be then visualized with the GeMo

Installation

Install VCFHunter following the documentation presented above:

git clone https://github.com/gdroc/GeMo_tutorials.git
cd GeMo_tutorials
python3 -m venv $PWD/venv
source venv/bin/activate
pip install numpy
pip install matplotlib
pip install scipy

Download datasets

Two ways :

Download Baurens_et_al_2019.zip available on Zenodo

mkdir data
cd data
wget https://zenodo.org/record/6542870/files/Baurens_et_al_2019.zip
unzip Baurens_et_al_2019.zip
ls Baurens_et_al_2019.vcf > Vcf.conf

Goto Identification of private alleles and formatting output for more analysis

Create input dataset using Gigwa, a web application for managing and exploring high-density genotyping data, to download a VCF

Select the database Populations_A_B

Select the accessions P2 and T01 to T11 on the Indivuals drop down menu, and click on Search button

Download result (check radio “Export Metadata” and “Keep file on servers”)

Copy the link, and create a repository on your terminal

mkdir data
cd data
wget --no-check-certificate https://www.crop-diversity.org/gigwa/genofilt/tmpOutput/anonymousUser/b429763f507dc1bb2b169d7da5cf1804/Population_A-B__project1__2021-10-12__148329variants__VCF.zip
unzip Population_A-B__project1__2021-10-12__148329variants__VCF.zip
cut -f 1,6 Population_A-B__21individuals_metadata.tsv > Baurens_et_al_2019_origin.txt
sed -i 's:balbisiana:BB:' Baurens_et_al_2019_origin.txt
sed -i 's:acuminata:AA:' Baurens_et_al_2019_origin.txt
ls Population_A-B__148329variants__10individuals.vcf > Vcf.conf

VCF content

grep "^#CHROM" Population_A-B__148329variants__21individuals.vcf
#CHROM  POS ID  REF ALT QUAL    FILTER  INFO    FORMAT  ACC48-FPG   ACC48-FPN   ACC48-P_Ceylan  ACC48-Red_Yade  DYN163-Kunnan   DYN275-Pelipita DYN359-Safet_Velchi GP1 GP2 P1  P2  T01 T02 T03 T04 T05 T06 T07 T08 T10 T11

Workflow

The principle of this analysis is to :

Identify specific allele of distinct genetic pools,
Calculate the expected allelic ratio of these alleles in these genetic pools,
Calculate the observed allelic ratio a/several given accessions
Normalize these observed ratios using expected ratio to infer the number of haplotypes of each genetic pools that are present on a given windows of the studied accession.

Files obtained at the end of the process can be given to GeMo tools to visualize data and optimize parameters.

Input

Baurens_et_al_2019_origin.txt
Vcf.conf is a file which contained path to vcf files which will be used for e-chromosome painting.
Baurens_et_al_2019_chromosome.txt (tabulated file with the chromosome name and length)
Baurens_et_al_2019_color.txt

group	name	r	g	b
AA	acuminata	0	255	0
BB	balbisiana	255	0	0

Identification of private alleles and formatting output for more analysis

bin/IdentPrivateAllele.py -c data/Vcf.conf -g Baurens_et_al_2019_origin.txt -o step1 -a y  -m y

In this first step, the program use genotyping information provided in vcf files passed in Vcf.conf file and the file Origin.tab containing the corresponding genetic pools of some accessions of the vcf to identify alleles specific of each pools.

Outputs can be found in directory passed in -o option. For each accessions identified as belonging to a genetic pool a directory is created.

tree step1
step1
├── P2
│   ├── P2_ratio.tab.gz
│   └── tmp_1_P2_stats.tab
├── T01
│   ├── T01_ratio.tab.gz
│   └── tmp_1_T01_stats.tab
├── T02
│   ├── T02_ratio.tab.gz
│   └── tmp_1_T02_stats.tab
├── T03
│   ├── T03_ratio.tab.gz
│   └── tmp_1_T03_stats.tab
├── T04
│   ├── T04_ratio.tab.gz
│   └── tmp_1_T04_stats.tab
├── T05
│   ├── T05_ratio.tab.gz
│   └── tmp_1_T05_stats.tab
├── T06
│   ├── T06_ratio.tab.gz
│   └── tmp_1_T06_stats.tab
├── T07
│   ├── T07_ratio.tab.gz
│   └── tmp_1_T07_stats.tab
├── T08
│   ├── T08_ratio.tab.gz
│   └── tmp_1_T08_stats.tab
├── T10
│   ├── T10_ratio.tab.gz
│   └── tmp_1_T10_stats.tab
└── T11
    ├── T11_ratio.tab.gz
    └── tmp_1_T11_stats.tab

Determination of expected read ratio for each ancestral position based on ancestral accessions merged together

bin/allele_ratio_group.py -g Baurens_et_al_2019_origin.txt -p _ratio.tab.gz -o step2 -i step1

In this second step the program take the input of specific allele identified in each accessions used to define genetic pools (ratio.tab.gz files of step1 folder) and calculate an average expected allele ratio (globally a proxy of the fixation level of the allele) in the genetic pool the allele belongs.

A tabulated file is generated per genetic pool with the following format:

c hromosome	position	allele	genetic pool	average allelic ratio observed	number of ancestral a ccessions
chr02	15033812	A	AA	0.9959677 419354839	8
chr02	17722345	G	AA	1.0	8
chr09	39501254	T	AA	1.0	8
chr05	17536961	T	AA	1.0	8
chr06	10144735	A	AA	0.9931737 588652483	8
chr08	4718673	T	AA	0.9932432 432432432	8
chr10	37498708	T	AA	0.9239074 518611573	8

Calculation of observed ratio in other accessions

The third step is to calculate, for each position in which an allele specific of a genetic pool was identified, the observed allelic ratio in a studied accession. In this example we calculate this ratio on the Kunnan accession.

bin/allele_ratio_per_acc.py -c Vcf.conf -g Baurens_et_al_2019_origin.txt -i step2 -o step3 -a DYN163-Kunnan

The output can be found in the step3 folder passed in -o option. This tabulated file contained 6 columns: column 1 corresponded to the chromosome, column 2 is the position of the allele, column 3 is the allele, column 4 corresponded to the observed allele frequency in the accession, column 5 is the expected allele frequency calculated at step 2 and column 6 is the genetic group to which the allele has been attributed.

For example : zmore step3/DYN163-Kunnan_ratio.tab.gz

chr	pos	allele	obs_ratio	exp_ratio	grp
chr01	20888	A	0.0	0.23513227513227516	BB
chr01	20916	C	0.14754098360655737	0.28604868303910713	BB
chr01	21019	G	0.21875	0.3700537473602161	BB
chr01	67413	T	0.5818181818181818	1.0	AA
chr01	67413	A	0.41818181818181815	1.0	BB
chr01	67461	G	0.0	0.975	AA
chr01	89923	G	0.6842105263157895	1.0	AA
chr01	89923	T	0.3157894736842105	1.0	BB
chr01	89958	T	0.6842105263157895	1.0	AA

Calculation on sliding of the normalized observed ratio and ancestral blocs

In this step, in a given sliding windows, the observed average allelic ratio is calculated for each genetic pool and normalized by the expected allelic ratio. The resulting value is used to infer the number of haplotypes from the studied genetic pool present in the studied accession.

Output are of two types:

<accession>_win_ratio.tab.gz file containing normalized values for each genetic pools in the given windows. This file contained 4 + X columns, X being the number of genetic pools tested. The column 1 contained the chromosome name, column 2 contained the position of the central allele in the windows, column 3 contained the start position of the windows and column 4 contained the end position of the windows. Columns 5 to end contained the normalized ratio calculated for the accessions. A columns per genetic pool.
<accession>_<chromosome>_<haplotype>.tab contained the hypothesized haplotypes from this accession given results from tab.gz file. Haplotype are hypothetic ones that tries to minimize recombinations events between distinct genetic pools. These files are formatted as follows: column 1 contained accession name, column 2 contained chromosome ID, column 3, 4 and 5 contained start, end, and origin of a region.

mkdir step4
bin/PaintArp.py -a DYN163-Kunnan -r step3/DYN163-Kunnan_ratio.tab.gz -c Baurens_et_al_2019_color.txt -o step4/DYN163-Kunnan -w 12 -O 0 -s Baurens_et_al_2019_chromosome.txt

File formatting for GeMo visualization

This steps aims at reformatting the files so that they are compatible with GeMo tool. GeMo tool performs two tasks, the first one consists in drawing ancestral block identified at step 4. The second one also draw these blocks but allowed refinement of these block using custom and adjustable parameters. For block drawing of step 4 we will reformat block files so that they match expectation with GeMo. For this run the following command line:

mkdir step5
bin/convertForIdeo.py --name DYN163-Kunnan --dir step4 --col Baurens_et_al_2019_color.txt --size Baurens_et_al_2019_chromosome.txt --prefix step5/DYN163-Kunnan --plo 2

This command generate several files with the following names:

<accession_id>_ideo.txt that contained block that could be drawn with GeMo (data section),
<accession_id>_curve.txt that contained block that could be drawn with GeMo (data section),
<accession_id>_ideoProb.txt that contained block that could be drawn with GeMo (data section),
<accession_id>_chrom.txt that contained information required to draw chromosomes.
<accession_id>_color.txt contained color information that could be used to draw blocks with custom color.