Welcome to GeMo’s documentation!

GeMo is a WebApp to represent Genome Mosaics with current focus on plants. However, GeMo is developed in a generic way it can be also applied to other organisms.

Introduction

GeMo is a WebApp to represent Genome Mosaics with current focus on plants. However, GeMo is developed in a generic way it can be also applied to other organisms.

Main features

  • Dynamic chromosome painting visualisation

  • Online Data curation of mosaic prediction

  • Markers or Genes Plots on mosaic karyotypes

  • Data and high quality image export

Input formats

GeMo requires two types of datasets to generate the ideogram visualization

Menu

The position of the mosaic blocks along the chromosomes. It accepts two types of files:

  • Genomic blocks

chr

haplotype

start

end

ancestral_group

chr01

0

1

29070452

g4

chr01

1

1

29070452

g4

chr02

0

1

29511734

g4

chr02

1

1

29511734

g4

  • Normalized curves

chr

start

end

V

T

S

chr01

1145

189582

0.001671988

0.014082301

0.001638686

chr01

189593

356965

0.001244196

0.012867256

0.001810139

chr01

356968

488069

0.001117959

0.010035172

0.000759437

chr01

488097

633373

0.002678213

0.010470727

0.003896031

  • Chromosomes sizes and labels

Chromosome data format, each column tab separated chr, len, centromereInf (optional), centromereSup (optional), label (optional)

chr

len

label

chr01

37945898

AB

chr02

34728925

AB

chr03

40528553

AB

chr04

34728925

AB

chr05

44598304

AB

chr06

46248384

AB

chr07

42818424

AB

chr08

38870123

AB

  • Optional files

Users can provide their own color codes or use the online features (custom or color blind friendly palettes)

Color

group

name

hex

g1

group1

#000000

g2

group2

#ffc000

g3

group3

#1440cd

g4

group4

#00b009

Annotations

A list of genomic coordinates (e.g. genes of interest, QTLs) can be provided in a BED-like to visually spot the corresponding regions on the chromosomes. This can be particularly useful to check correlations between parental/ancestral blocks and genes/regions of interest.

chr01

5287838

5289224

gene

0

-

chr01

15485703

15486813

gene

0

+

chr02

2276353

2277821

gene

0

+

bed_annot

Data outputs

Once data is provided the chromosome diagram is generated on the fly. Chromosomes display colored blocks usually corresponding to their ancestral/parental origin. An interactive legend is present to label each group with a corresponding color. The user can modify the color of a group directly in the legend.

  • Blocks

In the example below, the 11 chromosomes of an doploid organism is visualized. Three main colors (green, blue and red) are visible and corresponds to 3 distinct genepools that contributed to the genetic make up of this genotype. The segements in grey corresponds to unknown.

blocks
  • Curves

In this mode, the graph represents the proportion of haplotypes of each ancestral origin along chromosomes. They are the results of a normalisation of the number of reads supporting each origin on a given window.

curves

In this example, allelic ratio for a range of founding genepools are respresented by different colors for chromosome 1. Two genepools in green is the main contributor with smaller contribtuons from the blue and red gene pools.

Data curation and export

Uploaded datasets are automatically loaded in the text box of the GeMo menu, allowing users to update the content and reflect it on the image by clicking on the “update image” button.

In curve mode, users can visually set the threshold on the graph to recalculate the origin and size of clored block forming the mosacis. This can be particularly useful when multiple putative parental gene pools with unclear signals can create noisy mosaics or to switch segments from one haplotype to another for better consistency. Once a threshold is changed, the karyotype diagram is automatically updated.

treshold

For pre-loaded data, the curve mode can be activated only when the normalized curves dataset exists. In this case, a toggle button labeled “Curve based mode” is present at the top of the user input form.

GeMo offers the possibility to download the latest version of the data sets and export the output graphics as SVG for publication purposes. In addition, data can be also stored temporarily online with a unique URL allowing to share it with multiple users.

Live demo

GeMo is available for free to use at https://gemo.southgreen.fr/ where anyone can upload its own data or test with pre-loaded mosaics/datasets.

Citation

Summo M, Comte A, Martin G, Weitz E, Perelle P, Droc G and Rouard M. GeMo: A mosaic genome painting tool for plant genomes. (in prep)

Acknowledgements

GeMo has been developed in the framework of the Genome Harvest project supported by the Agropolis fondation.

Troubleshootings and web browser compatibility

  • Some issues were reported for color management when using the exported SVG with Inkscape.

  • It is optimized for Chrome and works in Firefox and Edge but some design issues may occur with Safari.

The web interfaces were tested with the following platforms and web browsers:

OS

Version

Chrome

Firefox

Edge

Safari

Windows 10

10

88.0.4324.150

94.0.1

96.0.1054.29

n/a

Mac OS

11.2

97.0.4692.36

94.0.2

n/a

14.0.3

Quick Start

The objective of this tutorial is to reproduce part of the results presented in Baurens et al (2019) and Ahmed et al (2019), using respectively VCFHunter and TraceAncestor.

The outputs of these programs can then be used in the GeMo webapps.

Installation requirements

This tutorial is developed to run on Linux or Apple (MAC OS X) operating systems. There are no versions planned for Windows.

Software requirements:

  • Perl 5 for TraceAncestor

  • Python 3 for VCFHunter

Testing your Perl installation

To test that Perl 5 is installed, enter on the command line

perl -version

Testing your Python installation

To test that Python 3 is installed, enter on the command line

python3 --version

Now, you can clone the repository, create a virtualenv and install several additionnal package using pip.

git clone https://github.com/gdroc/GeMo_tutorials.git
cd GeMo_tutorials
python3 -m venv $PWD/venv
source venv/bin/activate
pip install numpy
pip install matplotlib
pip install scipy

Download Dataset

For this tutorial, Dataset that will be used by TraceAncestor or by VCFHunter are accessible on Zenodo https://doi.org/10.5281/zenodo.6539270

To download this, you only need to launch the script download_dataset.pl without any parameter

perl download_dataset.pl

This script create a new directory data

data/
├── Ahmed_et_al_2019_color.txt
├── Ahmed_et_al_2019_individuals.txt
├── Ahmed_et_al_2019_origin.txt
├── Ahmed_et_al_2019.vcf
├── Baurens_et_al_2019_color.txt
├── Baurens_et_al_2019_individuals.txt
├── Baurens_et_al_2019_origin.txt
├── Baurens_et_al_2019_chromosome.txt
└── Baurens_et_al_2019.vcf

These files are require for this tutorial to run VCFHunter or TraceAncestor

Input

  • Baurens_et_al_2019_origin.txt : A two column file with individuals in the first column and group tag (i.e. origin) in the second column

individuals

origin

P2

AA

T01

BB

T02

BB

T03

AA

T04

AA

T05

AA

T06

AA

T07

AA

T08

BB

  • Baurens_et_al_2019.vcf : A vcf file with ancestral and admixed individuals

grep #CHROM data/Baurens_et_al_2019.vcf
#CHROM       POS     ID      REF     ALT     QUAL    FILTER  INFO    FORMAT  ACC48-FPG       ACC48-FPN       ACC48-P_Ceylan  ACC48-Red_Yade  DYN163-Kunnan   DYN275-Pelipita DYN359-Safet_Velchi     GP1     GP2     P1      P2      T01     T02     T03     T04     T05     T06     T07     T08     T10     T11
  • Baurens_et_al_2019_individuals.txt : A two column file with individuals to scan for origin (same as defined in the VCF headerline) in the first column and the ploidy in the second column.

  • Baurens_et_al_2019_color.txt : A color file with 4 columns: col1=group and the three last column corresponded to RGB code.

group

name

r

g

b

AA

acuminata

0

255

0

BB

balbisiana

255

0

0

Run workflow using create_gemo_input.pl

perl create_gemo_input.pl --help
Parameters :
    -v, --vcf         A vcf file [required]
    -o, --origin      A two column file with individuals in the first column and group tag (i.e. origin) in the second column [Required]
    -i, --individuals List of individuals to scan from vcf, as defined in the VCF headerline [Required]
    -m, --method      Permissible values: vcfhunter traceancestor (String). Default vcfhunter
    -c, --color       A color file with 4 columns: col1=group and the three last column corresponded to RGB code.
    -t, --threads     Number of threads
    -d, --dirout      Path to the output directory (Default method option name)
    -h, --help        display this help

1. With VCFHunter method

You must use the dataset prefixed with Baurens_et_al.

perl create_gemo_input.pl --vcf data/Baurens_et_al_2019.vcf --origin data/Baurens_et_al_2019_origin.txt --individuals data/Baurens_et_al_2019_individuals.txt --method vcfhunter --color data/Baurens_et_al_2019_color.txt --threads 4

2. With TraceAncestor method

You must use the dataset prefixed with with Ahmed_et_al.

perl create_gemo_input.pl --vcf data/Ahmed_et_al_2019.vcf --origin data/Ahmed_et_al_2019_origin.txt --individuals data/Ahmed_et_al_2019_individuals.txt --method traceancestor --color data/Ahmed_et_al_2019_color.txt

Explanation of outputs

A directory was create depending on parameter dirout (default method name)

For example, for VCFHunter, for each individual present in the file data/Baurens_et_al_2019_individuals.txt, 4 outputs are produced in this directory, prefixed with the name of indivual :

  • DYN163-Kunnan_ideo.txt : A text file of the position of genomic blocks the ancestry mosaic with a succession of genomic blocks along the chromosome

chr

haplotype

start

end

ancestral_group

chr01

0

0

20888

AA

chr01

0

20888

451633

AA

chr01

0

451633

848109

AA

chr01

0

848109

1198648

AA

chr01

0

1198648

1555128

un

chr01

0

1555128

1899887

AA

chr01

0

1899887

2296417

un

chr01

0

2296417

2759817

un

  • DYN163-Kunnan_chrom.txt : A tab file with name, length and karyotype based on ploidy (optionaly the location of centromere).

chr

len

centromereInf

centromereSup

label

chr01

29070452

14535226

14535228

AB

chr02

29511734

14755867

14755869

AB

chr03

35020413

17510206

17510208

AB

chr04

37105743

18552871

18552873

AB

chr05

41853232

20926616

20926618

AB

chr06

37593364

18796682

18796684

AB

chr07

35028021

17514010

17514012

AB

chr08

44889171

22444585

22444587

AB

chr09

41306725

20653362

20653364

AB

chr10

37674811

18837405

18837407

AB

chr11

27954350

13977175

13977177

AB

  • BDYN163-Kunnan_color.txt : Frequency of ancestors alleles along chromosome for the particular hybrid focused.

group

name

hex

AA

acuminata

#00ff00

BB

balbisiana

#ff0000

un

un

#bdbdbd

  • DYN163-Kunnan_curve.txt : Frequency of ancestors alleles along chromosome for the GeMo visualization tool.

chr

start

end

AA

BB

chr01

20888

525207

0.660757486645395

0.30378982223766354

chr01

525207

1086954

0.6425583592191819

0.3508607451997505

chr01

1086954

1563263

0.7355412887547506

0.2661255866893344

chr01

1563263

2058335

0.6136974042002844

0.3851682528896984

chr01

2058335

2638987

0.5543371247412866

0.39469329280411

chr01

2638987

3190388

0.6752108036341729

0.3208947817296506

chr01

3190388

3905155

0.6951554613138214

0.3155181655339866

chr01

3905155

4800522

0.6813746934348566

0.32271710110143237

Visualization and block refinement with GeMo

Go to GeMo WebApp

  • Ideogram Mode

GeMo_Vizualise
  • Curve mode

GeMo_Vizualise

References

Chromosome painting using non admixed ancestral accessions (VCFHunter)

The aims of this tutorial are to showing how data should be processed to be then visualized with the GeMo

Installation

Install VCFHunter following the documentation presented above:

git clone https://github.com/gdroc/GeMo_tutorials.git
cd GeMo_tutorials
python3 -m venv $PWD/venv
source venv/bin/activate
pip install numpy
pip install matplotlib
pip install scipy

Download datasets

Two ways :

mkdir data
cd data
wget https://zenodo.org/record/6542870/files/Baurens_et_al_2019.zip
unzip Baurens_et_al_2019.zip
ls Baurens_et_al_2019.vcf > Vcf.conf

Goto Identification of private alleles and formatting output for more analysis

  • Create input dataset using Gigwa, a web application for managing and exploring high-density genotyping data, to download a VCF

Select the database Populations_A_B

Select Database

Select the accessions P2 and T01 to T11 on the Indivuals drop down menu, and click on Search button

Select Individu

Download result (check radio “Export Metadata” and “Keep file on servers”)

Export Gigwa

Copy the link, and create a repository on your terminal

mkdir data
cd data
wget --no-check-certificate https://www.crop-diversity.org/gigwa/genofilt/tmpOutput/anonymousUser/b429763f507dc1bb2b169d7da5cf1804/Population_A-B__project1__2021-10-12__148329variants__VCF.zip
unzip Population_A-B__project1__2021-10-12__148329variants__VCF.zip
cut -f 1,6 Population_A-B__21individuals_metadata.tsv > Baurens_et_al_2019_origin.txt
sed -i 's:balbisiana:BB:' Baurens_et_al_2019_origin.txt
sed -i 's:acuminata:AA:' Baurens_et_al_2019_origin.txt
ls Population_A-B__148329variants__10individuals.vcf > Vcf.conf

VCF content

grep "^#CHROM" Population_A-B__148329variants__21individuals.vcf
#CHROM  POS ID  REF ALT QUAL    FILTER  INFO    FORMAT  ACC48-FPG   ACC48-FPN   ACC48-P_Ceylan  ACC48-Red_Yade  DYN163-Kunnan   DYN275-Pelipita DYN359-Safet_Velchi GP1 GP2 P1  P2  T01 T02 T03 T04 T05 T06 T07 T08 T10 T11

Workflow

The principle of this analysis is to :

  • Identify specific allele of distinct genetic pools,

  • Calculate the expected allelic ratio of these alleles in these genetic pools,

  • Calculate the observed allelic ratio a/several given accessions

  • Normalize these observed ratios using expected ratio to infer the number of haplotypes of each genetic pools that are present on a given windows of the studied accession.

Files obtained at the end of the process can be given to GeMo tools to visualize data and optimize parameters.

Input

  • Baurens_et_al_2019_origin.txt

  • Vcf.conf is a file which contained path to vcf files which will be used for e-chromosome painting.

  • Baurens_et_al_2019_chromosome.txt (tabulated file with the chromosome name and length)

  • Baurens_et_al_2019_color.txt

group

name

r

g

b

AA

acuminata

0

255

0

BB

balbisiana

255

0

0

Identification of private alleles and formatting output for more analysis

bin/IdentPrivateAllele.py -c data/Vcf.conf -g Baurens_et_al_2019_origin.txt -o step1 -a y  -m y

In this first step, the program use genotyping information provided in vcf files passed in Vcf.conf file and the file Origin.tab containing the corresponding genetic pools of some accessions of the vcf to identify alleles specific of each pools.

Outputs can be found in directory passed in -o option. For each accessions identified as belonging to a genetic pool a directory is created.

tree step1
step1
├── P2
│   ├── P2_ratio.tab.gz
│   └── tmp_1_P2_stats.tab
├── T01
│   ├── T01_ratio.tab.gz
│   └── tmp_1_T01_stats.tab
├── T02
│   ├── T02_ratio.tab.gz
│   └── tmp_1_T02_stats.tab
├── T03
│   ├── T03_ratio.tab.gz
│   └── tmp_1_T03_stats.tab
├── T04
│   ├── T04_ratio.tab.gz
│   └── tmp_1_T04_stats.tab
├── T05
│   ├── T05_ratio.tab.gz
│   └── tmp_1_T05_stats.tab
├── T06
│   ├── T06_ratio.tab.gz
│   └── tmp_1_T06_stats.tab
├── T07
│   ├── T07_ratio.tab.gz
│   └── tmp_1_T07_stats.tab
├── T08
│   ├── T08_ratio.tab.gz
│   └── tmp_1_T08_stats.tab
├── T10
│   ├── T10_ratio.tab.gz
│   └── tmp_1_T10_stats.tab
└── T11
    ├── T11_ratio.tab.gz
    └── tmp_1_T11_stats.tab

Determination of expected read ratio for each ancestral position based on ancestral accessions merged together

bin/allele_ratio_group.py -g Baurens_et_al_2019_origin.txt -p _ratio.tab.gz -o step2 -i step1

In this second step the program take the input of specific allele identified in each accessions used to define genetic pools (ratio.tab.gz files of step1 folder) and calculate an average expected allele ratio (globally a proxy of the fixation level of the allele) in the genetic pool the allele belongs.

A tabulated file is generated per genetic pool with the following format:

c hromosome

position

allele

genetic pool

average allelic ratio observed

number of ancestral a ccessions

chr02

15033812

A

AA

0.9959677 419354839

8

chr02

17722345

G

AA

1.0

8

chr09

39501254

T

AA

1.0

8

chr05

17536961

T

AA

1.0

8

chr06

10144735

A

AA

0.9931737 588652483

8

chr08

4718673

T

AA

0.9932432 432432432

8

chr10

37498708

T

AA

0.9239074 518611573

8

Calculation of observed ratio in other accessions

The third step is to calculate, for each position in which an allele specific of a genetic pool was identified, the observed allelic ratio in a studied accession. In this example we calculate this ratio on the Kunnan accession.

bin/allele_ratio_per_acc.py -c Vcf.conf -g Baurens_et_al_2019_origin.txt -i step2 -o step3 -a DYN163-Kunnan

The output can be found in the step3 folder passed in -o option. This tabulated file contained 6 columns: column 1 corresponded to the chromosome, column 2 is the position of the allele, column 3 is the allele, column 4 corresponded to the observed allele frequency in the accession, column 5 is the expected allele frequency calculated at step 2 and column 6 is the genetic group to which the allele has been attributed.

For example : zmore step3/DYN163-Kunnan_ratio.tab.gz

chr

pos

allele

obs_ratio

exp_ratio

grp

chr01

20888

A

0.0

0.23513227513227516

BB

chr01

20916

C

0.14754098360655737

0.28604868303910713

BB

chr01

21019

G

0.21875

0.3700537473602161

BB

chr01

67413

T

0.5818181818181818

1.0

AA

chr01

67413

A

0.41818181818181815

1.0

BB

chr01

67461

G

0.0

0.975

AA

chr01

89923

G

0.6842105263157895

1.0

AA

chr01

89923

T

0.3157894736842105

1.0

BB

chr01

89958

T

0.6842105263157895

1.0

AA

Calculation on sliding of the normalized observed ratio and ancestral blocs

In this step, in a given sliding windows, the observed average allelic ratio is calculated for each genetic pool and normalized by the expected allelic ratio. The resulting value is used to infer the number of haplotypes from the studied genetic pool present in the studied accession.

Output are of two types:

  • <accession>_win_ratio.tab.gz file containing normalized values for each genetic pools in the given windows. This file contained 4 + X columns, X being the number of genetic pools tested. The column 1 contained the chromosome name, column 2 contained the position of the central allele in the windows, column 3 contained the start position of the windows and column 4 contained the end position of the windows. Columns 5 to end contained the normalized ratio calculated for the accessions. A columns per genetic pool.

  • <accession>_<chromosome>_<haplotype>.tab contained the hypothesized haplotypes from this accession given results from tab.gz file. Haplotype are hypothetic ones that tries to minimize recombinations events between distinct genetic pools. These files are formatted as follows: column 1 contained accession name, column 2 contained chromosome ID, column 3, 4 and 5 contained start, end, and origin of a region.

mkdir step4
bin/PaintArp.py -a DYN163-Kunnan -r step3/DYN163-Kunnan_ratio.tab.gz -c Baurens_et_al_2019_color.txt -o step4/DYN163-Kunnan -w 12 -O 0 -s Baurens_et_al_2019_chromosome.txt

File formatting for GeMo visualization

This steps aims at reformatting the files so that they are compatible with GeMo tool. GeMo tool performs two tasks, the first one consists in drawing ancestral block identified at step 4. The second one also draw these blocks but allowed refinement of these block using custom and adjustable parameters. For block drawing of step 4 we will reformat block files so that they match expectation with GeMo. For this run the following command line:

mkdir step5
bin/convertForIdeo.py --name DYN163-Kunnan --dir step4 --col Baurens_et_al_2019_color.txt --size Baurens_et_al_2019_chromosome.txt --prefix step5/DYN163-Kunnan --plo 2

This command generate several files with the following names:

  • <accession_id>_ideo.txt that contained block that could be drawn with GeMo (data section),

  • <accession_id>_curve.txt that contained block that could be drawn with GeMo (data section),

  • <accession_id>_ideoProb.txt that contained block that could be drawn with GeMo (data section),

  • <accession_id>_chrom.txt that contained information required to draw chromosomes.

  • <accession_id>_color.txt contained color information that could be used to draw blocks with custom color.

References

Chromosome painting using TraceAncestor

TraceAncestor is a suite of script that allows to estimate the allelic dosage of ancestral alleles in hybrid individuals and then to perform chromosome painting.

Installation

git clone https://github.com/gdroc/GeMo_tutorials.git
cd GeMo_tutorials

Download dataset, you only need to launch the script download_dataset.pl without any parameter

perl download_dataset.pl

This script create a new directory data

data/
├── Ahmed_et_al_2019_color.txt
├── Ahmed_et_al_2019_individuals.txt
├── Ahmed_et_al_2019_origin.txt
├── Ahmed_et_al_2019.vcf

Workflow

vcf2gst.pl

Usage

This script is used to define GST values from individuals that are identified as pure breed for an ancestor.

Must be used on pure breed. If there is introgressed part on the genome of the individual, the part must be removed before analysis.

bin/vcf2gst.pl --help

Parameters :
   --vcf       vcf containing the ancestors and other individuals to scan [Required]
   --ancestor  A two column file with individuals in the first column and group tag (i.e. origin) in the second column [Required]
   --depth     minimal depth for a snp to be used in the analysis (Default 5)
   --output    output file name (Default GSTmatrix.txt)
   --help

Input

–ancestor Ancestor file (Required)

A two column file with individuals in the first column and group tag (i.e. origin) in the second column

individuals

origin

De_Chios

Mandarin

Shekwasha

Mandarin

Sunki

Mandarin

Cleopatra

Mandarin

Pink

Pummello

Timor

Pummello

Tahitian

Pummello

Deep_red

Pummello

Corsican

Citron

Buddha_Hand

Citron

–vcf VCF file (Required)

Now, you can run the following command

perl bin/vcf2gst.pl --ancestor data/Ahmed_et_al_2019_origin.txt --vcf data/Ahmed_et_al_2019.vcf --output GSTMatrix.txt

Output

The output is a CSV file containing GST (inter-population differentiation parameter) information:

with :

  • #CHROM = chromosome name

  • POS = position of DSNP

  • REF = Base of the reference allele of this DSNP

  • ALT = Base of the alternative allele of this DSNP

  • %Nref = Percentage of maximal missing data for this DSNP

  • GST = value of GST (inter-population differentiation parameter) (With 1,2,3 the ancestors names)

  • F = Alternative allele frequency for each ancestor (With 1,2,3 the ancestors names)

prefilter.pl

Usage

This script is used to define a matrix of ancestry informative markers from the matrix gotten at the step 1.

bin/prefilter.pl --help
Parameters :
    --matrix    GST matrix [Required]
    --gst       threshold for gst (Default : 0.9)
    --missing   threshold for missing data (Default 0.3)
    --output    output file name (Default Diagnosis_matrix)
    --help      display this help

Now, you can run the following command

perl bin/prefilter.pl --matrix GSTMatrix.txt --output Diagnosis_matrix.txt

Output

A matrix containing all the ancestry informative markers for every ancestors.

with:

  • ancestor = Ancestor names

  • chromosome = Chromosome numbers

  • position = Position of the SNP marker

  • allele = Base of the ancestral allele

TraceAncestor.pl

Usage

bin/TraceAncestor.pl --help

Parameters :
    --matrix     Diagnosis matrix [Required]
    --vcf       vcf of the hybrid population
    --individuals    A two column file with individuals to scan for origin (same as defined in the VCF headerline) in the first column and the ploidy in the second column [Required]
    --window    number of markers by window (Default 10)
    --lod       LOD value to conclude for one hypothesis (Default 3)
    --freq      theoretical frequency used to calcul the LOD (Default 0.99)
    --cut       number of K bases in one window (Default 100)
    --dirout    Directory output (Default result)
    --help      display this help

Input

–individuals A two column file with individuals to scan for origin (same as defined in the VCF headerline) in the first column and the ploidy in the second column.

Now, you can run the following command

perl bin/TraceAncestor.pl --matrix Diagnosis_matrix.txt --vcf data/Ahmed_et_al_2019.vcf --individuals data/Ahmed_et_al_2019_individuals.txt

Output

For each individual present in the file data/Ahmed_et_al_2019_individuals.txt, 4 outputs are produced, prefixed with the name of indivual :

  • Bergamot_ideo.txt : A text file of the position of genomic blocks the ancestry mosaic with a succession of genomic blocks along the chromosome

chr

haplotype

start

end

ancestral_group

1

0

1

28700000

Citron

1

1

1

28700000

Pummello

2

0

1

600000

Citron

2

0

3000001

4200000

Mandarin

2

0

4200001

10400000

Citron

2

0

10800001

35200000

Citron

  • Bergamot_chrom.txt : A tab file with name, length and karyotype based on ploidy.

  • Bergamot_ancestor.txt : Frequency of ancestors alleles along chromosome for the particular hybrid focused.

  • Bergamot_curve.txt : Frequency of ancestors alleles along chromosome for the GeMo visualization tool.

Visualization and block refinement with GeMo

Go to GeMo WebApp

  • Load data has follow

Gemo_Vizualise

References

PCA analysis

Installation

pip install Bio
pip install sklearn

Dependencies

plink

R Package ade4

R
install.packages("ade4")

Datasets

  • Download Rice 3K RG 404k CoreSNP Dataset, all chromosomes

cd data
wget https://s3.amazonaws.com/3kricegenome/snpseek-dl/3krg-base-filt-core-v0.7/core_v0.7.bed.gz
wget https://s3.amazonaws.com/3kricegenome/snpseek-dl/3krg-base-filt-core-v0.7/core_v0.7.bim.gz
wget https://s3.amazonaws.com/3kricegenome/snpseek-dl/3krg-base-filt-core-v0.7/core_v0.7.fam.gz

gunzip core_v0.7.bed.gz
gunzip core_v0.7.bim.gz
gunzip core_v0.7.fam.gz
  • Download information for a subset of these accession

wget https://raw.githubusercontent.com/SantosJGND/Galaxy_KDE_classifier/v1.2/Downstream_functions/Analyses_Jsubtrop_self_KDE/Order_core.txt
grep -v "COUNTRY" Order_core.txt | cut -f 2 > sample.txt

Workflow

  • Convert to vcf using plink

plink --bfile core_v0.7 --recode vcf-iid --keep-fam sample.txt --out core_v0.7
  • Adjust some missing value on vcf file

sed -i 's=GT=GT:AD:DP=' core_v0.7.vcf
sed -i 's=0/0=0/0:20,0:20=g' core_v0.7.vcf
sed -i 's=0/1=0/1:10,10:20=g' core_v0.7.vcf
sed -i 's=1/1=1/1:0,20:20=g' core_v0.7.vcf
sed -i 's=\.\/\.=\.\/\.:\.,\.:\.=g' core_v0.7.vcf

The first step of the Chromosome painting is to perform a PCA analysis on the vcf file to cluster the alleles and the accession.

Create a folder in which the analysis will be performed and run the following command line:

mkdir PCA
bin/vcf2struct.1.0.py --vcf data/core_v0.7.vcf --names data/sample.txt --type FACTORIAL --prefix PCA/Analysis --nAxes 6 --mulType coa

The last command line run the factorial analysis (–type FACTORIAL option). During this analysis the vcf file is recoded as followed : For each allele at each variants site two markers were generated; One marker for the presence of the allele (0/1 coded) and one for the absence of the allele (0/1 coded).

_images/Vcf2struct_Fig3.png

Only alleles present or absent in part (not all) of selected accessions were included in the final matrix file named PCA/Analysis_matrix_4_PCA.tab in this example. An additional column named “GROUP” can be identified. This column is filled with “UN” value if no –group argument is passed. We will explain later this argument.

The factorial analysis (here a COA, –mulType option) was performed on the transposed matrix using R (The R script is generated by the script and can be found here: PCA/Analysis_multivariate.R). R warning messages and command lines are recorded in the file named Analysis_multivariate.Rout. Graphical outputs of the analysis were draw and for example accessions and alleles can be projected along axis in the following picture.

_images/PCA1.png

Correspond to the file : PCA/Analysis_axis_1_vs_2.pdf

In this example the left graph represent accessions projected along axis 1 and 2 and the right represent the allele projected along synthetic axis. A graphical representation is performed for each axis combinations and each file is named according to the following nomenclature *prefix + _axis_X_vs_Y.pdf*. Several pdf for accessions along axis only is also generated and are named according to the following nomenclature *prefix + _axis_X_vs_Y_accessions.pdf*.

Individual and variables coordinates for the selected 6 first axis (–nAxes option) are recorded in files named PCA/Analysis_individuals_coordinates.tab and PCA/Analysis_variables_coordinates.tab respectively. A third file named PCA/Analysis_variables_coordinates_scaled.tab containing allele scaled coordinates (columns centered and reduced) along synthetic axis is generated.

sort -k 2n,2 PCA/Analysis_individuals_coordinates.tab | cut -f 1 -d " " | tail -10 | sed 's:\"::g' | sed  's=\.=-=' | sed "s:$:\tg1:" > group1.txt
sort -k 3n,3 PCA/Analysis_individuals_coordinates.tab | cut -f 1 -d " " | tail -10 | sed 's:\"::g' | sed  's=\.=-=' | sed "s:$:\tg2:" > group2.txt
sort -k 3nr,3 PCA/Analysis_individuals_coordinates.tab | cut -f 1 -d " " | tail -10 | sed 's:\"::g' | sed  's=\.=-=' | sed "s:$:\tg3:" > group3.txt

echo '["group"]' > data/origin.txt
cat group1.txt group2.txt group3.txt >> data/origin.txt
echo '["color"]' >> data/origin.txt
echo -e "g1\tred=0:green=1:blue=0:alpha=0.7" >> data/origin.txt
echo -e "g2\tred=0:green=0:blue=1:alpha=0.7" >> data/origin.txt
echo -e "g3\tred=1:green=0:blue=0:alpha=0.7" >> data/origin.txt

The –group option

We assume that in some case you have additional informations on your dataset such as which accessions are admixed and which accessions are likely to be the ancestral one. And maybe you want to verify/project this information in your analysis. This can be done passing a configuration file with two section to the –group option. This file can be found in the data/config/ folder and is named AncestryInfo.tab. You can have a look at the file if you want but basically the two sections are named [group] and [color] and contained respectively the accession suspected grouping and a color (in RGB proportion) you want to attribute to each group. Accessions with no group should filled with “UN” value.

Warning

Group name should be written in upper case (due to R sorting).

mkdir -p PCA_group
bin/vcf2struct.1.0.py --vcf data/core_v0.7.vcf --names data/sample.txt --type FACTORIAL --prefix PCA_group/Analysis --nAxes 6 --mulType coa --group data/origin.txt
_images/PCA2.png

Mean Shift clustering Now that allele have been projected along synthetic axes, it is time to cluster these alleles. The idea is that the structure reflected by the synthetic axis represent the ancestral structure. In this context, the alleles at the extremities of the cloud of points will be the ancestral ones. These alleles can be clustered using several approaches. In this tutorial we will use a Mean Shift clustering approach.

bin/vcf2struct.1.0.py --type SNP_CLUST-MeanShift --VarCoord PCA_group/Analysis_variables_coordinates.tab --dAxes 1:2 --mat PCA_group/Analysis_matrix_4_PCA.tab --thread 8 --prefix PCA_group/Analysis --quantile 0.15

The Mean Shift clustering is performed with only the 2 first axes of the COA (–dAxes 1:2) because the analysis showed that most of the inertia is on these axes. With a mean shift approach, the number of group is automatically detected.

During the process, several informations are returned to standard output, but at the end of the process three main informations are returned:

  • the number of alleles used for the analysis. Allele present or absent in all accessions are removed.

  • the number of estimated clusters which can be found in the line:

Performing MeanShift
Bandwidth estimation: 0.5199882678747445
number of estimated clusters : 4
  • the number of allele grouped within each group is returned and should look like as followed:

Group g0 contained 28363 dots
Group g1 contained 8704 dots
Group g2 contained 3444 dots
Group g3 contained 3300 dots

Five file are generated and can be found in the PCA_group folder:

  • PCA_group/Analysis_kMean_allele.tab file which correspond to the PCA_group/Analysis_matrix_4_PCA.tab in which the allele grouping has been recorded.

  • PCA_group/Analysis_centroid_coordinates.tab file which regroup the centroids coordinates.

  • PCA_group/Analysis_centroid_iteration_grouping.tab file which records for each centroid its grouping.

  • PCA_group/Analysis_group_color.tab file that attribute a color to the groups.

  • PCA_group/Analysis_kMean_gp_prop.tab file that report for each allele the probability to be in each groups. This is not a “real” probability, the idea was to have a statistics in case you want to filter alleles. This value was calculated as the inverse of the euclidian distance of one point and each centroids and these values were normalized so that the sum is equal to 1.

Visualization of the allele grouping can be done as followed:

./bin/vcf2struct.1.0.py --type VISUALIZE_VAR_2D --VarCoord PCA_group/Analysis_variables_coordinates.tab --dAxes 1:2 --mat PCA_group/Analysis_kMean_allele.tab --group PCA_group/Analysis_group_color.tab --prefix PCA_group/AlleleGrouping

And corresponding representation :

_images/AlleleGrouping_axis1_vs_axis2.png

PCA_group/AlleleGrouping_axis1_vs_axis2.png

It is not necessary to have a 3d visualization but we can try the command anyway:

./bin/vcf2struct.1.0.py --type VISUALIZE_VAR_3D --VarCoord PCA_group/Analysis_variables_coordinates.tab --dAxes 1:2:3 --mat PCA_group/Analysis_kMean_allele.tab --group PCA_group/Analysis_group_color.tab

A window which should look like this should open:

_images/PCA3.png

This 3d visualization can be rotated with the mouse.

Local Install

Prerequisites

To install GeMo on your computer you need a local server environment like MAMP.

You will also need to install Python 3 and Node. We recommand to install NVM to manage Node and NPM versions.

Clone the GeMo repository

git clone https://github.com/SouthGreenPlatform/GeMo.git
cd GeMo

Install Node dependencies

npm install
npm ci

Create required directories

mkdir tmp
mkdir tmp/gemo_run
mkdir tmp/gemo_saved

Launch node server

In GeMo directory :

npm run server

Configure socket variable

In the GeMo directory, modify the index.php file to connect to your local node server :

var socket = io('http://localhost:9070');

Configure MAMP

Start MAMP and click the “Start” button in the toolbar. In MAMP > Preferences... > Web Server the Document root is set to /Applications/MAMP/htdocs. You can change the path to point on the GeMo directorie.

Your local GeMo is now accessible in your web browser : http://localhost:8888/