2015 bio it visualizing genomic variants and annotations is vital for accurate interpretation
Analysis of Genomic Variants of Pakistani Sub …...Analysis of Genomic Variants of Pakistani...
Transcript of Analysis of Genomic Variants of Pakistani Sub …...Analysis of Genomic Variants of Pakistani...
ABSTRACT
The 1000 Genomes Project is the first internationally collaborated project to
sequence the genomes of populations at low-coverage (4X coverage) to
identify genetic variants that have frequencies of at least 1%. The
generated data provides a comprehensive resource on human genetic
variations through the characterization of many millions multi-allelic SNPs,
several classes of structural variants (SVs), and their haplotype contexts.
Analysis of variation data is a critical step in the interpretation of
sequencing data. Although, genetic variant data from the 1000 Genomes
Project team is freely and publicly available for research studies but high
resolution analysis is yet to be done. To address this deficiency, we initiated
to explore genetic variant data of 96 Pakistani individuals (PJL sub-
population). VCFtools were used to mine genomic variant data specific for
PJL sub-population. Sample analysis revealed a total of 62 families, with 42
families are involved in trios study. SNPs, InDels and ratios of transitions–
to–transversions (Ts/Tv) were calculated for each chromosome. SNPs
densities at an interval of 1 Mbs were calculated showing that chr4 in PJL
sub-population is the least variant while chr22 has the most variable
pattern. Principal Component Analysis (PCA) by R statistical package was
performed to observe the relationship of chromosomes within PJL data and
developed a comparative model against the whole 1000 Genomes Project
data. This study will also help us to identify demographic history in future.
INTRODUCTION
The understanding of genetic variation is essential to decode traces that
evolution has left in our genomes and the availability of whole-genome
sequence data now allow us to do interpret these signals at a resolution
never possible before. Genetic variation in humans generally follows clines
defined by geographical regions, and there are possibly very few fixed
differences between any pair of continents or populations. Nevertheless,
genetic differences among populations exist, reflecting mainly past
demographic events. Common population-specific SNP distributions are
non-randomly distributed throughout the genome. In some cases,
differences accumulate as adaptation to population-specific environmental
pressure, a process known as positive selection.
Pakistan is situated at the crossroads of Indian subcontinent,
Central Asia, and the Middle East. With an ethnically and linguistically
diverse population of >170 million, Pakistan is the 6th largest country in the
world. Most of the Pakistani population has an ancestral north Indian origin,
generically close to Middle Easterners, Central Asians and Europeans. The
data produced by the 1000 Genomes Project has enabled us to reconstruct
the complex evolutionary history of the human species in remarkable detail.
All the Pakistani sub-population (PJL) data of South-Asian population
represents the characteristic variation sets that will be an important assets
to improve the genetic variation map of this region. Remarkably, this simple
approach, if applied to whole genome sequences from large population
samples, usually seems to lead directly to the functional variants
responsible for the differentiated phenotype.
In this study, efforts have been devoted to understand the genetic
differences of PJL sub-population against the 1000 Genomes Project.
Initially at the start of this project, emphasis have been put forward to count
the genomic variants, supported by multivariate analysis to develop a
model representing the divergence at chromosome level. In future, the
information generated by this work will be used to further explore the
abundant phenotypic variation to uncover evolutionary history.
COMPUTATIONAL METHODS
Downloaded *.vcf files (v4.2) for each chromosome (chr1 – chr22)
from EBI’s ftp server of the 1000 Genomes Project along with other
accessory files (ped file, panel file, etc.,)
vcf-subset script was utilized with the following options to generate
*.vcf files only for PJL samples:
“-c” − list of PJL sample IDs to be kept in PJL *.vcf files for each
chromosome, and
"-p“− print only those sites that have alternative alleles in the PJL
samples and skip any other sites that are all REF allele in PJL
samples.
BCFtools stats (1.1+htslib-1.1) was used to count SNPs, InDels and
ratio of Ts/Tv; SNPs densities were calculated in defined bins of 1
Mbs by SNPdensity output filtering statistics option.
Perl API scripts of VCFtools (v0.1.11) was used to mine the sub-
population of PJL
CONCLUSION
Genetic variant data of PJL sub-population showed that adaptation has been
frequent in our evolutionary history.
Much more focus is needed on chr4 and chr22 of PJL data as these two
chromosomes has the most distinctive pattern.
PCA provides such simplistic models representing the comparative behavior at
population level.
Using the 1000 Genomes Project data, a more comprehensive genetic variation
map of PJL will be produced to support the evolutionary pressure of PJL genome.
REFERENCES
1. http://www.1000genomes.org/
2. The 1000 Genomes Project Consortium, An integrated map of genetic
variation from 1,092 human genomes, Nature 491, 2012, 56–
65doi:10.1038/nature11632.
Analysis of Genomic Variants of Pakistani Sub-Population Sequenced by the 1000 Genomes Project
Waqasuddin Khan,* Ishtiaq A. Khan, and M. Kamran Azim
Jamil-ur-Rahman Center for Genome Research, Dr. Panjwani Center for Molecular Medicine and Drug Research, International Center for Chemical and Biological Sciences,
University of Karachi, Karachi-75270, Pakistan.
96 PJL sample (individual) IDs were extracted manually from the
1000 Genomes Project’s panel file
PCA was performed by R statistical package (v3.1.2)
RESULTS AND DISCUSSION
Fig. 1. SNPs and InDels Counts.
Fig. 2. InDels Frequency as calculated by BCFtools.
Fig. 3. Substitutions types as calculated by BCFtools.
Fig. 4. Counts of Ts and Tv and their ratios as calculated by BCFtools.
Fig. 5. For adjusting SNP ratios on the scale of 0-1, corrected SNP counts were calculated by the following formula:
𝑪𝒐𝒓𝒓𝒆𝒄𝒕𝒆𝒅 𝑺𝑵𝑷 𝑪𝒐𝒖𝒏𝒕 = 𝑹𝒂𝒕𝒊𝒐 𝒐𝒇 𝑻𝒐𝒕𝒂𝒍 𝑺𝑵𝑷 𝑪𝒐𝒖𝒏𝒕 − 𝑴𝒊𝒏𝒊𝒎𝒖𝒎 𝑺𝑵𝑷 𝑪𝒐𝒖𝒏𝒕
(𝑴𝒂𝒙𝒊𝒎𝒖𝒎 𝑺𝑵𝑷 𝑪𝒐𝒖𝒏𝒕 − 𝑴𝒊𝒏𝒊𝒎𝒖𝒎 𝑺𝑵𝑷 𝑪𝒐𝒖𝒏𝒕)
Fig. 6. Heat map of corrected SNP counts by R function. Heat map with colors scaled according to the SNP densities (Transformed SNP densities: from orange to light yellow region; from low Z-scores to high Z-score values). Each column represents the chromosomes labelled on the vertical axis (right), and each row shows the SNP densities labelled on the horizontal axis (bottom) of the heat map. The dendrogram obtained with the hierarchical cluster analysis is displayed on the left. Clustering of chromosomes is achieved on the basis of SNP densities.
Fig. 7. Exploratory multivariate analysis of SNP densities by R package. PCA of (A) PJL sub-population, and (B) 1000 Genomes Project. Both quantitative and qualitative variables, along with the inclusion of supplementary variables and observations were added to the analysis. The red circle on (A) of chr4 has the most dimensionality in terms of SNP densities.
Father 52
Mother 52
Child 54
Unrelated 1
Total Individuals 159
Families Having Father Only 3
Families Having Mother Only 3
Families Having Child Only 3
Families Having Father/Child 4
Families Having Mother/Child 4
Families Having Father/Mother 3
Families Having Father/Mother/Child 42
Total Families 63
Table. 1. Individuals for PJL sub-population as reported by the 1000 Genomes Project
Table. 2. Classification of families on the basis of individual selected as reported by the 1000 Genomes Project