2015
WHOLE-GENOME GENETIC DIVERSITY AND FUNCTIONAL CLASSIFICATION OF VARIATIONS OF
A PAKISTANI INDIVIDUAL ______________________________________________________________________________
MUHAMMAD ILYAS
_____________________________________________________ National Centre of Excellence in Molecular Biology
UNIVERSITY OF THE PUNJAB, LAHORE PAKISTAN
2015
Whole-Genome Genetic Diversity and Functional Classification of Variations of a Pakistani Individual
___________________________________________________________________________
A THESIS SUBMITTED TO
UNIVERSITY OF THE PUNJAB
In Partial Fulfillment of the Requirement for the Degree of
DOCTORATE OF PHILOSOPHY
in MOLECULAR BIOLOGY
(Human Genomics, Bioinformatics)
Submitted by MUHAMMAD ILYAS
Supervisors DR. ZIAUR RAHMAN
PROF. DR. JONG BHAK
___________________________________________________ National Centre of Excellence in Molecular Biology
University of the Punjab, Lahore, Pakistan
“IN THE NAME OF ALLAH, THE MOST BENEFICENT, THE MOST MERCIFUL”
DEDICATED TO
MY MOTHER AND FATHER
WHOSE AFFECTION, LOVE, ENCOURAGEMENT AND PRAYS OF
DAY AND NIGHT MAKE ME ABLE TO GET SUCH SUCCESS AND
HONOR
ALONG WITH ALL HARD WORKING AND RESPECTED TEACHERS
CERTIFICATE This is to certify that the experimental work described in the thesis submitted by
MUHAMMAD ILYAS has been carried out under my direct supervision. Data/results reported
in this manuscript are duly recorded in the Centre’s official note book(s). I have personally gone
through the raw data and certify the authenticity of all the results reported herein. I further certify
that these data have not been used in part or full, in a manuscript already submitted or in the
process of submission in partial/complete fulfillment of the award of any other degree from any
other institution at home or abroad. I also certify that the enclosed manuscript, has been prepared
under my supervision and I endorse its evaluation for the award of PhD. Degree through the
official procedures of the Centre/University.
In accordance with the rules of the Centre, data books No. 1078 is declared as
unexpendable document that will be kept in the registry of the Centre for a minimum of three
years from the date of the thesis defense Examination.
Signature of Supervisor ___________________________
Name of Supervisor: Dr. Ziaur Rahman
Signature of Co-Supervisor: ________________________
Name of Co-Supervisor: Prof. Dr. Jong Bhak
I
SUMMARY
Pakistan covers a key geographic area in human history, being both part of the Indus
River region that acted as one of the cradles of civilization and as a link between Western
Eurasia and Eastern Asia. This region is inhabited by a number of distinct ethnic groups, the
largest being the Punjabi, Pathan (Pakhtun), Sindhi, and Baloch. We analyzed the first male
Pakistani genome (PTN) from the north-west province of Pakistan, by sequencing it to 29.7-fold
coverage using the Illumina HiSeq2000 platform. A total of 3.8 million single nucleotide
variations (SNVs) and 0.5 million small indels were identified by comparing with the human
reference genome. Among the SNVs, 129,441 were novel, and 10,315 nonsynonymous SNVs
were found in 5,344 genes. SNVs were annotated for health consequences and high risk diseases,
as well as possible influences on drug efficacy. It is confirmed that the PTN genome presented
here is representative of the Pathan/Pakhtun ethnic group by comparing it to a panel of Central
Asians from the HGDP-CEPH panels typed for ~650k SNPs. The mtDNA (C4a1a1) and Y
haplogroup (L1) of this individual were also typical of his geographic region of origin. The
demographic history by PSMC was constructed, which highlights a recent increase in effective
population size compatible with admixture between European and Asian lineages expected in
this geographic region. It is a useful resource to understand genetic variation and human
migration across the whole Asian continent. Finally it was concluded that modern
Pathans/Pakhtuns are admixture of European and Asian lineages, which made them unique from
other world populations. Their genetic makeup will help us discovering rare variants and
facilitate developing personalized medicine.
II
ACKNOWLEDGEMENTS
At the onset, I bow my head to the Omnipotent, the most merciful, the Compassionate
and the Omniscient Al-Mighty Allah, who showered upon me all HIS blessings throughout my
life and especially for giving me the strength for the completion of this research work.
I wish to acknowledge the remarkable contribution of Prof. Dr. Sheikh Riazuddin (S.I.,
T.I., HI) founder and Ex-Director, and Prof. Dr. Tayyab Husnain (I.F., T.I.) Director, Centre of
Excellence in Molecular Biology in the establishment and strengthening of the prestigious
institute CEMB, where I began learning research and science.
I am also grateful to my supervisor Dr. Ziaur Rahman for his guidance, energy, time
and other form of contributions. I am deeply grateful to him for the confidence, for being a
constant source of inspiration and for always sustaining me in pursuing my own ideas, and I am
most indebted to the extremely friendly atmosphere on a professional and personal level.
Without his support my research work was impossible.
Foremost, I would like to express my deep and sincere gratitude to my co-supervisor,
Prof. Dr. Jong Bhak Director and CEO of Personal Genomics Institute, Genome Research
Foundation South Korea. His vision, patience and motivation in every step of study made it
possible for me to work in this exciting and emerging field of research. His encouragement and
support helped me understand and carry out my research project in South Korea. The positive
atmosphere and excellent working facility in his laboratory raised my devotion for learning and
knowledge. Colleagues at Jong’s lab (JongSo Kim, Yunsung Cho, Hakmin, Jesse Cooper and
Jaewoo Moon) helped me in successful completion of this research.
I am indebted to the members of the Tonellato’s lab at Harvard Medical School for
providing a stimulating environment for intellectual development and research. From the day I
joined the group, Prof. Dr. Peter J Tonellato played a crucial role in getting me up to speed
with biomedical informatics and personalized medicines. I constantly benefited from his
continuous support and guidance all along my work. Informal discussions with Michiyo
III
Yamada, Sheida Nabavi, Latrice Landry and Yassine Souilmi were crucial for the success of my
research project. My whole stay at Harvard has been a rewarding and most agreeable experience,
and, also, Boston is one of the most enjoyable cities I have lived in.
I wish to express my deepest gratitude to my senior colleagues at CEMB, Khalid
Masood, the person who always motivated me to do best in bioinformatics, Sobia Ahsan Halim,
Muhammad Israr, Aneela Yasmin and Shahid ur Rahman for their helpful guidance. I also
acknowledge my lab members Atif Anwar Mirza and Zulfiqar Ali Mir for their kind cooperation
during my PhD.
I would also like to express my indebtedness to Prof. Dr. Andrea Manica (Cambridge
University UK) Prof. Dr. Qasim Ayub (Welcome Trust Sanger Lab, UK), Prof. Dr. Sultan-e-
Rome (Government Jehanzeb College Swat), Khwaja Aftab Ahmad (Swat) and Dr. Muhammad
Fahim (IBGE Peshawar) who encouraged me by showing interest in my work. They generously
provided reading material and shared their knowledge with me. However, special thanks are due
to Prof. Dr. Habib Ahmad and Prof. Dr. Mukhtar Alam for their kind words and continuous
guidance.
I additionally appreciate the support of my friends Ziaur Rahman, Sulaiman Shams,
Imtiaz Ali, Sahib Zar and Inamullah. Their endless help and support allowed me to overcome all
of the difficult times.
Finally, I thank those that are dearest to me, who have loved me unconditionally, and
stood by me during times of confusion and frustration. My mother and father, my brother
Muhammad Abbas, my sisters and my loving wife, who helped me, get through some of the
most difficult challenges that I have faced to date. I thank her for her patience and understanding
over the past few years. Last but not the least I am grateful to the rest of my family for their
endless love, support and encouragement throughout my entire academic career. My family has
been far away from me these years, but they were closer than ever in my mind and heart.
IV
I would like to appreciate the financial support of Genome Research Foundation while I
was working in South Korea. Thanks to Higher Education Commission of Pakistan for providing
me the fellowship, which helped me a lot to get advance training of personalized genomics and
biomedical informatics at Harvard University, Boston, USA.
Many people, especially my classmates and team members itself, have made valuable
comment suggestions on this project which gave me an inspiration to improve my research. I
thank all the people for their help directly and indirectly to complete this dissertation.
Muhammad Ilyas
Lahore, 2015
V
LIST OF ABBREVIATIONS
BAC: Bacterial Artificial Chromosome BGI: Beijing Genomics Institute ddNTP: dideoxyribonucleic acid dNTP: dideoxyribonucleic acid EST: Expression Sequence Tag FISH: Fluorescent in situ Hybridization GWAS: Genome Wide Association Study NGS: Next Generation Sequencing qPCR: quantitative PCR SNP: Single Nucleotide Polymorphism TGS: Third Generation Sequencing WGA: Whole Genome Amplification KPGP: Korean Personal Genomes Project 1KGP: 1000 Genome Project SNV: Single Nucleotide Variant CAMDA: Critical Assessment of Massive Data Analysis CDS: Coding DNA Sequence UTR: Un Translated Region NMD: nonsense mediated decay PTN: Pathan Genome PK1: Pakistani Genomes (Sindi) SNV: Single Nucleotide Variant CDS: Coding DNA Sequence SJK: First Korean Genome PGP: Personal Genomics Project
VI
TABLE OF CONTENTS
SUMMARY I
ACKNOWLEDGEMENT II
LIST OF ABBREVIATIONS V
LIST OF FIGURES IX
LIST OF TABLES X
CHAPTER 1
1. INTRODUCTION 1
CHAPTER 2
2. LITERATURE REVIEW 7
2.1. SEQUENCING TECHNIQUES 11
2.1.1. HIGH-THROUGHPUT SEQUENCING 11
2.1.2. DE NOVO SEQUENCING 12
2.1.3. RE-SEQUENCING 13
2.1.4. EXOME SEQUENCING 13
2.2. HIGH THROUGHPUT SEQUENCING PLATFORMS 14
2.2.1. ROCHE 454 SYSTEM: PYROSEQUENCING 14
2.2.2. AB SOLID SYSTEM: SEQUENCING BY LIGATION 16
2.2.3. ILLUMINA/SOLEXA SYSTEM: SEQUENCING WITH REVERSIBLE TERMINATORS 17
2.2.4. ION TORRENT: SEMICONDUCTOR SEQUENCING 19
2.2.5. THE THIRD GENERATION SEQUENCER 19
2.3. GENETIC VARIANTS IN HUMAN GENOME 20
2.3.1. SINGLE NUCLEOTIDE VARIANTS/POLYMORPHISMS 21
2.3.2. STRUCTURAL VARIATIONS 22
2.3.3. COPY NUMBER VARIATIONS 22
2.3.4. LINEAGE MARKERS FOR POPULATION STUDY 23
2.3.5. VARIABLE NUMBER TANDEM REPEATS 24
2.3.6. SHORT TANDEM REPEATS (STRS) 24
2.4. APPLICATIONS OF GENOME VARIANTS 25
2.4.1. GENETIC ANCESTRY AND ADMIXTURE MAPPING 26
2.4.2. MEDICAL AND CLINICAL IMPLICATIONS 26
2.4.3. PHARMACOGENOMICS 28
VII
2.5. PERSONAL AND POPULATION GENOME PROJECTS 30
2.5.1. PERSONAL GENOME PROJECT (PGP) 30
2.5.2 1000 GENOMES PROJECT (1KGP) 31
2.5.3 PAN-ASIAN POPULATION GENOMICS INITIATIVE (PAPGI) 31
2.5.4 ONE MILLION GENOMES 31
2.5.5 HUMAN GENOME DIVERSITY PROJECT (HGDP) 32
2.5.6 BILLION GENOMES PROJECT 32
2.5.7 OTHER GENOME CONSORTIUMS 32
CHAPTER 3
3. MATERIALS AND METHODS 33 3.1. SUBJECT SELECTION AND ETHICAL STATEMENT 33
3.2. DATA SOURCES 34
3.3. DNA EXTRACTION 34
3.4. CYTOGENETIC ANALYSIS 35
3.5. LIBRARY PREPARATION AND WHOLE GENOME SEQUENCING 35
3.6. WORKFLOW FOR GENOMIC DATA ANALYSIS 37
3.7. SEQUENCE ALIGNMENT 39
3.8. SNP AND INDEL DETECTION 40
3.9. COPY NUMBER VARIATION DETECTION 40
3.10. FUNCTIONAL ANNOTATION 41
3.11. PHARMACOGENOMICS ANALYSIS 43
3.12. MULTIDIMENSIONAL SCALING AND ADMIXTURE 43
3.13. PAIRWISE SEQUENTIALLY MARKOVIAN COALESCENT ANALYSIS 44
3.14. PHYLOGENOMIC ANALYSIS 45
CHAPTER 4
4. RESULTS 46
4.1. GENOME SEQUENCING AND VARIANTS IDENTIFICATION 46
4.2. FUNCTIONAL CLASSIFICATION AND CLINICAL RELEVANCE OF VARIANTS 49
4.3. PHARMACOGENOMICS ANALYSIS 52
4.4. COMPARISON OF PTN GENOME TO WORLDWIDE POPULATIONS 58
4.5. COMPARISON WITH OTHER PAKISTANI INDIVIDUALS 61
4.6. DEMOGRAPHIC HISTORY ANALYSIS 64
4.7. MTDNA AND Y-CHROMOSOME ANALYSES 65
VIII
4.8. PHYLOGENOMIC ANALYSIS 65
CHAPTER 5
5. DISCUSSION 67
5.1. CLINICAL RELEVANCE AND VARIANT CHARACTERIZATION 68
5.2. PHARMACOGENOMIC PROFILE 71
5.3. GENEALOGICAL AND ADMIXTURE ANALYSIS 72
5.4. DEMOGRAPHIC HISTORY ANALYSIS AND ANCESTRAL POPULATION SIZE 73
5.5. CONCLUSION 74
CHAPTER 6 6. REFERENCES 75
LIST OF PUBLICATIONS 92
APPENDIX-I WEBSITE USED 93
APPENDIX-II IRB APPROVAL 94
IX
LIST OF FIGURES FIGURE 2.1. THE DROP IN COST DRIVES OF SEQUENCING A COMPLETE HUMAN GENOME. 8
FIGURE 2.2: THE PYROSEQUENCING PROCESS. 16
FIGURE 2.3: APPLIED BIOSYSTEM’S SOLID SEQUENCING BY LIGATION. 17
FIGURE 2.4: REVERSIBLE TERMINATOR CHEMISTRY UTILIZES IN THE ILLUMINA PLATFORMS. 18
FIGURE 3.1: FAMILY PEDIGREE OF DONOR WITH MEMBERS HAVING GENETIC DISORDERS. 33
FIGURE 3.2: CYTOGENETIC ANALYSIS THROUGH GTG BANDING KARYOTYPE AND LEGENDS. 35
FIGURE 3.3: ILLUMINA HISEQ2000 MACHINE AND ACCESSORIES. 36
FIGURE 3.4: LIBRARY QUALITY GENERATED BY BIOANALYZER. 37
FIGURE 3.5: WORKFLOW OF THE NEXT GENERATION SEQUENCING AND BIOINFORMATICS DATA
ANALYSIS.
38
FIGURE 3.6: SCHEMATICS REPRESENTATION OF THE PIPELINE DEVELOPED. 42
FIGURE 3.7: SCHEMA OF THE PHARMACOGENOMICS ANALYSIS. 43
FIGURE 4.1: NOVEL SNVS IN PERSONAL GENOMES IN THIRTEEN DIFFERENT ETHNIC GROUPS. 48
FIGURE 4.2: COPY NUMBER VARIATIONS COUNTS DISTRIBUTED IN EACH CHROMOSOME. 48
FIGURE 4.3: COMPARATIVE VARIANT COUNT OF OTHER REPORTED INDIVIDUAL GENOMES WITH
PAKISTANI (PTN) GENOME.
50
FIGURE 4.4: MULTIDIMENSIONAL SCALING (MDS) PLOT GENERATED BY PLINK. 59
FIGURE 4.5: ADMIXTURE RESULTS FOR K = 2 AND K = 3 FOR THE PTN INDIVIDUAL. 60
FIGURE 4.6: CHROMOSOME PAINTING OF POSSIBLE GENOMIC ADMIXTURE. 61
FIGURE 4.7: ADMIXTURE RESULTS OF PAKISTANI PATHAN (PTN) INDIVIDUAL TO OTHER ETHNIC
GROUPS IN SOUTH ASIA.
62
FIGURE 4.8: RELATIONSHIP OF PAKISTANI PATHAN INDIVIDUAL TO OTHER ETHNIC GROUPS IN
SOUTH ASIA.
63
FIGURE 4.9: PAIRWISE SEQUENTIALLY MARKOVIAN COALESCENT (PSMC) MODEL FOR
RECONSTRUCTING PAKISTAN’S DEMOGRAPHIC HISTORY.
64
FIGURE 4.10: PHYLOGENOMIC TREE OF PAKISTANI PTN GENOME WITH OTHER WORLD ETHNIC
GENOMES.
66
X
LIST OF TABLES
TABLE 4.1. SUMMARY OF DATA PRODUCTION AND MAPPING RESULTS 46
TABLE 4.2. SUMMARY OF SNVS FOUND IN PATHAN’S GENOME AND OVERLAPS WITH
DBSNP137
47
TABLE 4.3. VARIANTS (SNVS, INDELS AND CNVRS) IDENTIFIED IN PAKISTANI
(PTN) GENOME
47
TABLE 4.4. FUNCTIONALLY DAMAGED NOVEL NSSNVS. 51
TABLE 4.5. CLINICAL RELEVANCE CODING SNVS IN PAKISTANI PTN WHOLE
GENOME.
53
TABLE 4.6. DAMAGED NSSNVS AND THE DRUGS. 54
TABLE 4.7. LIST OF DRUGS (PHARMGKB) IN THE PTN GENOME. VIP 57
TABLE OF CONTENTS
SUMMARY I
ACKNOWLEDGEMENT II
LIST OF ABBREVIATIONS V
LIST OF TABLES VI
LIST OF FIGURES VII
CHAPTER 1
1. INTRODUCTION 1
CHAPTER 2
2. LITERATURE REVIEW 7
2.1. SEQUENCING TECHNIQUES 11
2.1.1. HIGH-THROUGHPUT SEQUENCING 11
2.1.2. DE NOVO SEQUENCING 12
2.1.3. RE-SEQUENCING 13
2.1.4. EXOME SEQUENCING 13
2.2. HIGH THROUGHPUT SEQUENCING PLATFORMS 14
2.2.1. ROCHE 454 SYSTEM: PYROSEQUENCING 14
2.2.2. AB SOLID SYSTEM: SEQUENCING BY LIGATION 16
2.2.3. ILLUMINA/SOLEXA SYSTEM: SEQUENCING WITH REVERSIBLE TERMINATORS 17
2.2.4. ION TORRENT: SEMICONDUCTOR SEQUENCING 19
2.2.5. THE THIRD GENERATION SEQUENCER 19
2.3. GENETIC VARIANTS IN HUMAN GENOME 20
2.3.1. SINGLE NUCLEOTIDE VARIANTS/POLYMORPHISMS 21
2.3.2. STRUCTURAL VARIATIONS 22
2.3.3. COPY NUMBER VARIATIONS 22
2.3.4. LINEAGE MARKERS FOR POPULATION STUDY 23
2.3.5. VARIABLE NUMBER TANDEM REPEATS 24
2.3.6. SHORT TANDEM REPEATS (STRS) 24
2.4. APPLICATIONS OF GENOME VARIANTS 25
2.4.1. GENETIC ANCESTRY AND ADMIXTURE MAPPING 26
2.4.2. MEDICAL AND CLINICAL IMPLICATIONS 26
2.4.3. PHARMACOGENOMICS 28
2.5. PERSONAL AND POPULATION GENOME PROJECTS 30
2.5.1. PERSONAL GENOME PROJECT (PGP) 30
2.5.2 1000 GENOMES PROJECT (1KGP) 31
2.5.3 PAN-ASIAN POPULATION GENOMICS INITIATIVE (PAPGI) 31
2.5.4 ONE MILLION GENOMES 31
2.5.5 HUMAN GENOME DIVERSITY PROJECT (HGDP) 32
2.5.6 BILLION GENOMES PROJECT 32
2.5.7 OTHER GENOME CONSORTIUMS 32
CHAPTER 3
3. MATERIALS AND METHODS 33 3.1. SUBJECT SELECTION AND ETHICAL STATEMENT 33
3.2. DATA SOURCES 34
3.3. DNA EXTRACTION 34
3.4. CYTOGENETIC ANALYSIS 35
3.5. LIBRARY PREPARATION AND WHOLE GENOME SEQUENCING 35
3.6. WORKFLOW FOR GENOMIC DATA ANALYSIS 37
3.7. SEQUENCE ALIGNMENT 39
3.8. SNP AND INDEL DETECTION 40
3.9. COPY NUMBER VARIATION DETECTION 40
3.10. FUNCTIONAL ANNOTATION 41
3.11. PHARMACOGENOMICS ANALYSIS 43
3.12. MULTIDIMENSIONAL SCALING AND ADMIXTURE 43
3.13. PAIRWISE SEQUENTIALLY MARKOVIAN COALESCENT ANALYSIS 44
3.14. PHYLOGENOMIC ANALYSIS 45
CHAPTER 4
4. RESULTS 46
4.1. GENOME SEQUENCING AND VARIANTS IDENTIFICATION 46
4.2. FUNCTIONAL CLASSIFICATION AND CLINICAL RELEVANCE OF VARIANTS 49
4.3. PHARMACOGENOMICS ANALYSIS 52
4.4. COMPARISON OF PTN GENOME TO WORLDWIDE POPULATIONS 58
4.5. COMPARISON WITH OTHER PAKISTANI INDIVIDUALS 61
4.6. DEMOGRAPHIC HISTORY ANALYSIS 64
4.7. MTDNA AND Y-CHROMOSOME ANALYSES 65
4.8. PHYLOGENOMIC ANALYSIS 65
CHAPTER 5
5. DISCUSSION 67
5.1. CLINICAL RELEVANCE AND VARIANT CHARACTERIZATION 68
5.2. PHARMACOGENOMIC PROFILE 71
5.3. GENEALOGICAL AND ADMIXTURE ANALYSIS 72
5.4. DEMOGRAPHIC HISTORY ANALYSIS AND ANCESTRAL POPULATION SIZE 73
5.5. CONCLUSION 74
5.6. RECOMMENDATIONS AND FUTURE PLANS 75
CHAPTER 6 6. REFERENCES 76
LIST OF PUBLICATIONS 91
APPENDIX-I WEBSITE USED 92
APPENDIX-II IRB APPROVAL 93
Chapter 1
INTRODUCTION
Pages 1-6
1
CHAPTER 1
1. Introduction
Next generation sequencing (NGS) technology has become the most exciting scientific
achievement among the research community. It refers to a set of new DNA sequencing
procedures that carry remarkable advancement in sequencing abilities by employing particularly
parallel reactions on millions of genomic fragments (Mardis, 2008). The cost to sequence
comparatively short DNA fragments is now at least two orders of magnitude less than the usual
Sanger procedure (Hudson, 2008). The cost have plummeted in recent years, rapidly outpacing
the traditional benchmark for the decreasing cost of the technology known as Moore’s law
(Mayer, 2006). Many techniques, including latest chemistries, amplification methodology,
efficient and high-resolution microscopy, were remodeled to make this development possible
(Park, 2008).
Genome-wide studies using microarray technology have brought important
developments for the last many years (Kelly et al., 2013). Initially microarray chip technologies
were used for gene expression analysis, but later it found extensive uses in estimation of copy
number alterations, microRNA studies, genotyping single nucleotide variants and mapping of
the binding sites for protein-protein and DNA-protein interactions (Mardis, 2008). However,
NGS technology provides important developments and has the tendency to replace many of the
microchip platforms in the near future (Elingarami et al., 2013). The un-availability of
sequencing equipment and high prices are still unaffordable for many researchers at this time,
but due to competing market forces a substantial decrease is expected in coming years
2
A new era of personalized genomics has been initiated after the advancement in
sequencing technologies. To date, many genome sequences for individuals from distinct regions
have been reported. Venter was the first one to sequence his personal genome using Sanger
dideoxy method, which is still the method of choice for de novo sequencing due to its per base
accuracy (99.9%) of long reads of almost 1000 bp (Levy et al., 2007). With Sanger method
diploid sequences were assembled with phase information that has not been performed in other
published genomes (Bentley et al., 2008, Kitzman et al., 2010, Pushkarev et al., 2009). Despite
limitations in read length, which is extremely important for the assembly of contigs and final
genomes, it is the NGS technology that has made personal genomics possible by dramatically
reducing the cost and increasing the efficiency. To date, more than ten individual genome
sequences, analyzed by NGS, have been published such as, two individuals of northwest
European origin (Levy et al., 2007, Wheeler et al., 2008), a Yoruba (Bentley et al., 2008), an
Indian Gujarati (Kitzman et al., 2010) as well as an Indian female and a male (Gupta et al.,
2012, Patowary et al., 2012), a person from China (Wang et al., 2008), Korean individuals (Kim
et al., 2009, Ahn et al., 2009), an Aboriginal Australian (Rasmussen et al., 2011), a Japanese
(Fujimoto et al., 2010), Pakistani (Azim etal., 2013), Sri Lankan (Dissanayake et al., 2011) and
Turkish (Dogan et al., 2014). NGS facilitates researchers to map short range NGS data to
known reference genome, hence circumvent expensive and laborious long fragment based de
novo assembly (Metzker, 2010). As demonstrated by a large percentage of unmapped data in
previous human genome re-sequencing projects, however, a re-sequenced genome may not fully
reflect ethnic and individual genetic differences because its assembly is dependent on the
previously sequenced genome. After the introduction of NGS, the genome sequencing
bottleneck of a whole population or people is not the sequencing process itself, but the
3
bioinformatics process of fast and accurate mapping to the available data, structural variation
analyses, phylogenetic analyses, association study, and application to phenotypes such as
diseases (Ahn et al., 2009).
Sequencing technology is improving fast, with a drastic reduction of its costs (Lander et
al. 2001). Due to these advances, the knowledge of human genetic diversity and population
history has greatly expanded (Veeramah and Hammer 2014), enabling us to investigate variants
with health consequences and paving the way to personalized medicine (Feero and Guttmacher,
2014). Genome wide microarray study (GWAS) has characterized the function of thousands of
common SNVs, but there are still millions of variants left unexplored (Sebastiani et al. 2009).
Therefore, whole genome sequencing is necessary for a detailed study of rare genomic variants.
A number of international consortia have started sequencing the whole genomes of large panels,
including the 1000 Genomes Project which covers populations from Nigeria, Japan, China,
Europe, Kenya, Italy, Peru, India, United States (www.1000genomes.org), the PGP consortium
(www.personalgenomes.org), Simons Genome Diversity Project (www.simonsfoundation.org)
which consists of data from 260 genomes from 127 populations (Africans, Native Americans,
Central Asians or Siberians, East Asians, Oceanians, South Asians and West Eurasians, Korean
Personal Genomes Project (kpgp.kr), Complete Genomics (www.completegenomics.com),
Iranian Genome Project (www.irangenes.com/) and the 100 Malay genomes (Wong et al. 2013).
These consortia, as well as several geographically more restricted projects, aim to understand
the functional aspects of both common and unique variants in humans. Genetic variants are the
genetic differences between two individuals or populations which make them biochemically
similar on average 99.9% to any other humans (Collins and Mansoura, 2001). Even the two
identical twins developed form one zygote is not genetically identical. They will have genetic
4
variations due to mutations occurring during development (Patwari and Lee, 2008). This
information makes one person unique from the others. Studying genetic variation, also known as
variomics, has great applications in ancestry and clinical studies. Researchers are using these
variants for understanding the ancient humans, their migrations and admix genetic structure
which made them similar to other diverse populations in the world (Schork et al., 2009). There
are some disease associated variants which occur more frequently in individuals from a certain
geographic regions. Researchers around the globe are searching for such rare and common
variants to solve the mystery of different diseases (Bodmer and Bonilla, 2008). Genetic
variations can be of many kinds that start from point mutations e.g. SNPs to the large
microscopic alterations e.g. CNVs. SNP is the change of a nucleotide between members of the
species that happens in about 1% of the whole group (Collins et al., 1998). Approximately 30
million polymorphic positions have been reported in humans so for. A CNV is a large
microscopic chromosomal region which happens due to deletion or duplications, also reported to
have strong association with diseases like cancer, autism and other neurological disorders
(Rendon et al., 2006).
Besides their value for biomedicine, individual genome sequences are a rich source of
information about human evolution (Sankararaman et al., 2014). A human DNA can help us
explore the history and peopling of a region. Various groups have undertaken different studies
in this regard (Do et al., 2015). Previously, it has been reported that a minor contribution from
Iranian, Arab, Turkish and Greek is present in the people living in the northwest province of
Pakistan (Firasat S. et al., 2007). The claim is mainly based on the Greek invasion of the Indian
sub-continent by Alexander the Great in 327-323 BC and the subsequent stay of Greek soldiers
in the area (Mansoor et al., 2001). Other historians mention that, when Afghanistan and the
5
present-day Pakistan were the eastern provinces of the Xerxes’s Persian kingdom, Greek slaves
were brought and kept in this region during the time of about 150 years before Alexander’s
arrival (Wood, 2001).
Pakistan lies at an important junction between the Indian sub-continent in the East and
the Central Asian States in the West While China lies at the North. Due to its particular
geography, climate and socio-religio-cultural record, a number of ethnic and linguistic groups
like Punajbi, Pathan, Sindhi and Baloch live in the country (Bolstad, 2010). A number of these
groups have been included in genetic panels typing uniparental microsatellites and SNPs
(Cavalli-Sforza 2005). Human Genetics Diversity Panel included 190 individuals belong to
eight different ethnic groups from Pakistan, which had been typed for ~650K SNPs but it left
many genetic information unexplored therefore whole genome sequencing was needed to
explore the hidden information in genetic makeup of Pakistani populations. Up-till now only
one male Pakistani individual of Sindhi ethnic origin has been sequenced so far (Azim et al.
2013).
Here we report a whole genome sequence of an individual from Khyber Pakhtunkhwa,
the north-west province of Pakistan. The genome was aligned to the reference genome which is
a merger of several ethnic populations. We disclosed a number of variants including SNPs,
Indels and CNVs in northwestern Pakistani Genome. Traditional methods were used to get
highly reliable variants for medical considerations. Potential clinical phenotypes were screened
for ns-SNPs, exonic indels, and copy number alterations. Several other complete genome
sequences reported from different ethnic populations were used to understand the genetic
ancestry, migration patterns and population bottlenecks of Pakistani population. Variants were
then annotated and scanned for associated functions along with SNVs that could modulate drug
6
response. Possible deleterious non-synonymous SNVs (nsSNVs) were investigated for potential
effect on the pharmacokinetics and pharmacodynamics of drugs. Additionally, multiple
analytical approaches were used to assess the influence of ancestral contributions within the
Pakistani genome. It is a useful resource to understand genetic variation and human migration
across the whole of Asia. The genetic data and variant functions for Pakistani individual
genome (PTN) will provide an important public resource, which will be helpful for the clinical
genetics research and diagnostics.
Chapter 2
LITERATURE REVIEW Pages 7-32
7
CHAPTER 2 2. Literature Review
Genomics is the field of biological sciences that deals with the recombinant DNA, DNA
sequencing methodology, and computational analysis of structure and function of genome
sequence composed of the entire set of DNA within a single cell of an organism (Bild et al.,
2014). Developments in the field of genomics have enabled us to do a revolutionary research to
understand even the most complex biological systems like brain (Biswal et al., 2010). The
intragenomic phenomena such as heterosis, epistasis, pleiotropy are also included in this field of
biology (Ragunath et al., 2014). Alternatively, the search for function and roles of single gene is
the preliminary focus of molecular genetics and is a common area of interest for modern
medical and biological research (Carroll 2003).
Human genome sequence draft was produced through collaboration with many
international institutes (Collins et al., 2003). They presented the primary analysis results of their
data showing some features which can be observed through the analysis of a sequence
(Sachidanandam et al., 2001). A chimpanzee genome sequence draft was presented and was
compared with humans, marking differences between chimpanzee and human genomes (Prüfer
2012). Also the population genetics and phylogenetic relation of humans was inquired through
chimpanzee genome (The Chimpanzee Sequencing and Analysis Consortium 2005). HapMap-
III was helped in characterizing 3.1 million SNVs in 270 human individuals from 4 diverse
populations of different geographical origin (Pemberton et al., 2010). Also the sharing region
among different populations was also defined (Li et al., 2008). An accurate, economical and
8
rapid approach for intra-species genetic variation has been described (Bentley et al., 2008). Low
cost experimental method of reversible terminator chemistry was used to decode human genome
of Yoruba, a male from Ibadan, Nigeria. 4 million single nucleotide polymorphisms were
characterized along with 400,000 structural variants (Figure 2.1) (Manolio and Collins 2009).
Figure 2.1: The drop in cost drives of sequencing a complete human genome using Next Generation Sequencing
technologies. (http://www.meragenome.com)
Snapshot was provided for Next Generation Sequencing Approach to understand the
properties and functions a genome (Marguerat et al., 2008). Microarray based arrays are
supersede by sequencing based assays and the data obtained from these distinct approaches was
contrasted and compared (Laird 2010). First Asian individual genome was sequenced using
massively parallel sequencing technology (Wang et al., 2008). Three million single nucleotide
polymorphisms were identified in this region with high accuracy and consistency (Li et al
2009). Through these results potential importance of High throughput Sequencing technology
9
was described for individual genomics (Wang et al., 2008). Individual genome of James Watson
was reported in a couple of months through massively parallel sequencing in picolitre size
reaction vessels (Wadman 2008). The genome was sequenced for the first time via NGS that
made it possible to get personal genome sequence in a very short time (Wheeler et al., 2008). A
single molecule method was reported for the sequencing of individual human genome
(Pushkarev et al., 2009). Genome of an anonymous individual of African individual was
sequenced using ligation based sequencing essay (McKernan et al., 2009). This method was
used because it improves the accuracy of results through a unique error correction method
(Zhang et al., 2011). The first male Korean individual genome was sequenced using illumina
paired-end sequencing methods (Ahn et al., 2009). The results obtained were analyzed and
compared with Chinese genome (YH), the only available Asian genome, to observe significant
differences among both genomes of closely related ethnic groups (Li et al., 2009). A combine
approach was used to decode the other Korean AK1 genome sequence. The approach includes
complete genome sequencing by shot gun method targeted BAC sequencing and high resolution
comparative genomics hybridization via traditional microchips (Kim et al., 2009).
A genome sequencer with efficient imaging and less reagent consumption was
developed, which used cPAL chemistry and assayed each base from self-assembling DNA
nanoballs or patterned nanoarrays (Drmanac et al., 2010). Researchers used this technology for
sequencing three human genomes, due to high accuracy rate and affordable cost of sequencing
consumables (Liu et al., 2012).
An era of personalized genomics has been initiated due to the advancements in
sequencing technologies. Many individual genomes have been reported from distinct regions
such as, two individuals of northwest European origin (Wheeler et al., 2008) (Levy et al.,
10
2007), a Yoruba (Bentley et al., 2008), an Indian Gujarati (Kitzman et al., 2010) as well as an
Indian female and a male (Gupta et al., 2012) (Patowary et al., 2012), a person from China
(Wang et al., 2008), Korean individuals (Kim et al., 2009) (Ahn et al., 2009), an Aboriginal
Australian (Rasmussen et al., 2011), a Japanese (Fujimoto et al., 2010), and 1,000 genomes
from a consortium (Dits, 2010). The complete-genome sequences derived from numerous
diverse ethnic populations is helping us in understanding genetic ancestry, migration patterns
and population bottlenecks.
Venter was the first one to sequence his personal genome using Sanger dideoxy method,
which is still the method of choice for de novo sequencing (Levy et al., 2007). With sanger
method diploid sequences were assembled with phase information that has not been performed
in other published genomes (Bentley et al., 2008) (Kitzman et al., 2010) (Pushkarev et al.,
2009). Despite limitations in read length, which is extremely important for the assembly of
contigs and final genomes, it is the next generation sequencing (NGS) technology that has made
personal genomics possible by dramatically reducing the cost and increasing the efficiency
(Metzker 2010). Scientists can simply map small-reads from NGS machine to a reference
sequence, to do re-sequencing a genome, avoiding expensive and laborious long fragment based
de novo assembly (Goto et al., 2011). As demonstrated by a large percentage of unmapped data
in previous human genome re-sequencing projects, it should be noted that a re-sequenced
genome may not fully reflect ethnic and individual genetic differences because its assembly is
dependent on the previously sequenced genome (Halaschek-Wiener et al., 2009). After the
introduction of NGS, the genome sequencing bottleneck of a whole population or people is not
the sequencing process itself, but the bioinformatics process of fast and accurate mapping to
known data, structural variation analyses, phylogenetic analyses, association study, and
11
application to phenotypes such as diseases (Veltman et al., 2013).
2.1 Sequencing Techniques:
The NGS technologies made it possible for researchers to create large number of
sequence data at high speed and reduced cost to less than 4%-0.1% as compare to the Sanger
system, which differ in error profiles and limitations (Kircher 2012). The choice to get an
appropriate sequencing platform depends on a research project (Ekblom and Wolf 2014,
Meldrum et al., 2011). In the last few years, a change is observed from the time span from
sequencing till computational analysis of the generated data (Bielejec et al., 2014). Expectedly in
future, researchers will spend more time, expertise and funds on analyzing the generated data
(Burrows and Savage 2014). Comparatively smaller research teams will find it hard to arrange
and manage the setup to store and analyze 100s of terabits of raw and processed sequencing data
(Sathi 2014). Even the well established genome centers also face the same problems for the
ongoing use of NGS platforms (Eisenstein 2012). Therefore current equipments are likely to be
improved for further increase throughput and lower price of decoding DNA molecules (Glenn
2011). All these advancements will be helpful in future research in biological data analysis.
2.1.1 High-throughput sequencing
A vast expansion of high throughput sequencing techniques is observed since last few
years. Initial determination of a draft of the human genome took ten years, at an estimated cost of
$US 3 × 109 (Del Giacco and Cattaneo 2012). Instruments exist that can produce 250 Gb per
week (Lesk 2011). The largest dedicated institution in the field, the BGI – formerly the Beijing
Genomics Institute, but currently in Shenzhen – has 128 such instruments (Rubenstein 2010).
12
Each can produce 25 × 109 bp per day. This corresponds to one human genome at over 8X
coverage (Del Giacco and Cattaneo 2012). Running at full capacity, these resources could
produce 10,000 human genomes per year.
Moreover, there is no reason to think that the technical progress will not continue to
accelerate. There are two aspects of a large-scale sequencing project (Lander etal., 2001). One is
the generation of the raw data (Ramos et al., 2011). Most methods sequence long DNA
molecules by fragmenting them, and partially sequencing the pieces (Alberts et al., 2002). To
determine the first genome from a species, these short sequences must be assembled into the
whole sequence, using overlaps between the individual fragments (Li et al., 2010). The typical
length of the individual short sequences reported is called the read length of the method. The
goals of contemporary technical development are to increase not only the number of bases
sequenced per unit time and per unit cost, but the read length (DePristo et al., 2011). Both
generation of raw data, and assembly, depend crucially on effective and efficient computer
programs. Some contemporary genome centres have as many computational biologists on their
staffs as „wet-lab‟ scientists (Sloot et al., 2006). The very high throughput sequencing capacity
of new instruments allows addressing several types of biological questions (Mardis 2010).
2.1.2 De novo sequencing
De novo sequencing of a genome is a challenging job as there will be no reference to
compare it with (Davey et al., 2011; Elshire et al., 2011). Researchers working in such projects
get millions of DNA short fragments of having almost 200 bp in size (Robasky et al., 2014;
Grabherr et al., 2011; Butler et al., 2008). Therefore they need to have high coverage fragments
to in assembling of complete genome. Designing new and advanced bioinformatics algorithms
13
and computational tools for efficient de novo assembly is an emerging field of science these days
(Schatz et al., 2010).
2.1.3 Re-Sequencing
The re-sequencing of a genome from a specie is much easier then the de novo sequencing
(Bentley et al., 2008). The DNA fragments generated are compared to a reference genome which
is already been successfully assembled using de-novo analysis (Del Giacco and Cattaneo 2012).
Sequencing coverage must be plenty to avoid errors in sequence determination and variants
calling.
2.1.4 Exome sequencing
Exomes are the regions in the human genomes which are responsible to make proteins
necessary for human body (Ng et al., 2008). One goal of re-sequencing is to determine variation
in the genome of an individual from the reference genome. Approximately three percent of the
human genome consists of exons which estimated to be more than 150 thousand (Ng et al.,
2009). Inherited disorders are some time due to the abnormal behavior of a certain protein. The
reason behind this abnormality is a mutation occurs in the coding region of an exon sequence
(Baralle et al., 2005). Next generation sequencing is helping the researchers to identify these
variants by doing only exome sequencing (Ng et al., 2010). So they do not need to sequence the
whole genome of a patient to investigate about a pathogenic variant.
14
2.2 High Throughput Sequencing Platforms:
After the successful completion of decoding the first human genome project, different
companies like 454 Solexa launched their Genome Analyzer in 2005 (Bennett et al., 2005). Later
another company (SOLiD) released its parallel sequencing systems with more powerful
technology known as next generation sequencing technology which performed very well to get
accurate sequence results as compare to Sanger sequencing (Zhao and Grant 2011). These
pioneer companies SOLiD and 454 were then purchased by Applied Biosystems and Roche
respectively while Illumina purchased Solexa (McPherson 2014). Soon the three companies
successfully improved the performance.
2.2.1 Roche 454 System: Pyrosequencing
Roche 454 is one of the pioneers in commercially successful NGS systems based on the
pyrosequencing technology (Capobianchi et al., 2013). In the pyrosequencing procedure a
nucleotide is washed over several copies of the desired regions at a time. If the nucleotide is
found complementary to the DNA template it causes polymerases (Huse et al., 2007). The
generation of the longest complementary nucleotides region by polymerase leads to the
termination of polymerase incorporation process (Kircher and Kelso 2010). In 2005, Roche-454
parallelized this technique on a picotiter plate for high-throughput sequencing purpose (Mardis
2008). Each of the two million wells of the plate has room for exactly one 28-µm diameter bead
sheltered with copies of the nucleotides to be read (Figure 2.2) (Margulies et al., 2008).
The main prerequisite of the pyrosequencing method is to cover single beads with many
copies of the same molecule (Kircher and Kelso 2010), by making libraries in which every single
molecule gets two unlike adapter sequences, each on the 5′ and 3′ end of the chain (Metzker
15
2010). Ligation of the two synthesized oligos is required to prepare the 454/Roche sequencing
library (Kircher 2011). The adopters and oligonucleotides are complementary to each other on
the beads; consequently molecules attaches to the beads by hybridization procedure (Dressman et
al., 2003). The empty beads can then be separated from the others and by another adapter, and
then used in the process (Gansauge and Meyer 2013).
It is now possible to sequence 1.5 million beads in a single reaction and to establish 500
nucleotides using the updated version of 454/Roche platform (Casals et al., 2012). Read length is
identified by flow cycles count or base chemistry and the pattern of bases in the DNA to be
obtained (Haydock et al., 2015). This number limited to 200 flow cycles for now which produces
400 nucleotides lengthy reads (Buermans and Dunnen 2014). Estimated that the available Roche
platforms has the capability to generate 750 Mb of DNA with cost 20$/Mb in a day (Hui 2014).
Figure 2.2: The pyrosequencing process (Kircher 2011).
16
2.2.2 AB SOLiD System: Sequencing by Ligation
Applied Biosystems (ABI) bought SOLiD technology in 2006 and released it for
commercial usage in late 2007 (Coombs 2008). The Harvard University developed this system
and upgrades it with a cheaper cost and known as Polonator, a joint work with Dover System
(Datta et al., 2010). Later then, a company was established with the name Complete Genomics
Inc which started human genome sequencing service (Kircher andKelso 2010). They are also
using the same technology developed by Harvard, but some new modified strategy of making
library was added. The clonal sequencing features are created by emulsion polymerase chain
reaction, other than a bridge PCR (Figure 2.3) (Voelkerding et al., 2009). It uses a di-base
technique that can read two DNA bases at the same time, every step while Illumina platform
reads the nucleotide sequences directly (Park 2009). The ABI SOLiD uses only four dyes
represented by a single color. Each base is cross checked two times as long as the machine
moves along the reads. There is possibility to remove the problematic regions generated by the
system during sequencing process. The updated SOLiD systems are able to produce about 1
billion 50 bp per run of having 100 Gb of data in a day (John and Grody 2008).
17
Figure 2.3: Applied Biosystem‟s SOLiD sequencing by ligation (Kircher 2011).
2.2.3 Illumina-Solexa System: Sequencing with Reversible Terminators
Genome Analyzer was first introduced in 2006 by Solexa which was then acquired by
Illumina in 2007 (Ansorge 2009). The amplified sequencing features in this system are created
by the bridge PCR, based on sequencing by synthesis which is more similar to Sanger
technology (Ross and Cronin 2011). Each nucleotide is saved through imaging techniques during
this procedure, and is then converted into base calls (Branton et al., 2008). The process starts
18
with the library making and amplification for sequencing. The two stranded library is then
converted to single strand nucleotide chain, which are then poured into the flow cell according to
the protocol. Olegonuceotides in the flow cell will start hybridization (Malone and Oliver 2011).
Amplified regions from the DNA template are then clustered together on the surface.
Approximately 1000 copies of template are present in each cluster (Lagally et al., 2001). With
the help of Hiseq 2000, we can possibly amplify about 30 million regions (Gilbert et al., 2010).
The eight lanes in the flow cells can sequence eight independent libraries, parallel. The single
stranded nucleotides in a cluster generated (Figure 2.4). The marker is hybridized with adaptors.
The images obtain are then analyzed the bad quality reads are filtered out and the final output
data files are in FASTQ format (Martin 2011). The Illumina machines are capable to decode 100
base pairs with comparatively lower rate of errors. Per reaction can produce almost 20 Gb of data
in less than 24 hours (Quail et al., 2008).
Figure 2.4: Reversible terminator chemistry utilizes in the Illumina platforms (Kircher 2011).
19
2.2.4 Ion Torrent: Semiconductor Sequencing
In 2010, the company known as Life Technologies released their personal genomics
machine (Ion Torrent-PGM) (Quail et al., 2012). It is a benchtop high-throughput sequencer
which uses semiconductor sequencing technology use for genome re-sequencing (Egan et al.,
2012). Their cheaper cost and easy to use sample preparation method helps to reduce the burden
on core facilities and encourage the use of NGS in medical related fields (Gullapalli et al., 2012).
The Ion Torrent PGM is commercially available and has the power to analyze medical related
samples with high productivity and accuracy. Using semiconductor-based technology, the Ion
Torrent produces direct sequence reads without an optical interface (Delseny et al., 2010). The
pH sensor detects the signals of protons generated with the addition of nucleotide (Toumazou et
al., 2013). Ion Torrent PMG is the pioneer commercial sequencer which does not need
fluorescence and camera for scanning, which became the reason of its high speed cheaper price
and have smaller equipment size (Rothberg et al., 2011). The error rate is comparatively very
high which tend to increase in genomic regions where the real polymorphism is also higher
(Derrien et al., 2012). Therefore it becomes the biggest challenge for analysts to decrease these
errors. The per-base accuracy was validated by the company in 2011 and gave 99.6% result
based on fifty bases read with hundred Mb per run (Westerfield 2013). The accuracy was then
verified repeatedly by the company itself, but these figures have never been verified by other
research groups outside the manufacturing company (Yeo et al., 2012).
2.2.5 The Third Generation Sequencer
With the increasing demand of using NGS technology, another generation of sequencing
has been introduced. The 3rd generation sequencing has a couple of important aspects e.g. the
20
PCR is not require before sequencing which helps scientists in saving time (Liu et al., 2012). The
Pacbio or Nanopore signals are captured in the real time means that they are under observation
during the catalytic process of incorporating a nucleotide in a chain (El-Metwally et al., 2014).
One of the methodology known as Single-molecule real-time is based on third-generation
technology introduced by Pacific Bioscience (Raley et al., 2014). SMRT needs lower DNA
quantity (< 1 μg) in start, compared to other platforms and results in significantly longer read
lengths (Wall et al., 2009). This technology is not common among researcher like the other
second generation sequencers.
Nanopore developed by Oxford University researchers, is known to the public by the
name Nanopore sequencer (Laszlo et al., 2014). It‟s a third-generation platform with longer read
of magnitude bigger than existing technologies. They are trying to bring the cost much lower
than the current market rates. The most interesting thing about Nanopore is that, it is futuristic
USB-powered sequencer at only one thousand dollars with easy to use protocol (Pabinger et al.,
2014). Later the company stopped the production and came up after two years with a new beta
version of sequencer known as MinION (Mikheyev and Tin 2014). The product result in the
beginning was quite premature and unfair but it‟s improving quickly.
2.3 Genetic Variants in the Human Genome
Genetic variants are the differences within individuals and populations genetic makeup.
A single gene may have multiple variants in different positions in a whole human population,
which then become a polymorphism (Cargill et al., 1999). No humans are identical even if they
are developed from one zygote (Scott et al., 2000). The difference is due to the alterations
happen during development process but it is estimated that there are 99.9 percent similarity
21
between human individuals (Check 2005). These variations are the key information generally use
in DNA fingerprinting and personal /population identification (Edwards et al., 1992). Different
populations have different allele frequencies which sometime make them unique for certain
character (Wright 1949). The more a population is geographically distant the more it has
different genetic makeup.
The genetic mutations in individuals occur during meiosis where genes exchange during
crossing over of the chromosomes (Campbell et al., 2014). Another reason of genetic alteration
is natural selection and environmental fact (Williams 2008). That is the reason some genes or
alleles shows expression if it gets a chance to express based on geographic regions (Cavalli-
Sforza et al., 1994). Somtime the cause of mutation is genetic drift, this is the effect of random
changes in the gene pool, which has a great importance on ancestry related studies e.g. when did
the modern humans migrated from Africa (Tishkoff and Kidd 2004).
Human genetic variation has both genealogical and medical importance (Burchard et al.,
2003). It helps researchers in getting knowledge about the ancient migrations and how diverse
human populations are genetically similar to each other (Sachidanandam et al., 2001). Genetic
polymorphism is useful in the disease association studies because certain disease calling variants
occur more frequently in a population. On average there are 60 unique mutations in an
individual, if compare with parents (Taillon-Miller et al., 1999). The genetic differentiation in
humans can be found in many formats, starting from chromosomal base to point mutations.
2.3.1 Single Nucleotide Variants / Polymorphisms
A single nucleotide change which is also known as SNiPs is the difference or change of a
nucleotide among members of a group which occurs in about one percent of the group. Up-till
22
now there are more than 30 million SNPs reported in humans (Frazer et al., 2007). These are the
most common variations used in genomic studies. Single nucleotides variants are the major
source of heterogeneity which occurs about every 100 to 300 bases on average (Batra et al.,
2014). There are two types of SNPs i.e. Synonymous and non synonymous. Non-Synonymous or
functional SNPs are those variants which alter the function of gene and cause a phenotypic
change between humans (Haller et al., 2014). Out of 30 million SNPs only three to five percent
have associated function (Hinds et al., 2005). Synonymous SNPs are also important and can be
use as genetic markers in different genome base studies.
2.3.2 Structural variation
This is another kind of human genetic variation which occurs due to the structural
changes in the chromosome of an organism (Feuk et al., 2006). It includes microscopic
chromosomal regions that are deleted, duplicated, inverted or inserted (Redon et al., 2006). For
the first time structural variants were studied in the two personal genomes in 2007. It has a great
contribution in genome variation and has been investigated by researchers for having association
with complex diseases (Frazer et al., 2009).
2.3.3 Copy Number Alteration / Variation
Copy Number Variations are those Genetic polymorphisms in which a Structural segment
of DNA that is 1 kilobase (i.e.1000 Nucleotide Bases) or larger are present in a variable number
as compare to reference genome (Redon et al., 2006). These are mutations and include deletions,
insertions, and duplications (Freeman et al., 2006). Other definitions encompass even larger
swaths of DNA. The Welcome Trust Sanger Institute, (Conrad et al., 2010) heads up the Copy
23
Number Variation Project, defines CNV as variable number of repetitions of 10 kb (10, 000 base
pairs) to 5, 000 kb (5 Million base pairs) sequences (micro-duplications) (St Clair 2009).
The most of Copy Number Variants may cover about 12% of human Genome (Redon et
al., 2006), which means that there are ~12 CNVs in an individual (Feuk et al., 2006) and
accumulative result of CNV Inheritance may constitute more than 10% of human genome
(Lupski et al., 2011). Latest Research suggest that average human Genome comprises greater
than1000 CNVs, encompasses approximately four million base pairs (Conrad et al., 2010) and
occurs at the rate of 0.07-0.12 per generation (Itsara et al., 2010). CNV either inherited from
parents or produce de-novo, in both cases functional consequences occur at translational level
by altering gene dose effect and include truncated protein sequences, eliminated/reduced protein
expression (typically the result of deletions), or increased/enhanced protein expression (typically
caused by duplication) hence effect Individuals phenotype (Connolly et al., 2014).
A large number of algorithms have been developed to identify CNVs from sequencing
data, including CNVnator, cnvHiTSeq and XHMM (Tan et al., 2014). Different CNV algorithms
have different strengths and weaknesses (Li and Olivier., 2012), and the most effective strategy
in terms of minimizing erroneous CNV calls is to incorporate multiple toolsets, which can be
validated computationally via local de novo assembly (Wong et al., 2010).
2.3.4 Lineage Markers for Population Study
Paternally inherited Y and Mitochondrial DNA have extensively been used for
understanding the human history and movement of anatomically modern humans (Richards et
al., 2000). The Y chromosome (NRY) and Mitochondrial genome characterize the only two
haploid parts of human complete genome, since they are transmitted uniparentally, without
24
restructuring in each generation during the process of meiotic cell division (Jobling and Tyler-
Smith 2003). These two haploid systems are passed down from generation to generation without
changing (unless a mutation alters the haplotype), therefore it can preserve records of genetic
history better than autosomal nuclear DNA (Hellenthal et al., 2008). Because autosomal nuclear
DNA, are shuffled with each generation i.e. 50% of an individual‟s genetic information comes
from his or her father and 50% from his or her mother (Helgason et al., 2003).
2.3.5 Variable number tandem repeats
The VNTRs are the other type of variants used in DNA finger printing and forensic
sciences (Luczak-Kadlubowska et al., 2008). They are tandem repeats variation of short
sequences in human genome. VNTRs can be found on many chromosomes with different size
length in different individual‟s genetic makeup (Nakamura et al., 1987). It is used in forensic
sciences as personal or parental identification, crime scene investigations etc (Lewontin and
Hartl 1991).
2.3.6 Short tandem repeats (STRs)
Short tandem repeats of almost five base pairs are microsatellites, while longer then that
are known as minisatellites. Currently, STR measurement is based on electrophoretic technique,
which requires dye labeled primers and very careful analysis of results because of technology
artifacts (Chung et al., 2004). It is previously introduced for STR typing by using terminator
nucleotide to terminate the polymerization at shortest allele (Sanchez et al., 2006). This helped to
sequence heterozygous samples for STRs.
25
2.4 Applications of Genome Variants
Every individual in this world has some genetic difference at certain level of genetic
sequence, which is the reason behind the diversity of human beings (Tooby and Cosmides 1990).
It is the main and important objective for the global scientists to understand genetic diversity of
human so they could get knowledge about the evolutionary history of this important species
(Jobling et al., 2013). It will then be possible to know that where human populations came from,
and where they are heading to. Knowledge about genetic diversity of humans is also necessary so
researchers could understand about different diseases, and how we respond to specific drug at
individual or cohort base (Price et al., 2010). Millions of SNPs in genes that might have
association with diseases in four world populations were discovered by the HapMap consortium
using genome wide microarray chips (McCarroll et al., 2006). Moreover, scientists are trying to
understand population history which will help in discovering diseased genes. Improvements in
our understanding of patterns of human genetic variation have also informed our view of the
history of modern human populations (Cavalli-Sforza 2005). The new methodologies to visualize
and interpret genetic data explained by the researchers have helped in understanding about the
human evolution.
Personal genetic information can be used to investigate population architecture and
allocate the person to groups that frequently match with their geological lineage (Shaer eta l.,
2014). With the development of new techniques and algorithms, it is now possible to accurately
estimate the genetic relation among individuals (Lange et al., 2014).
26
2.4.1 Genetic Ancestry and Admixture Mapping
Admixture analysis is used to study how genotypic information changes the disease rate
in human populations. It occurs when divers‟ populations‟ starts interbreeding and their progeny
characterize a combination of alleles from different ancestral groups (Mendelson et al., 2014).
An admixture ratio estimation of a person is a helpful tool in population genetics and
epidemiology. Admixture analyses enable the scientists to categorize those with no information
of ancestry into distinct populations (Ruiz-Linares eta l., 2014). This technique has effectively
been applied on different populations to know about their genetics. Current admixed ethnic
groups, which map their ancestry to numerous regions, are suitable for investigating genes for
diseases and other phenotypes that vary in occurrence between parental populations (Race and
Group 2005).
Some of the monogenic disorders occur due to the variation in allele frequency in a
population which usually associates with ancestry either it is ethnic or geographical (Via et al.,
2009). The health-care experts generally use such information into account to make some
decision. Common diseases like diabetes, obesity, heart problems, blood pressure and
neurological disorders include many genetic variants and environmental factors (O‟Donnell et
al., 1998). Scientist investigates to about the involvement of pathogenic alleles with low or
moderate response.
2.4.2 Medical and Clinical Implications
Sequencing whole genomes technology is now capable to identify disease variants in
patients with accuracy results and lower cost. Still, although researchers and policy makers are
trying to handle the issues in using and interpretation of genotype data (Kaye and Hawkins
27
2014). Until now genetic variants have been used in molecular diagnostic testing but with limited
loci (Yip et al., 2008). With the cheaper faster and accurate sequencing technology, the
diagnostic tests can be done at single-nucleotide level.
Human genome data generated by 1000 genome project and other genome research
groups, investigators all around the world have come up with more advanced tools to study the
role of variants along with its associated environment in complex diseases (Cirulli and Goldstein
2010). Genome wide studies are already facilitating clinical researchers to improve diagnostics
and better decision-making tools for patients (Houdayer et al., 2008). That is how role of
genomics in health care initiated the era of genetic medicine which is also familiar to people as
personalized medicine.
Moving discoveries from the laboratory to the professional clinics takes reasonable time
and funding. Recently the American government has announced to invest over 200 million
dollars in genetics health care and Precision Medicine, another term for personalized genomic
medicine (McCarthy 2015). According to genetic professional, generally it takes more than ten
years for an industry to perform medical related studies, due to policies designed by the FDA
(Ciociola et al., 2014).
Genome wide study is contributing to individual‟s risk of developing diseases which are
common in world populations. These common diseases include diabetes, cancer, hypertension
and cardiovascular disorders (Eyre et al., 2004). A profound understanding of genetic makeup of
such diseases will help us reveal the essential mechanism of cells and, eventually our knowledge
about how various elements work simultaneously to affect an individual‟s health will increase
(Lander 2011).
28
2.4.3 Pharmacogenomics
It is the field of genomics in which advance molecular and genetic techniques are used to
better understand a patient‟s genetic abnormality and prescribe better medicine for him
(Goldman et al., 2007). Rational medicines are saving millions of lives every day. Yet there
might be one drug which will not be helpful for a patient, even if it works for others (Edwards
and Aronson 2000). In some cases it may cause severe side effects for one person but not for the
other. Many scientists have realized that most prescribed medicines do not work on
most patients who take those (Vermeire et al., 2001). It is an open secret within the
companies that most of its drugs are useless for most of the patients but for the first
time such news has gone public. After decoding the human genome many ideas were developed
to find the causes and cure of human diseases (Jobling et al., 2013). The clinical application of
this individual genetic information, leads to a new era of personalized drugs, which created
challenges and opportunities for the biomedical researchers and health care professionals
(Guttmacher et al., 2007).
Individualized drug uses information from a person‟s genetic profile and uses it for
identifying gene expression level to a disease, choosing a drug and starts a preventive measure
that is appropriate for certain patient (Chobanian et al., 2003). Computational analysis of an
individual genetic data for predisposition to disease is changing the way medicine are
discovered and instructed. It is indeed a bold new research effort to revolutionize how we can
improve health care system (Collins et al., 2003). This type of innovation is associated with
considerable scientific uncertainty and financial risk but recently countries like USA are
spending million dollars to initiate the personalized medicine that promises to accelerate
29
biomedical inventions and provide medical professionals with new tools and skills to select
which treatments will work best for which patients (Hamburg and Collins 2010).
With the development of personalized medicine it will be possible to produce more
effective drugs having lower chances of adverse effect as compared to rational drugs (Okimoto
and Bivona 2014). The healthcare management will be soon capable to develop more targeted
drug therapy to the diseased individuals with less errors and lower rate of drug-related side
effects (Whirl‐Carrillo et al., 2012). With the development of biotechnology, healthcare
professionals are now familiar with the fact that the same drug does not work in the same way in
each patient. Some patients do not positively respond to the treatment (Fletcher et al.,
2012). Many patients treated for different diseases do not respond to prescribed drugs. The idea
behind individual medicine is that, all patients have an exclusive genetic makeup and this should
be utilized in the choice of medical treatment, resulting in improved efficacy and minimization
of side effects (Gamma 2013). Precision medicine can be regarded as the current era‟s answer
for rational drug usage. Physicians will be provided an objective improve medical treatment
along with these novel molecular diagnostic procedures for a many disease areas (Pauwels et al.,
2014).
Personalized health care has the capability to revolutionize how we could prevent,
diagnose and treat human diseases (Snyderman and Williams 2003). It is the beginning of a
journey that holds much promise, but it will require thoughtful and joint research among
scientists, health-care professionals, ethicists, policy makers, patient advocates and general
public to chart the wisest course (Roberts and Ostergren 2013).
30
2.5 Personal and Population Genome Projects
Scientists contributing in the field of genomics by studying several personal and
population genome projects running by different collaborative research groups around the globe.
Many renowned personalities have donated their DNA to research communities so they could
understand the hidden genetic information and use it for the betterment of humankind
(Hellenthal et al., 2014). These renowned individuals include Craig Venter (American
geneticist), James D. Watson (Nobel Laureate), Steven Quake (BioEngineer) and George Church
(Harvard professor). Atta ur Rehman (Former Education Minister) from Pakistan has also
contributed to the field by providing his genome. Similarly other individual genomes belong to
different ethnic groups from different countries have been reported which includes First Indian
genome from male and a female genome from south Asian India (SAIF), a genome from Sri
lanka, Irish genome, Turkey, Australian and African genome.
Due to the sudden decrease in the cost to sequence and analyze a human whole genome,
many research groups have established consortiums to study genomes from geographically
different region and ethnic groups, to understand the biology of genetic disorders and understand
how these populations migrated from one place to other (Kidd et al., 2004).
2.5.1 Personal Genome Project (PGP): PGP was started by Prof. George Church from Harvard
Medical School, in the year 2005 (www.personalgenomes.org). A long term project which aim is
to analyze the personal genome of donors who sign consent that their genomes can be publically
available to the world. It was believe to collect 100,000 donors from America but later many
other semi-consortia from UK, Korea etc also participated to perform their role (Church 2005).
31
2.5.2 1000 Genomes Project (1KGP): Announced in 2008 by Welcome Trust, Beijing Institute
and National Health Institute (Siva 2008). The main objective of the project was to collect
human genetic variants by sequencing one thousand human genomes belong to diverse ethnic
groups around the world. The project was then divided into three phases and more genomes from
other populations were also included. The first phase was completed and reported in 2012.
Currently there are 2,577 genome samples from 26 populations are available online on the 1000
genome project official website (www.1000genomes.org/).
2.5.3 Pan-Asian Population Genomics Initiative (PAPGI): PAPGI is a second version of Pan-
Asian SNP consortium which was successfully completed by scientist from China, India, Japan,
South Korea, Singapore and Thailand additionally supported by Indonesia, Malaysia,
Philippines, and Taiwan. The current version is being assisted by Middle East countries which
include Saudi Arab, Kuwait and UAE. They are generously participating in the data production
and analysis (Ranganathan et al., 2012). The goals of this project are to study Asian genomes
and to correlate them with local adaptation, population migration, and genetic variation related
with phenotypic and genetic disorders (Ngamphiw et al., 2011). The consortium is helping
research community to understand human evolution and medical applications (www.papgi.org).
2.5.4 One Million Genomes: American government is going to spend $215 million on a
“personalized medicines” initiative, which will include genetic health care information from
volunteers (Insel et al., 2015). The money allocated for this project has also involved the study of
cancer and other rare diseases. A bio bank will be created where millions of genomes from
32
Americans will be stored. Later these genotypes will be used for establishing precision medicine
(Collins and Varmus 2015).
2.5.5 Human Genome Diversity Project (HGDP): Stanford researchers started this project in
collaboration with Centre Etude Polymorphism Humain (CEPH) in Paris to study human genetic
evolution. Approximately 1,043 samples from 53 diverse populations were studied (Bryc et al.,
2010). Their 650K single nucleotide variants were determined using microarray chip developed
by illumina. The data collected was from Africa, Europe, Asia and the USA (Rosenberg 2006).
Some of the HGDP samples have been sequenced (WGS) with ~30X coverage by Simon‟s
Foundation.
2.5.6 Billion Genomes Project: BiG started by Theragen BiO Institute San Diego. The idea is to
sequence every individual human living on earth and understand about the unknown genetic
information of human population around the globe (http://billiongenome.com).
2.5.7 Other Genome Consortiums: Many other genome project and consortiums were also
established by different countries which include, Singapore Genomes Variation Project (Teo et
al., 2009), Indian Genome Variation Project (www.igvdb.res.in), Malaya genome project (Wong
et al., 2013), Korean Personal Genome Project (Zhang et al., 2014), The African Genome
Variation Project (Gurdasani et al., 2015), Genome Arabia Project and Iranian Genome Project
(irangenes.com).
Chapter 3
MATERIAL AND METHODS
Pages 33-45
33
CHAPTER 3 3. Materials and Methods
3.1 Subject Selection and enrollment of participant and ethical statement:
This study has been performed in accordance with Declaration of Helsinki and
has been approved by the Institutional Review Board Genome Research Foundation
(GRF) with IRB-REC-2011-10-003. Signed informed consent was obtained from the
participant in this study to publish the entire content of his genome, as well as personal
identifying information (such as age, sex and location).
There are documented cases of his family members with hypertension, heart
problems, neuro disorders, diabetes and obesity. His father has been diagnosed for
cardiovascular disorder, hypertension and Alzheimer’s. His mother has osteoarthritis and
grandparents were died due to heart attack, cancer and hypertension.
Figure 3.1: Family pedigree of donor (red), with members having genetic disorders.
34
3.2 Data sources:
The UCSC reference genome (hg19, February 2009), dbSNP version 137 and
genome annotations, were retrieved from the database (www.genome.ucsc.edu). Variant
calling files (VCF) were retrieved from different publically available databases. i.e. 41
samples from 9 diverse populations (African ancestry in Southwest USA; Utah residents
with Northern and Western European ancestry from the CEPH collection; Han Chinese in
Beijing, China; Gujarati Indian in Houston, Texas, USA; Japanese in Tokyo, Japan;
Luhya in Webuye, Kenya; Maasai in Kinyawa, Kenya; Toscans in Italy and Yoruba in
Ibadan, Nigeria) were collected from Complete Genomics Inc
(www.completegenomics.com) and five samples were taken from Korean Personal
Genome Project (KPGP) (www.kpgp.kr). Twelve South Asian populations from the
CEPH- HGDP which were genotyped on 650K SNP arrays were also downloaded from
the public databases of Stanford University (http://www.hagsc.org/hgdp/files.html).
3.3 DNA Extraction:
Genomic DNA was extracted from the arterial blood lymphocytes of a 30 year old
healthy male individual, who was reported to come from Pakistani Pakhtun ethnicity for
at least three generations. Consent form was signed prior to the collection of the blood
sample from which genomic DNA was extracted. Extraction kit QIAamp DNA Blood
Mini Kit was used for DNA extraction from the blood (Qiagen). Tecan’s Infinite F200
nanodrop was used to assess DNA purity, 1.7 % agarose gel electrophoresis to confirm
DNA size (presence of high molecular weight DNA) and Invitrogen’s Qubit fluorometer
to determine the DNA concentration.
35
3.4 Cytogenetic Analysis:
Karyotyping was carried out with cultured peripheral blood lymphocytes using
standard techniques, and GTG banding was used to identify chromosomal aberrations,
which is useful for identifying genetic diseases through the photographic representation
of the entire chromosome complement (Speicher et al., 2005). Blood sample was frozen
and stained using trypsin. The sample was then observed with microscope. The bands
were pronounced and we were able to mark the normal genetic male traits while also
recoding any slight abnormalities.
Figure 3.2: Cytogenetic analysis through GTG banding karyotype and legends.
3.5 Library preparation and Whole Genome Sequencing:
The 1.1 μg of gDNA was used to generate two paired-end libraries suitable for the
HiSeq sequencing platform (IlluminaH) prepared using the TrueSeq DNA Preparation
Kit, following Illumina’s standard protocol (Pair End Library Preparation Kit, Illumina,
San Diego, CA, USA). Quality control analysis of the library using an Agilent 2100
Bioanalyzer indicated that the library was of acceptable quality, containing the expected
fragment size and yield, for continued sample processing. The library generated was used
36
in the cBot System for cluster generation in three flow cell lanes. Cluster generation was
then performed on an Illumina cBot and the libraries sequenced on an Illumina HiSeq
2000 following the Pair-End protocol for each. Bad quality reads were eliminated from
the final output of the sequencing machine.
Figure 3.3: Illumina HiSeq2000 Machine and accessories. (http://qbi.uq.edu.au)
Shearing of gDNA was done using Covaris S series (Covaris, MS, USA).
Following end repair, A-tailing and adaptor ligation, DNA in the 500-600 bp range was
purified from a 2% agarose gel. Polymerase chain reaction (PCR) was performed using
the following cycling profile: initial denaturation at 98°C for 30 sec. followed by 10
cycles of 98°C for 30 sec, 60°C for 30 sec, and a final extension step at 72°C for 5 min.
Proper DNA size was then confirmed with the Agilent Bioanalyzer, followed by qPCR
quantification with Roche Light Cycler 480 II and Kapa Biosystems reagents. The
remainder of our analyses was initiated from the FASTQ files provided by Illumina's
downstream analysis CASAVA software suite.
37
Figure 3.4: Library quality generated by BioAnalyzer.
3.6 Workflow for Genomic Data Analysis:
A custom workflow was created for the analysis of the genome. This included
calling variations from the alignments, comparison with other variant databases including
dbSNP (Sherry et al., 2001), database of genetic variants (Iafrate et al., 2004, Feuk et al.,
2006) and those from the 1000 Genomes Consortium (www.1000genomes.org). The
workflow further included the mapping and comparison of markers associated with
damaged variants and pharmacogenomics traits. Multiple analytical approaches have been
added to the workflow to assess the influence of ancestral contributions within a personal
genome along with the historical background of the region. The detailed components of
the analysis workflow are given in Figure 3.5.
Python programming language script was used to develop NGS data analysis
pipeline. It was designed to run on UNIX system and was tested on the Red Hat
Enterprise Linux (RHEL) server v5.6. It uses the Modules package to provide dynamic
modification (e.g. changing the path and version of Python) of a user's environment via
module files. Its Map Reduce approach was implemented mainly based on a custom
Simple Job Management framework SJM which currently supports Sun Grid Engine but
38
can be easily extended to support other batch systems. Each step in the pipeline was
implemented in a separate python script and the job description file generated for SJM is
in a human-readable format.
Figure 3.5: Workflow of the next generation sequencing and bioinformatics data analysis (Koboldt et al.,
2010).
39
3.7 Sequence alignment:
The input reads are generally in FASTQ format. BWA version 0.5.9 was used for
sequence alignment against the human reference Genome HG19 (Li and Durbin,2009). A
software package BWA was used for mapping low-divergent reads against a human
reference genome. It has a combination of three different algorithms: backtrack, SW and
MEM. The backtrack algorithm is one designed for Illumina reads (100bp), while SW
and MEM are made for longer sequences (70bp to 1Mbp). SW and MEM has better
performance than BWA-backtrack for 70-100bp reads, generated with Illumina.
Illumina’s quality score was converted into Sanger’s quality score by BWA. The
multithreading option was enabled with two concurrent threads for generating the SA
coordinates in mapping. The original alignment output which was in a SAM format was
converted into BAM using SAMtools version0.1.14. SAMtools is a package which helps
in variant calling and alignment visualization along with other processes like sorting,
indexing, data extraction and file conversion. SAM files are usually in larger size which is
compressed for saving hard disk space. Typically BAM files are heavy and cannot be
processed. SAMtools make us able to work directly with a compressed BAM file, without
having to uncompress the complete file (Li et al., 2009). SAM and BAM files have
detailed information about the reads along with references, alignments, quality
information, and user-specified annotations which can be removed with SAMtools.
Sorting of the BAMs was done by the Picard tool (http://picard.sourceforge.net)
version 1.32 and binning the BAMs by chromosome was performed using SAMtools.
Picard was used to remove duplicates in alignments where as GATK version 1.0.5506
was used for local realignment and base quality checking (McKenna et al., 2010). The
GATK is a Genome Analysis toolkit for analyzing high-throughput sequencing data. It
40
offers a wide variety of packages, with special emphasis on variant calling and
genotyping and data quality control (Figure 3.6).
3.8 SNP and Indel detection:
The Unified Genotyper in GATK version 1.0.5506 was used for SNP and indel
detection with call # confidence set to 30.0 and emit # confidence set to 10.0. Dindel
model was enabled in indel calling. Filter label was applied using the Variant Filtration
program in GATK for allele balance (AB) greater than 0.75 quality score (QUAL) less
than 50.0 depth of coverage (DP) greater than 360 strand bias (SB) greater than V0.1 or
mapping quality zero reads (MQ0) greater than or equal to 4. The mpileup function in
SAMtools / BCFtools version 0.1.14 was also used for SNP and indel detection. The
generated VCFs were concatenated and merged using VCFtools version 0.1.5 and
indexed using Tabix version 0.2.4 (Danecek et al., 2011).
3.9 Copy Number Variation Detection:
Copy number variations have been studied using array based technologies but
their resolution is limited, hybridization reduces accuracy and their predefined probes
incompatible with the novel CNV detection. Next generation sequencing is emerging
technologies with rapid cost reduction which detect CNVs with higher resolution and
accuracy. ReadDepth 0.9.7 was used for identification of copy number variations with bin
size 0.01 (Miller et al., 2011). Copy number calls smaller than 1.3 were taken as loss and
greater than 2.6 as gains. ReadDepth is a new tool developed in R programming for CNV
discovery. It calls CNVs on the bases of sequence depth, and then invokes a circular
binary segmentation algorithm to call segment boundaries. It also allows for explicit
41
control of the false discovery rate (FDR), which minimizes the number of false positive
CNV detected.
3.10 Functional annotation
Functional annotation of genome variants means the process of attaching
biological information to sequences which includes the identification of elements on the
genome and assigning biological meaning to these elements. This process also called
gene prediction. Automatic annotation tools like ANNOVAR try to perform all these
analysis by computer programs (Wang et al., 2010).
All the detected variants obtained in the VCF format using SAMTools were then
annotated with ANNOVAR. The UCSC known genes and repeat masker databases were
used for gene and repeat annotations respectively. DGV
(http://projects.tcag.ca/variation/), SIFT (Ng and Henikoff, 2003), PolyPhen2 (Jordan et
al., 2011), and ClinVar were used for functional annotation (Landrum et al.,2013).
42
Figure 3.6: Schematics representation of the pipeline developed.
43
3.11 Pharmacogenomics Analysis
Functionally damaged nonsynonymous SNVs were used to retrieve the genes involved in
drug transport and metabolism and drug targets were retrieved from DrugBank and PharmGKB
(Hewett et al., 2002, Wishart et al., 2008). Variants associated with pharmacogenomics
characters were collected manually from literature and other data sources. A perl script was used
to get overlaps between the two sets (Figure 3.7). The clinically associated variants have been
recommended for testing. The methodology used for phamacogenomics analysis has already
been report previously (Salleh et al., 2013).
Figure 3.7: Schema of the Pharmacogenomics analysis (Salleh et al., 2013).
3.12 Multidimensional Scaling and ADMIXTURE:
Total 52 samples from 13 different ethnic groups, including Pakistani (Pathan) genome,
were used to do admixture, phylogenetic and MDS analysis. Complete genome variant files from
Complete Genomics Inc. USA, were downloaded from publically available data. The samples
include Africans in USA, European individuals from the CEPH collection, Han Chinese,
Gujarati Indian, Japanese, Puerto Rican, Luhya Kenyan, Maasai Kenyan, Mexican ancestry,
44
Italian and Yoruba (Drmanac et al., 2010) and five genomes were obtained from the Korean
Personal Genome Project (www.kpgp.kr). VCFTool was used to merge all the samples. Dataset
was restricted to the 607,578 SNVs available in all samples which also approved for quality
control. PLINK was then used to prepare data for admixture studies (Purcell et al., 2007).
Admixture analysis was performed using the program ADMIXTURE to identify the presence of
diverse ancestral relation of Pathan genome with others (Alexander et al., 2009). We explored
values of K, from K = 2 to K = 13. An ancestry painting was performed with the help of a
publically available tool INTERPRETOME, by analyzing individual genome information
(Karczewski et al., 2012). To describe how our genome clustered with the other populations,
multidimensional scaling (MDS) was constructed using PLINK. Pairwise identity-by-state (IBS)
distances were calculated between all individuals using the 607,578 SNV markers, and MDS
components were obtained using the mds-plot option based on the IBS matrix.
3.13 Pairwise Sequentially Markovian Coalescent Analysis
We conducted a PSMC (Pairwise Sequentially Markovian Coalescent) analysis to reconstruct
the demographic population history of Pathans (Li and Durbin 2012). We compared the Pathan
genome to a set of 11 HGDP genomes from around the world (as published by Meyer et al). We
first used samtools to extract the diploid genomes from their BAM files aligned to hg19, and
excluded sex chromosomes and mitochondrial genomes because they are haploid. In PSMC, we
used the command line options -N25 -t15 -r5 -p "4+25*2+4+6" that have been successfully used
in previous similar analyses of human and great apes (Prado-Martinez et al., 2013).
45
3.14 Phylogenomics Analysis
The most important aspect of evolutionary biology is to understand the relationship
among species. Single nucleotide variants (SNVs) which is also known as SNPs generated
through the sequencing, genotyping and other related technologies enable phylogeny
reconstruction by providing extraordinary numbers of characters for investigation (Miller et al.,
2013). In the current study SNP-based phylogeny was construction after identifying SNPs in all
individuals, and then compiled. The neighbor joining tree was generated by using pairwise FST
calculated for all ethnic samples by using the population allele frequencies across all autosomal
variants. The function “Neighbor” from PHYLIP was used to construct all bootstrap trees (Saitou
and Nei, 1987), and then MEGA5 was used to visualize it (Tamura et al., 2011). Yoruba
population was used as an out-group to root the phylogenetic tree.
Chapter 4 RESULTS
Pages 46-66
46
CHAPTER 4 4. Results
4.1 Genome Sequencing and Variants Identification:
DNA extracted from blood was sequenced with paired-end reads of 90bp using the
IlluminaHiSeq2000 sequencer, producing 1,069,127,687 reads. A total of 83.3 Gb of
sequences were generated and aligned to the human reference genome (without Ns,
2,861,343,702bp), covering 98.2% of the reference genome at an average 28.5u depth (Table
4.1).
Table 4.1: Summary of data production and mapping results Reads length 90 No. of Reads 1,069,127,687 No. of Mapped Reads 992,124,335 Mapped Reads % 92.80% No. of nucleotide Gb 83.25 Gb 89,385,267,060 Mapping depth 28.5
We identified a total of 3,813,440 SNVs,of which 3,683,999 (96.6%) were reported in
the dbSNP database (Sherry et a., 2001) and 129,441 were novel (Table 4.2) which were
further compared with the novel variants count of other individual genomes from literature
(Figure 4.1). There were 1,272,912 homozygous and 2,540,528 heterozygous SNVs. A total
of 18,547 SNVs were found in coding DNA sequence (CDS) regions, 25,481 in 3’
untranslated regions (UTR), and 4,969 in 5’ UTRs. A total of 10,315 SNVs in 5,344 genes
were non-synonymous (nsSNVs).
47
Table 4.2: Summary of SNVs found in Pathan’s genome and overlaps with dbSNP137 Total SNVs
Homozygous SNVs
Heterozygous SNVs
SNVs mapped to dbSNP (v137)
% of SNVs mapped to dbSNP
Novel SNVs
% of Novel SNVs
3,813,440 1,272,912 2,540,528 3,683,999 96.6% 129,441 3.39%
A total of 504,276 short indels (up to ±20 bases) were observed, of which 306,128
were found in intergenic regions, 237 in CDS regions, and 193,308 in intron regions.
Additionally, 1,503 CNVRs were found, 713 of which were classed as duplicated and 790 as
deleted, affecting 2,364 overlapped genes (Table 4.3).
Table 4.3: Variants (SNVs, Indels and CNVRs) identified in Pakistani (PTN) genome.
SNVs Indels CNVRs Total 3,813,440 504,276 1,503 Intergenic 2,376,933 306,128 866 Novel 129,441 --- 65 Homozygous 1,272,912 190,463 --- Hetrozygous 2,540,528 313,813 --- Synonymous SNVs 9,639 --- --- nonSynonymous SNVs
10,315 --- ---
CDS 18,547 237 253 Intron 1,387,430 193,308 220 3` UTR 25,481 4,149 5 5` UTR 4,969 399 17 Reported 3,683,999
(dbSNP) --- 1,438
(DGV)
A total of 65 CNVRs had not previously been described in the database of genomic
variants (DGV; http://projects.tcag.ca/variation/). Figure 4.2 shows the number of gained and
lost CNVRs in each chromosome. ANNOVAR was used for detailed annotation analysis of
CNVRs to identify genes associated with these regions.
48
Figure 4.1: Novel SNVs in personal genomes in thirteen different ethnic groups. Scatter plot showing novel
variants repoted in personal genomes. Data collected from literature.
Figure 4.2: Copy number variations counts distributed in each chromosome.
49
4.2 Functional Classification and Clinical Relevance of Variants:
All 10,315 nsSNVs found in the Pakistani (PTN) genome were further scrutinized for
their possible functional effects using computational prediction methods (SIFT and
Polyphen2), resulting in 43 nsSNVs in 43 genes being classified as functionally damaging
(Table 4.4). Additionally, nsSNVs were annotated using ClinVar for their clinical relevance,
and we found that 31 coding SNVs are associated with several diseases (Table 4.5). Of
particular note are an SNV (rs1049296, Pro570Ser) in the TF gene (Wang et al., 2013),
which affects Alzheimer’s susceptibility; Ser217Leu in ELAC2 gene (rs4792311), which is
implicated in genetic susceptibility to hereditary prostate cancer (Alvarez-Cubero et al.,
2013). The rate of prostate cancer is low in Pakistan (3.8%) (Aziz et al., 2003), as compared
to Americans and Caucasians (Bhurgri et al., 2009). Three coding SNVs on GHRLOS
(rs696217, Leu72Met), SERPINE1 (rs6092, Ala15Thr), and PPARG (rs1801282, Pro12Ala)
which all have links with obesity (Gueorguiev et al., 2009, Bouchard et al., 2010, Galbete et
al., 2013). About 22.2% of Pakistanis are reported to be obese which is close to European
(~24%) and United States populations (~19%) (Flegal et al., 2010, Kopelman et al., 2009).
We also found three pathogenic SNVs in genes associated with hair, skin and
pigmentation: EDAR (rs3827760, Val370Ala), SLC45A2 (rs16891982, Phe374Leu), and TYR
(rs1042602, Ser192Tyr) (Tan et al., 2013, Spichenok et al., 2011, Sulem et al., 2007). In
addition, we detected a SNV (rs17822931, Gly180Arg) in ABCC11, which is responsible for
wet earwax which was also found in the Pakistani PK1 genome (Yoshiura et al., 2006).
50
Figure 4.3:Comparative variant count of other reported individual genomes with Pakistani (PTN) genome.
Graphical representation of comparative study of PTN SNVs with other personal genomes reported previously.
One of the variants (rs1065852, Pro34Ser) in the CYP2D6 gene is responsible for
poor metabolism of debrisoquine, an adrenergic-blocking medication used for the treatment
of hypertension (Zheng et al., 2013). Also, two SNVs in the TPMT (rs1142345, Tyr240Cys
and rs1800460, Ala154Thr) are known to have a pathogenic effect and lead to thiopurine
methyltransferase (TPMT)deficiency (Li et al., 2013, Corrigan et al., 2013). Moreover two
nsSNVs (rs2056899 and rs140980900) ofCYP4A22 and GGT5 genes in the Arachidonic acid
metabolism pathway were found. Arachidonic acid in the human body usually comes from
dietary animal sources, such as meat, eggs, and dairy. Meat is an important diet part of the
people living in the northwestern Pakistan, usually consumed at least once a day, often in the
form of kabab (minced meat fried in oil), or curry (Lindholm 2004).
51
Table 4.4: Functionally damaged novel nsSNVs.
CHR POS REF ALT AA GENE SIFT (≤ 0.05) Polyphen2 chr1 114442945 T C E232G AP4B1 0.00 Damaging chr1 235976331 G C L75V LYST 0.00 Damaging chr1 113253928 C T G336R PPM1J 0.01 Damaging chr1 156242159 G T A222E SMG5 0.01 Damaging chr10 73475893 G A R68C C10orf105 0.02 Damaging chr11 128839275 C G G1931R ARHGAP32 0.00 Damaging chr11 46388863 C T L251F DGKZ 0.04 Damaging chr11 607617 G A G720R PHRF1 0.01 Damaging chr12 46757591 C A M324I SLC38A2 0.03 Damaging chr12 21457414 C A G179V SLCO1A2 0.00 Damaging chr12 8327035 C G H42Q ZNF705A 0.00 Damaging chr14 71445083 C T R677W PCNX 0.01 Damaging chr15 45426095 G A R31Q DUOX1 0.04 Damaging chr15 42041072 T C L1817P MGA 0.00 Damaging chr16 70524280 C T V555M COG4 0.04 Damaging chr16 27782929 A G E1385G KIAA0556 0.05 Damaging chr16 75147696 A G L324P LDHD 0.00 Damaging chr17 36003399 G C D17E DDX52 0.01 Damaging chr17 78082104 C A P324Q GAA 0.00 Damaging chr17 2995813 T G T160P OR1D2 0.00 Damaging chr17 7324288 C A D98E SPEM1 0.00 Damaging chr18 10487685 G A G399S APCDD1 0.01 Damaging chr18 55143927 C G S496C ONECUT2 0.00 Damaging chr19 4513548 C T G128R PLIN4 0.01 Damaging chr2 42990263 C T V353M OXER1 0.01 Damaging chr2 179439827 G C Q23678E TTN 0.00 Damaging chr2 98779387 C G I354M VWA3B 0.05 Damaging chr21 34924337 C G P934A SON 0.02 Damaging chr22 50307056 G A S91F ALG12 0.01 Damaging chr4 69796409 G A P387S UGT2A3 0.00 Damaging chr5 65290677 G A D98N ERBB2IP 0.04 Damaging chr5 154320687 T A L6Q MRPL22 0.00 Damaging chr5 140475629 T A Y419N PCDHB2 0.00 Damaging chr6 56879992 G T K120N BEND6 0.00 Damaging chr6 32188296 C T G349S NOTCH4 0.00 Damaging chr6 84234199 G A G347S PRSS35 0.00 Damaging chr7 73634930 G C R94S LAT2 0.05 Damaging chr8 28989961 C G E936Q KIF13B 0.02 Damaging chr8 81897091 C T D266N PAG1 0.05 Damaging chr8 110476498 C A H2479Q PKHD1L1 0.01 Damaging chr8 142228631 C T D319N SLC45A4 0.03 Damaging chr9 135863848 G T C168F GFI1B 0.02 Damaging chrX 152801794 C T T30M ATP2B3 0.00 Damaging
52
Comparative genomic analysis was done using Pakistani genome symbolized as
“PTN” and the other previously published Pakistani (PK1) genome. Non-synonymous
variants from Pakistani (PK1) genome were annotated for investigating associated diseases.
Out of ~8,000 nsSNVs only 37 variants (three novel) were found linked with certain
disorders. Eight clinically relevant SNVs were detected overlapped with PTN genome. We
found no damaged variants responsible for Alzheimer’s, obesity and heart related diseases
just like we found in PTN genome. An SNV (rs1057910; CYP2C9) was observed in PK1
genome which is known for Wafarin response. Moreover, a pathogenic mutation (rs1169305)
was seen in the HNF1A gene which may become a cause of diabetes in the PK1 individual.
Most of the clinically relevant variants adopted in this study were originally described
in Caucasian populations. While this result might be a consequence of the genomic affinities
of the PTN genome with other Caucasian populations, it might also reflect a bias due to most
of the GWAS work being carried out on Caucasian populations (Ayub and Tyler-Smith
2009). Therefore a cohort study in the Pakistani population will be required for
authentication.
4.3 Pharmacogenomics Analysis:
Damaging nsSNVs were annotated using PharmGKB and DrugBank databases
(Hewett et al., 2002, Thorn et al., 2013, Wishart et al., 2008). A significant number of
variants were found linked with susceptibility to poisonous drugs, while remaining nsSNV
were associated to the drug’s efficacy used in the treatment of diseases such as depression,
diabetes mellitus and so on (Table 4.6).
53
Table 4.5: Clinical relevance coding SNVs in Pakistani PTN whole genome.
Chr Position rsID Ref Alt Clinical Significance Description chr1 115236057 rs17602729 G A Pathogenic Muscle AMP deaminase deficiency (MMDD) chr2 49189921 rs6166 C T Association Ovarian hyperstimulation syndrome (OHSS) chr2 49191041 rs6165 C T drug response Ovarian response to FSH stimulation chr2 109513601 rs3827760 A G Pathogenic Hair morphology chr2 215813331 rs726070 C T Pathogenic Autosomal recessive congenital ichthyosis 4B (ARCI4B) chr3 10331457 rs696217 G T Pathogenic Obesity chr3 12393125 rs1801282 C G Pathogenic Obesity chr3 15686693 rs13078881 G C Pathogenic Biotinidase deficiency chr3 133494354 rs1049296 C T risk factor susceptibility to Alzheimer disease chr4 102751076 rs10516487 G A Pathogenic association with Systemic lupus erythmatosus chr5 33951693 rs16891982 C G Pathogenic Skin/hair/eye pigmentation, variation in, 5 (SHEP5) chr5 35861068 rs1494558 T C Pathogenic Severe combined immunodeficiency chr5 35871190 rs1494555 G A Pathogenic Severe combined immunodeficiency chr6 18130918 rs1142345 T C Pathogenic Thiopurine methyltransferase deficiency (TPMT) chr6 18139228 rs1800460 C T Pathogenic Thiopurine methyltransferase deficiency (TPMT) chr7 100771717 rs6092 G A Pathogenic Plasminogen activator inhibitor type 1 deficiency chr7 138417791 rs3807153 A G Pathogenic Renal tubular acidosis, distal, autosomal recessive (RTADR) chr8 18258103 rs1799930 G A drug response Slow acetylator due to N-acetyltransferase enzyme variant chr10 54531235 rs1800450 C T Pathogenic Mannose-binding protein deficiency chr10 70645376 rs10509305 A C Pathogenic Preeclampsia/eclampsia 4 (PEE4) chr11 5255582 rs35152987 C A Pathogenic delta Thalassemia chr11 88911696 rs1042602 C A Pathogenic Skin/hair/eye pigmentation, variation in, 3 (SHEP3) chr11 113270828 rs1800497 G A Pathogenic Dopamine receptor d2, reduced brain density of chr12 14993439 rs11276 C T Pathogenic DOMBROCK BLOOD GROUP chr14 21790040 rs10151259 G T Pathogenic Cone-rod dystrophy 13 (CORD13) chr15 28228553 rs74653330 C T Pathogenic Tyrosinase-positive oculocutaneous albinism (OCA2) chr16 48258198 rs17822931 C T Pathogenic Colostrum secretion, Ear wax chr17 12915009 rs4792311 G A Pathogenic Prostate cancer, hereditary, 2 (HPC2) chr20 43043159 rs142204928 G A likely pathogenic Maturity-onset diabetes of the young, type 1 (MODY1) chr20 43280227 rs73598374 C T Pathogenic Adenosine deaminase 2 allozyme chr22 42526694 rs1065852 G A Pathogenic poor metabolism of Debrisoquine
54
Table 4.6: Damaged nsSNVs and the drugs.
rsID Position Ref Alt AA Category Gene
rs1065852 22:42526694 G A P34S ENZ CYP2D6
Drugs
amitriptyline;antipsychotics;atomoxetine;carvedilol;chlorpheniramine;chlorpromazine;citalopram;clomipramine;clozapine;codei
ne;debrisoquine;desipramine;dextromethorphan;doxepin;escitalopram;flecainide;fluoxetine;fluvoxamine;gefitinib;haloperidol;il
operidone;imipramine;maprotiline;metoprolol;mexiletine;mianserin;morphine;nortriptyline;paroxetine;perhexiline;perphenazine
;propafenone;propranolol;risperidone;sparteine;tamoxifen;thioridazine;timolol;tolterodine;tramadol;yohimbine;zuclopenthixol
Diseases Breast Neoplasms; Cystic Fibrosis; Depression; Depressive Disorder; Hypertension; Neoplasms; Pain; Parkinson Disease;
Schizophrenia; tardive dyskinesia
rs1142345 6:18130918 T C Y240C ENZ TPMT
Drugs azathioprine; cisplatin; mercaptopurine; methotrexate; purine analogues; s-adenosylmethionine; thioguanine
Diseases Drug Toxicity; Neoplasms; Ototoxicity; Precursor Cell Lymphoblastic Leukemia-Lymphoma
rs12210538 6:110760008 A G M409T TRANS SLC22A16
Drugs cyclophosphamide; doxorubicin
Diseases Breast Neoplasms; Drug Toxicity
rs1799930 8:18258103 G A R197Q ENZ NAT2
Drugs clonazepam; Drugs For Treatment Of Tuberculosis;ethambutol;isoniazid;pyrazinamide;rifampin;sulfamethoxazole;trimethoprim
Diseases Drug Toxicity; Hepatitis; Hypersensitivity; Infection; Maculopapular Exanthema; Pneumonia; Toxic liver disease; Tuberculosis
rs1800460 6:18139228 C T A154T TAR TPMT
Drugs azathioprine; cisplatin; mercaptopurine; purine analogues;s-adenosylmethionine;thioguanine
Diseases Drug Toxicity;Neoplasms;Ototoxicity;Precursor Cell Lymphoblastic Leukemia-Lymphoma
55
rs1800566 16:69745145 G A P187S TAR NQO1
Drugs
1-methyloxy-4-sulfone-benzene;Analgesics and anesthetics;anthracyclines and related
substances;Antibiotics;antiepileptics;Antifungals For Systemic Use;antiinflammatory and antirheumatic products, non-steroids;
Antimycobacterials;Antithyroid Preparations;cisplatin;cyclophosphamide;dicumarol;doxorubicin;Drugs For Treatment Of
Tuberculosis;epirubicin;etoposide;fluorouracil;warfarin
Diseases Breast Neoplasms;Carcinoma, Non-Small-Cell Lung;Heart Failure;Leukemia;Lung Neoplasms;Toxic liver disease
rs1801133 1:11856378 G A A222V ENZ MTHFR
Drugs
antineoplastic
agents;antipsychotics;benazepril;busulfan;capecitabine;carboplatin;cisplatin;cyclophosphamide;cyclosporine;dactinomycin;dexa
methasone;disulfiram;docetaxel;doxorubicin;fluorouracil;folic acid;gemcitabine;hormonal contraceptives for systemic
use;hydroxychloroquine;leucovorin;mercaptopurine;methotrexate;nitrous
oxide;oxaliplatin;paclitaxel;pemetrexed;pravastatin;sulfasalazine;vincristine;vinorelbine;vitamin b-complex, plain
Diseases
Alopecia;Alzheimer Disease;Arthritis, Juvenile Rheumatoid;Arthritis, Psoriatic;Arthritis, Rheumatoid;Breast
Neoplasms;Carcinoma, Non-Small-Cell Lung;Cardiovascular Diseases;Cleft Lip;Cleft Palate;Cocaine-Related Disorders;olonic
Neoplasms;Colorectal Neoplasms;Artery Disease;Down Syndrome;Drug Toxicity;Graft vs Host
Disease;Hyperhomocysteinemia;Hypertension;Leukemia;Leukemia, Lymphocytic, Chronic, B-Cell;Leukemia, Myelogenous,
Chronic, BCR-ABL Positive;Leukopenia;Lymphoma, Non-Hodgkin;metabolic syndrome;Migraine with Aura;Myocardial
Infarction;Neoplasms;Neoplasms, Second Primary;Neural Tube Defects;Neutropenia;Osteonecrosis;Osteosarcoma;Pre-
Eclampsia;Precursor Cell Lymphoblastic Leukemia-Lymphoma;Psoriasis;Schizophrenia;Thrombocytopenia;Toxic liver
disease;Transplantation;venous thromboembolism
rs1801394 5:7870973 A G I49M TAR MTRR
56
Drugs folic acid;leucovorin;methotrexate;tegafur;vitamin b-complex, plain
Diseases Arthritis, Rheumatoid;Colorectal Neoplasms;Migraine with Aura;Precursor Cell Lymphoblastic Leukemia-
Lymphoma;Stomatitis
rs2228570 12:48272895 A G M51T TAR VDR
Drugs 1,25-dihydroxyvitamin d3;calcipotriol;calcitriol;dexamethasone;vitamin d and analogues
Diseases Breast Neoplasms;Fractures, Bone;Osteonecrosis;Precursor Cell Lymphoblastic Leukemia-Lymphoma;Prostatic
Neoplasms;Tuberculosis
rs4149056 12:21331549 T C V174A TRANS SLCO1B1
Drugs
Arsenic compounds; atorvastatin; atrasentan; axitinib; bosentan; capecitabine; caspofungin; cerivastatin; cytarabine; enalapril;
erythromycin; fludarabine; fluorouracil;fluvastatin;gemtuzumab ozogamicin;hmg coa reductase inhibitors;idarubicin; irinotecan;
leucovorin; lopinavir; lovastatin; methotrexate; mycophenolate mofetil; nateglinide; olmesartan; penicillin g; pitavastatin;
pravastatin; repaglinide; rifampin; rosuvastatin; simvastatin;SN-38;troglitazone;valsartan
Diseases
Carcinoma, Non-Small-Cell Lung;Colorectal Neoplasms;Coronary Disease;Coronary Stenosis;Diabetes Mellitus, Type
2;Diarrhea;Hypercholesterolemia;Hyperlipidemias;Hyperlipoproteinemia Type II;Kidney Transplantation;Leukemia, Myeloid,
Acute;Muscular Diseases;Myocardial Infarction;Myopathy, Central Core;Neoplasms;Neutropenia;Obesity;Precursor Cell
Lymphoblastic Leukemia-Lymphoma;Rhabdomyolysis;Toxic liver disease;Transplantation
rs4646487 1:47279175 C T R173W ENZ CYP4B1
Drugs docetaxel; thalidomide
Diseases Prostatic Neoplasms
57
After determining the possibly pathogenic variants found in SIFT and Polyphen2, the
consensus of both datasets was further analyzed in order to find the most probable impact of
these deleterious variants in terms of drug targeting, transport, and metabolism. We found
nsSNVs that affect the function of drugs (two transport, five enzymatic, and four drug
targets). A variant rs1801133 (A222V in MTHFR gene) was found associated with increased
risk of metabolic syndrome when treated with antipsychotics (Ellingrod et al., 2008). Our
donor has high chance of having decreased diastolic blood pressure if treated with benazepril
(Jiang et al., 2004). One of the variants (rs1799930, R197Q in NAT2 gene) was associated
with increased risk of toxic liver disease when treated with ethambutol, isoniazid,
pyrazinamide, and rifampin (Çetintaş et al., 2008). We also observed an SNV (rs1065852,
Chr22:42526694 G > A) which made this individual use escitalopram for depression and
other anxiety (Han et al., 2013). The detail list of those drugs can be found in Table 4.7.
Table 4.7: List of drugs (PharmGKB) in the PTN Genome. VIP: Very Important Pharmacogenes; PD:
Pharmacodynamic; PK: Pharmacokinetic
Prot ID Symbol Genotyped VIP PD PK Variant Annotation Q96J66 ABCC11 TRUE FALSE - PK FALSE Q9BWD1 ACAT2 FALSE FALSE PD - FALSE B0ZBD3 ADRA1A FALSE FALSE PD PK FALSE A2RU49 AGPHD1 FALSE FALSE - - TRUE P50995 ANXA11 FALSE FALSE PD - TRUE P04114 APOB TRUE FALSE PD - TRUE P38398 BRCA1 FALSE TRUE PD - TRUE Q9UIR0 BTNL2 FALSE FALSE - - TRUE P56545 CTBP2 FALSE FALSE - - TRUE Q6NWU0 CYP2D6 TRUE TRUE PD PK TRUE Q5TCH4 CYP4A22 TRUE FALSE - - FALSE P13584 CYP4B1 TRUE FALSE PD PK TRUE Q14246 EMR1 FALSE FALSE - - TRUE P04626 ERBB2 FALSE FALSE PD PK TRUE Q2V2M9 FHOD3 FALSE FALSE - - TRUE Q08379 GOLGA2 FALSE FALSE PD - FALSE P34931 HSPA1L FALSE FALSE PD - TRUE Q70Z44 HTR3D FALSE FALSE PD - FALSE
58
P42858 HTT FALSE FALSE - - TRUE P05107 ITGB2 FALSE FALSE PD - FALSE P98164 LRP2 FALSE FALSE PD - TRUE Q9Y6C9 MTCH2 FALSE FALSE - - TRUE Q6UB35 MTHFD1L FALSE FALSE - - TRUE P42898 MTHFR TRUE TRUE PD PK TRUE Q9Y2K3 MYH15 FALSE FALSE PD - TRUE Q99466 NOTCH4 FALSE FALSE PD - FALSE Q14980 NUMA1 TRUE FALSE - - FALSE Q5JQS5 OR2B11 FALSE FALSE - - TRUE Q9P1Y6 PHRF1 FALSE FALSE - - TRUE Q9Y2K2 SIK3 FALSE FALSE - - TRUE P46721 SLCO1A2 TRUE FALSE PD PK TRUE P50226 SULT1A2 TRUE FALSE - - TRUE P51580 TPMT TRUE TRUE PD PK TRUE O75445 USH2A FALSE FALSE - - TRUE P11473 VDR TRUE TRUE PD PK TRUE Q709C8 VPS13C FALSE FALSE - - TRUE Q502W6 VWA3B FALSE FALSE - - TRUE
4.4 Comparison of PTN genome to worldwide populations:
Multidimensional scaling (MDS) for the PTN genome with 10 other diverse
populations from the Complete Genomics Inc dataset was carried out using 46,946 common
variants. The Pakistani Pathan individual (PTN) was observed near Gujarati Indians (GIH)
because of their geographical and traditional proximity between them (Figure 4.4). This
whole genome scale study of the PTN revealed a strong influence of Caucasians in the North-
West province of Pakistan. Populations from East Asians and Africans have made their own
clusters in the MDS, distinct from each other.
59
Figure 4.4: Multidimensional scaling (MDS) plot generated by PLINK based on 46,946 SNVs data to
show the ancestry of the PTN genome. Two-dimensional visualization of genotype data, with samples from
ten different ethnic populations (ASW: African ancestry in Southwest USA, CEU: Utah residents of Northern
and Western European ancestry, KOR: Korean, CHB: Han Chinese in Biejing, GIH: Gujarati Indians in
Houston, Texas, JPT: Japanese in Tokyo, Japan, LWK: Luhya in Webuye, Kenya, MKK: Maasai in Kinyawa,
Kenya, TSI: Toscani in Italia, YRI: Yoruba in Ibadan, Nigeria) collected by the HapMap Consortium and our
donor Pathan (PTN) individual.
60
The same 46,946 SNVs were used to perform model-based cluster analysis using the
software ADMIXTURE. We performed analysis for K = 2 to K = 13 distinct ancestral
populations. For K = 3, the PTN genome corresponds to the Caucasian ancestry, accounting
for 85% of ancestry overall in PTN Pakistani individual and 74% in Gujarati Indians (Figure
4.5). For K = 4, the Caucasian, African and East Asian ancestral populations were observed
same as seen for K = 3. Comparing results from K = 3 and K = 4, we see remarkable
agreement in the relative proportions of Caucasian and Asian ancestry across all Indian and
Pakistani individual. However, K = 4 shows a very clear separation of South Asian ancestry
to distinct groups. Results from K = 5 to K =13 suggest further separation in the ancestral
populations. Moreover, the ancestry chromosome painting was performed using
INTERPRETOME, which verifies the admixture SNVs of the Pakistani individual with
Caucasians and Asians (Figure 4.6). The admixture results are in agreement with the MDS
plots and suggest shared common ancestry of Pakistanis and Caucasians.
Figure 4.5: ADMIXTURE results for K = 2 and K = 3 for the PTN individual combined with 46 selected
whole-genomes from Complete Genomics Inc. dataset (ASW: African ancestry in Southwest USA, CEU: Utah
residents of Northern and Western European ancestry, KOR: Korean, CHB: Han Chinese in Biejing, GIH:
Gujarati Indians in Houston, Texas, JPT: Japanese in Tokyo, Japan, LWK: Luhya in Webuye, Kenya, MKK:
Maasai in Kinyawa, Kenya, TSI: Toscani in Italia, YRI: Yoruba in Ibadan, Nigeria) and PTN: Pakistani Pathan.
61
The analysis was based on 46,946 SNVs. Each individual is represented by a vertical line, divided into colored
segments that represent membership coefficients in the subgroups.
Figure 4.6: Chromosome painting of possible genomic admixture, with Caucasians, Africans and Asians.
INTERPRETOME was used to create the chromosome ancestry painting.
4.5 Comparison with other Pakistani Individuals:
We investigated how representative our Pakistani PTN genome was of its ethnic
group by comparing it to other 190 Pakistani individuals in the HGDP-CEPH panel
(Rosenberg 2006, Li et al., 2008), which had been typed for ~650k SNVs. Admixture
analysis was performed based on 643,281 SNVs (thinned to avoid LD). We considered the
cluster membership from ADMIXTURE and STRUCTURE (from K=2 to K=5), the
Pakistani (PTN) genome composition was within the variability observed within the PTN
sample from the HGDP (Figure 4.7). Similarly, in a multi-dimensional scaling (MDS) plot,
the PTN genome fell within the other Pathan individuals (Figure 4.8). Taken together, these
62
two results confirm that the Pakistani genome symbolized as “PTN”, presented in this thesis
is representative of the Pathan ethnic group. These results are also in line with the self-
reported ancestry of the subject, with all his grandparents coming from Afghanistan to
Khyber Pakhtunkhwa (Pakistan).
Figure 4.7: Admixture results of Pakistani Pathan (PTN) individual to other ethnic groups in South Asia.
Admixture results for K = 2 and K = 5 for the Pathan individual combined with eight ethnic genomes from
HGDP dataset. The analysis was based on 643,281 SNVs. Each individual is represented by a vertical line,
divided into colored segments that represent membership coefficients in the subgroups.
63
Figure 4.8: Relationship of Pakistani Pathan individual to other ethnic groups in South Asia. Tweleve different groups from South Asia were compared with PTN. The
analysis was based on 643,281 SNVs.
64
4.6 Demographic History Analysis:
We inferred the demographic history of the Pakistani Pathan using the pairwise
sequentially Markovian coalescent (PSMC) model (Li and Durbin 2012) (Figure 4.9), and
compared it to a panel of worldwide populations based on a number of HGDP genomes (Meyer
et al., 2012). As previously reported, all populations share a similar demographic history
between 1 million to 200kyr ago. From 200kyr ago to 20kyr ago, the PTN follow a similar
trajectory to other Asian and European populations, with an inferred effective population size
smaller than African populations, reflecting the out of Africa bottleneck. Over the last 20k years,
the PTN shows an explosion in effective population size, contemporaneous to other Eurasian
populations but much greater in magnitude. The very large effective population size likely
reflects admixture between European and Asian lineages giving rise to modern Pathans in
Pakistan (as also suggested by the analysis of mtDNA and Y-chromosome), rather than an actual
increase in census sizes.
Figure 4.9: Pairwise Sequentially Markovian Coalescent (PSMC) model for reconstructing Pakistan’s demographic
history.
65
4.7 mtDNA and Y-chromosome analyses
The full mitochondrial genome of the Pakistani individual was generated by mapping its
reads to the revised Cambridge reference sequence (rCRS) (Andrews et al., 1999). Adenine and
thymine (AT) content of the genome was 55.5%, while guanine and cytosine (GC) content was
44.5%. A total of 57 SNVs were found in the PTN mitochondrial genome, 13 of which had not
been previously reported.The variants were then mapped with HaploGrep (Kloss-Brandstätter et
al., 2011) to identify the mitochondrial haplogroup of our PTN individual. A total of 14 SNVs
were diagnostic of the C4a1a1 haplogroup, which is more prevalent in the southern Siberian
populations, and is also reported in Pakistani Pathans (Rakha et al., 2011, Derenko et al., 2010).
The AT and GC contents of the Y-chromosome were 39.87% and 60.13%, respectively.
A total of 13,724 SNVs were identified, of which 4,423 were novel. The observed Y-
chromosomal SNVs were annotated as markers for the L1 haplotype of clade L. Haplogroup L
has high frequency in Pakistan (14%) as compare to India (6.3%), Turkey (~4%) and Caucasians
(~6%) (Mohyuddin et al., 2001, Firasat et al., 2007).
4.8 Phylogenomic Analysis:
A phylogenetic tree was constructed using 46 unrelated individuals in which, genomes
belonging to the same population and geographic region were found together in the same clad.
The PTN genome was observed closer to the Indian genome, which were the most similar and
geographically nearest to each other compared to the other representative genomes from other
Asian individuals. Pakistan lies next to China on the North East side geographically, which
makes a separate tree with its genetically similar ethnic groups such as Japan and Korea (Figure
4.11). Genomes from East Asia were placed close to each other. African, which includes the
66
genomes from Yoruba (YRI), Maasai (MKK), and Luhya (LWK) populations including Africans
from USA (ASW), were on one clad being clearly separated from the Asian and Caucasian
genomes. Utah genomes (CEU) were grouped together, separated from those of Italy (TSI). Only
the Indian (GIH) and Pakistani (PTN) genomes were used from South Asia for this study.
Together they made a clad. However, they also showed a rather clear separation from each other.
Figure 4.10: Phylogenomic tree of Pakistani PTN genome with other world ethnic genomes.
Chapter 5 DISCUSSION
Pages 67-74
67
CHAPTER 5
5. Discussion
Globally, human populations show structured genetic diversity as a result of geographical
dispersion, selection and drift (Gurdasani et al., 2015). Understanding this variation can provide
insights into evolutionary processes that shape both human adaptation and variation in disease
susceptibility (Ding and Kullo 2009). Although the Hapmap (Gibbs et al., 2003), HGDP (Cann
et al., 2002), PanAsia (Ngamphiw et al., 2011) and 1000 Genomes Projects (Siva, 2008) have
greatly enhanced our understanding of genetic variation globally, the detailed characterization of
Pakistani populations remains unexplored. The efforts such as the Human Genomes Diversity
Panel examine Pakistan genetic diversity but are limited by variant density (Cann et al., 2002).
The Pakistan population consists of four major ethnic groups (Punjabis, Pakhtuns, Sindhis,
Balochis) each with unique cultural, dietary, environmental and ancestral heritage (Mehdi et al.,
1999). Genetic inferences about these ethnic groups have mostly focused on the uniparental
lineage markers, indicating the Pakistanis ancient admixture with Caucasians (Mohyuddin et al.,
2001). Clarification and study of the Pakistani population’s admixture provide fundamental
knowledge pertinent to interpretation of any genetic study of prevalent disease in Pakistani
groups and corresponding improved healthcare. Disease prevalence in the Pakistan includes
Cancer, Diabetes, Hypertension, Cardiovascular and Neurological disorders (Dennis et al., 2006;
Rizvi et al., 2004; Whiting et al., 2011; Shera et al., 2007; Jafar et al., 2005; Jafar et al., 2003;
Nanan 2009; Shah et al., 2001; Mirza and Jenkins 2004). For example, it is estimated that 10%
of the population is afflicted with neurological diseases (Husain et al., 2000).
68
The disease consequence of genetic diversity associated with dispersion, selection and drift,
and complicated by admixture, disease prevalence, severity, and resistance vary considerably
among ethnic groups. These factors are further complicated by inheritance issues and
noninherited and environmental causes, such as poverty, unequal access to care, lifestyle, and
health-related cultural practices (Chin et al., 2007). Genetic makeup of populations from
Pakistan is important for the knowledge contribution to specific diseases and is important to
scientists around the globe due to increased likelihood of congenital diseases unique in
prevalence to Pakistani populations. Consequently, this research was conducted to sequence the
first whole genome from northwest Pakistan for discovering disease variants as well as provide a
foundation for complex disease studies. The current research does not only provide new
approaches in exploring population admixture dynamics, but also help us conduct the first
genetic study of diseases and pharmaco genes in the northwestern population of Pakistan. The
ultimate goal of this study was to extend the results of these studies to the interpretation and
translation to improve healthcare to the Pakistani people.
5.1 Clinical Relevance and Variant Characterization:
Studying complex diseases and gene mapping is often difficult due to sampling from
genetically heterogeneous populations. This complexity can be circumvented in isolated
populations where both genetic and environmental homogeneity will likely produce fewer
variants of the disease and the extent of linkage disequilibrium is generally larger than out bred
populations (Race and Group 2005). Genomic variations including single nucleotide variations
(SNVs), small insertions and deletions (indels), and copy number variations (CNVs) were
69
identified. Variants were then annotated and scanned for associated biological and physiological
function along with SNVs that could modulate drug response.
Overall, 3.8 million single nucleotide variations (SNVs), 1,503 copy number variation
regions (CNVRs) and 0.5 million small indels were identified by comparing it with the human
reference genome (hg19). Among the SNVs, 129,441 were novel, and 10,315 non-synonymous
SNVs were found in 5,344 genes. SNVs were annotated for genealogical study, high risk
diseases, as well as possible influences on drug efficacy. Functional classification of all the non-
synonymous variants obtained was performed using computational prediction methods. Clinical
variants were investigated, and it was found that 31 coding SNVs are associated with several
diseases. From our analysis we found that the donor is susceptible to Alzheimer’s, after
discovering an SNV rs1049296 in the TF gene where proline changes into serine on position 570
(Wang et al., 2013). The associated SNV with AD decreases the affinity of iron to TF leading to
iron accumulation in brain cells which results in memory loss. Another variant rs4792311 in
ELAC2 gene in Pakistani genome (PTN) was observed which is reported to have interaction with
prostate cancer. In result of this SNV serine on position 217 was found replaced by leucine
(Alvarez-Cubero et al., 2013). The rate of prostate cancer is low in Pakistan (3.8%) (Aziz et al.,
2003), as compared to Americans and Caucasian (Bhurgri et al., 2009). The donor’s family
medical history showed that there are documented cases of obesity, hypertension and heart
diseases. Therefore, we specifically investigated those genes which are responsible for the said
disorders. Three variants responsible for obesity were found on in genes GHRLOS (rs696217,
Leu72Met), SERPINE1 (rs6092, Ala15Thr), and PPARG (rs1801282, Pro12Ala) (Gueorguiev et
al., 2009; Bouchard et al., 2010; Galbete et al., 2013). About 22.2% of Pakistanis are reported to
be obese which is close to European (~24%) and United States populations (~19%) (Flegal et al.,
70
2010; Kopelman et al., 2009; Streib 2007). We also found three pathogenic SNVs in genes
associated with hair, skin and pigmentation: EDAR (rs3827760, Val370Ala), SLC45A2
(rs16891982, Phe374Leu), and TYR (rs1042602, Ser192Tyr) (Tan et al., 2013; Spichenok et al.,
2011; Sulem et al., 2007). In addition, we detected a SNV (rs17822931, Gly180Arg) in
ABCC11, which is responsible for wet earwax which was also found in the Pakistani PK1
genome (Yoshiura et al., 2006).
One of the variants (rs1065852, Pro34Ser) in the CYP2D6 gene is responsible for poor
metabolism of debrisoquine, an adrenergic-blocking medication used for the treatment of
hypertension (Zheng et al., 2013). Also, two SNVs are known to have a pathogenic effect and
lead to thiopurine methyltransferase (TPMT) deficiency (Li et al., 2013; Corrigan et al., 2013).
Moreover, two nsSNVs in the Arachidonic acid metabolism pathway were found. Arachidonic
acid in the human body usually comes from dietary animal sources, such as meat, eggs, and dairy
products. Meat is an important part of diet for the people living in Khyber Pakhtunkhwa, usually
consumed at least once a day, often in the form of kabab (minced meat fried in oil), or curry
(Lindholm, 2004).
Comparative genomic analysis was done using genome from the northwest (PTN) and the
other previously published Pakistani (PK1) genome (Azim et al., 2013). The PK1 genome was
report to have Sindhi ethnicity. Non-synonymous variants from Pakistani (PK1) genome were
annotated and screened against disease and drugs databases for example SIFT, PolyPhen,
OMIM, ClinVar, PharmGKB and Drug bank (Ng and Henikoff. 2003, Jordan et al., 2011,
Landrum et al., 2013, Amberger et al., 2011, Thorn et al., 2013, Wishart et al., 2008) for
investigating associated diseases. Out of ~8,000 nsSNVs only 37 variants (three novel) were
found linked with certain disorders. Eight clinically relevant SNVs were detected overlapped
71
with PTN genome. We found no damaged variants responsible for Alzheimer’s, obesity and
heart related diseases in PK1 just like we found in PTN genome. An SNV was observed in PK1
genome which is known for Wafarin response (Schwarz et al., 2008). Moreover, a pathogenic
mutation (rs1169305) was seen in the HNF1A gene which may become a cause of diabetes in the
PK1 individual (Bonnycastle et al., 2006). In addition, we detected an SNV (rs17822931,
Gly180Arg) in ABCC11, which is responsible for wet earwax which was found in both Pakistani
genomes (Yoshiura et al., 2006).
5.2 Pharmacogenomic Profile:
The genetic map of PTN individual was further used for finding possible influence on drug
efficacy. A large number of variants were associated with susceptibility to poisonous drugs,
while others nsSNV were linked to the efficacy of medicines used in the treatment of diseases
such as depression, diabetes mellitus, Alzheimer disease, arthritis and so on. A variant was found
associated with increased risk of metabolic syndrome when treated with antipsychotics
(Ellingrod et al., 2008). Our donor has high chance of having decreased diastolic blood pressure
if treated with benazepril (Jiang et al., 2004). One of the variants was associated with increased
risk of toxic liver disease when treated with ethambutol, isoniazid, pyrazinamide, and rifampin
(Çetintaş et al., 2008). We also observed an SNV which made this individual use escitalopram
for depression and other anxiety (Han et al., 2013).
Most of the clinically relevant variants adopted in this study were originally described in
Caucasian populations. While this result might be a consequence of the genomic affinities of the
Pakistani genome with other Caucasian populations, it might also reflect a bias due to most of
72
the GWAS work being carried out on Caucasian populations (Ayub et al., 2009). Therefore a
cohort study in the Pakistani population will be required for authentication.
The methodology, technology and infrastructure that we developed and used are equally
powerful to study other global ethnic populations and the diseases most prevalent in those
populations. Most importantly we successfully created a DNA variation dataset of the Pakistani
population and make it available to researchers for understanding human biology with respect to
disease predisposition, adverse drug reaction, and other genetically valuable healthcare
interpretation.
5.3 Genealogical and Admixture Analysis:
For the last many years researchers have been trying to clarify the origins and stratification
as well as intra and inter-population relationships of ethnic groups in Pakistan. Originally the
focus was on uniparental lineage markers passed through the Y chromosome and mtDNA in
male and female, respectively (Mohyuddin et al., 2001, Firasat et al., 2007, Rakha et al., 2011,
Metspalu et al., 2004). Therefore we analyzed the ever first whole genome of a Pathan /
Pakhtun from a North West province (Khyber Pakhtunkhwa) of Pakistan, to explore what
additional information can be learnt. Other analytical approaches were also used to assess the
influence of ancestral contributions within Pakistani Pakhtuns along with the historical
background of the region. Our analysis of 46 unrelated human genomes from 10 different
populations provides a comprehensive view of the PTN genome. We found that the Pakistani
Pathans appears with the Indian cline in our MDS beside Caucasians and East Asian. We saw
that at K = 4 the Pakistani Pathans and Indians made their own component to become better
representatives of the South Asia, that was additionally confirmed by comparing our
73
representative genome with other individuals from South Asia in the HGDP-CEPH panel (Li et
al., 2008), which were studied using illumina Omnichips of ~650k SNVs. We considered the
cluster membership (from K=2 to K=5), the PTN genome composition was within the
variability observed within the Pathan sample from the HGDP (Figure 4.7). Similarly, in a
multi-dimensional scaling (MDS) plot, the PTN genome fell within the other Pathan/Pakistan
individuals (Figure 4.8). African populations were found the most distant and differentiated
from the Pathan population. Being the only neighboring genome, Indian genomes showed the
closest genetic relationship with the Pakistani PTN genome. Both types of ethnic genomes made
a separate clad distant from other Asian genomes supported by the MDS plot and phylogenetic
tree analysis.
Based on our results we confirmed that our genome PTN is representative of the Pathan
ethnic group. These results are also in line with the self-reported ancestry of the subject, with all
his grandparents coming from Afghanistan to Khyber Pakhtunkhwa (Pakistan). We found that
the Pathan genome has more than 80% of Caucasian ancestry with C4a1a1 mito group and L Y-
chromosome group, suggesting that Pathans are probably an admixture of Caucasian and South
Asians at the genomic level. Haplogroup L has high frequency in Pakistan (14%) as compared
to India (6.3%), Turkey (~4%) and Caucasians (~6%) (Mohyuddin et al., 2001, Firasat et al.,
2007).
5.4 Demographic History Analysis and Ancestral Population Size:
We inferred the demographic history of the Pakistani genome (PTN) using the pairwise
sequentially Markovian coalescent (PSMC) model (Li H, Durbin 2012) (Figure 4.9), and
compared it to a panel of worldwide populations based on a number of HGDP genomes (Meyer
74
et al., 2012). As previously reported, all populations share a similar demographic history
between 1 million to 200kyr ago. From 200kyr ago to 20kyr ago, the PTN follow a similar
trajectory to other Asian and European populations, with an inferred effective population size
smaller than African populations, reflecting the out of Africa bottleneck. Over the last 20k years,
the PTN shows an explosion in effective population size, contemporaneous to other Eurasian
populations but much greater in magnitude. The very large effective population size likely
reflects admixture between European and Asian lineages giving rise to modern Pathans (as also
suggested by the analysis of mtDNA and Y-chromosome), rather than an actual increase in
census sizes.
5.5 Conclusion:
Here we present, for the first time, the whole genome of a Pakistani individual from a
north-west province (Khyber Pakhtunkhwa). This research does not only provide new
approaches in exploring population admixture dynamics, but also help us conduct the first
genetic study of diseases and pharmaco genes in the northwestern population of Pakistan. The
ultimate goal of this research was to extend the results of these studies to the interpretation and
translation to improve healthcare to the Pakistani people. Our analysis provides a detailed view
of the PTN genome diversity and functional classification of variants and its impact in
pharmacogenomics. A large scale analysis of diverse genomes is needed to help researchers
around the world in understanding genetic diversity and functional classification of variants
along with pharmacogenomic traits and associated drugs that would be use as personalized
medicine.
75
5.6 Recommendations and Future Plans:
x A genetic resource for all Pakistani populations should be established for computing their
allele sharing as a measure of linkage disequilibrium, admixture, and migration.
x Cohort study in the Pakistani population is required for Authentication, which will help
us, conducting the genetic disease studies.
x Rare and common diseases, its susceptibility and association within Pakistani
population's genetic makeup should be investigated.
x Patients, physicians and science journalists should be educated on interpreting genomic
results.
x Genomics applications and implications should be openly discussed through Conferences
and Workshops etc. This will encourage interaction between experts, academicians,
researchers, students, policy makers etc.
Chapter 6 REFERENCES
Pages 76-90
76
CHAPTER 6
6. References
Ahn, S.-M., Kim, T.-H., Lee, S., Kim, D., Ghang, H., Kim, D.-S., et al. (2009). The first Korean
genome sequence and analysis: full genome sequencing for a socio-ethnic group. Genome
research, 19(9), 1622-1629.
Alexander, D. H., Novembre, J., & Lange, K. (2009). Fast model-based estimation of ancestry in
unrelated individuals. Genome research, 19(9), 1655-1664.
Alvarez-Cubero, M. J., Saiz, M., Martinez-Gonzalez, L. J., Alvarez, J. C., Lorente, J. A., &
Cozar, J. M. (2013). Genetic analysis of the principal genes related to prostate cancer: a
review. Paper presented at the Urologic Oncology: Seminars and Original Investigations.
Amberger, J., Bocchini, C., & Hamosh, A. (2011). A new face and new challenges for Online
Mendelian Inheritance in Man (OMIM®). Human mutation, 32(5), 564-567.
Andrews, R. M., Kubacka, I., Chinnery, P. F., Lightowlers, R. N., Turnbull, D. M., & Howell, N.
(1999). Reanalysis and revision of the Cambridge reference sequence for human
mitochondrial DNA. Nature genetics, 23(2), 147-147.
Ayub, Q., & Tyler-Smith, C. (2009). Genetic variation in South Asia: assessing the influences of
geography, language and ethnicity for understanding history and disease risk. Briefings in
functional genomics & proteomics, 8(5), 395-404.
77
Azim, M. K., Yang, C., Yan, Z., Choudhary, M. I., Khan, A., Sun, X., et al. (2013). Complete
agenome sequencing and variant analysis of a Pakistani individual. Journal of human
genetics, 58(9), 622-626.
Aziz, Z., Sana, S., Saeed, S., & Akram, M. (2003). Institution based tumor registry from Punjab:
five year data based analysis. JOURNAL-PAKISTAN MEDICAL ASSOCIATION, 53(8),
350-353.
Bentley, D. R., Balasubramanian, S., Swerdlow, H. P., Smith, G. P., Milton, J., Brown, C. G., et
al. (2008). Accurate whole human genome sequencing using reversible terminator
chemistry. nature, 456(7218), 53-59.
Bhurgri, Y., Kayani, N., Pervez, S., Ahmed, R., Tahir, I., Afif, M., et al. (2009). Incidence and
Trends of Prostate Cancer in Karachi South. Asian Pacific Journal of Cancer Prevention,
10, 45-48.
Bodmer, W., & Bonilla, C. (2008). Common and rare variants in multifactorial susceptibility to
common diseases. Nature genetics, 40(6), 695-701.
Bonnycastle, L. L., Willer, C. J., Conneely, K. N., Jackson, A. U., Burrill, C. P., Watanabe, R.
M., et al. (2006). Common variants in maturity-onset diabetes of the young genes
contribute to risk of type 2 diabetes in Finns. Diabetes, 55(9), 2534-2540.
Bouchard, L., Vohl, M.-C., Lebel, S., Hould, F.-S., Marceau, P., Bergeron, J., et al. (2010).
Contribution of genetic and metabolic syndrome to omental adipose tissue PAI-1 gene
mRNA and plasma levels in obesity. Obesity surgery, 20(4), 492-499.
Cann, H. M., De Toma, C., Cazes, L., Legrand, M.-F., Morel, V., Piouffre, L., et al. (2002). A
human genome diversity cell line panel. Science (New York, NY), 296(5566), 261.
78
Cavalli-Sforza, L. L. (2005). The human genome diversity project: past, present and future.
Nature Reviews Genetics, 6(4), 333-340.
ÇETİNTAŞ, V. B., ERER, O. F., KOSOVA, B., ÖZDEMİR, İ., TOPÇUOĞLU, N., AKTOĞU,
S., et al. (2008). Determining the relation between N-acetyltransferase-2 acetylator
phenotype and antituberculosis drug induced hepatitis by molecular biologic tests. Tuberk
Toraks, 56, 81-86.
Chin, M. H., Walters, A. E., Cook, S. C., & Huang, E. S. (2007). Interventions to reduce racial
and ethnic disparities in health care. Medical Care Research and Review, 64(5 suppl), 7S-
28S.
Collins, F. S., & Mansoura, M. K. (2001). The Human Genome Project. Revealing the shared
inheritance of all humankind. Cancer, 91(1 Suppl), 221-225.
Collins, F. S., Brooks, L. D., & Chakravarti, A. (1998). A DNA polymorphism discovery
resource for research on human genetic variation. Genome research, 8(12), 1229-1231.
Corrigan, A., Lal, R., Wickramasinghe, S., Whelan, S., Sanderson, J., Marinaki, A., et al. (2013).
31 Testing for association between TPMT, COMT and NOX3 variants and the onset of
ototoxicity in lung cancer patients treated with platinum chemotherapy. Lung Cancer, 79,
S11.
Danecek, P., Auton, A., Abecasis, G., Albers, C. A., Banks, E., DePristo, M. A., et al. (2011).
The variant call format and VCFtools. Bioinformatics, 27(15), 2156-2158.
Dennis, B., Aziz, K., She, L., Faruqui, A., Davis, C., Manolio, T. A., et al. (2006). High rates of
obesity and cardiovascular disease risk factors in lower middle class community in
Pakistan: the Metroville Health Study. J Pak Med Assoc, 56(6), 267-272.
79
Derenko M, Malyarchuk B, Grzybowski T, Denisova G, Rogalla U, Perkova M, Dambueva I,
Zakharov I. (2010). Origin and post-glacial dispersal of mitochondrial DNA haplogroups
C and D in northern Asia. PloS one, 5(12):e15214.
Ding, K., & Kullo, I. J. (2009). Evolutionary genetics of coronary heart disease. Circulation,
119(3), 459-467.
Dissanayake, V. H., Samarakoon, P. S., Scaria, V., Patowary, A., Sivasubbu, S., & Gokhale, R.
S. (2011). The Sri Lankan Personal Genome Project. The Sri Lankan Personal Genome
Project, 2(1), 4-8.
Do, R., Balick, D., Li, H., Adzhubei, I., Sunyaev, S., & Reich, D. (2015). No evidence that
selection has been less effective at removing deleterious mutations in Europeans than in
Africans. Nature genetics.
Dogan, H., Can, H., & Otu, H. H. (2014). Whole Genome Sequence of a Turkish Individual.
PloS one, 9(1).
Drmanac, R., Sparks, A. B., Callow, M. J., Halpern, A. L., Burns, N. L., Kermani, B. G., et al.
(2010). Human genome sequencing using unchained base reads on self-assembling DNA
nanoarrays. Science, 327(5961), 78-81.
Elingarami, S., Li, X., & He, N. (2013). Applications of nanotechnology, next generation
sequencing and microarrays in biomedical research. Journal of nanoscience and
nanotechnology, 13(7), 4539-4551.
Ellingrod, V. L., Miller, D. D., Taylor, S. F., Moline, J., Holman, T., & Kerr, J. (2008).
Metabolic syndrome and insulin resistance in schizophrenia patients receiving
antipsychotics genotyped for the methylenetetrahydrofolate reductase (MTHFR) 677C/T
and 1298A/C variants. Schizophrenia research, 98(1), 47-54.
80
Feero, W. G., & Guttmacher, A. E. (2014). Genomics, personalized medicine, and pediatrics.
Academic pediatrics, 14(1), 14-22.
Felsenstein, J. (2002). {PHYLIP}(Phylogeny Inference Package) version 3.6 a3.
Feuk, L., Carson, A. R., & Scherer, S. W. (2006). Structural variation in the human genome.
Nature Reviews Genetics, 7(2), 85-97.
Firasat, S., Khaliq, S., Mohyuddin, A., Papaioannou, M., Tyler-Smith, C., Underhill, P. A., et al.
(2007). Y-chromosomal evidence for a limited Greek contribution to the Pathan
population of Pakistan. European Journal of Human Genetics, 15(1), 121-126.
Flegal, K. M., Carroll, M. D., Ogden, C. L., & Curtin, L. R. (2010). Prevalence and trends in
obesity among US adults, 1999-2008. Jama, 303(3), 235-241.
Fujimoto, A., Nakagawa, H., Hosono, N., Nakano, K., Abe, T., Boroevich, K. A., et al. (2010).
Whole-genome sequencing and comprehensive variant analysis of a Japanese individual
using massively parallel sequencing. Nature genetics, 42(11), 931-936.
Galbete, C., Toledo, J., Martínez-González, M. Á., Martínez, J. A., Guillén-Grima, F., & Marti,
A. (2013). Lifestyle factors modify obesity risk linked to PPARG2 and FTO variants in
an elderly population: a cross-sectional analysis in the SUN Project. Genes & nutrition,
8(1), 61-67.
Gibbs, R. A., Belmont, J. W., Hardenbol, P., Willis, T. D., Yu, F., Yang, H., et al. (2003). The
international HapMap project. Nature, 426(6968), 789-796.
Gueorguiev, M., Lecoeur, C., Meyre, D., Benzinou, M., Mein, C. A., Hinney, A., et al. (2009).
Association studies on ghrelin and ghrelin receptor gene polymorphisms with obesity.
Obesity, 17(4), 745-754.
81
Gupta, R., Ratan, A., Rajesh, C., Chen, R., Kim, H. L., Burhans, R., et al. (2012). Sequencing
and analysis of a South Asian-Indian personal genome. BMC genomics, 13(1), 440.
Gurdasani, D., Carstensen, T., Tekola-Ayele, F., Pagani, L., Tachmazidou, I., Hatzikotoulas, K.,
et al. (2015). The African Genome Variation Project shapes medical genetics in Africa.
Nature, 517(7534), 327-332.
Han, K.-M., Chang, H. S., Choi, I.-K., Ham, B.-J., & Lee, M.-S. (2013). CYP2D6 P34S
polymorphism and outcomes of escitalopram treatment in Koreans with major
depression. Psychiatry investigation, 10(3), 286-293.
Hewett, M., Oliver, D. E., Rubin, D. L., Easton, K. L., Stuart, J. M., Altman, R. B., et al. (2002).
PharmGKB: the pharmacogenetics knowledge base. Nucleic acids research, 30(1), 163-
165.
Hudson, M. E. (2008). Sequencing breakthroughs for genomic ecology and evolutionary biology.
Molecular ecology resources, 8(1), 3-17.
Husain, N., Creed, F., & Tomenson, B. (2000). Depression and social stress in Pakistan.
Psychological medicine, 30(2), 395-402.
Iafrate, A. J., Feuk, L., Rivera, M. N., Listewnik, M. L., Donahoe, P. K., Qi, Y., et al. (2004).
Detection of large-scale variation in the human genome. Nature genetics, 36(9), 949-951.
Jafar, T. H., Jessani, S., Jafary, F. H., Ishaq, M., Orkazai, R., Orkazai, S., et al. (2005). General
Practitioners’ Approach to Hypertension in Urban Pakistan Disturbing Trends in Practice.
Circulation, 111(10), 1278-1283.
Jafar, T. H., Levey, A. S., Jafary, F. H., White, F., Gul, A., Rahbar, M. H., et al. (2003). Ethnic
subgroup differences in hypertension in Pakistan. Journal of hypertension, 21(5), 905-
912.
82
Jiang, S., Hsu, Y.-H., Xu, X., Xing, H., Chen, C., Niu, T., et al. (2004). The C677T
polymorphism of the methylenetetrahydrofolate reductase gene is associated with the
level of decrease on diastolic blood pressure in essential hypertension patients treated by
angiotensin-converting enzyme inhibitor. Thrombosis research, 113(6), 361-369.
Jordan, D. M., Kiezun, A., Baxter, S. M., Agarwala, V., Green, R. C., Murray, M. F., et al.
(2011). Development and validation of a computational method for assessment of
missense variants in hypertrophic cardiomyopathy. The American Journal of Human
Genetics, 88(2), 183-192.
Karczewski, K. J., Tirrell, R. P., Cordero, P., Tatonetti, N. P., Dudley, J. T., Salari, K., et al.
(2012). Interpretome: a freely available, modular, and secure personal genome
interpretation engine. Paper presented at the Pac Symp Biocomput.
Kelly, A. D., Hill, K. E., Correll, M., Hu, L., Wang, Y. E., Rubio, R., et al. (2013). Next-
generation sequencing and microarray-based interrogation of microRNAs from formalin-
fixed, paraffin-embedded tissue: preliminary assessment of cross-platform concordance.
Genomics, 102(1), 8-14.
Kim, J.-I., Ju, Y. S., Park, H., Kim, S., Lee, S., Yi, J.-H., et al. (2009). A highly annotated whole-
genome sequence of a Korean individual. nature, 460(7258), 1011-1015.
Kircher, M. (2011). Understanding and improving high-throughput sequencing data production
and analysis. PhD Thesis. (http://www.qucosa.de)
Kitzman, J. O., MacKenzie, A. P., Adey, A., Hiatt, J. B., Patwardhan, R. P., Sudmant, P. H., et
al. (2011). Haplotype-resolved genome sequencing of a Gujarati Indian individual.
Nature biotechnology, 29(1), 59-63.
83
Kitzmann, K. M., Dalton III, W. T., Stanley, C. M., Beech, B. M., Reeves, T. P., Buscemi, J., et
al. (2010). Lifestyle interventions for youth who are overweight: a meta-analytic review.
Health Psychology, 29(1), 91.
Kloss-Brandstätter A, Pacher D, Schönherr S, Weissensteiner H, Binna R, Specht G, Kronenberg
F. (2011). HaploGrep: a fast and reliable algorithm for automatic classification of
mitochondrial DNA haplogroups. Human Mutation, 32(1):25-32.
Koboldt, D. C., Ding, L., Mardis, E. R., & Wilson, R. K. (2010). Challenges of sequencing
human genomes. Briefings in bioinformatics, 11(5), 484-498.
Kopelman, P. G., Caterson, I. D., & Dietz, W. H. (2009). Clinical obesity in adults and children:
John Wiley & Sons.
Lander, E. S., Linton, L. M., Birren, B., Nusbaum, C., Zody, M. C., Baldwin, J., et al. (2001).
Initial sequencing and analysis of the human genome. nature, 409(6822), 860-921.
Landrum, M. J., Lee, J. M., Riley, G. R., Jang, W., Rubinstein, W. S., Church, D. M., et al.
(2013). ClinVar: public archive of relationships among sequence variation and human
phenotype. Nucleic acids research, gkt1113.
Levy, S., Sutton, G., Ng, P. C., Feuk, L., Halpern, A. L., Walenz, B. P., et al. (2007). The diploid
genome sequence of an individual human. PLoS biology, 5(10), e254.
Li, H., & Durbin, R. (2009). Fast and accurate short read alignment with Burrows–Wheeler
transform. Bioinformatics, 25(14), 1754-1760.
Li, H., & Durbin, R. (2012). Inference of human population history from whole genome
sequence of a single individual. Nature, 475(7357), 493.
Li, H., Handsaker, B., Wysoker, A., Fennell, T., Ruan, J., Homer, N., et al. (2009). The sequence
alignment/map format and SAMtools. Bioinformatics, 25(16), 2078-2079.
84
Li, J. Z., Absher, D. M., Tang, H., Southwick, A. M., Casto, A. M., Ramachandran, S., et al.
(2008). Worldwide human relationships inferred from genome-wide patterns of variation.
Science, 319(5866), 1100-1104.
Li, X., Lian, F.-M., Guo, D., Fan, L., Tang, J., Peng, J.-B., et al. (2013). The rs1142345 in TPMT
Affects the Therapeutic Effect of Traditional Hypoglycemic Herbs in Prediabetes.
Evidence-Based Complementary and Alternative Medicine, 2013.
Lindholm, C. (2004). Swat Pathan Encyclopedia of Sex and Gender (pp. 833-840): Springer.
Mansoor, S., Amin, I., Hussain, M., Zafar, Y., Bull, S., Briddon, R., et al. (2001). Association of
a disease complex involving a begomovirus, DNA 1 and a distinct DNA beta with leaf
curl disease of okra in Pakistan. Plant Disease, 85(8), 922-922.
Mardis, E. R. (2008). Next-generation DNA sequencing methods. Annu. Rev. Genomics Hum.
Genet., 9, 387-402.
McKenna, A., Hanna, M., Banks, E., Sivachenko, A., Cibulskis, K., Kernytsky, A., et al. (2010).
The Genome Analysis Toolkit: a MapReduce framework for analyzing next-generation
DNA sequencing data. Genome research, 20(9), 1297-1303.
Mehdi, S., Qamar, R., Ayub, Q., Khaliq, S., Mansoor, A., Ismail, M., et al. (1999). The Origins
of Pakistani Populations Genomic Diversity (pp. 83-90): Springer.
Metspalu, M., Kivisild, T., Metspalu, E., Parik, J., Hudjashov, G., Kaldma, K., et al. (2004).
Most of the extant mtDNA boundaries in south and southwest Asia were likely shaped
during the initial settlement of Eurasia by anatomically modern humans. BMC genetics,
5(1), 26.
Metzker, M. L. (2010). Sequencing technologies—the next generation. Nature Reviews Genetics,
11(1), 31-46.
85
Meyer, F. (2006). Genome Sequencing vs. Moore's Law: Cyber Challenges for the Next Decade.
CTWatch Quarterly, 2(3).
Meyer, M., Kircher, M., Gansauge, M.-T., Li, H., Racimo, F., Mallick, S., et al. (2012). A high-
coverage genome sequence from an archaic Denisovan individual. Science, 338(6104),
222-226.
Miller, A. J., Matasci, N., Schwaninger, H., Aradhya, M. K., Prins, B., Zhong, G.-Y., et al.
(2013). Vitis phylogenomics: hybridization intensities from a SNP array outperform
genotype calls. PloS one, 8(11), e78680.
Miller, C. A., Hampton, O., Coarfa, C., & Milosavljevic, A. (2011). ReadDepth: a parallel R
package for detecting copy number alterations from short sequencing reads. PloS one,
6(1), e16327.
Mirza, I., & Jenkins, R. (2004). Risk factors, prevalence, and treatment of anxiety and depressive
disorders in Pakistan: systematic review. Bmj, 328(7443), 794.
Mohyuddin, A., Ayub, Q., Qamar, R., Zerjal, T., Helgason, A., Mehdi, S. Q., et al. (2001). Y-
chromosomal STR haplotypes in Pakistani populations. Forensic science international,
118(2), 141-146.
Nanan, D. (2002). The obesity pandemic-implications for Pakistan. JPMA, 52(342).
Ng, P. C., & Henikoff, S. (2003). SIFT: Predicting amino acid changes that affect protein
function. Nucleic acids research, 31(13), 3812-3814.
Ngamphiw, C., Assawamakin, A., Xu, S., Shaw, P. J., Yang, J. O., Ghang, H., et al. (2011).
PanSNPdb: the Pan-Asian SNP genotyping database. PloS one, 6(6), e21451.
Park, P. J. (2008). Epigenetics meets next-generation sequencing. Epigenetics, 3(6), 318-321.
86
Patowary, A., Purkanti, R., Singh, M., Chauhan, R. K., Bhartiya, D., Dwivedi, O. P., et al.
(2012). Systematic analysis and functional annotation of variations in the genome of an
Indian individual. Human mutation, 33(7), 1133-1140.
Patwari, P., & Lee, R. T. (2008). Mechanical control of tissue morphogenesis. Circulation
research, 103(3), 234-243.
Prado-Martinez, J., Sudmant, P. H., Kidd, J. M., Li, H., Kelley, J. L., Lorente-Galdos, B., et al.
(2013). Great ape genetic diversity and population history. Nature, 499(7459), 471-475.
Purcell, S., Neale, B., Todd-Brown, K., Thomas, L., Ferreira, M. A., Bender, D., et al. (2007).
PLINK: a tool set for whole-genome association and population-based linkage analyses.
The American Journal of Human Genetics, 81(3), 559-575.
Pushkarev, D., Neff, N. F., & Quake, S. R. (2009). Single-molecule sequencing of an individual
human genome. Nature biotechnology, 27(9), 847-850.
Race, E., & Group, G. W. (2005). The use of racial, ethnic, and ancestral categories in human
genetics research. The American Journal of Human Genetics, 77(4), 519-532.
Rakha, A., Shin, K.-J., Yoon, J. A., Kim, N. Y., Siddique, M. H., Yang, I. S., et al. (2011).
Forensic and genetic characterization of mtDNA from Pathans of Pakistan. International
journal of legal medicine, 125(6), 841-848.
Rasmussen, M., Guo, X., Wang, Y., Lohmueller, K. E., Rasmussen, S., Albrechtsen, A., et al.
(2011). An Aboriginal Australian genome reveals separate human dispersals into Asia.
Science, 334(6052), 94-98.
Redon, R., Ishikawa, S., Fitch, K. R., Feuk, L., Perry, G. H., Andrews, T. D., et al. (2006).
Global variation in copy number in the human genome. nature, 444(7118), 444-454.
87
Rizvi, S., Khan, M., Kundi, A., Marsh, D., Samad, A., & Pasha, O. (2004). Status of rheumatic
heart disease in rural Pakistan. Heart, 90(4), 394-399.
Rosenberg, N. A. (2006). Standardized subsets of the HGDP‐CEPH Human Genome Diversity
Cell Line Panel, accounting for atypical and duplicated samples and pairs of close
relatives. Annals of human genetics, 70(6), 841-847.
Saitou, N., & Nei, M. (1987). The neighbor-joining method: a new method for reconstructing
phylogenetic trees. Molecular biology and evolution, 4(4), 406-425.
Salleh, M. Z., Teh, L. K., Lee, L. S., Ismet, R. I., Patowary, A., Joshi, K., et al. (2013).
Systematic pharmacogenomics analysis of a Malay whole genome: proof of concept for
personalized medicine. PloS one, 8(8), e71554.
Sankararaman, S., Mallick, S., Dannemann, M., Prüfer, K., Kelso, J., Pääbo, S., et al. (2014).
The genomic landscape of Neanderthal ancestry in present-day humans. nature,
507(7492), 354-357.
Schork, N. J., Murray, S. S., Frazer, K. A., & Topol, E. J. (2009). Common vs. rare allele
hypotheses for complex diseases. Current opinion in genetics & development, 19(3), 212-
219.
Schwarz, U. I., Ritchie, M. D., Bradford, Y., Li, C., Dudek, S. M., Frye-Anderson, A., et al.
(2008). Genetic determinants of response to warfarin during initial anticoagulation. New
England Journal of Medicine, 358(10), 999-1008.
Sebastiani, P., Hadley, E. C., Province, M., Christensen, K., Rossi, W., Perls, T. T., et al. (2009).
A family longevity selection score: ranking sibships by their longevity, size, and
availability for study. American journal of epidemiology, kwp309.
88
Shah, S., Luby, S., Rahbar, M., Khan, A., & McCormick, J. (2001). Hypertension and its
determinants among adults in high mountain villages of the Northern Areas of Pakistan.
Journal of human hypertension, 15(2), 107-112.
Shera, A., Jawad, F., & Maqsood, A. (2007). Prevalence of diabetes in Pakistan. Diabetes
research and clinical practice, 76(2), 219-222.
Sherry, S. T., Ward, M.-H., Kholodov, M., Baker, J., Phan, L., Smigielski, E. M., et al. (2001).
dbSNP: the NCBI database of genetic variation. Nucleic acids research, 29(1), 308-311.
Siva, N. (2008). 1000 Genomes project. Nature biotechnology, 26(3), 256-256.
Speicher, M. R., & Carter, N. P. (2005). The new cytogenetics: blurring the boundaries with
molecular biology. Nature Reviews Genetics, 6(10), 782-792.
Spichenok, O., Budimlija, Z. M., Mitchell, A. A., Jenny, A., Kovacevic, L., Marjanovic, D., et al.
(2011). Prediction of eye and skin color in diverse populations using seven SNPs.
Forensic Science International: Genetics, 5(5), 472-478.
Streib, L. (2007). World’s fattest countries. Forbes. com. Online: http://www. forbes.
com/2007/02/07/worlds-fattest-countriesforbeslife-cx_ls_0208worldfat_5. html [Accessed
6 March 2013].
Sulem, P., Gudbjartsson, D. F., Stacey, S. N., Helgason, A., Rafnar, T., Magnusson, K. P., et al.
(2007). Genetic determinants of hair, eye and skin pigmentation in Europeans. Nature
genetics, 39(12), 1443-1452.
Tamura, K., Peterson, D., Peterson, N., Stecher, G., Nei, M., & Kumar, S. (2011). MEGA5:
molecular evolutionary genetics analysis using maximum likelihood, evolutionary
distance, and maximum parsimony methods. Molecular biology and evolution, 28(10),
2731-2739.
89
Tan, J., Yang, Y., Tang, K., Sabeti, P. C., Jin, L., & Wang, S. (2013). The adaptive variant
EDARV370A is associated with straight hair in East Asians. Human genetics, 132(10),
1187-1191.
Taus-Bolstad, S. (2008). Pakistan in pictures: Lerner Books [UK].
Thorn, C. F., Klein, T. E., & Altman, R. B. (2013). PharmGKB: the pharmacogenomics
knowledge base Pharmacogenomics (pp. 311-320): Springer.
Veeramah, K. R., & Hammer, M. F. (2014). The impact of whole-genome sequencing on the
reconstruction of human population history. Nature Reviews Genetics, 15(3), 149-162.
Wang, J., Wang, W., Li, R., Li, Y., Tian, G., Goodman, L., et al. (2008). The diploid genome
sequence of an Asian individual. nature, 456(7218), 60-65.
Wang, K., Li, M., & Hakonarson, H. (2010). ANNOVAR: functional annotation of genetic
variants from high-throughput sequencing data. Nucleic acids research, 38(16), e164-
e164.
Wang, Y., Xu, S., Liu, Z., Lai, C., Xie, Z., Zhao, C., et al. (2013). Meta-analysis on the
association between the TF gene rs1049296 and AD. The Canadian Journal of
Neurological Sciences, 40(05), 691-697.
Wheeler, D. A., Srinivasan, M., Egholm, M., Shen, Y., Chen, L., McGuire, A., et al. (2008). The
complete genome of an individual by massively parallel DNA sequencing. nature,
452(7189), 872-876.
Whiting, D. R., Guariguata, L., Weil, C., & Shaw, J. (2011). IDF diabetes atlas: global estimates
of the prevalence of diabetes for 2011 and 2030. Diabetes research and clinical practice,
94(3), 311-321.
90
Wishart, D. S., Knox, C., Guo, A. C., Cheng, D., Shrivastava, S., Tzur, D., et al. (2008).
DrugBank: a knowledgebase for drugs, drug actions and drug targets. Nucleic acids
research, 36(suppl 1), D901-D906.
Wong, L.-P., Ong, R. T.-H., Poh, W.-T., Liu, X., Chen, P., Li, R., et al. (2013). Deep whole-
genome sequencing of 100 southeast Asian Malays. The American Journal of Human
Genetics, 92(1), 52-66.
Wood, M. (2001). In the footsteps of Alexander the Great: a journey from Greece to Asia: Univ
of California Press.
Yoshiura, K.-i., Kinoshita, A., Ishida, T., Ninokata, A., Ishikawa, T., Kaname, T., et al. (2006).
A SNP in the ABCC11 gene is the determinant of human earwax type. Nature genetics,
38(3), 324-330.
Zheng, T., Su, C., Zhao, J., Zhang, X., Zhang, T., Zhang, L., et al. (2013). Effects of CYP3A5
and CYP2D6 genetic polymorphism on the pharmacokinetics of diltiazem and its
metabolites in Chinese subjects. Die Pharmazie-An International Journal of
Pharmaceutical Sciences, 68(4), 257-260.
LIST OF PUBLICATIONS Page: 91
91
PUBLICATIONS
Muhammad Ilyas, Jong-Soo Kim, Jesse Cooper, Young-Ah Shin, Hak-Min Kim, Yun Sung
Cho, Seungwoo Hwang, Hyunho Kim, Jaewoo Moon, Oksung Chung, JeHoon Jun, Achal
Rastogi, Sanghoon Song, Junsu Ko, Andrea Manica, Ziaur Rahman, Tayyab Husnain and Jong
Bhak. 2015. Whole genome sequencing of an ethnic Pathan (Pakhtun) from the north-west of
Pakistan. BMC Genomics. 16:172
Muhammad Ilyas, Ziaur Rahman, Tayyab Husnain and Jong Bhak. 2015. Pharmacogenomic
Profile of a Pakistani Individual. Sci. Tech. and Dev. 33 (4): 183-187
APPENDIX Page: 92-93
92
APPENDIX-I
WEBSITES USED
1000 Genome Project http://www.1000genomes.org The Personal Genome Project http://www.personalgenomes.org Simons Genome Diversity Project http://www.simonsfoundation.org Korean Personal Genomes Project http://kpgp.kr Complete Genomics http://www.completegenomics.com Iranian Genome Project http://www.irangenes.com Human Genome Organisation http://www.hugo-international.org Harvard E-commons http://ecommons.med.harvard.edu PubMed http://www.ncbi.nlm.nih.gov/pubmed Omictools http://omictools.com RNASeqBlog http://www.rna-seqblog.com Seqanswers http://seqanswers.com SNPedia http://www.snpedia.com Biobase http://www.biobase-international.com Biocomputing Platforms Ltd http://www.bcplatforms.com Bioinformatics Solutions http://www.bioinformaticssolutions.com Gataca http://www.gatacallc.com Genoptix Medical Laboratory http://www.genoptix.com Golden Helix http://www.goldenhelix.com Microsoft http://www.microsoft.com Unipro http://ugene.unipro.ru/ 23 and me http://www.23andme.com Ancestry.com http://www.ancestry.com Personal Genome Diagnostics http://www.personalgenome.com Beijing Genomics (BGI) http://www.genomics.cn Illumina http://www.illumina.com InterpretOmics http://www.interpretomics.co Population Genomics Initiative (PAPGI) http://www.papgi.org Billion Genomes Project http://billiongenome.com Indian Genome Variation Project http://www.igvdb.res
93
APPENDIX-II
INSTITUTIONAL REVIEW BOARD (IRB) APPROVAL
Top Related