Download - WHOLE-GENOME GENETIC DIVERSITY AND FUNCTIONAL ...prr.hec.gov.pk/jspui/bitstream/123456789/7102/1/Muhammad_Ilyas... · 2015 Whole-Genome Genetic Diversity and Functional Classification

2015

WHOLE-GENOME GENETIC DIVERSITY AND FUNCTIONAL CLASSIFICATION OF VARIATIONS OF

A PAKISTANI INDIVIDUAL ______________________________________________________________________________

MUHAMMAD ILYAS

_____________________________________________________ National Centre of Excellence in Molecular Biology

UNIVERSITY OF THE PUNJAB, LAHORE PAKISTAN

2015

Whole-Genome Genetic Diversity and Functional Classification of Variations of a Pakistani Individual

___________________________________________________________________________

A THESIS SUBMITTED TO

UNIVERSITY OF THE PUNJAB

In Partial Fulfillment of the Requirement for the Degree of

DOCTORATE OF PHILOSOPHY

in MOLECULAR BIOLOGY

(Human Genomics, Bioinformatics)

Submitted by MUHAMMAD ILYAS

Supervisors DR. ZIAUR RAHMAN

PROF. DR. JONG BHAK

___________________________________________________ National Centre of Excellence in Molecular Biology

University of the Punjab, Lahore, Pakistan

“IN THE NAME OF ALLAH, THE MOST BENEFICENT, THE MOST MERCIFUL”

DEDICATED TO

MY MOTHER AND FATHER

WHOSE AFFECTION, LOVE, ENCOURAGEMENT AND PRAYS OF

DAY AND NIGHT MAKE ME ABLE TO GET SUCH SUCCESS AND

HONOR

ALONG WITH ALL HARD WORKING AND RESPECTED TEACHERS

CERTIFICATE This is to certify that the experimental work described in the thesis submitted by

MUHAMMAD ILYAS has been carried out under my direct supervision. Data/results reported

in this manuscript are duly recorded in the Centre’s official note book(s). I have personally gone

through the raw data and certify the authenticity of all the results reported herein. I further certify

that these data have not been used in part or full, in a manuscript already submitted or in the

process of submission in partial/complete fulfillment of the award of any other degree from any

other institution at home or abroad. I also certify that the enclosed manuscript, has been prepared

under my supervision and I endorse its evaluation for the award of PhD. Degree through the

official procedures of the Centre/University.

In accordance with the rules of the Centre, data books No. 1078 is declared as

unexpendable document that will be kept in the registry of the Centre for a minimum of three

years from the date of the thesis defense Examination.

Signature of Supervisor ___________________________

Name of Supervisor: Dr. Ziaur Rahman

Signature of Co-Supervisor: ________________________

Name of Co-Supervisor: Prof. Dr. Jong Bhak

I

SUMMARY

Pakistan covers a key geographic area in human history, being both part of the Indus

River region that acted as one of the cradles of civilization and as a link between Western

Eurasia and Eastern Asia. This region is inhabited by a number of distinct ethnic groups, the

largest being the Punjabi, Pathan (Pakhtun), Sindhi, and Baloch. We analyzed the first male

Pakistani genome (PTN) from the north-west province of Pakistan, by sequencing it to 29.7-fold

coverage using the Illumina HiSeq2000 platform. A total of 3.8 million single nucleotide

variations (SNVs) and 0.5 million small indels were identified by comparing with the human

reference genome. Among the SNVs, 129,441 were novel, and 10,315 nonsynonymous SNVs

were found in 5,344 genes. SNVs were annotated for health consequences and high risk diseases,

as well as possible influences on drug efficacy. It is confirmed that the PTN genome presented

here is representative of the Pathan/Pakhtun ethnic group by comparing it to a panel of Central

Asians from the HGDP-CEPH panels typed for ~650k SNPs. The mtDNA (C4a1a1) and Y

haplogroup (L1) of this individual were also typical of his geographic region of origin. The

demographic history by PSMC was constructed, which highlights a recent increase in effective

population size compatible with admixture between European and Asian lineages expected in

this geographic region. It is a useful resource to understand genetic variation and human

migration across the whole Asian continent. Finally it was concluded that modern

Pathans/Pakhtuns are admixture of European and Asian lineages, which made them unique from

other world populations. Their genetic makeup will help us discovering rare variants and

facilitate developing personalized medicine.

II

ACKNOWLEDGEMENTS

At the onset, I bow my head to the Omnipotent, the most merciful, the Compassionate

and the Omniscient Al-Mighty Allah, who showered upon me all HIS blessings throughout my

life and especially for giving me the strength for the completion of this research work.

I wish to acknowledge the remarkable contribution of Prof. Dr. Sheikh Riazuddin (S.I.,

T.I., HI) founder and Ex-Director, and Prof. Dr. Tayyab Husnain (I.F., T.I.) Director, Centre of

Excellence in Molecular Biology in the establishment and strengthening of the prestigious

institute CEMB, where I began learning research and science.

I am also grateful to my supervisor Dr. Ziaur Rahman for his guidance, energy, time

and other form of contributions. I am deeply grateful to him for the confidence, for being a

constant source of inspiration and for always sustaining me in pursuing my own ideas, and I am

most indebted to the extremely friendly atmosphere on a professional and personal level.

Without his support my research work was impossible.

Foremost, I would like to express my deep and sincere gratitude to my co-supervisor,

Prof. Dr. Jong Bhak Director and CEO of Personal Genomics Institute, Genome Research

Foundation South Korea. His vision, patience and motivation in every step of study made it

possible for me to work in this exciting and emerging field of research. His encouragement and

support helped me understand and carry out my research project in South Korea. The positive

atmosphere and excellent working facility in his laboratory raised my devotion for learning and

knowledge. Colleagues at Jong’s lab (JongSo Kim, Yunsung Cho, Hakmin, Jesse Cooper and

Jaewoo Moon) helped me in successful completion of this research.

I am indebted to the members of the Tonellato’s lab at Harvard Medical School for

providing a stimulating environment for intellectual development and research. From the day I

joined the group, Prof. Dr. Peter J Tonellato played a crucial role in getting me up to speed

with biomedical informatics and personalized medicines. I constantly benefited from his

continuous support and guidance all along my work. Informal discussions with Michiyo

III

Yamada, Sheida Nabavi, Latrice Landry and Yassine Souilmi were crucial for the success of my

research project. My whole stay at Harvard has been a rewarding and most agreeable experience,

and, also, Boston is one of the most enjoyable cities I have lived in.

I wish to express my deepest gratitude to my senior colleagues at CEMB, Khalid

Masood, the person who always motivated me to do best in bioinformatics, Sobia Ahsan Halim,

Muhammad Israr, Aneela Yasmin and Shahid ur Rahman for their helpful guidance. I also

acknowledge my lab members Atif Anwar Mirza and Zulfiqar Ali Mir for their kind cooperation

during my PhD.

I would also like to express my indebtedness to Prof. Dr. Andrea Manica (Cambridge

University UK) Prof. Dr. Qasim Ayub (Welcome Trust Sanger Lab, UK), Prof. Dr. Sultan-e-

Rome (Government Jehanzeb College Swat), Khwaja Aftab Ahmad (Swat) and Dr. Muhammad

Fahim (IBGE Peshawar) who encouraged me by showing interest in my work. They generously

provided reading material and shared their knowledge with me. However, special thanks are due

to Prof. Dr. Habib Ahmad and Prof. Dr. Mukhtar Alam for their kind words and continuous

guidance.

I additionally appreciate the support of my friends Ziaur Rahman, Sulaiman Shams,

Imtiaz Ali, Sahib Zar and Inamullah. Their endless help and support allowed me to overcome all

of the difficult times.

Finally, I thank those that are dearest to me, who have loved me unconditionally, and

stood by me during times of confusion and frustration. My mother and father, my brother

Muhammad Abbas, my sisters and my loving wife, who helped me, get through some of the

most difficult challenges that I have faced to date. I thank her for her patience and understanding

over the past few years. Last but not the least I am grateful to the rest of my family for their

endless love, support and encouragement throughout my entire academic career. My family has

been far away from me these years, but they were closer than ever in my mind and heart.

IV

I would like to appreciate the financial support of Genome Research Foundation while I

was working in South Korea. Thanks to Higher Education Commission of Pakistan for providing

me the fellowship, which helped me a lot to get advance training of personalized genomics and

biomedical informatics at Harvard University, Boston, USA.

Many people, especially my classmates and team members itself, have made valuable

comment suggestions on this project which gave me an inspiration to improve my research. I

thank all the people for their help directly and indirectly to complete this dissertation.

Muhammad Ilyas

Lahore, 2015

V

LIST OF ABBREVIATIONS

BAC: Bacterial Artificial Chromosome BGI: Beijing Genomics Institute ddNTP: dideoxyribonucleic acid dNTP: dideoxyribonucleic acid EST: Expression Sequence Tag FISH: Fluorescent in situ Hybridization GWAS: Genome Wide Association Study NGS: Next Generation Sequencing qPCR: quantitative PCR SNP: Single Nucleotide Polymorphism TGS: Third Generation Sequencing WGA: Whole Genome Amplification KPGP: Korean Personal Genomes Project 1KGP: 1000 Genome Project SNV: Single Nucleotide Variant CAMDA: Critical Assessment of Massive Data Analysis CDS: Coding DNA Sequence UTR: Un Translated Region NMD: nonsense mediated decay PTN: Pathan Genome PK1: Pakistani Genomes (Sindi) SNV: Single Nucleotide Variant CDS: Coding DNA Sequence SJK: First Korean Genome PGP: Personal Genomics Project

VI

TABLE OF CONTENTS

SUMMARY I

ACKNOWLEDGEMENT II

LIST OF ABBREVIATIONS V

LIST OF FIGURES IX

LIST OF TABLES X

CHAPTER 1

1. INTRODUCTION 1

CHAPTER 2

2. LITERATURE REVIEW 7

2.1. SEQUENCING TECHNIQUES 11

2.1.1. HIGH-THROUGHPUT SEQUENCING 11

2.1.2. DE NOVO SEQUENCING 12

2.1.3. RE-SEQUENCING 13

2.1.4. EXOME SEQUENCING 13

2.2. HIGH THROUGHPUT SEQUENCING PLATFORMS 14

2.2.1. ROCHE 454 SYSTEM: PYROSEQUENCING 14

2.2.2. AB SOLID SYSTEM: SEQUENCING BY LIGATION 16

2.2.3. ILLUMINA/SOLEXA SYSTEM: SEQUENCING WITH REVERSIBLE TERMINATORS 17

2.2.4. ION TORRENT: SEMICONDUCTOR SEQUENCING 19

2.2.5. THE THIRD GENERATION SEQUENCER 19

2.3. GENETIC VARIANTS IN HUMAN GENOME 20

2.3.1. SINGLE NUCLEOTIDE VARIANTS/POLYMORPHISMS 21

2.3.2. STRUCTURAL VARIATIONS 22

2.3.3. COPY NUMBER VARIATIONS 22

2.3.4. LINEAGE MARKERS FOR POPULATION STUDY 23

2.3.5. VARIABLE NUMBER TANDEM REPEATS 24

2.3.6. SHORT TANDEM REPEATS (STRS) 24

2.4. APPLICATIONS OF GENOME VARIANTS 25

2.4.1. GENETIC ANCESTRY AND ADMIXTURE MAPPING 26

2.4.2. MEDICAL AND CLINICAL IMPLICATIONS 26

2.4.3. PHARMACOGENOMICS 28

VII

2.5. PERSONAL AND POPULATION GENOME PROJECTS 30

2.5.1. PERSONAL GENOME PROJECT (PGP) 30

2.5.2 1000 GENOMES PROJECT (1KGP) 31

2.5.3 PAN-ASIAN POPULATION GENOMICS INITIATIVE (PAPGI) 31

2.5.4 ONE MILLION GENOMES 31

2.5.5 HUMAN GENOME DIVERSITY PROJECT (HGDP) 32

2.5.6 BILLION GENOMES PROJECT 32

2.5.7 OTHER GENOME CONSORTIUMS 32

CHAPTER 3

3. MATERIALS AND METHODS 33 3.1. SUBJECT SELECTION AND ETHICAL STATEMENT 33

3.2. DATA SOURCES 34

3.3. DNA EXTRACTION 34

3.4. CYTOGENETIC ANALYSIS 35

3.5. LIBRARY PREPARATION AND WHOLE GENOME SEQUENCING 35

3.6. WORKFLOW FOR GENOMIC DATA ANALYSIS 37

3.7. SEQUENCE ALIGNMENT 39

3.8. SNP AND INDEL DETECTION 40

3.9. COPY NUMBER VARIATION DETECTION 40

3.10. FUNCTIONAL ANNOTATION 41

3.11. PHARMACOGENOMICS ANALYSIS 43

3.12. MULTIDIMENSIONAL SCALING AND ADMIXTURE 43

3.13. PAIRWISE SEQUENTIALLY MARKOVIAN COALESCENT ANALYSIS 44

3.14. PHYLOGENOMIC ANALYSIS 45

CHAPTER 4

4. RESULTS 46

4.1. GENOME SEQUENCING AND VARIANTS IDENTIFICATION 46

4.2. FUNCTIONAL CLASSIFICATION AND CLINICAL RELEVANCE OF VARIANTS 49


4.4. COMPARISON OF PTN GENOME TO WORLDWIDE POPULATIONS 58

4.5. COMPARISON WITH OTHER PAKISTANI INDIVIDUALS 61

4.6. DEMOGRAPHIC HISTORY ANALYSIS 64

4.7. MTDNA AND Y-CHROMOSOME ANALYSES 65

VIII


CHAPTER 5

5. DISCUSSION 67

5.1. CLINICAL RELEVANCE AND VARIANT CHARACTERIZATION 68

5.2. PHARMACOGENOMIC PROFILE 71

5.3. GENEALOGICAL AND ADMIXTURE ANALYSIS 72

5.4. DEMOGRAPHIC HISTORY ANALYSIS AND ANCESTRAL POPULATION SIZE 73

5.5. CONCLUSION 74

CHAPTER 6 6. REFERENCES 75

LIST OF PUBLICATIONS 92

APPENDIX-I WEBSITE USED 93

APPENDIX-II IRB APPROVAL 94

IX

LIST OF FIGURES FIGURE 2.1. THE DROP IN COST DRIVES OF SEQUENCING A COMPLETE HUMAN GENOME. 8

FIGURE 2.2: THE PYROSEQUENCING PROCESS. 16

FIGURE 2.3: APPLIED BIOSYSTEM’S SOLID SEQUENCING BY LIGATION. 17

FIGURE 2.4: REVERSIBLE TERMINATOR CHEMISTRY UTILIZES IN THE ILLUMINA PLATFORMS. 18

FIGURE 3.1: FAMILY PEDIGREE OF DONOR WITH MEMBERS HAVING GENETIC DISORDERS. 33

FIGURE 3.2: CYTOGENETIC ANALYSIS THROUGH GTG BANDING KARYOTYPE AND LEGENDS. 35

FIGURE 3.3: ILLUMINA HISEQ2000 MACHINE AND ACCESSORIES. 36

FIGURE 3.4: LIBRARY QUALITY GENERATED BY BIOANALYZER. 37

FIGURE 3.5: WORKFLOW OF THE NEXT GENERATION SEQUENCING AND BIOINFORMATICS DATA

ANALYSIS.

38

FIGURE 3.6: SCHEMATICS REPRESENTATION OF THE PIPELINE DEVELOPED. 42

FIGURE 3.7: SCHEMA OF THE PHARMACOGENOMICS ANALYSIS. 43

FIGURE 4.1: NOVEL SNVS IN PERSONAL GENOMES IN THIRTEEN DIFFERENT ETHNIC GROUPS. 48

FIGURE 4.2: COPY NUMBER VARIATIONS COUNTS DISTRIBUTED IN EACH CHROMOSOME. 48

FIGURE 4.3: COMPARATIVE VARIANT COUNT OF OTHER REPORTED INDIVIDUAL GENOMES WITH

PAKISTANI (PTN) GENOME.

50

FIGURE 4.4: MULTIDIMENSIONAL SCALING (MDS) PLOT GENERATED BY PLINK. 59

FIGURE 4.5: ADMIXTURE RESULTS FOR K = 2 AND K = 3 FOR THE PTN INDIVIDUAL. 60

FIGURE 4.6: CHROMOSOME PAINTING OF POSSIBLE GENOMIC ADMIXTURE. 61

FIGURE 4.7: ADMIXTURE RESULTS OF PAKISTANI PATHAN (PTN) INDIVIDUAL TO OTHER ETHNIC

GROUPS IN SOUTH ASIA.

62

FIGURE 4.8: RELATIONSHIP OF PAKISTANI PATHAN INDIVIDUAL TO OTHER ETHNIC GROUPS IN

SOUTH ASIA.

63

FIGURE 4.9: PAIRWISE SEQUENTIALLY MARKOVIAN COALESCENT (PSMC) MODEL FOR

RECONSTRUCTING PAKISTAN’S DEMOGRAPHIC HISTORY.

64

FIGURE 4.10: PHYLOGENOMIC TREE OF PAKISTANI PTN GENOME WITH OTHER WORLD ETHNIC

GENOMES.

66

X

LIST OF TABLES

TABLE 4.1. SUMMARY OF DATA PRODUCTION AND MAPPING RESULTS 46

TABLE 4.2. SUMMARY OF SNVS FOUND IN PATHAN’S GENOME AND OVERLAPS WITH

DBSNP137

47

TABLE 4.3. VARIANTS (SNVS, INDELS AND CNVRS) IDENTIFIED IN PAKISTANI

(PTN) GENOME

47

TABLE 4.4. FUNCTIONALLY DAMAGED NOVEL NSSNVS. 51

TABLE 4.5. CLINICAL RELEVANCE CODING SNVS IN PAKISTANI PTN WHOLE

GENOME.

53

TABLE 4.6. DAMAGED NSSNVS AND THE DRUGS. 54

TABLE 4.7. LIST OF DRUGS (PHARMGKB) IN THE PTN GENOME. VIP 57

TABLE OF CONTENTS

SUMMARY I

ACKNOWLEDGEMENT II

LIST OF ABBREVIATIONS V

LIST OF TABLES VI

LIST OF FIGURES VII

CHAPTER 1

1. INTRODUCTION 1

CHAPTER 2

2. LITERATURE REVIEW 7

2.1. SEQUENCING TECHNIQUES 11

2.1.1. HIGH-THROUGHPUT SEQUENCING 11

2.1.2. DE NOVO SEQUENCING 12

2.1.3. RE-SEQUENCING 13

2.1.4. EXOME SEQUENCING 13

2.2. HIGH THROUGHPUT SEQUENCING PLATFORMS 14

2.2.1. ROCHE 454 SYSTEM: PYROSEQUENCING 14

2.2.2. AB SOLID SYSTEM: SEQUENCING BY LIGATION 16

2.2.3. ILLUMINA/SOLEXA SYSTEM: SEQUENCING WITH REVERSIBLE TERMINATORS 17

2.2.4. ION TORRENT: SEMICONDUCTOR SEQUENCING 19

2.2.5. THE THIRD GENERATION SEQUENCER 19

2.3. GENETIC VARIANTS IN HUMAN GENOME 20

2.3.1. SINGLE NUCLEOTIDE VARIANTS/POLYMORPHISMS 21

2.3.2. STRUCTURAL VARIATIONS 22

2.3.3. COPY NUMBER VARIATIONS 22

2.3.4. LINEAGE MARKERS FOR POPULATION STUDY 23

2.3.5. VARIABLE NUMBER TANDEM REPEATS 24

2.3.6. SHORT TANDEM REPEATS (STRS) 24

2.4. APPLICATIONS OF GENOME VARIANTS 25

2.4.1. GENETIC ANCESTRY AND ADMIXTURE MAPPING 26

2.4.2. MEDICAL AND CLINICAL IMPLICATIONS 26

2.4.3. PHARMACOGENOMICS 28

2.5. PERSONAL AND POPULATION GENOME PROJECTS 30

2.5.1. PERSONAL GENOME PROJECT (PGP) 30

2.5.2 1000 GENOMES PROJECT (1KGP) 31

2.5.3 PAN-ASIAN POPULATION GENOMICS INITIATIVE (PAPGI) 31

2.5.4 ONE MILLION GENOMES 31

2.5.5 HUMAN GENOME DIVERSITY PROJECT (HGDP) 32

2.5.6 BILLION GENOMES PROJECT 32

2.5.7 OTHER GENOME CONSORTIUMS 32

CHAPTER 3

3. MATERIALS AND METHODS 33 3.1. SUBJECT SELECTION AND ETHICAL STATEMENT 33

3.2. DATA SOURCES 34

3.3. DNA EXTRACTION 34

3.4. CYTOGENETIC ANALYSIS 35

3.5. LIBRARY PREPARATION AND WHOLE GENOME SEQUENCING 35

3.6. WORKFLOW FOR GENOMIC DATA ANALYSIS 37

3.7. SEQUENCE ALIGNMENT 39

3.8. SNP AND INDEL DETECTION 40

3.9. COPY NUMBER VARIATION DETECTION 40

3.10. FUNCTIONAL ANNOTATION 41


3.12. MULTIDIMENSIONAL SCALING AND ADMIXTURE 43

3.13. PAIRWISE SEQUENTIALLY MARKOVIAN COALESCENT ANALYSIS 44


CHAPTER 4

4. RESULTS 46

4.1. GENOME SEQUENCING AND VARIANTS IDENTIFICATION 46

4.2. FUNCTIONAL CLASSIFICATION AND CLINICAL RELEVANCE OF VARIANTS 49


4.4. COMPARISON OF PTN GENOME TO WORLDWIDE POPULATIONS 58

4.5. COMPARISON WITH OTHER PAKISTANI INDIVIDUALS 61

4.6. DEMOGRAPHIC HISTORY ANALYSIS 64

4.7. MTDNA AND Y-CHROMOSOME ANALYSES 65


CHAPTER 5

5. DISCUSSION 67

5.1. CLINICAL RELEVANCE AND VARIANT CHARACTERIZATION 68

5.2. PHARMACOGENOMIC PROFILE 71

5.3. GENEALOGICAL AND ADMIXTURE ANALYSIS 72

5.4. DEMOGRAPHIC HISTORY ANALYSIS AND ANCESTRAL POPULATION SIZE 73

5.5. CONCLUSION 74

5.6. RECOMMENDATIONS AND FUTURE PLANS 75

CHAPTER 6 6. REFERENCES 76

LIST OF PUBLICATIONS 91

APPENDIX-I WEBSITE USED 92

APPENDIX-II IRB APPROVAL 93

Chapter 1

INTRODUCTION

Pages 1-6

1

CHAPTER 1

1. Introduction

Next generation sequencing (NGS) technology has become the most exciting scientific

achievement among the research community. It refers to a set of new DNA sequencing

procedures that carry remarkable advancement in sequencing abilities by employing particularly

parallel reactions on millions of genomic fragments (Mardis, 2008). The cost to sequence

comparatively short DNA fragments is now at least two orders of magnitude less than the usual

Sanger procedure (Hudson, 2008). The cost have plummeted in recent years, rapidly outpacing

the traditional benchmark for the decreasing cost of the technology known as Moore’s law

(Mayer, 2006). Many techniques, including latest chemistries, amplification methodology,

efficient and high-resolution microscopy, were remodeled to make this development possible

(Park, 2008).

Genome-wide studies using microarray technology have brought important

developments for the last many years (Kelly et al., 2013). Initially microarray chip technologies

were used for gene expression analysis, but later it found extensive uses in estimation of copy

number alterations, microRNA studies, genotyping single nucleotide variants and mapping of

the binding sites for protein-protein and DNA-protein interactions (Mardis, 2008). However,

NGS technology provides important developments and has the tendency to replace many of the

microchip platforms in the near future (Elingarami et al., 2013). The un-availability of

sequencing equipment and high prices are still unaffordable for many researchers at this time,

but due to competing market forces a substantial decrease is expected in coming years

2

A new era of personalized genomics has been initiated after the advancement in

sequencing technologies. To date, many genome sequences for individuals from distinct regions

have been reported. Venter was the first one to sequence his personal genome using Sanger

dideoxy method, which is still the method of choice for de novo sequencing due to its per base

accuracy (99.9%) of long reads of almost 1000 bp (Levy et al., 2007). With Sanger method

diploid sequences were assembled with phase information that has not been performed in other

published genomes (Bentley et al., 2008, Kitzman et al., 2010, Pushkarev et al., 2009). Despite

limitations in read length, which is extremely important for the assembly of contigs and final

genomes, it is the NGS technology that has made personal genomics possible by dramatically

reducing the cost and increasing the efficiency. To date, more than ten individual genome

sequences, analyzed by NGS, have been published such as, two individuals of northwest

European origin (Levy et al., 2007, Wheeler et al., 2008), a Yoruba (Bentley et al., 2008), an

Indian Gujarati (Kitzman et al., 2010) as well as an Indian female and a male (Gupta et al.,

2012, Patowary et al., 2012), a person from China (Wang et al., 2008), Korean individuals (Kim

et al., 2009, Ahn et al., 2009), an Aboriginal Australian (Rasmussen et al., 2011), a Japanese

(Fujimoto et al., 2010), Pakistani (Azim etal., 2013), Sri Lankan (Dissanayake et al., 2011) and

Turkish (Dogan et al., 2014). NGS facilitates researchers to map short range NGS data to

known reference genome, hence circumvent expensive and laborious long fragment based de

novo assembly (Metzker, 2010). As demonstrated by a large percentage of unmapped data in

previous human genome re-sequencing projects, however, a re-sequenced genome may not fully

reflect ethnic and individual genetic differences because its assembly is dependent on the

previously sequenced genome. After the introduction of NGS, the genome sequencing

bottleneck of a whole population or people is not the sequencing process itself, but the

3

bioinformatics process of fast and accurate mapping to the available data, structural variation

analyses, phylogenetic analyses, association study, and application to phenotypes such as

diseases (Ahn et al., 2009).

Sequencing technology is improving fast, with a drastic reduction of its costs (Lander et

al. 2001). Due to these advances, the knowledge of human genetic diversity and population

history has greatly expanded (Veeramah and Hammer 2014), enabling us to investigate variants

with health consequences and paving the way to personalized medicine (Feero and Guttmacher,

2014). Genome wide microarray study (GWAS) has characterized the function of thousands of

common SNVs, but there are still millions of variants left unexplored (Sebastiani et al. 2009).

Therefore, whole genome sequencing is necessary for a detailed study of rare genomic variants.

A number of international consortia have started sequencing the whole genomes of large panels,

including the 1000 Genomes Project which covers populations from Nigeria, Japan, China,

Europe, Kenya, Italy, Peru, India, United States (www.1000genomes.org), the PGP consortium

(www.personalgenomes.org), Simons Genome Diversity Project (www.simonsfoundation.org)

which consists of data from 260 genomes from 127 populations (Africans, Native Americans,

Central Asians or Siberians, East Asians, Oceanians, South Asians and West Eurasians, Korean

Personal Genomes Project (kpgp.kr), Complete Genomics (www.completegenomics.com),

Iranian Genome Project (www.irangenes.com/) and the 100 Malay genomes (Wong et al. 2013).

These consortia, as well as several geographically more restricted projects, aim to understand

the functional aspects of both common and unique variants in humans. Genetic variants are the

genetic differences between two individuals or populations which make them biochemically

similar on average 99.9% to any other humans (Collins and Mansoura, 2001). Even the two

identical twins developed form one zygote is not genetically identical. They will have genetic

4

variations due to mutations occurring during development (Patwari and Lee, 2008). This

information makes one person unique from the others. Studying genetic variation, also known as

variomics, has great applications in ancestry and clinical studies. Researchers are using these

variants for understanding the ancient humans, their migrations and admix genetic structure

which made them similar to other diverse populations in the world (Schork et al., 2009). There

are some disease associated variants which occur more frequently in individuals from a certain

geographic regions. Researchers around the globe are searching for such rare and common

variants to solve the mystery of different diseases (Bodmer and Bonilla, 2008). Genetic

variations can be of many kinds that start from point mutations e.g. SNPs to the large

microscopic alterations e.g. CNVs. SNP is the change of a nucleotide between members of the

species that happens in about 1% of the whole group (Collins et al., 1998). Approximately 30

million polymorphic positions have been reported in humans so for. A CNV is a large

microscopic chromosomal region which happens due to deletion or duplications, also reported to

have strong association with diseases like cancer, autism and other neurological disorders

(Rendon et al., 2006).

Besides their value for biomedicine, individual genome sequences are a rich source of

information about human evolution (Sankararaman et al., 2014). A human DNA can help us

explore the history and peopling of a region. Various groups have undertaken different studies

in this regard (Do et al., 2015). Previously, it has been reported that a minor contribution from

Iranian, Arab, Turkish and Greek is present in the people living in the northwest province of

Pakistan (Firasat S. et al., 2007). The claim is mainly based on the Greek invasion of the Indian

sub-continent by Alexander the Great in 327-323 BC and the subsequent stay of Greek soldiers

in the area (Mansoor et al., 2001). Other historians mention that, when Afghanistan and the

5

present-day Pakistan were the eastern provinces of the Xerxes’s Persian kingdom, Greek slaves

were brought and kept in this region during the time of about 150 years before Alexander’s

arrival (Wood, 2001).

Pakistan lies at an important junction between the Indian sub-continent in the East and

the Central Asian States in the West While China lies at the North. Due to its particular

geography, climate and socio-religio-cultural record, a number of ethnic and linguistic groups

like Punajbi, Pathan, Sindhi and Baloch live in the country (Bolstad, 2010). A number of these

groups have been included in genetic panels typing uniparental microsatellites and SNPs

(Cavalli-Sforza 2005). Human Genetics Diversity Panel included 190 individuals belong to

eight different ethnic groups from Pakistan, which had been typed for ~650K SNPs but it left

many genetic information unexplored therefore whole genome sequencing was needed to

explore the hidden information in genetic makeup of Pakistani populations. Up-till now only

one male Pakistani individual of Sindhi ethnic origin has been sequenced so far (Azim et al.

2013).

Here we report a whole genome sequence of an individual from Khyber Pakhtunkhwa,

the north-west province of Pakistan. The genome was aligned to the reference genome which is

a merger of several ethnic populations. We disclosed a number of variants including SNPs,

Indels and CNVs in northwestern Pakistani Genome. Traditional methods were used to get

highly reliable variants for medical considerations. Potential clinical phenotypes were screened

for ns-SNPs, exonic indels, and copy number alterations. Several other complete genome

sequences reported from different ethnic populations were used to understand the genetic

ancestry, migration patterns and population bottlenecks of Pakistani population. Variants were

then annotated and scanned for associated functions along with SNVs that could modulate drug

6

response. Possible deleterious non-synonymous SNVs (nsSNVs) were investigated for potential

effect on the pharmacokinetics and pharmacodynamics of drugs. Additionally, multiple

analytical approaches were used to assess the influence of ancestral contributions within the

Pakistani genome. It is a useful resource to understand genetic variation and human migration

across the whole of Asia. The genetic data and variant functions for Pakistani individual

genome (PTN) will provide an important public resource, which will be helpful for the clinical

genetics research and diagnostics.

Chapter 2

LITERATURE REVIEW Pages 7-32

7

CHAPTER 2 2. Literature Review

Genomics is the field of biological sciences that deals with the recombinant DNA, DNA

sequencing methodology, and computational analysis of structure and function of genome

sequence composed of the entire set of DNA within a single cell of an organism (Bild et al.,

2014). Developments in the field of genomics have enabled us to do a revolutionary research to

understand even the most complex biological systems like brain (Biswal et al., 2010). The

intragenomic phenomena such as heterosis, epistasis, pleiotropy are also included in this field of

biology (Ragunath et al., 2014). Alternatively, the search for function and roles of single gene is

the preliminary focus of molecular genetics and is a common area of interest for modern

medical and biological research (Carroll 2003).

Human genome sequence draft was produced through collaboration with many

international institutes (Collins et al., 2003). They presented the primary analysis results of their

data showing some features which can be observed through the analysis of a sequence

(Sachidanandam et al., 2001). A chimpanzee genome sequence draft was presented and was

compared with humans, marking differences between chimpanzee and human genomes (Prüfer

2012). Also the population genetics and phylogenetic relation of humans was inquired through

chimpanzee genome (The Chimpanzee Sequencing and Analysis Consortium 2005). HapMap-

III was helped in characterizing 3.1 million SNVs in 270 human individuals from 4 diverse

populations of different geographical origin (Pemberton et al., 2010). Also the sharing region

among different populations was also defined (Li et al., 2008). An accurate, economical and

8

rapid approach for intra-species genetic variation has been described (Bentley et al., 2008). Low

cost experimental method of reversible terminator chemistry was used to decode human genome

of Yoruba, a male from Ibadan, Nigeria. 4 million single nucleotide polymorphisms were

characterized along with 400,000 structural variants (Figure 2.1) (Manolio and Collins 2009).

Figure 2.1: The drop in cost drives of sequencing a complete human genome using Next Generation Sequencing

technologies. (http://www.meragenome.com)

Snapshot was provided for Next Generation Sequencing Approach to understand the

properties and functions a genome (Marguerat et al., 2008). Microarray based arrays are

supersede by sequencing based assays and the data obtained from these distinct approaches was

contrasted and compared (Laird 2010). First Asian individual genome was sequenced using

massively parallel sequencing technology (Wang et al., 2008). Three million single nucleotide

polymorphisms were identified in this region with high accuracy and consistency (Li et al

2009). Through these results potential importance of High throughput Sequencing technology

9

was described for individual genomics (Wang et al., 2008). Individual genome of James Watson

was reported in a couple of months through massively parallel sequencing in picolitre size

reaction vessels (Wadman 2008). The genome was sequenced for the first time via NGS that

made it possible to get personal genome sequence in a very short time (Wheeler et al., 2008). A

single molecule method was reported for the sequencing of individual human genome

(Pushkarev et al., 2009). Genome of an anonymous individual of African individual was

sequenced using ligation based sequencing essay (McKernan et al., 2009). This method was

used because it improves the accuracy of results through a unique error correction method

(Zhang et al., 2011). The first male Korean individual genome was sequenced using illumina

paired-end sequencing methods (Ahn et al., 2009). The results obtained were analyzed and

compared with Chinese genome (YH), the only available Asian genome, to observe significant

differences among both genomes of closely related ethnic groups (Li et al., 2009). A combine

approach was used to decode the other Korean AK1 genome sequence. The approach includes

complete genome sequencing by shot gun method targeted BAC sequencing and high resolution

comparative genomics hybridization via traditional microchips (Kim et al., 2009).

A genome sequencer with efficient imaging and less reagent consumption was

developed, which used cPAL chemistry and assayed each base from self-assembling DNA

nanoballs or patterned nanoarrays (Drmanac et al., 2010). Researchers used this technology for

sequencing three human genomes, due to high accuracy rate and affordable cost of sequencing

consumables (Liu et al., 2012).

An era of personalized genomics has been initiated due to the advancements in

sequencing technologies. Many individual genomes have been reported from distinct regions

such as, two individuals of northwest European origin (Wheeler et al., 2008) (Levy et al.,

10

2007), a Yoruba (Bentley et al., 2008), an Indian Gujarati (Kitzman et al., 2010) as well as an

Indian female and a male (Gupta et al., 2012) (Patowary et al., 2012), a person from China

(Wang et al., 2008), Korean individuals (Kim et al., 2009) (Ahn et al., 2009), an Aboriginal

Australian (Rasmussen et al., 2011), a Japanese (Fujimoto et al., 2010), and 1,000 genomes

from a consortium (Dits, 2010). The complete-genome sequences derived from numerous

diverse ethnic populations is helping us in understanding genetic ancestry, migration patterns

and population bottlenecks.

Venter was the first one to sequence his personal genome using Sanger dideoxy method,

which is still the method of choice for de novo sequencing (Levy et al., 2007). With sanger

method diploid sequences were assembled with phase information that has not been performed

in other published genomes (Bentley et al., 2008) (Kitzman et al., 2010) (Pushkarev et al.,

2009). Despite limitations in read length, which is extremely important for the assembly of

contigs and final genomes, it is the next generation sequencing (NGS) technology that has made

personal genomics possible by dramatically reducing the cost and increasing the efficiency

(Metzker 2010). Scientists can simply map small-reads from NGS machine to a reference

sequence, to do re-sequencing a genome, avoiding expensive and laborious long fragment based

de novo assembly (Goto et al., 2011). As demonstrated by a large percentage of unmapped data

in previous human genome re-sequencing projects, it should be noted that a re-sequenced

genome may not fully reflect ethnic and individual genetic differences because its assembly is

dependent on the previously sequenced genome (Halaschek-Wiener et al., 2009). After the

introduction of NGS, the genome sequencing bottleneck of a whole population or people is not

the sequencing process itself, but the bioinformatics process of fast and accurate mapping to

known data, structural variation analyses, phylogenetic analyses, association study, and

11

application to phenotypes such as diseases (Veltman et al., 2013).

2.1 Sequencing Techniques:

The NGS technologies made it possible for researchers to create large number of

sequence data at high speed and reduced cost to less than 4%-0.1% as compare to the Sanger

system, which differ in error profiles and limitations (Kircher 2012). The choice to get an

appropriate sequencing platform depends on a research project (Ekblom and Wolf 2014,

Meldrum et al., 2011). In the last few years, a change is observed from the time span from

sequencing till computational analysis of the generated data (Bielejec et al., 2014). Expectedly in

future, researchers will spend more time, expertise and funds on analyzing the generated data

(Burrows and Savage 2014). Comparatively smaller research teams will find it hard to arrange

and manage the setup to store and analyze 100s of terabits of raw and processed sequencing data

(Sathi 2014). Even the well established genome centers also face the same problems for the

ongoing use of NGS platforms (Eisenstein 2012). Therefore current equipments are likely to be

improved for further increase throughput and lower price of decoding DNA molecules (Glenn

2011). All these advancements will be helpful in future research in biological data analysis.

2.1.1 High-throughput sequencing

A vast expansion of high throughput sequencing techniques is observed since last few

years. Initial determination of a draft of the human genome took ten years, at an estimated cost of

$US 3 × 109 (Del Giacco and Cattaneo 2012). Instruments exist that can produce 250 Gb per

week (Lesk 2011). The largest dedicated institution in the field, the BGI – formerly the Beijing

Genomics Institute, but currently in Shenzhen – has 128 such instruments (Rubenstein 2010).

12

Each can produce 25 × 109 bp per day. This corresponds to one human genome at over 8X

coverage (Del Giacco and Cattaneo 2012). Running at full capacity, these resources could

produce 10,000 human genomes per year.

Moreover, there is no reason to think that the technical progress will not continue to

accelerate. There are two aspects of a large-scale sequencing project (Lander etal., 2001). One is

the generation of the raw data (Ramos et al., 2011). Most methods sequence long DNA

molecules by fragmenting them, and partially sequencing the pieces (Alberts et al., 2002). To

determine the first genome from a species, these short sequences must be assembled into the

whole sequence, using overlaps between the individual fragments (Li et al., 2010). The typical

length of the individual short sequences reported is called the read length of the method. The

goals of contemporary technical development are to increase not only the number of bases

sequenced per unit time and per unit cost, but the read length (DePristo et al., 2011). Both

generation of raw data, and assembly, depend crucially on effective and efficient computer

programs. Some contemporary genome centres have as many computational biologists on their

staffs as „wet-lab‟ scientists (Sloot et al., 2006). The very high throughput sequencing capacity

of new instruments allows addressing several types of biological questions (Mardis 2010).

2.1.2 De novo sequencing

De novo sequencing of a genome is a challenging job as there will be no reference to

compare it with (Davey et al., 2011; Elshire et al., 2011). Researchers working in such projects

get millions of DNA short fragments of having almost 200 bp in size (Robasky et al., 2014;

Grabherr et al., 2011; Butler et al., 2008). Therefore they need to have high coverage fragments

to in assembling of complete genome. Designing new and advanced bioinformatics algorithms

13

and computational tools for efficient de novo assembly is an emerging field of science these days

(Schatz et al., 2010).

2.1.3 Re-Sequencing

The re-sequencing of a genome from a specie is much easier then the de novo sequencing

(Bentley et al., 2008). The DNA fragments generated are compared to a reference genome which

is already been successfully assembled using de-novo analysis (Del Giacco and Cattaneo 2012).

Sequencing coverage must be plenty to avoid errors in sequence determination and variants

calling.

2.1.4 Exome sequencing

Exomes are the regions in the human genomes which are responsible to make proteins

necessary for human body (Ng et al., 2008). One goal of re-sequencing is to determine variation

in the genome of an individual from the reference genome. Approximately three percent of the

human genome consists of exons which estimated to be more than 150 thousand (Ng et al.,

2009). Inherited disorders are some time due to the abnormal behavior of a certain protein. The

reason behind this abnormality is a mutation occurs in the coding region of an exon sequence

(Baralle et al., 2005). Next generation sequencing is helping the researchers to identify these

variants by doing only exome sequencing (Ng et al., 2010). So they do not need to sequence the

whole genome of a patient to investigate about a pathogenic variant.

14

2.2 High Throughput Sequencing Platforms:

After the successful completion of decoding the first human genome project, different

companies like 454 Solexa launched their Genome Analyzer in 2005 (Bennett et al., 2005). Later

another company (SOLiD) released its parallel sequencing systems with more powerful

technology known as next generation sequencing technology which performed very well to get

accurate sequence results as compare to Sanger sequencing (Zhao and Grant 2011). These

pioneer companies SOLiD and 454 were then purchased by Applied Biosystems and Roche

respectively while Illumina purchased Solexa (McPherson 2014). Soon the three companies

successfully improved the performance.

2.2.1 Roche 454 System: Pyrosequencing

Roche 454 is one of the pioneers in commercially successful NGS systems based on the

pyrosequencing technology (Capobianchi et al., 2013). In the pyrosequencing procedure a

nucleotide is washed over several copies of the desired regions at a time. If the nucleotide is

found complementary to the DNA template it causes polymerases (Huse et al., 2007). The

generation of the longest complementary nucleotides region by polymerase leads to the

termination of polymerase incorporation process (Kircher and Kelso 2010). In 2005, Roche-454

parallelized this technique on a picotiter plate for high-throughput sequencing purpose (Mardis

2008). Each of the two million wells of the plate has room for exactly one 28-µm diameter bead

sheltered with copies of the nucleotides to be read (Figure 2.2) (Margulies et al., 2008).

The main prerequisite of the pyrosequencing method is to cover single beads with many

copies of the same molecule (Kircher and Kelso 2010), by making libraries in which every single

molecule gets two unlike adapter sequences, each on the 5′ and 3′ end of the chain (Metzker

15

2010). Ligation of the two synthesized oligos is required to prepare the 454/Roche sequencing

library (Kircher 2011). The adopters and oligonucleotides are complementary to each other on

the beads; consequently molecules attaches to the beads by hybridization procedure (Dressman et

al., 2003). The empty beads can then be separated from the others and by another adapter, and

then used in the process (Gansauge and Meyer 2013).

It is now possible to sequence 1.5 million beads in a single reaction and to establish 500

nucleotides using the updated version of 454/Roche platform (Casals et al., 2012). Read length is

identified by flow cycles count or base chemistry and the pattern of bases in the DNA to be

obtained (Haydock et al., 2015). This number limited to 200 flow cycles for now which produces

400 nucleotides lengthy reads (Buermans and Dunnen 2014). Estimated that the available Roche

platforms has the capability to generate 750 Mb of DNA with cost 20$/Mb in a day (Hui 2014).

Figure 2.2: The pyrosequencing process (Kircher 2011).

16

2.2.2 AB SOLiD System: Sequencing by Ligation

Applied Biosystems (ABI) bought SOLiD technology in 2006 and released it for

commercial usage in late 2007 (Coombs 2008). The Harvard University developed this system

and upgrades it with a cheaper cost and known as Polonator, a joint work with Dover System

(Datta et al., 2010). Later then, a company was established with the name Complete Genomics

Inc which started human genome sequencing service (Kircher andKelso 2010). They are also

using the same technology developed by Harvard, but some new modified strategy of making

library was added. The clonal sequencing features are created by emulsion polymerase chain

reaction, other than a bridge PCR (Figure 2.3) (Voelkerding et al., 2009). It uses a di-base

technique that can read two DNA bases at the same time, every step while Illumina platform

reads the nucleotide sequences directly (Park 2009). The ABI SOLiD uses only four dyes

represented by a single color. Each base is cross checked two times as long as the machine

moves along the reads. There is possibility to remove the problematic regions generated by the

system during sequencing process. The updated SOLiD systems are able to produce about 1

billion 50 bp per run of having 100 Gb of data in a day (John and Grody 2008).

17

Figure 2.3: Applied Biosystem‟s SOLiD sequencing by ligation (Kircher 2011).

2.2.3 Illumina-Solexa System: Sequencing with Reversible Terminators

Genome Analyzer was first introduced in 2006 by Solexa which was then acquired by

Illumina in 2007 (Ansorge 2009). The amplified sequencing features in this system are created

by the bridge PCR, based on sequencing by synthesis which is more similar to Sanger

technology (Ross and Cronin 2011). Each nucleotide is saved through imaging techniques during

this procedure, and is then converted into base calls (Branton et al., 2008). The process starts

18

with the library making and amplification for sequencing. The two stranded library is then

converted to single strand nucleotide chain, which are then poured into the flow cell according to

the protocol. Olegonuceotides in the flow cell will start hybridization (Malone and Oliver 2011).

Amplified regions from the DNA template are then clustered together on the surface.

Approximately 1000 copies of template are present in each cluster (Lagally et al., 2001). With

the help of Hiseq 2000, we can possibly amplify about 30 million regions (Gilbert et al., 2010).

The eight lanes in the flow cells can sequence eight independent libraries, parallel. The single

stranded nucleotides in a cluster generated (Figure 2.4). The marker is hybridized with adaptors.

The images obtain are then analyzed the bad quality reads are filtered out and the final output

data files are in FASTQ format (Martin 2011). The Illumina machines are capable to decode 100

base pairs with comparatively lower rate of errors. Per reaction can produce almost 20 Gb of data

in less than 24 hours (Quail et al., 2008).

Figure 2.4: Reversible terminator chemistry utilizes in the Illumina platforms (Kircher 2011).

19

2.2.4 Ion Torrent: Semiconductor Sequencing

In 2010, the company known as Life Technologies released their personal genomics

machine (Ion Torrent-PGM) (Quail et al., 2012). It is a benchtop high-throughput sequencer

which uses semiconductor sequencing technology use for genome re-sequencing (Egan et al.,

2012). Their cheaper cost and easy to use sample preparation method helps to reduce the burden

on core facilities and encourage the use of NGS in medical related fields (Gullapalli et al., 2012).

The Ion Torrent PGM is commercially available and has the power to analyze medical related

samples with high productivity and accuracy. Using semiconductor-based technology, the Ion

Torrent produces direct sequence reads without an optical interface (Delseny et al., 2010). The

pH sensor detects the signals of protons generated with the addition of nucleotide (Toumazou et

al., 2013). Ion Torrent PMG is the pioneer commercial sequencer which does not need

fluorescence and camera for scanning, which became the reason of its high speed cheaper price

and have smaller equipment size (Rothberg et al., 2011). The error rate is comparatively very

high which tend to increase in genomic regions where the real polymorphism is also higher

(Derrien et al., 2012). Therefore it becomes the biggest challenge for analysts to decrease these

errors. The per-base accuracy was validated by the company in 2011 and gave 99.6% result

based on fifty bases read with hundred Mb per run (Westerfield 2013). The accuracy was then

verified repeatedly by the company itself, but these figures have never been verified by other

research groups outside the manufacturing company (Yeo et al., 2012).

2.2.5 The Third Generation Sequencer

With the increasing demand of using NGS technology, another generation of sequencing

has been introduced. The 3rd generation sequencing has a couple of important aspects e.g. the

20

PCR is not require before sequencing which helps scientists in saving time (Liu et al., 2012). The

Pacbio or Nanopore signals are captured in the real time means that they are under observation

during the catalytic process of incorporating a nucleotide in a chain (El-Metwally et al., 2014).

One of the methodology known as Single-molecule real-time is based on third-generation

technology introduced by Pacific Bioscience (Raley et al., 2014). SMRT needs lower DNA

quantity (< 1 μg) in start, compared to other platforms and results in significantly longer read

lengths (Wall et al., 2009). This technology is not common among researcher like the other

second generation sequencers.

Nanopore developed by Oxford University researchers, is known to the public by the

name Nanopore sequencer (Laszlo et al., 2014). It‟s a third-generation platform with longer read

of magnitude bigger than existing technologies. They are trying to bring the cost much lower

than the current market rates. The most interesting thing about Nanopore is that, it is futuristic

USB-powered sequencer at only one thousand dollars with easy to use protocol (Pabinger et al.,

2014). Later the company stopped the production and came up after two years with a new beta

version of sequencer known as MinION (Mikheyev and Tin 2014). The product result in the

beginning was quite premature and unfair but it‟s improving quickly.

2.3 Genetic Variants in the Human Genome

Genetic variants are the differences within individuals and populations genetic makeup.

A single gene may have multiple variants in different positions in a whole human population,

which then become a polymorphism (Cargill et al., 1999). No humans are identical even if they

are developed from one zygote (Scott et al., 2000). The difference is due to the alterations

happen during development process but it is estimated that there are 99.9 percent similarity

21

between human individuals (Check 2005). These variations are the key information generally use

in DNA fingerprinting and personal /population identification (Edwards et al., 1992). Different

populations have different allele frequencies which sometime make them unique for certain

character (Wright 1949). The more a population is geographically distant the more it has

different genetic makeup.

The genetic mutations in individuals occur during meiosis where genes exchange during

crossing over of the chromosomes (Campbell et al., 2014). Another reason of genetic alteration

is natural selection and environmental fact (Williams 2008). That is the reason some genes or

alleles shows expression if it gets a chance to express based on geographic regions (Cavalli-

Sforza et al., 1994). Somtime the cause of mutation is genetic drift, this is the effect of random

changes in the gene pool, which has a great importance on ancestry related studies e.g. when did

the modern humans migrated from Africa (Tishkoff and Kidd 2004).

Human genetic variation has both genealogical and medical importance (Burchard et al.,

2003). It helps researchers in getting knowledge about the ancient migrations and how diverse

human populations are genetically similar to each other (Sachidanandam et al., 2001). Genetic

polymorphism is useful in the disease association studies because certain disease calling variants

occur more frequently in a population. On average there are 60 unique mutations in an

individual, if compare with parents (Taillon-Miller et al., 1999). The genetic differentiation in

humans can be found in many formats, starting from chromosomal base to point mutations.

2.3.1 Single Nucleotide Variants / Polymorphisms

A single nucleotide change which is also known as SNiPs is the difference or change of a

nucleotide among members of a group which occurs in about one percent of the group. Up-till

22

now there are more than 30 million SNPs reported in humans (Frazer et al., 2007). These are the

most common variations used in genomic studies. Single nucleotides variants are the major

source of heterogeneity which occurs about every 100 to 300 bases on average (Batra et al.,

2014). There are two types of SNPs i.e. Synonymous and non synonymous. Non-Synonymous or

functional SNPs are those variants which alter the function of gene and cause a phenotypic

change between humans (Haller et al., 2014). Out of 30 million SNPs only three to five percent

have associated function (Hinds et al., 2005). Synonymous SNPs are also important and can be

use as genetic markers in different genome base studies.

2.3.2 Structural variation

This is another kind of human genetic variation which occurs due to the structural

changes in the chromosome of an organism (Feuk et al., 2006). It includes microscopic

chromosomal regions that are deleted, duplicated, inverted or inserted (Redon et al., 2006). For

the first time structural variants were studied in the two personal genomes in 2007. It has a great

contribution in genome variation and has been investigated by researchers for having association

with complex diseases (Frazer et al., 2009).

2.3.3 Copy Number Alteration / Variation

Copy Number Variations are those Genetic polymorphisms in which a Structural segment

of DNA that is 1 kilobase (i.e.1000 Nucleotide Bases) or larger are present in a variable number

as compare to reference genome (Redon et al., 2006). These are mutations and include deletions,

insertions, and duplications (Freeman et al., 2006). Other definitions encompass even larger

swaths of DNA. The Welcome Trust Sanger Institute, (Conrad et al., 2010) heads up the Copy

23

Number Variation Project, defines CNV as variable number of repetitions of 10 kb (10, 000 base

pairs) to 5, 000 kb (5 Million base pairs) sequences (micro-duplications) (St Clair 2009).

The most of Copy Number Variants may cover about 12% of human Genome (Redon et

al., 2006), which means that there are ~12 CNVs in an individual (Feuk et al., 2006) and

accumulative result of CNV Inheritance may constitute more than 10% of human genome

(Lupski et al., 2011). Latest Research suggest that average human Genome comprises greater

than1000 CNVs, encompasses approximately four million base pairs (Conrad et al., 2010) and

occurs at the rate of 0.07-0.12 per generation (Itsara et al., 2010). CNV either inherited from

parents or produce de-novo, in both cases functional consequences occur at translational level

by altering gene dose effect and include truncated protein sequences, eliminated/reduced protein

expression (typically the result of deletions), or increased/enhanced protein expression (typically

caused by duplication) hence effect Individuals phenotype (Connolly et al., 2014).

A large number of algorithms have been developed to identify CNVs from sequencing

data, including CNVnator, cnvHiTSeq and XHMM (Tan et al., 2014). Different CNV algorithms

have different strengths and weaknesses (Li and Olivier., 2012), and the most effective strategy

in terms of minimizing erroneous CNV calls is to incorporate multiple toolsets, which can be

validated computationally via local de novo assembly (Wong et al., 2010).

2.3.4 Lineage Markers for Population Study

Paternally inherited Y and Mitochondrial DNA have extensively been used for

understanding the human history and movement of anatomically modern humans (Richards et

al., 2000). The Y chromosome (NRY) and Mitochondrial genome characterize the only two

haploid parts of human complete genome, since they are transmitted uniparentally, without

24

restructuring in each generation during the process of meiotic cell division (Jobling and Tyler-

Smith 2003). These two haploid systems are passed down from generation to generation without

changing (unless a mutation alters the haplotype), therefore it can preserve records of genetic

history better than autosomal nuclear DNA (Hellenthal et al., 2008). Because autosomal nuclear

DNA, are shuffled with each generation i.e. 50% of an individual‟s genetic information comes

from his or her father and 50% from his or her mother (Helgason et al., 2003).

2.3.5 Variable number tandem repeats

The VNTRs are the other type of variants used in DNA finger printing and forensic

sciences (Luczak-Kadlubowska et al., 2008). They are tandem repeats variation of short

sequences in human genome. VNTRs can be found on many chromosomes with different size

length in different individual‟s genetic makeup (Nakamura et al., 1987). It is used in forensic

sciences as personal or parental identification, crime scene investigations etc (Lewontin and

Hartl 1991).

2.3.6 Short tandem repeats (STRs)

Short tandem repeats of almost five base pairs are microsatellites, while longer then that

are known as minisatellites. Currently, STR measurement is based on electrophoretic technique,

which requires dye labeled primers and very careful analysis of results because of technology

artifacts (Chung et al., 2004). It is previously introduced for STR typing by using terminator

nucleotide to terminate the polymerization at shortest allele (Sanchez et al., 2006). This helped to

sequence heterozygous samples for STRs.

25

2.4 Applications of Genome Variants

Every individual in this world has some genetic difference at certain level of genetic

sequence, which is the reason behind the diversity of human beings (Tooby and Cosmides 1990).

It is the main and important objective for the global scientists to understand genetic diversity of

human so they could get knowledge about the evolutionary history of this important species

(Jobling et al., 2013). It will then be possible to know that where human populations came from,

and where they are heading to. Knowledge about genetic diversity of humans is also necessary so

researchers could understand about different diseases, and how we respond to specific drug at

individual or cohort base (Price et al., 2010). Millions of SNPs in genes that might have

association with diseases in four world populations were discovered by the HapMap consortium

using genome wide microarray chips (McCarroll et al., 2006). Moreover, scientists are trying to

understand population history which will help in discovering diseased genes. Improvements in

our understanding of patterns of human genetic variation have also informed our view of the

history of modern human populations (Cavalli-Sforza 2005). The new methodologies to visualize

and interpret genetic data explained by the researchers have helped in understanding about the

human evolution.

Personal genetic information can be used to investigate population architecture and

allocate the person to groups that frequently match with their geological lineage (Shaer eta l.,

2014). With the development of new techniques and algorithms, it is now possible to accurately

estimate the genetic relation among individuals (Lange et al., 2014).

26

2.4.1 Genetic Ancestry and Admixture Mapping

Admixture analysis is used to study how genotypic information changes the disease rate

in human populations. It occurs when divers‟ populations‟ starts interbreeding and their progeny

characterize a combination of alleles from different ancestral groups (Mendelson et al., 2014).

An admixture ratio estimation of a person is a helpful tool in population genetics and

epidemiology. Admixture analyses enable the scientists to categorize those with no information

of ancestry into distinct populations (Ruiz-Linares eta l., 2014). This technique has effectively

been applied on different populations to know about their genetics. Current admixed ethnic

groups, which map their ancestry to numerous regions, are suitable for investigating genes for

diseases and other phenotypes that vary in occurrence between parental populations (Race and

Group 2005).

Some of the monogenic disorders occur due to the variation in allele frequency in a

population which usually associates with ancestry either it is ethnic or geographical (Via et al.,

2009). The health-care experts generally use such information into account to make some

decision. Common diseases like diabetes, obesity, heart problems, blood pressure and

neurological disorders include many genetic variants and environmental factors (O‟Donnell et

al., 1998). Scientist investigates to about the involvement of pathogenic alleles with low or

moderate response.

2.4.2 Medical and Clinical Implications

Sequencing whole genomes technology is now capable to identify disease variants in

patients with accuracy results and lower cost. Still, although researchers and policy makers are

trying to handle the issues in using and interpretation of genotype data (Kaye and Hawkins

27

2014). Until now genetic variants have been used in molecular diagnostic testing but with limited

loci (Yip et al., 2008). With the cheaper faster and accurate sequencing technology, the

diagnostic tests can be done at single-nucleotide level.

Human genome data generated by 1000 genome project and other genome research

groups, investigators all around the world have come up with more advanced tools to study the

role of variants along with its associated environment in complex diseases (Cirulli and Goldstein

2010). Genome wide studies are already facilitating clinical researchers to improve diagnostics

and better decision-making tools for patients (Houdayer et al., 2008). That is how role of

genomics in health care initiated the era of genetic medicine which is also familiar to people as

personalized medicine.

Moving discoveries from the laboratory to the professional clinics takes reasonable time

and funding. Recently the American government has announced to invest over 200 million

dollars in genetics health care and Precision Medicine, another term for personalized genomic

medicine (McCarthy 2015). According to genetic professional, generally it takes more than ten

years for an industry to perform medical related studies, due to policies designed by the FDA

(Ciociola et al., 2014).

Genome wide study is contributing to individual‟s risk of developing diseases which are

common in world populations. These common diseases include diabetes, cancer, hypertension

and cardiovascular disorders (Eyre et al., 2004). A profound understanding of genetic makeup of

such diseases will help us reveal the essential mechanism of cells and, eventually our knowledge

about how various elements work simultaneously to affect an individual‟s health will increase

(Lander 2011).

28

2.4.3 Pharmacogenomics

It is the field of genomics in which advance molecular and genetic techniques are used to

better understand a patient‟s genetic abnormality and prescribe better medicine for him

(Goldman et al., 2007). Rational medicines are saving millions of lives every day. Yet there

might be one drug which will not be helpful for a patient, even if it works for others (Edwards

and Aronson 2000). In some cases it may cause severe side effects for one person but not for the

other. Many scientists have realized that most prescribed medicines do not work on

most patients who take those (Vermeire et al., 2001). It is an open secret within the

companies that most of its drugs are useless for most of the patients but for the first

time such news has gone public. After decoding the human genome many ideas were developed

to find the causes and cure of human diseases (Jobling et al., 2013). The clinical application of

this individual genetic information, leads to a new era of personalized drugs, which created

challenges and opportunities for the biomedical researchers and health care professionals

(Guttmacher et al., 2007).

Individualized drug uses information from a person‟s genetic profile and uses it for

identifying gene expression level to a disease, choosing a drug and starts a preventive measure

that is appropriate for certain patient (Chobanian et al., 2003). Computational analysis of an

individual genetic data for predisposition to disease is changing the way medicine are

discovered and instructed. It is indeed a bold new research effort to revolutionize how we can

improve health care system (Collins et al., 2003). This type of innovation is associated with

considerable scientific uncertainty and financial risk but recently countries like USA are

spending million dollars to initiate the personalized medicine that promises to accelerate

29

biomedical inventions and provide medical professionals with new tools and skills to select

which treatments will work best for which patients (Hamburg and Collins 2010).

With the development of personalized medicine it will be possible to produce more

effective drugs having lower chances of adverse effect as compared to rational drugs (Okimoto

and Bivona 2014). The healthcare management will be soon capable to develop more targeted

drug therapy to the diseased individuals with less errors and lower rate of drug-related side

effects (Whirl‐Carrillo et al., 2012). With the development of biotechnology, healthcare

professionals are now familiar with the fact that the same drug does not work in the same way in

each patient. Some patients do not positively respond to the treatment (Fletcher et al.,

2012). Many patients treated for different diseases do not respond to prescribed drugs. The idea

behind individual medicine is that, all patients have an exclusive genetic makeup and this should

be utilized in the choice of medical treatment, resulting in improved efficacy and minimization

of side effects (Gamma 2013). Precision medicine can be regarded as the current era‟s answer

for rational drug usage. Physicians will be provided an objective improve medical treatment

along with these novel molecular diagnostic procedures for a many disease areas (Pauwels et al.,

2014).

Personalized health care has the capability to revolutionize how we could prevent,

diagnose and treat human diseases (Snyderman and Williams 2003). It is the beginning of a

journey that holds much promise, but it will require thoughtful and joint research among

scientists, health-care professionals, ethicists, policy makers, patient advocates and general

public to chart the wisest course (Roberts and Ostergren 2013).

30

2.5 Personal and Population Genome Projects

Scientists contributing in the field of genomics by studying several personal and

population genome projects running by different collaborative research groups around the globe.

Many renowned personalities have donated their DNA to research communities so they could

understand the hidden genetic information and use it for the betterment of humankind

(Hellenthal et al., 2014). These renowned individuals include Craig Venter (American

geneticist), James D. Watson (Nobel Laureate), Steven Quake (BioEngineer) and George Church

(Harvard professor). Atta ur Rehman (Former Education Minister) from Pakistan has also

contributed to the field by providing his genome. Similarly other individual genomes belong to

different ethnic groups from different countries have been reported which includes First Indian

genome from male and a female genome from south Asian India (SAIF), a genome from Sri

lanka, Irish genome, Turkey, Australian and African genome.

Due to the sudden decrease in the cost to sequence and analyze a human whole genome,

many research groups have established consortiums to study genomes from geographically

different region and ethnic groups, to understand the biology of genetic disorders and understand

how these populations migrated from one place to other (Kidd et al., 2004).

2.5.1 Personal Genome Project (PGP): PGP was started by Prof. George Church from Harvard

Medical School, in the year 2005 (www.personalgenomes.org). A long term project which aim is

to analyze the personal genome of donors who sign consent that their genomes can be publically

available to the world. It was believe to collect 100,000 donors from America but later many

other semi-consortia from UK, Korea etc also participated to perform their role (Church 2005).

31

2.5.2 1000 Genomes Project (1KGP): Announced in 2008 by Welcome Trust, Beijing Institute

and National Health Institute (Siva 2008). The main objective of the project was to collect

human genetic variants by sequencing one thousand human genomes belong to diverse ethnic

groups around the world. The project was then divided into three phases and more genomes from

other populations were also included. The first phase was completed and reported in 2012.

Currently there are 2,577 genome samples from 26 populations are available online on the 1000

genome project official website (www.1000genomes.org/).

2.5.3 Pan-Asian Population Genomics Initiative (PAPGI): PAPGI is a second version of Pan-

Asian SNP consortium which was successfully completed by scientist from China, India, Japan,

South Korea, Singapore and Thailand additionally supported by Indonesia, Malaysia,

Philippines, and Taiwan. The current version is being assisted by Middle East countries which

include Saudi Arab, Kuwait and UAE. They are generously participating in the data production

and analysis (Ranganathan et al., 2012). The goals of this project are to study Asian genomes

and to correlate them with local adaptation, population migration, and genetic variation related

with phenotypic and genetic disorders (Ngamphiw et al., 2011). The consortium is helping

research community to understand human evolution and medical applications (www.papgi.org).

2.5.4 One Million Genomes: American government is going to spend $215 million on a

“personalized medicines” initiative, which will include genetic health care information from

volunteers (Insel et al., 2015). The money allocated for this project has also involved the study of

cancer and other rare diseases. A bio bank will be created where millions of genomes from

32

Americans will be stored. Later these genotypes will be used for establishing precision medicine

(Collins and Varmus 2015).

2.5.5 Human Genome Diversity Project (HGDP): Stanford researchers started this project in

collaboration with Centre Etude Polymorphism Humain (CEPH) in Paris to study human genetic

evolution. Approximately 1,043 samples from 53 diverse populations were studied (Bryc et al.,

2010). Their 650K single nucleotide variants were determined using microarray chip developed

by illumina. The data collected was from Africa, Europe, Asia and the USA (Rosenberg 2006).

Some of the HGDP samples have been sequenced (WGS) with ~30X coverage by Simon‟s

Foundation.

2.5.6 Billion Genomes Project: BiG started by Theragen BiO Institute San Diego. The idea is to

sequence every individual human living on earth and understand about the unknown genetic

information of human population around the globe (http://billiongenome.com).

2.5.7 Other Genome Consortiums: Many other genome project and consortiums were also

established by different countries which include, Singapore Genomes Variation Project (Teo et

al., 2009), Indian Genome Variation Project (www.igvdb.res.in), Malaya genome project (Wong

et al., 2013), Korean Personal Genome Project (Zhang et al., 2014), The African Genome

Variation Project (Gurdasani et al., 2015), Genome Arabia Project and Iranian Genome Project

(irangenes.com).

http://billiongenome.com/

Chapter 3

MATERIAL AND METHODS

Pages 33-45

33

CHAPTER 3 3. Materials and Methods

3.1 Subject Selection and enrollment of participant and ethical statement:

This study has been performed in accordance with Declaration of Helsinki and

has been approved by the Institutional Review Board Genome Research Foundation

(GRF) with IRB-REC-2011-10-003. Signed informed consent was obtained from the

participant in this study to publish the entire content of his genome, as well as personal

identifying information (such as age, sex and location).

There are documented cases of his family members with hypertension, heart

problems, neuro disorders, diabetes and obesity. His father has been diagnosed for

cardiovascular disorder, hypertension and Alzheimer’s. His mother has osteoarthritis and

grandparents were died due to heart attack, cancer and hypertension.

Figure 3.1: Family pedigree of donor (red), with members having genetic disorders.

34

3.2 Data sources:

The UCSC reference genome (hg19, February 2009), dbSNP version 137 and

genome annotations, were retrieved from the database (www.genome.ucsc.edu). Variant

calling files (VCF) were retrieved from different publically available databases. i.e. 41

samples from 9 diverse populations (African ancestry in Southwest USA; Utah residents

with Northern and Western European ancestry from the CEPH collection; Han Chinese in

Beijing, China; Gujarati Indian in Houston, Texas, USA; Japanese in Tokyo, Japan;

Luhya in Webuye, Kenya; Maasai in Kinyawa, Kenya; Toscans in Italy and Yoruba in

Ibadan, Nigeria) were collected from Complete Genomics Inc

(www.completegenomics.com) and five samples were taken from Korean Personal

Genome Project (KPGP) (www.kpgp.kr). Twelve South Asian populations from the

CEPH- HGDP which were genotyped on 650K SNP arrays were also downloaded from

the public databases of Stanford University (http://www.hagsc.org/hgdp/files.html).

3.3 DNA Extraction:

Genomic DNA was extracted from the arterial blood lymphocytes of a 30 year old

healthy male individual, who was reported to come from Pakistani Pakhtun ethnicity for

at least three generations. Consent form was signed prior to the collection of the blood

sample from which genomic DNA was extracted. Extraction kit QIAamp DNA Blood

Mini Kit was used for DNA extraction from the blood (Qiagen). Tecan’s Infinite F200

nanodrop was used to assess DNA purity, 1.7 % agarose gel electrophoresis to confirm

DNA size (presence of high molecular weight DNA) and Invitrogen’s Qubit fluorometer

to determine the DNA concentration.

35

3.4 Cytogenetic Analysis:

Karyotyping was carried out with cultured peripheral blood lymphocytes using

standard techniques, and GTG banding was used to identify chromosomal aberrations,

which is useful for identifying genetic diseases through the photographic representation

of the entire chromosome complement (Speicher et al., 2005). Blood sample was frozen

and stained using trypsin. The sample was then observed with microscope. The bands

were pronounced and we were able to mark the normal genetic male traits while also

recoding any slight abnormalities.

Figure 3.2: Cytogenetic analysis through GTG banding karyotype and legends.

3.5 Library preparation and Whole Genome Sequencing:

The 1.1 μg of gDNA was used to generate two paired-end libraries suitable for the

HiSeq sequencing platform (IlluminaH) prepared using the TrueSeq DNA Preparation

Kit, following Illumina’s standard protocol (Pair End Library Preparation Kit, Illumina,

San Diego, CA, USA). Quality control analysis of the library using an Agilent 2100

Bioanalyzer indicated that the library was of acceptable quality, containing the expected

fragment size and yield, for continued sample processing. The library generated was used

36

in the cBot System for cluster generation in three flow cell lanes. Cluster generation was

then performed on an Illumina cBot and the libraries sequenced on an Illumina HiSeq

2000 following the Pair-End protocol for each. Bad quality reads were eliminated from

the final output of the sequencing machine.

Figure 3.3: Illumina HiSeq2000 Machine and accessories. (http://qbi.uq.edu.au)

Shearing of gDNA was done using Covaris S series (Covaris, MS, USA).

Following end repair, A-tailing and adaptor ligation, DNA in the 500-600 bp range was

purified from a 2% agarose gel. Polymerase chain reaction (PCR) was performed using

the following cycling profile: initial denaturation at 98°C for 30 sec. followed by 10

cycles of 98°C for 30 sec, 60°C for 30 sec, and a final extension step at 72°C for 5 min.

Proper DNA size was then confirmed with the Agilent Bioanalyzer, followed by qPCR

quantification with Roche Light Cycler 480 II and Kapa Biosystems reagents. The

remainder of our analyses was initiated from the FASTQ files provided by Illumina's

downstream analysis CASAVA software suite.

37

Figure 3.4: Library quality generated by BioAnalyzer.

3.6 Workflow for Genomic Data Analysis:

A custom workflow was created for the analysis of the genome. This included

calling variations from the alignments, comparison with other variant databases including

dbSNP (Sherry et al., 2001), database of genetic variants (Iafrate et al., 2004, Feuk et al.,

2006) and those from the 1000 Genomes Consortium (www.1000genomes.org). The

workflow further included the mapping and comparison of markers associated with

damaged variants and pharmacogenomics traits. Multiple analytical approaches have been

added to the workflow to assess the influence of ancestral contributions within a personal

genome along with the historical background of the region. The detailed components of

the analysis workflow are given in Figure 3.5.

Python programming language script was used to develop NGS data analysis

pipeline. It was designed to run on UNIX system and was tested on the Red Hat

Enterprise Linux (RHEL) server v5.6. It uses the Modules package to provide dynamic

modification (e.g. changing the path and version of Python) of a user's environment via

module files. Its Map Reduce approach was implemented mainly based on a custom

Simple Job Management framework SJM which currently supports Sun Grid Engine but

38

can be easily extended to support other batch systems. Each step in the pipeline was

implemented in a separate python script and the job description file generated for SJM is

in a human-readable format.

Figure 3.5: Workflow of the next generation sequencing and bioinformatics data analysis (Koboldt et al.,

2010).

39

3.7 Sequence alignment:

The input reads are generally in FASTQ format. BWA version 0.5.9 was used for

sequence alignment against the human reference Genome HG19 (Li and Durbin,2009). A

software package BWA was used for mapping low-divergent reads against a human

reference genome. It has a combination of three different algorithms: backtrack, SW and

MEM. The backtrack algorithm is one designed for Illumina reads (100bp), while SW

and MEM are made for longer sequences (70bp to 1Mbp). SW and MEM has better

performance than BWA-backtrack for 70-100bp reads, generated with Illumina.

Illumina’s quality score was converted into Sanger’s quality score by BWA. The

multithreading option was enabled with two concurrent threads for generating the SA

coordinates in mapping. The original alignment output which was in a SAM format was

converted into BAM using SAMtools version0.1.14. SAMtools is a package which helps

in variant calling and alignment visualization along with other processes like sorting,

indexing, data extraction and file conversion. SAM files are usually in larger size which is

compressed for saving hard disk space. Typically BAM files are heavy and cannot be

processed. SAMtools make us able to work directly with a compressed BAM file, without

having to uncompress the complete file (Li et al., 2009). SAM and BAM files have

detailed information about the reads along with references, alignments, quality

information, and user-specified annotations which can be removed with SAMtools.

Sorting of the BAMs was done by the Picard tool (http://picard.sourceforge.net)

version 1.32 and binning the BAMs by chromosome was performed using SAMtools.

Picard was used to remove duplicates in alignments where as GATK version 1.0.5506

was used for local realignment and base quality checking (McKenna et al., 2010). The

GATK is a Genome Analysis toolkit for analyzing high-throughput sequencing data. It

40

offers a wide variety of packages, with special emphasis on variant calling and

genotyping and data quality control (Figure 3.6).

3.8 SNP and Indel detection:

The Unified Genotyper in GATK version 1.0.5506 was used for SNP and indel

detection with call # confidence set to 30.0 and emit # confidence set to 10.0. Dindel

model was enabled in indel calling. Filter label was applied using the Variant Filtration

program in GATK for allele balance (AB) greater than 0.75 quality score (QUAL) less

than 50.0 depth of coverage (DP) greater than 360 strand bias (SB) greater than V0.1 or

mapping quality zero reads (MQ0) greater than or equal to 4. The mpileup function in

SAMtools / BCFtools version 0.1.14 was also used for SNP and indel detection. The

generated VCFs were concatenated and merged using VCFtools version 0.1.5 and

indexed using Tabix version 0.2.4 (Danecek et al., 2011).

3.9 Copy Number Variation Detection:

Copy number variations have been studied using array based technologies but

their resolution is limited, hybridization reduces accuracy and their predefined probes

incompatible with the novel CNV detection. Next generation sequencing is emerging

technologies with rapid cost reduction which detect CNVs with higher resolution and

accuracy. ReadDepth 0.9.7 was used for identification of copy number variations with bin

size 0.01 (Miller et al., 2011). Copy number calls smaller than 1.3 were taken as loss and

greater than 2.6 as gains. ReadDepth is a new tool developed in R programming for CNV

discovery. It calls CNVs on the bases of sequence depth, and then invokes a circular

binary segmentation algorithm to call segment boundaries. It also allows for explicit

41

control of the false discovery rate (FDR), which minimizes the number of false positive

CNV detected.

3.10 Functional annotation

Functional annotation of genome variants means the process of attaching

biological information to sequences which includes the identification of elements on the

genome and assigning biological meaning to these elements. This process also called

gene prediction. Automatic annotation tools like ANNOVAR try to perform all these

analysis by computer programs (Wang et al., 2010).

All the detected variants obtained in the VCF format using SAMTools were then

annotated with ANNOVAR. The UCSC known genes and repeat masker databases were

used for gene and repeat annotations respectively. DGV

(http://projects.tcag.ca/variation/), SIFT (Ng and Henikoff, 2003), PolyPhen2 (Jordan et

al., 2011), and ClinVar were used for functional annotation (Landrum et al.,2013).

42

Figure 3.6: Schematics representation of the pipeline developed.

43

3.11 Pharmacogenomics Analysis

Functionally damaged nonsynonymous SNVs were used to retrieve the genes involved in

drug transport and metabolism and drug targets were retrieved from DrugBank and PharmGKB

(Hewett et al., 2002, Wishart et al., 2008). Variants associated with pharmacogenomics

characters were collected manually from literature and other data sources. A perl script was used

to get overlaps between the two sets (Figure 3.7). The clinically associated variants have been

recommended for testing. The methodology used for phamacogenomics analysis has already

been report previously (Salleh et al., 2013).

Figure 3.7: Schema of the Pharmacogenomics analysis (Salleh et al., 2013).

3.12 Multidimensional Scaling and ADMIXTURE:

Total 52 samples from 13 different ethnic groups, including Pakistani (Pathan) genome,

were used to do admixture, phylogenetic and MDS analysis. Complete genome variant files from

Complete Genomics Inc. USA, were downloaded from publically available data. The samples

include Africans in USA, European individuals from the CEPH collection, Han Chinese,

Gujarati Indian, Japanese, Puerto Rican, Luhya Kenyan, Maasai Kenyan, Mexican ancestry,

44

Italian and Yoruba (Drmanac et al., 2010) and five genomes were obtained from the Korean

Personal Genome Project (www.kpgp.kr). VCFTool was used to merge all the samples. Dataset

was restricted to the 607,578 SNVs available in all samples which also approved for quality

control. PLINK was then used to prepare data for admixture studies (Purcell et al., 2007).

Admixture analysis was performed using the program ADMIXTURE to identify the presence of

diverse ancestral relation of Pathan genome with others (Alexander et al., 2009). We explored

values of K, from K = 2 to K = 13. An ancestry painting was performed with the help of a

publically available tool INTERPRETOME, by analyzing individual genome information

(Karczewski et al., 2012). To describe how our genome clustered with the other populations,

multidimensional scaling (MDS) was constructed using PLINK. Pairwise identity-by-state (IBS)

distances were calculated between all individuals using the 607,578 SNV markers, and MDS

components were obtained using the mds-plot option based on the IBS matrix.

3.13 Pairwise Sequentially Markovian Coalescent Analysis

We conducted a PSMC (Pairwise Sequentially Markovian Coalescent) analysis to reconstruct

the demographic population history of Pathans (Li and Durbin 2012). We compared the Pathan

genome to a set of 11 HGDP genomes from around the world (as published by Meyer et al). We

first used samtools to extract the diploid genomes from their BAM files aligned to hg19, and

excluded sex chromosomes and mitochondrial genomes because they are haploid. In PSMC, we

used the command line options -N25 -t15 -r5 -p "4+25*2+4+6" that have been successfully used

in previous similar analyses of human and great apes (Prado-Martinez et al., 2013).

45

3.14 Phylogenomics Analysis

The most important aspect of evolutionary biology is to understand the relationship

among species. Single nucleotide variants (SNVs) which is also known as SNPs generated

through the sequencing, genotyping and other related technologies enable phylogeny

reconstruction by providing extraordinary numbers of characters for investigation (Miller et al.,

2013). In the current study SNP-based phylogeny was construction after identifying SNPs in all

individuals, and then compiled. The neighbor joining tree was generated by using pairwise FST

calculated for all ethnic samples by using the population allele frequencies across all autosomal

variants. The function “Neighbor” from PHYLIP was used to construct all bootstrap trees (Saitou

and Nei, 1987), and then MEGA5 was used to visualize it (Tamura et al., 2011). Yoruba

population was used as an out-group to root the phylogenetic tree.

Chapter 4 RESULTS

Pages 46-66

46

CHAPTER 4 4. Results

4.1 Genome Sequencing and Variants Identification:

DNA extracted from blood was sequenced with paired-end reads of 90bp using the

IlluminaHiSeq2000 sequencer, producing 1,069,127,687 reads. A total of 83.3 Gb of

sequences were generated and aligned to the human reference genome (without Ns,

2,861,343,702bp), covering 98.2% of the reference genome at an average 28.5u depth (Table

4.1).

Table 4.1: Summary of data production and mapping results Reads length 90 No. of Reads 1,069,127,687 No. of Mapped Reads 992,124,335 Mapped Reads % 92.80% No. of nucleotide Gb 83.25 Gb 89,385,267,060 Mapping depth 28.5

We identified a total of 3,813,440 SNVs,of which 3,683,999 (96.6%) were reported in

the dbSNP database (Sherry et a., 2001) and 129,441 were novel (Table 4.2) which were

further compared with the novel variants count of other individual genomes from literature

(Figure 4.1). There were 1,272,912 homozygous and 2,540,528 heterozygous SNVs. A total

of 18,547 SNVs were found in coding DNA sequence (CDS) regions, 25,481 in 3’

untranslated regions (UTR), and 4,969 in 5’ UTRs. A total of 10,315 SNVs in 5,344 genes

were non-synonymous (nsSNVs).

47

Table 4.2: Summary of SNVs found in Pathan’s genome and overlaps with dbSNP137 Total SNVs

Homozygous SNVs

Heterozygous SNVs

SNVs mapped to dbSNP (v137)

% of SNVs mapped to dbSNP

Novel SNVs

% of Novel SNVs

3,813,440 1,272,912 2,540,528 3,683,999 96.6% 129,441 3.39%

A total of 504,276 short indels (up to ±20 bases) were observed, of which 306,128

were found in intergenic regions, 237 in CDS regions, and 193,308 in intron regions.

Additionally, 1,503 CNVRs were found, 713 of which were classed as duplicated and 790 as

deleted, affecting 2,364 overlapped genes (Table 4.3).

Table 4.3: Variants (SNVs, Indels and CNVRs) identified in Pakistani (PTN) genome.

SNVs Indels CNVRs Total 3,813,440 504,276 1,503 Intergenic 2,376,933 306,128 866 Novel 129,441 --- 65 Homozygous 1,272,912 190,463 --- Hetrozygous 2,540,528 313,813 --- Synonymous SNVs 9,639 --- --- nonSynonymous SNVs

10,315 --- ---

CDS 18,547 237 253 Intron 1,387,430 193,308 220 3` UTR 25,481 4,149 5 5` UTR 4,969 399 17 Reported 3,683,999

(dbSNP) --- 1,438

(DGV)

A total of 65 CNVRs had not previously been described in the database of genomic

variants (DGV; http://projects.tcag.ca/variation/). Figure 4.2 shows the number of gained and

lost CNVRs in each chromosome. ANNOVAR was used for detailed annotation analysis of

CNVRs to identify genes associated with these regions.

48

Figure 4.1: Novel SNVs in personal genomes in thirteen different ethnic groups. Scatter plot showing novel

variants repoted in personal genomes. Data collected from literature.

Figure 4.2: Copy number variations counts distributed in each chromosome.

49

4.2 Functional Classification and Clinical Relevance of Variants:

All 10,315 nsSNVs found in the Pakistani (PTN) genome were further scrutinized for

their possible functional effects using computational prediction methods (SIFT and

Polyphen2), resulting in 43 nsSNVs in 43 genes being classified as functionally damaging

(Table 4.4). Additionally, nsSNVs were annotated using ClinVar for their clinical relevance,

and we found that 31 coding SNVs are associated with several diseases (Table 4.5). Of

particular note are an SNV (rs1049296, Pro570Ser) in the TF gene (Wang et al., 2013),

which affects Alzheimer’s susceptibility; Ser217Leu in ELAC2 gene (rs4792311), which is

implicated in genetic susceptibility to hereditary prostate cancer (Alvarez-Cubero et al.,

2013). The rate of prostate cancer is low in Pakistan (3.8%) (Aziz et al., 2003), as compared

to Americans and Caucasians (Bhurgri et al., 2009). Three coding SNVs on GHRLOS

(rs696217, Leu72Met), SERPINE1 (rs6092, Ala15Thr), and PPARG (rs1801282, Pro12Ala)

which all have links with obesity (Gueorguiev et al., 2009, Bouchard et al., 2010, Galbete et

al., 2013). About 22.2% of Pakistanis are reported to be obese which is close to European

(~24%) and United States populations (~19%) (Flegal et al., 2010, Kopelman et al., 2009).

We also found three pathogenic SNVs in genes associated with hair, skin and

pigmentation: EDAR (rs3827760, Val370Ala), SLC45A2 (rs16891982, Phe374Leu), and TYR

(rs1042602, Ser192Tyr) (Tan et al., 2013, Spichenok et al., 2011, Sulem et al., 2007). In

addition, we detected a SNV (rs17822931, Gly180Arg) in ABCC11, which is responsible for

wet earwax which was also found in the Pakistani PK1 genome (Yoshiura et al., 2006).

50

Figure 4.3:Comparative variant count of other reported individual genomes with Pakistani (PTN) genome.

Graphical representation of comparative study of PTN SNVs with other personal genomes reported previously.

One of the variants (rs1065852, Pro34Ser) in the CYP2D6 gene is responsible for

poor metabolism of debrisoquine, an adrenergic-blocking medication used for the treatment

of hypertension (Zheng et al., 2013). Also, two SNVs in the TPMT (rs1142345, Tyr240Cys

and rs1800460, Ala154Thr) are known to have a pathogenic effect and lead to thiopurine

methyltransferase (TPMT)deficiency (Li et al., 2013, Corrigan et al., 2013). Moreover two

nsSNVs (rs2056899 and rs140980900) ofCYP4A22 and GGT5 genes in the Arachidonic acid

metabolism pathway were found. Arachidonic acid in the human body usually comes from

dietary animal sources, such as meat, eggs, and dairy. Meat is an important diet part of the

people living in the northwestern Pakistan, usually consumed at least once a day, often in the

form of kabab (minced meat fried in oil), or curry (Lindholm 2004).

51

Table 4.4: Functionally damaged novel nsSNVs.

CHR POS REF ALT AA GENE SIFT (≤ 0.05) Polyphen2 chr1 114442945 T C E232G AP4B1 0.00 Damaging chr1 235976331 G C L75V LYST 0.00 Damaging chr1 113253928 C T G336R PPM1J 0.01 Damaging chr1 156242159 G T A222E SMG5 0.01 Damaging chr10 73475893 G A R68C C10orf105 0.02 Damaging chr11 128839275 C G G1931R ARHGAP32 0.00 Damaging chr11 46388863 C T L251F DGKZ 0.04 Damaging chr11 607617 G A G720R PHRF1 0.01 Damaging chr12 46757591 C A M324I SLC38A2 0.03 Damaging chr12 21457414 C A G179V SLCO1A2 0.00 Damaging chr12 8327035 C G H42Q ZNF705A 0.00 Damaging chr14 71445083 C T R677W PCNX 0.01 Damaging chr15 45426095 G A R31Q DUOX1 0.04 Damaging chr15 42041072 T C L1817P MGA 0.00 Damaging chr16 70524280 C T V555M COG4 0.04 Damaging chr16 27782929 A G E1385G KIAA0556 0.05 Damaging chr16 75147696 A G L324P LDHD 0.00 Damaging chr17 36003399 G C D17E DDX52 0.01 Damaging chr17 78082104 C A P324Q GAA 0.00 Damaging chr17 2995813 T G T160P OR1D2 0.00 Damaging chr17 7324288 C A D98E SPEM1 0.00 Damaging chr18 10487685 G A G399S APCDD1 0.01 Damaging chr18 55143927 C G S496C ONECUT2 0.00 Damaging chr19 4513548 C T G128R PLIN4 0.01 Damaging chr2 42990263 C T V353M OXER1 0.01 Damaging chr2 179439827 G C Q23678E TTN 0.00 Damaging chr2 98779387 C G I354M VWA3B 0.05 Damaging chr21 34924337 C G P934A SON 0.02 Damaging chr22 50307056 G A S91F ALG12 0.01 Damaging chr4 69796409 G A P387S UGT2A3 0.00 Damaging chr5 65290677 G A D98N ERBB2IP 0.04 Damaging chr5 154320687 T A L6Q MRPL22 0.00 Damaging chr5 140475629 T A Y419N PCDHB2 0.00 Damaging chr6 56879992 G T K120N BEND6 0.00 Damaging chr6 32188296 C T G349S NOTCH4 0.00 Damaging chr6 84234199 G A G347S PRSS35 0.00 Damaging chr7 73634930 G C R94S LAT2 0.05 Damaging chr8 28989961 C G E936Q KIF13B 0.02 Damaging chr8 81897091 C T D266N PAG1 0.05 Damaging chr8 110476498 C A H2479Q PKHD1L1 0.01 Damaging chr8 142228631 C T D319N SLC45A4 0.03 Damaging chr9 135863848 G T C168F GFI1B 0.02 Damaging chrX 152801794 C T T30M ATP2B3 0.00 Damaging

52

Comparative genomic analysis was done using Pakistani genome symbolized as

“PTN” and the other previously published Pakistani (PK1) genome. Non-synonymous

variants from Pakistani (PK1) genome were annotated for investigating associated diseases.

Out of ~8,000 nsSNVs only 37 variants (three novel) were found linked with certain

disorders. Eight clinically relevant SNVs were detected overlapped with PTN genome. We

found no damaged variants responsible for Alzheimer’s, obesity and heart related diseases

just like we found in PTN genome. An SNV (rs1057910; CYP2C9) was observed in PK1

genome which is known for Wafarin response. Moreover, a pathogenic mutation (rs1169305)

was seen in the HNF1A gene which may become a cause of diabetes in the PK1 individual.

Most of the clinically relevant variants adopted in this study were originally described

in Caucasian populations. While this result might be a consequence of the genomic affinities

of the PTN genome with other Caucasian populations, it might also reflect a bias due to most

of the GWAS work being carried out on Caucasian populations (Ayub and Tyler-Smith

2009). Therefore a cohort study in the Pakistani population will be required for

authentication.

4.3 Pharmacogenomics Analysis:

Damaging nsSNVs were annotated using PharmGKB and DrugBank databases

(Hewett et al., 2002, Thorn et al., 2013, Wishart et al., 2008). A significant number of

variants were found linked with susceptibility to poisonous drugs, while remaining nsSNV

were associated to the drug’s efficacy used in the treatment of diseases such as depression,

diabetes mellitus and so on (Table 4.6).

53

Table 4.5: Clinical relevance coding SNVs in Pakistani PTN whole genome.

Chr Position rsID Ref Alt Clinical Significance Description chr1 115236057 rs17602729 G A Pathogenic Muscle AMP deaminase deficiency (MMDD) chr2 49189921 rs6166 C T Association Ovarian hyperstimulation syndrome (OHSS) chr2 49191041 rs6165 C T drug response Ovarian response to FSH stimulation chr2 109513601 rs3827760 A G Pathogenic Hair morphology chr2 215813331 rs726070 C T Pathogenic Autosomal recessive congenital ichthyosis 4B (ARCI4B) chr3 10331457 rs696217 G T Pathogenic Obesity chr3 12393125 rs1801282 C G Pathogenic Obesity chr3 15686693 rs13078881 G C Pathogenic Biotinidase deficiency chr3 133494354 rs1049296 C T risk factor susceptibility to Alzheimer disease chr4 102751076 rs10516487 G A Pathogenic association with Systemic lupus erythmatosus chr5 33951693 rs16891982 C G Pathogenic Skin/hair/eye pigmentation, variation in, 5 (SHEP5) chr5 35861068 rs1494558 T C Pathogenic Severe combined immunodeficiency chr5 35871190 rs1494555 G A Pathogenic Severe combined immunodeficiency chr6 18130918 rs1142345 T C Pathogenic Thiopurine methyltransferase deficiency (TPMT) chr6 18139228 rs1800460 C T Pathogenic Thiopurine methyltransferase deficiency (TPMT) chr7 100771717 rs6092 G A Pathogenic Plasminogen activator inhibitor type 1 deficiency chr7 138417791 rs3807153 A G Pathogenic Renal tubular acidosis, distal, autosomal recessive (RTADR) chr8 18258103 rs1799930 G A drug response Slow acetylator due to N-acetyltransferase enzyme variant chr10 54531235 rs1800450 C T Pathogenic Mannose-binding protein deficiency chr10 70645376 rs10509305 A C Pathogenic Preeclampsia/eclampsia 4 (PEE4) chr11 5255582 rs35152987 C A Pathogenic delta Thalassemia chr11 88911696 rs1042602 C A Pathogenic Skin/hair/eye pigmentation, variation in, 3 (SHEP3) chr11 113270828 rs1800497 G A Pathogenic Dopamine receptor d2, reduced brain density of chr12 14993439 rs11276 C T Pathogenic DOMBROCK BLOOD GROUP chr14 21790040 rs10151259 G T Pathogenic Cone-rod dystrophy 13 (CORD13) chr15 28228553 rs74653330 C T Pathogenic Tyrosinase-positive oculocutaneous albinism (OCA2) chr16 48258198 rs17822931 C T Pathogenic Colostrum secretion, Ear wax chr17 12915009 rs4792311 G A Pathogenic Prostate cancer, hereditary, 2 (HPC2) chr20 43043159 rs142204928 G A likely pathogenic Maturity-onset diabetes of the young, type 1 (MODY1) chr20 43280227 rs73598374 C T Pathogenic Adenosine deaminase 2 allozyme chr22 42526694 rs1065852 G A Pathogenic poor metabolism of Debrisoquine

54

Table 4.6: Damaged nsSNVs and the drugs.

rsID Position Ref Alt AA Category Gene

rs1065852 22:42526694 G A P34S ENZ CYP2D6

Drugs

amitriptyline;antipsychotics;atomoxetine;carvedilol;chlorpheniramine;chlorpromazine;citalopram;clomipramine;clozapine;codei

ne;debrisoquine;desipramine;dextromethorphan;doxepin;escitalopram;flecainide;fluoxetine;fluvoxamine;gefitinib;haloperidol;il

operidone;imipramine;maprotiline;metoprolol;mexiletine;mianserin;morphine;nortriptyline;paroxetine;perhexiline;perphenazine

;propafenone;propranolol;risperidone;sparteine;tamoxifen;thioridazine;timolol;tolterodine;tramadol;yohimbine;zuclopenthixol

Diseases Breast Neoplasms; Cystic Fibrosis; Depression; Depressive Disorder; Hypertension; Neoplasms; Pain; Parkinson Disease;

Schizophrenia; tardive dyskinesia

rs1142345 6:18130918 T C Y240C ENZ TPMT

Drugs azathioprine; cisplatin; mercaptopurine; methotrexate; purine analogues; s-adenosylmethionine; thioguanine

Diseases Drug Toxicity; Neoplasms; Ototoxicity; Precursor Cell Lymphoblastic Leukemia-Lymphoma

rs12210538 6:110760008 A G M409T TRANS SLC22A16

Drugs cyclophosphamide; doxorubicin

Diseases Breast Neoplasms; Drug Toxicity

rs1799930 8:18258103 G A R197Q ENZ NAT2

Drugs clonazepam; Drugs For Treatment Of Tuberculosis;ethambutol;isoniazid;pyrazinamide;rifampin;sulfamethoxazole;trimethoprim

Diseases Drug Toxicity; Hepatitis; Hypersensitivity; Infection; Maculopapular Exanthema; Pneumonia; Toxic liver disease; Tuberculosis

rs1800460 6:18139228 C T A154T TAR TPMT

Drugs azathioprine; cisplatin; mercaptopurine; purine analogues;s-adenosylmethionine;thioguanine

Diseases Drug Toxicity;Neoplasms;Ototoxicity;Precursor Cell Lymphoblastic Leukemia-Lymphoma

55

rs1800566 16:69745145 G A P187S TAR NQO1

Drugs

1-methyloxy-4-sulfone-benzene;Analgesics and anesthetics;anthracyclines and related

substances;Antibiotics;antiepileptics;Antifungals For Systemic Use;antiinflammatory and antirheumatic products, non-steroids;

Antimycobacterials;Antithyroid Preparations;cisplatin;cyclophosphamide;dicumarol;doxorubicin;Drugs For Treatment Of

Tuberculosis;epirubicin;etoposide;fluorouracil;warfarin

Diseases Breast Neoplasms;Carcinoma, Non-Small-Cell Lung;Heart Failure;Leukemia;Lung Neoplasms;Toxic liver disease

rs1801133 1:11856378 G A A222V ENZ MTHFR

Drugs

antineoplastic

agents;antipsychotics;benazepril;busulfan;capecitabine;carboplatin;cisplatin;cyclophosphamide;cyclosporine;dactinomycin;dexa

methasone;disulfiram;docetaxel;doxorubicin;fluorouracil;folic acid;gemcitabine;hormonal contraceptives for systemic

use;hydroxychloroquine;leucovorin;mercaptopurine;methotrexate;nitrous

oxide;oxaliplatin;paclitaxel;pemetrexed;pravastatin;sulfasalazine;vincristine;vinorelbine;vitamin b-complex, plain

Diseases

Alopecia;Alzheimer Disease;Arthritis, Juvenile Rheumatoid;Arthritis, Psoriatic;Arthritis, Rheumatoid;Breast

Neoplasms;Carcinoma, Non-Small-Cell Lung;Cardiovascular Diseases;Cleft Lip;Cleft Palate;Cocaine-Related Disorders;olonic

Neoplasms;Colorectal Neoplasms;Artery Disease;Down Syndrome;Drug Toxicity;Graft vs Host

Disease;Hyperhomocysteinemia;Hypertension;Leukemia;Leukemia, Lymphocytic, Chronic, B-Cell;Leukemia, Myelogenous,

Chronic, BCR-ABL Positive;Leukopenia;Lymphoma, Non-Hodgkin;metabolic syndrome;Migraine with Aura;Myocardial

Infarction;Neoplasms;Neoplasms, Second Primary;Neural Tube Defects;Neutropenia;Osteonecrosis;Osteosarcoma;Pre-

Eclampsia;Precursor Cell Lymphoblastic Leukemia-Lymphoma;Psoriasis;Schizophrenia;Thrombocytopenia;Toxic liver

disease;Transplantation;venous thromboembolism

rs1801394 5:7870973 A G I49M TAR MTRR

56

Drugs folic acid;leucovorin;methotrexate;tegafur;vitamin b-complex, plain

Diseases Arthritis, Rheumatoid;Colorectal Neoplasms;Migraine with Aura;Precursor Cell Lymphoblastic Leukemia-

Lymphoma;Stomatitis

rs2228570 12:48272895 A G M51T TAR VDR

Drugs 1,25-dihydroxyvitamin d3;calcipotriol;calcitriol;dexamethasone;vitamin d and analogues

Diseases Breast Neoplasms;Fractures, Bone;Osteonecrosis;Precursor Cell Lymphoblastic Leukemia-Lymphoma;Prostatic

Neoplasms;Tuberculosis

rs4149056 12:21331549 T C V174A TRANS SLCO1B1

Drugs

Arsenic compounds; atorvastatin; atrasentan; axitinib; bosentan; capecitabine; caspofungin; cerivastatin; cytarabine; enalapril;

erythromycin; fludarabine; fluorouracil;fluvastatin;gemtuzumab ozogamicin;hmg coa reductase inhibitors;idarubicin; irinotecan;

leucovorin; lopinavir; lovastatin; methotrexate; mycophenolate mofetil; nateglinide; olmesartan; penicillin g; pitavastatin;

pravastatin; repaglinide; rifampin; rosuvastatin; simvastatin;SN-38;troglitazone;valsartan

Diseases

Carcinoma, Non-Small-Cell Lung;Colorectal Neoplasms;Coronary Disease;Coronary Stenosis;Diabetes Mellitus, Type

2;Diarrhea;Hypercholesterolemia;Hyperlipidemias;Hyperlipoproteinemia Type II;Kidney Transplantation;Leukemia, Myeloid,

Acute;Muscular Diseases;Myocardial Infarction;Myopathy, Central Core;Neoplasms;Neutropenia;Obesity;Precursor Cell

Lymphoblastic Leukemia-Lymphoma;Rhabdomyolysis;Toxic liver disease;Transplantation

rs4646487 1:47279175 C T R173W ENZ CYP4B1

Drugs docetaxel; thalidomide

Diseases Prostatic Neoplasms

57

After determining the possibly pathogenic variants found in SIFT and Polyphen2, the

consensus of both datasets was further analyzed in order to find the most probable impact of

these deleterious variants in terms of drug targeting, transport, and metabolism. We found

nsSNVs that affect the function of drugs (two transport, five enzymatic, and four drug

targets). A variant rs1801133 (A222V in MTHFR gene) was found associated with increased

risk of metabolic syndrome when treated with antipsychotics (Ellingrod et al., 2008). Our

donor has high chance of having decreased diastolic blood pressure if treated with benazepril

(Jiang et al., 2004). One of the variants (rs1799930, R197Q in NAT2 gene) was associated

with increased risk of toxic liver disease when treated with ethambutol, isoniazid,

pyrazinamide, and rifampin (Çetintaş et al., 2008). We also observed an SNV (rs1065852,

Chr22:42526694 G > A) which made this individual use escitalopram for depression and

other anxiety (Han et al., 2013). The detail list of those drugs can be found in Table 4.7.

Table 4.7: List of drugs (PharmGKB) in the PTN Genome. VIP: Very Important Pharmacogenes; PD:

Pharmacodynamic; PK: Pharmacokinetic

Prot ID Symbol Genotyped VIP PD PK Variant Annotation Q96J66 ABCC11 TRUE FALSE - PK FALSE Q9BWD1 ACAT2 FALSE FALSE PD - FALSE B0ZBD3 ADRA1A FALSE FALSE PD PK FALSE A2RU49 AGPHD1 FALSE FALSE - - TRUE P50995 ANXA11 FALSE FALSE PD - TRUE P04114 APOB TRUE FALSE PD - TRUE P38398 BRCA1 FALSE TRUE PD - TRUE Q9UIR0 BTNL2 FALSE FALSE - - TRUE P56545 CTBP2 FALSE FALSE - - TRUE Q6NWU0 CYP2D6 TRUE TRUE PD PK TRUE Q5TCH4 CYP4A22 TRUE FALSE - - FALSE P13584 CYP4B1 TRUE FALSE PD PK TRUE Q14246 EMR1 FALSE FALSE - - TRUE P04626 ERBB2 FALSE FALSE PD PK TRUE Q2V2M9 FHOD3 FALSE FALSE - - TRUE Q08379 GOLGA2 FALSE FALSE PD - FALSE P34931 HSPA1L FALSE FALSE PD - TRUE Q70Z44 HTR3D FALSE FALSE PD - FALSE

58

P42858 HTT FALSE FALSE - - TRUE P05107 ITGB2 FALSE FALSE PD - FALSE P98164 LRP2 FALSE FALSE PD - TRUE Q9Y6C9 MTCH2 FALSE FALSE - - TRUE Q6UB35 MTHFD1L FALSE FALSE - - TRUE P42898 MTHFR TRUE TRUE PD PK TRUE Q9Y2K3 MYH15 FALSE FALSE PD - TRUE Q99466 NOTCH4 FALSE FALSE PD - FALSE Q14980 NUMA1 TRUE FALSE - - FALSE Q5JQS5 OR2B11 FALSE FALSE - - TRUE Q9P1Y6 PHRF1 FALSE FALSE - - TRUE Q9Y2K2 SIK3 FALSE FALSE - - TRUE P46721 SLCO1A2 TRUE FALSE PD PK TRUE P50226 SULT1A2 TRUE FALSE - - TRUE P51580 TPMT TRUE TRUE PD PK TRUE O75445 USH2A FALSE FALSE - - TRUE P11473 VDR TRUE TRUE PD PK TRUE Q709C8 VPS13C FALSE FALSE - - TRUE Q502W6 VWA3B FALSE FALSE - - TRUE

4.4 Comparison of PTN genome to worldwide populations:

Multidimensional scaling (MDS) for the PTN genome with 10 other diverse

populations from the Complete Genomics Inc dataset was carried out using 46,946 common

variants. The Pakistani Pathan individual (PTN) was observed near Gujarati Indians (GIH)

because of their geographical and traditional proximity between them (Figure 4.4). This

whole genome scale study of the PTN revealed a strong influence of Caucasians in the North-

West province of Pakistan. Populations from East Asians and Africans have made their own

clusters in the MDS, distinct from each other.

59

Figure 4.4: Multidimensional scaling (MDS) plot generated by PLINK based on 46,946 SNVs data to

show the ancestry of the PTN genome. Two-dimensional visualization of genotype data, with samples from

ten different ethnic populations (ASW: African ancestry in Southwest USA, CEU: Utah residents of Northern

and Western European ancestry, KOR: Korean, CHB: Han Chinese in Biejing, GIH: Gujarati Indians in

Houston, Texas, JPT: Japanese in Tokyo, Japan, LWK: Luhya in Webuye, Kenya, MKK: Maasai in Kinyawa,

Kenya, TSI: Toscani in Italia, YRI: Yoruba in Ibadan, Nigeria) collected by the HapMap Consortium and our

donor Pathan (PTN) individual.

http://hapmap.ncbi.nlm.nih.gov/

60

The same 46,946 SNVs were used to perform model-based cluster analysis using the

software ADMIXTURE. We performed analysis for K = 2 to K = 13 distinct ancestral

populations. For K = 3, the PTN genome corresponds to the Caucasian ancestry, accounting

for 85% of ancestry overall in PTN Pakistani individual and 74% in Gujarati Indians (Figure

4.5). For K = 4, the Caucasian, African and East Asian ancestral populations were observed

same as seen for K = 3. Comparing results from K = 3 and K = 4, we see remarkable

agreement in the relative proportions of Caucasian and Asian ancestry across all Indian and

Pakistani individual. However, K = 4 shows a very clear separation of South Asian ancestry

to distinct groups. Results from K = 5 to K =13 suggest further separation in the ancestral

populations. Moreover, the ancestry chromosome painting was performed using

INTERPRETOME, which verifies the admixture SNVs of the Pakistani individual with

Caucasians and Asians (Figure 4.6). The admixture results are in agreement with the MDS

plots and suggest shared common ancestry of Pakistanis and Caucasians.

Figure 4.5: ADMIXTURE results for K = 2 and K = 3 for the PTN individual combined with 46 selected

whole-genomes from Complete Genomics Inc. dataset (ASW: African ancestry in Southwest USA, CEU: Utah

residents of Northern and Western European ancestry, KOR: Korean, CHB: Han Chinese in Biejing, GIH:

Gujarati Indians in Houston, Texas, JPT: Japanese in Tokyo, Japan, LWK: Luhya in Webuye, Kenya, MKK:

Maasai in Kinyawa, Kenya, TSI: Toscani in Italia, YRI: Yoruba in Ibadan, Nigeria) and PTN: Pakistani Pathan.

61

The analysis was based on 46,946 SNVs. Each individual is represented by a vertical line, divided into colored

segments that represent membership coefficients in the subgroups.

Figure 4.6: Chromosome painting of possible genomic admixture, with Caucasians, Africans and Asians.

INTERPRETOME was used to create the chromosome ancestry painting.

4.5 Comparison with other Pakistani Individuals:

We investigated how representative our Pakistani PTN genome was of its ethnic

group by comparing it to other 190 Pakistani individuals in the HGDP-CEPH panel

(Rosenberg 2006, Li et al., 2008), which had been typed for ~650k SNVs. Admixture

analysis was performed based on 643,281 SNVs (thinned to avoid LD). We considered the

cluster membership from ADMIXTURE and STRUCTURE (from K=2 to K=5), the

Pakistani (PTN) genome composition was within the variability observed within the PTN

sample from the HGDP (Figure 4.7). Similarly, in a multi-dimensional scaling (MDS) plot,

the PTN genome fell within the other Pathan individuals (Figure 4.8). Taken together, these

62

two results confirm that the Pakistani genome symbolized as “PTN”, presented in this thesis

is representative of the Pathan ethnic group. These results are also in line with the self-

reported ancestry of the subject, with all his grandparents coming from Afghanistan to

Khyber Pakhtunkhwa (Pakistan).

Figure 4.7: Admixture results of Pakistani Pathan (PTN) individual to other ethnic groups in South Asia.

Admixture results for K = 2 and K = 5 for the Pathan individual combined with eight ethnic genomes from

HGDP dataset. The analysis was based on 643,281 SNVs. Each individual is represented by a vertical line,

divided into colored segments that represent membership coefficients in the subgroups.

63

Figure 4.8: Relationship of Pakistani Pathan individual to other ethnic groups in South Asia. Tweleve different groups from South Asia were compared with PTN. The

analysis was based on 643,281 SNVs.

64

4.6 Demographic History Analysis:

We inferred the demographic history of the Pakistani Pathan using the pairwise

sequentially Markovian coalescent (PSMC) model (Li and Durbin 2012) (Figure 4.9), and

compared it to a panel of worldwide populations based on a number of HGDP genomes (Meyer

et al., 2012). As previously reported, all populations share a similar demographic history

between 1 million to 200kyr ago. From 200kyr ago to 20kyr ago, the PTN follow a similar

trajectory to other Asian and European populations, with an inferred effective population size

smaller than African populations, reflecting the out of Africa bottleneck. Over the last 20k years,

the PTN shows an explosion in effective population size, contemporaneous to other Eurasian

populations but much greater in magnitude. The very large effective population size likely

reflects admixture between European and Asian lineages giving rise to modern Pathans in

Pakistan (as also suggested by the analysis of mtDNA and Y-chromosome), rather than an actual

increase in census sizes.

Figure 4.9: Pairwise Sequentially Markovian Coalescent (PSMC) model for reconstructing Pakistan’s demographic

history.

65

4.7 mtDNA and Y-chromosome analyses

The full mitochondrial genome of the Pakistani individual was generated by mapping its

reads to the revised Cambridge reference sequence (rCRS) (Andrews et al., 1999). Adenine and

thymine (AT) content of the genome was 55.5%, while guanine and cytosine (GC) content was

44.5%. A total of 57 SNVs were found in the PTN mitochondrial genome, 13 of which had not

been previously reported.The variants were then mapped with HaploGrep (Kloss-Brandstätter et

al., 2011) to identify the mitochondrial haplogroup of our PTN individual. A total of 14 SNVs

were diagnostic of the C4a1a1 haplogroup, which is more prevalent in the southern Siberian

populations, and is also reported in Pakistani Pathans (Rakha et al., 2011, Derenko et al., 2010).

The AT and GC contents of the Y-chromosome were 39.87% and 60.13%, respectively.

A total of 13,724 SNVs were identified, of which 4,423 were novel. The observed Y-

chromosomal SNVs were annotated as markers for the L1 haplotype of clade L. Haplogroup L

has high frequency in Pakistan (14%) as compare to India (6.3%), Turkey (~4%) and Caucasians

(~6%) (Mohyuddin et al., 2001, Firasat et al., 2007).

4.8 Phylogenomic Analysis:

A phylogenetic tree was constructed using 46 unrelated individuals in which, genomes

belonging to the same population and geographic region were found together in the same clad.

The PTN genome was observed closer to the Indian genome, which were the most similar and

geographically nearest to each other compared to the other representative genomes from other

Asian individuals. Pakistan lies next to China on the North East side geographically, which

makes a separate tree with its genetically similar ethnic groups such as Japan and Korea (Figure

4.11). Genomes from East Asia were placed close to each other. African, which includes the

66

genomes from Yoruba (YRI), Maasai (MKK), and Luhya (LWK) populations including Africans

from USA (ASW), were on one clad being clearly separated from the Asian and Caucasian

genomes. Utah genomes (CEU) were grouped together, separated from those of Italy (TSI). Only

the Indian (GIH) and Pakistani (PTN) genomes were used from South Asia for this study.

Together they made a clad. However, they also showed a rather clear separation from each other.

Figure 4.10: Phylogenomic tree of Pakistani PTN genome with other world ethnic genomes.

Chapter 5 DISCUSSION

Pages 67-74

67

CHAPTER 5

5. Discussion

Globally, human populations show structured genetic diversity as a result of geographical

dispersion, selection and drift (Gurdasani et al., 2015). Understanding this variation can provide

insights into evolutionary processes that shape both human adaptation and variation in disease

susceptibility (Ding and Kullo 2009). Although the Hapmap (Gibbs et al., 2003), HGDP (Cann

et al., 2002), PanAsia (Ngamphiw et al., 2011) and 1000 Genomes Projects (Siva, 2008) have

greatly enhanced our understanding of genetic variation globally, the detailed characterization of

Pakistani populations remains unexplored. The efforts such as the Human Genomes Diversity

Panel examine Pakistan genetic diversity but are limited by variant density (Cann et al., 2002).

The Pakistan population consists of four major ethnic groups (Punjabis, Pakhtuns, Sindhis,

Balochis) each with unique cultural, dietary, environmental and ancestral heritage (Mehdi et al.,

1999). Genetic inferences about these ethnic groups have mostly focused on the uniparental

lineage markers, indicating the Pakistanis ancient admixture with Caucasians (Mohyuddin et al.,

2001). Clarification and study of the Pakistani population’s admixture provide fundamental

knowledge pertinent to interpretation of any genetic study of prevalent disease in Pakistani

groups and corresponding improved healthcare. Disease prevalence in the Pakistan includes

Cancer, Diabetes, Hypertension, Cardiovascular and Neurological disorders (Dennis et al., 2006;

Rizvi et al., 2004; Whiting et al., 2011; Shera et al., 2007; Jafar et al., 2005; Jafar et al., 2003;

Nanan 2009; Shah et al., 2001; Mirza and Jenkins 2004). For example, it is estimated that 10%

of the population is afflicted with neurological diseases (Husain et al., 2000).

68

The disease consequence of genetic diversity associated with dispersion, selection and drift,

and complicated by admixture, disease prevalence, severity, and resistance vary considerably

among ethnic groups. These factors are further complicated by inheritance issues and

noninherited and environmental causes, such as poverty, unequal access to care, lifestyle, and

health-related cultural practices (Chin et al., 2007). Genetic makeup of populations from

Pakistan is important for the knowledge contribution to specific diseases and is important to

scientists around the globe due to increased likelihood of congenital diseases unique in

prevalence to Pakistani populations. Consequently, this research was conducted to sequence the

first whole genome from northwest Pakistan for discovering disease variants as well as provide a

foundation for complex disease studies. The current research does not only provide new

approaches in exploring population admixture dynamics, but also help us conduct the first

genetic study of diseases and pharmaco genes in the northwestern population of Pakistan. The

ultimate goal of this study was to extend the results of these studies to the interpretation and

translation to improve healthcare to the Pakistani people.

5.1 Clinical Relevance and Variant Characterization:

Studying complex diseases and gene mapping is often difficult due to sampling from

genetically heterogeneous populations. This complexity can be circumvented in isolated

populations where both genetic and environmental homogeneity will likely produce fewer

variants of the disease and the extent of linkage disequilibrium is generally larger than out bred

populations (Race and Group 2005). Genomic variations including single nucleotide variations

(SNVs), small insertions and deletions (indels), and copy number variations (CNVs) were

69

identified. Variants were then annotated and scanned for associated biological and physiological

function along with SNVs that could modulate drug response.

Overall, 3.8 million single nucleotide variations (SNVs), 1,503 copy number variation

regions (CNVRs) and 0.5 million small indels were identified by comparing it with the human

reference genome (hg19). Among the SNVs, 129,441 were novel, and 10,315 non-synonymous

SNVs were found in 5,344 genes. SNVs were annotated for genealogical study, high risk

diseases, as well as possible influences on drug efficacy. Functional classification of all the non-

synonymous variants obtained was performed using computational prediction methods. Clinical

variants were investigated, and it was found that 31 coding SNVs are associated with several

diseases. From our analysis we found that the donor is susceptible to Alzheimer’s, after

discovering an SNV rs1049296 in the TF gene where proline changes into serine on position 570

(Wang et al., 2013). The associated SNV with AD decreases the affinity of iron to TF leading to

iron accumulation in brain cells which results in memory loss. Another variant rs4792311 in

ELAC2 gene in Pakistani genome (PTN) was observed which is reported to have interaction with

prostate cancer. In result of this SNV serine on position 217 was found replaced by leucine

(Alvarez-Cubero et al., 2013). The rate of prostate cancer is low in Pakistan (3.8%) (Aziz et al.,

2003), as compared to Americans and Caucasian (Bhurgri et al., 2009). The donor’s family

medical history showed that there are documented cases of obesity, hypertension and heart

diseases. Therefore, we specifically investigated those genes which are responsible for the said

disorders. Three variants responsible for obesity were found on in genes GHRLOS (rs696217,

Leu72Met), SERPINE1 (rs6092, Ala15Thr), and PPARG (rs1801282, Pro12Ala) (Gueorguiev et

al., 2009; Bouchard et al., 2010; Galbete et al., 2013). About 22.2% of Pakistanis are reported to

be obese which is close to European (~24%) and United States populations (~19%) (Flegal et al.,

70

2010; Kopelman et al., 2009; Streib 2007). We also found three pathogenic SNVs in genes

associated with hair, skin and pigmentation: EDAR (rs3827760, Val370Ala), SLC45A2

(rs16891982, Phe374Leu), and TYR (rs1042602, Ser192Tyr) (Tan et al., 2013; Spichenok et al.,

2011; Sulem et al., 2007). In addition, we detected a SNV (rs17822931, Gly180Arg) in

ABCC11, which is responsible for wet earwax which was also found in the Pakistani PK1

genome (Yoshiura et al., 2006).

One of the variants (rs1065852, Pro34Ser) in the CYP2D6 gene is responsible for poor

metabolism of debrisoquine, an adrenergic-blocking medication used for the treatment of

hypertension (Zheng et al., 2013). Also, two SNVs are known to have a pathogenic effect and

lead to thiopurine methyltransferase (TPMT) deficiency (Li et al., 2013; Corrigan et al., 2013).

Moreover, two nsSNVs in the Arachidonic acid metabolism pathway were found. Arachidonic

acid in the human body usually comes from dietary animal sources, such as meat, eggs, and dairy

products. Meat is an important part of diet for the people living in Khyber Pakhtunkhwa, usually

consumed at least once a day, often in the form of kabab (minced meat fried in oil), or curry

(Lindholm, 2004).

Comparative genomic analysis was done using genome from the northwest (PTN) and the

other previously published Pakistani (PK1) genome (Azim et al., 2013). The PK1 genome was

report to have Sindhi ethnicity. Non-synonymous variants from Pakistani (PK1) genome were

annotated and screened against disease and drugs databases for example SIFT, PolyPhen,

OMIM, ClinVar, PharmGKB and Drug bank (Ng and Henikoff. 2003, Jordan et al., 2011,

Landrum et al., 2013, Amberger et al., 2011, Thorn et al., 2013, Wishart et al., 2008) for

investigating associated diseases. Out of ~8,000 nsSNVs only 37 variants (three novel) were

found linked with certain disorders. Eight clinically relevant SNVs were detected overlapped

71

with PTN genome. We found no damaged variants responsible for Alzheimer’s, obesity and

heart related diseases in PK1 just like we found in PTN genome. An SNV was observed in PK1

genome which is known for Wafarin response (Schwarz et al., 2008). Moreover, a pathogenic

mutation (rs1169305) was seen in the HNF1A gene which may become a cause of diabetes in the

PK1 individual (Bonnycastle et al., 2006). In addition, we detected an SNV (rs17822931,

Gly180Arg) in ABCC11, which is responsible for wet earwax which was found in both Pakistani

genomes (Yoshiura et al., 2006).

5.2 Pharmacogenomic Profile:

The genetic map of PTN individual was further used for finding possible influence on drug

efficacy. A large number of variants were associated with susceptibility to poisonous drugs,

while others nsSNV were linked to the efficacy of medicines used in the treatment of diseases

such as depression, diabetes mellitus, Alzheimer disease, arthritis and so on. A variant was found

associated with increased risk of metabolic syndrome when treated with antipsychotics

(Ellingrod et al., 2008). Our donor has high chance of having decreased diastolic blood pressure

if treated with benazepril (Jiang et al., 2004). One of the variants was associated with increased

risk of toxic liver disease when treated with ethambutol, isoniazid, pyrazinamide, and rifampin

(Çetintaş et al., 2008). We also observed an SNV which made this individual use escitalopram

for depression and other anxiety (Han et al., 2013).

Most of the clinically relevant variants adopted in this study were originally described in

Caucasian populations. While this result might be a consequence of the genomic affinities of the

Pakistani genome with other Caucasian populations, it might also reflect a bias due to most of

72

the GWAS work being carried out on Caucasian populations (Ayub et al., 2009). Therefore a

cohort study in the Pakistani population will be required for authentication.

The methodology, technology and infrastructure that we developed and used are equally

powerful to study other global ethnic populations and the diseases most prevalent in those

populations. Most importantly we successfully created a DNA variation dataset of the Pakistani

population and make it available to researchers for understanding human biology with respect to

disease predisposition, adverse drug reaction, and other genetically valuable healthcare

interpretation.

5.3 Genealogical and Admixture Analysis:

For the last many years researchers have been trying to clarify the origins and stratification

as well as intra and inter-population relationships of ethnic groups in Pakistan. Originally the

focus was on uniparental lineage markers passed through the Y chromosome and mtDNA in

male and female, respectively (Mohyuddin et al., 2001, Firasat et al., 2007, Rakha et al., 2011,

Metspalu et al., 2004). Therefore we analyzed the ever first whole genome of a Pathan /

Pakhtun from a North West province (Khyber Pakhtunkhwa) of Pakistan, to explore what

additional information can be learnt. Other analytical approaches were also used to assess the

influence of ancestral contributions within Pakistani Pakhtuns along with the historical

background of the region. Our analysis of 46 unrelated human genomes from 10 different

populations provides a comprehensive view of the PTN genome. We found that the Pakistani

Pathans appears with the Indian cline in our MDS beside Caucasians and East Asian. We saw

that at K = 4 the Pakistani Pathans and Indians made their own component to become better

representatives of the South Asia, that was additionally confirmed by comparing our

73

representative genome with other individuals from South Asia in the HGDP-CEPH panel (Li et

al., 2008), which were studied using illumina Omnichips of ~650k SNVs. We considered the

cluster membership (from K=2 to K=5), the PTN genome composition was within the

variability observed within the Pathan sample from the HGDP (Figure 4.7). Similarly, in a

multi-dimensional scaling (MDS) plot, the PTN genome fell within the other Pathan/Pakistan

individuals (Figure 4.8). African populations were found the most distant and differentiated

from the Pathan population. Being the only neighboring genome, Indian genomes showed the

closest genetic relationship with the Pakistani PTN genome. Both types of ethnic genomes made

a separate clad distant from other Asian genomes supported by the MDS plot and phylogenetic

tree analysis.

Based on our results we confirmed that our genome PTN is representative of the Pathan

ethnic group. These results are also in line with the self-reported ancestry of the subject, with all

his grandparents coming from Afghanistan to Khyber Pakhtunkhwa (Pakistan). We found that

the Pathan genome has more than 80% of Caucasian ancestry with C4a1a1 mito group and L Y-

chromosome group, suggesting that Pathans are probably an admixture of Caucasian and South

Asians at the genomic level. Haplogroup L has high frequency in Pakistan (14%) as compared

to India (6.3%), Turkey (~4%) and Caucasians (~6%) (Mohyuddin et al., 2001, Firasat et al.,

2007).

5.4 Demographic History Analysis and Ancestral Population Size:

We inferred the demographic history of the Pakistani genome (PTN) using the pairwise

sequentially Markovian coalescent (PSMC) model (Li H, Durbin 2012) (Figure 4.9), and

compared it to a panel of worldwide populations based on a number of HGDP genomes (Meyer

74

et al., 2012). As previously reported, all populations share a similar demographic history

between 1 million to 200kyr ago. From 200kyr ago to 20kyr ago, the PTN follow a similar

trajectory to other Asian and European populations, with an inferred effective population size

smaller than African populations, reflecting the out of Africa bottleneck. Over the last 20k years,

the PTN shows an explosion in effective population size, contemporaneous to other Eurasian

populations but much greater in magnitude. The very large effective population size likely

reflects admixture between European and Asian lineages giving rise to modern Pathans (as also

suggested by the analysis of mtDNA and Y-chromosome), rather than an actual increase in

census sizes.

5.5 Conclusion:

Here we present, for the first time, the whole genome of a Pakistani individual from a

north-west province (Khyber Pakhtunkhwa). This research does not only provide new

approaches in exploring population admixture dynamics, but also help us conduct the first

genetic study of diseases and pharmaco genes in the northwestern population of Pakistan. The

ultimate goal of this research was to extend the results of these studies to the interpretation and

translation to improve healthcare to the Pakistani people. Our analysis provides a detailed view

of the PTN genome diversity and functional classification of variants and its impact in

pharmacogenomics. A large scale analysis of diverse genomes is needed to help researchers

around the world in understanding genetic diversity and functional classification of variants

along with pharmacogenomic traits and associated drugs that would be use as personalized

medicine.

75

5.6 Recommendations and Future Plans:

x A genetic resource for all Pakistani populations should be established for computing their

allele sharing as a measure of linkage disequilibrium, admixture, and migration.

x Cohort study in the Pakistani population is required for Authentication, which will help

us, conducting the genetic disease studies.

x Rare and common diseases, its susceptibility and association within Pakistani

population's genetic makeup should be investigated.

x Patients, physicians and science journalists should be educated on interpreting genomic

results.

x Genomics applications and implications should be openly discussed through Conferences

and Workshops etc. This will encourage interaction between experts, academicians,

researchers, students, policy makers etc.

Chapter 6 REFERENCES

Pages 76-90

76

CHAPTER 6

6. References

Ahn, S.-M., Kim, T.-H., Lee, S., Kim, D., Ghang, H., Kim, D.-S., et al. (2009). The first Korean

genome sequence and analysis: full genome sequencing for a socio-ethnic group. Genome

research, 19(9), 1622-1629.

Alexander, D. H., Novembre, J., & Lange, K. (2009). Fast model-based estimation of ancestry in

unrelated individuals. Genome research, 19(9), 1655-1664.

Alvarez-Cubero, M. J., Saiz, M., Martinez-Gonzalez, L. J., Alvarez, J. C., Lorente, J. A., &

Cozar, J. M. (2013). Genetic analysis of the principal genes related to prostate cancer: a

review. Paper presented at the Urologic Oncology: Seminars and Original Investigations.

Amberger, J., Bocchini, C., & Hamosh, A. (2011). A new face and new challenges for Online

Mendelian Inheritance in Man (OMIM®). Human mutation, 32(5), 564-567.

Andrews, R. M., Kubacka, I., Chinnery, P. F., Lightowlers, R. N., Turnbull, D. M., & Howell, N.

(1999). Reanalysis and revision of the Cambridge reference sequence for human

mitochondrial DNA. Nature genetics, 23(2), 147-147.

Ayub, Q., & Tyler-Smith, C. (2009). Genetic variation in South Asia: assessing the influences of

geography, language and ethnicity for understanding history and disease risk. Briefings in

functional genomics & proteomics, 8(5), 395-404.

77

Azim, M. K., Yang, C., Yan, Z., Choudhary, M. I., Khan, A., Sun, X., et al. (2013). Complete

agenome sequencing and variant analysis of a Pakistani individual. Journal of human

genetics, 58(9), 622-626.

Aziz, Z., Sana, S., Saeed, S., & Akram, M. (2003). Institution based tumor registry from Punjab:

five year data based analysis. JOURNAL-PAKISTAN MEDICAL ASSOCIATION, 53(8),

350-353.

Bentley, D. R., Balasubramanian, S., Swerdlow, H. P., Smith, G. P., Milton, J., Brown, C. G., et

al. (2008). Accurate whole human genome sequencing using reversible terminator

chemistry. nature, 456(7218), 53-59.

Bhurgri, Y., Kayani, N., Pervez, S., Ahmed, R., Tahir, I., Afif, M., et al. (2009). Incidence and

Trends of Prostate Cancer in Karachi South. Asian Pacific Journal of Cancer Prevention,

10, 45-48.

Bodmer, W., & Bonilla, C. (2008). Common and rare variants in multifactorial susceptibility to

common diseases. Nature genetics, 40(6), 695-701.

Bonnycastle, L. L., Willer, C. J., Conneely, K. N., Jackson, A. U., Burrill, C. P., Watanabe, R.

M., et al. (2006). Common variants in maturity-onset diabetes of the young genes

contribute to risk of type 2 diabetes in Finns. Diabetes, 55(9), 2534-2540.

Bouchard, L., Vohl, M.-C., Lebel, S., Hould, F.-S., Marceau, P., Bergeron, J., et al. (2010).

Contribution of genetic and metabolic syndrome to omental adipose tissue PAI-1 gene

mRNA and plasma levels in obesity. Obesity surgery, 20(4), 492-499.

Cann, H. M., De Toma, C., Cazes, L., Legrand, M.-F., Morel, V., Piouffre, L., et al. (2002). A

human genome diversity cell line panel. Science (New York, NY), 296(5566), 261.

78

Cavalli-Sforza, L. L. (2005). The human genome diversity project: past, present and future.

Nature Reviews Genetics, 6(4), 333-340.

ÇETİNTAŞ, V. B., ERER, O. F., KOSOVA, B., ÖZDEMİR, İ., TOPÇUOĞLU, N., AKTOĞU,

S., et al. (2008). Determining the relation between N-acetyltransferase-2 acetylator

phenotype and antituberculosis drug induced hepatitis by molecular biologic tests. Tuberk

Toraks, 56, 81-86.

Chin, M. H., Walters, A. E., Cook, S. C., & Huang, E. S. (2007). Interventions to reduce racial

and ethnic disparities in health care. Medical Care Research and Review, 64(5 suppl), 7S-

28S.

Collins, F. S., & Mansoura, M. K. (2001). The Human Genome Project. Revealing the shared

inheritance of all humankind. Cancer, 91(1 Suppl), 221-225.

Collins, F. S., Brooks, L. D., & Chakravarti, A. (1998). A DNA polymorphism discovery

resource for research on human genetic variation. Genome research, 8(12), 1229-1231.

Corrigan, A., Lal, R., Wickramasinghe, S., Whelan, S., Sanderson, J., Marinaki, A., et al. (2013).

31 Testing for association between TPMT, COMT and NOX3 variants and the onset of

ototoxicity in lung cancer patients treated with platinum chemotherapy. Lung Cancer, 79,

S11.

Danecek, P., Auton, A., Abecasis, G., Albers, C. A., Banks, E., DePristo, M. A., et al. (2011).

The variant call format and VCFtools. Bioinformatics, 27(15), 2156-2158.

Dennis, B., Aziz, K., She, L., Faruqui, A., Davis, C., Manolio, T. A., et al. (2006). High rates of

obesity and cardiovascular disease risk factors in lower middle class community in

Pakistan: the Metroville Health Study. J Pak Med Assoc, 56(6), 267-272.

79

Derenko M, Malyarchuk B, Grzybowski T, Denisova G, Rogalla U, Perkova M, Dambueva I,

Zakharov I. (2010). Origin and post-glacial dispersal of mitochondrial DNA haplogroups

C and D in northern Asia. PloS one, 5(12):e15214.

Ding, K., & Kullo, I. J. (2009). Evolutionary genetics of coronary heart disease. Circulation,

119(3), 459-467.

Dissanayake, V. H., Samarakoon, P. S., Scaria, V., Patowary, A., Sivasubbu, S., & Gokhale, R.

S. (2011). The Sri Lankan Personal Genome Project. The Sri Lankan Personal Genome

Project, 2(1), 4-8.

Do, R., Balick, D., Li, H., Adzhubei, I., Sunyaev, S., & Reich, D. (2015). No evidence that

selection has been less effective at removing deleterious mutations in Europeans than in

Africans. Nature genetics.

Dogan, H., Can, H., & Otu, H. H. (2014). Whole Genome Sequence of a Turkish Individual.

PloS one, 9(1).

Drmanac, R., Sparks, A. B., Callow, M. J., Halpern, A. L., Burns, N. L., Kermani, B. G., et al.

(2010). Human genome sequencing using unchained base reads on self-assembling DNA

nanoarrays. Science, 327(5961), 78-81.

Elingarami, S., Li, X., & He, N. (2013). Applications of nanotechnology, next generation

sequencing and microarrays in biomedical research. Journal of nanoscience and

nanotechnology, 13(7), 4539-4551.

Ellingrod, V. L., Miller, D. D., Taylor, S. F., Moline, J., Holman, T., & Kerr, J. (2008).

Metabolic syndrome and insulin resistance in schizophrenia patients receiving

antipsychotics genotyped for the methylenetetrahydrofolate reductase (MTHFR) 677C/T

and 1298A/C variants. Schizophrenia research, 98(1), 47-54.

80

Feero, W. G., & Guttmacher, A. E. (2014). Genomics, personalized medicine, and pediatrics.

Academic pediatrics, 14(1), 14-22.

Felsenstein, J. (2002). {PHYLIP}(Phylogeny Inference Package) version 3.6 a3.

Feuk, L., Carson, A. R., & Scherer, S. W. (2006). Structural variation in the human genome.

Nature Reviews Genetics, 7(2), 85-97.

Firasat, S., Khaliq, S., Mohyuddin, A., Papaioannou, M., Tyler-Smith, C., Underhill, P. A., et al.

(2007). Y-chromosomal evidence for a limited Greek contribution to the Pathan

population of Pakistan. European Journal of Human Genetics, 15(1), 121-126.

Flegal, K. M., Carroll, M. D., Ogden, C. L., & Curtin, L. R. (2010). Prevalence and trends in

obesity among US adults, 1999-2008. Jama, 303(3), 235-241.

Fujimoto, A., Nakagawa, H., Hosono, N., Nakano, K., Abe, T., Boroevich, K. A., et al. (2010).

Whole-genome sequencing and comprehensive variant analysis of a Japanese individual

using massively parallel sequencing. Nature genetics, 42(11), 931-936.

Galbete, C., Toledo, J., Martínez-González, M. Á., Martínez, J. A., Guillén-Grima, F., & Marti,

A. (2013). Lifestyle factors modify obesity risk linked to PPARG2 and FTO variants in

an elderly population: a cross-sectional analysis in the SUN Project. Genes & nutrition,

8(1), 61-67.

Gibbs, R. A., Belmont, J. W., Hardenbol, P., Willis, T. D., Yu, F., Yang, H., et al. (2003). The

international HapMap project. Nature, 426(6968), 789-796.

Gueorguiev, M., Lecoeur, C., Meyre, D., Benzinou, M., Mein, C. A., Hinney, A., et al. (2009).

Association studies on ghrelin and ghrelin receptor gene polymorphisms with obesity.

Obesity, 17(4), 745-754.

81

Gupta, R., Ratan, A., Rajesh, C., Chen, R., Kim, H. L., Burhans, R., et al. (2012). Sequencing

and analysis of a South Asian-Indian personal genome. BMC genomics, 13(1), 440.

Gurdasani, D., Carstensen, T., Tekola-Ayele, F., Pagani, L., Tachmazidou, I., Hatzikotoulas, K.,

et al. (2015). The African Genome Variation Project shapes medical genetics in Africa.

Nature, 517(7534), 327-332.

Han, K.-M., Chang, H. S., Choi, I.-K., Ham, B.-J., & Lee, M.-S. (2013). CYP2D6 P34S

polymorphism and outcomes of escitalopram treatment in Koreans with major

depression. Psychiatry investigation, 10(3), 286-293.

Hewett, M., Oliver, D. E., Rubin, D. L., Easton, K. L., Stuart, J. M., Altman, R. B., et al. (2002).

PharmGKB: the pharmacogenetics knowledge base. Nucleic acids research, 30(1), 163-

165.

Hudson, M. E. (2008). Sequencing breakthroughs for genomic ecology and evolutionary biology.

Molecular ecology resources, 8(1), 3-17.

Husain, N., Creed, F., & Tomenson, B. (2000). Depression and social stress in Pakistan.

Psychological medicine, 30(2), 395-402.

Iafrate, A. J., Feuk, L., Rivera, M. N., Listewnik, M. L., Donahoe, P. K., Qi, Y., et al. (2004).

Detection of large-scale variation in the human genome. Nature genetics, 36(9), 949-951.

Jafar, T. H., Jessani, S., Jafary, F. H., Ishaq, M., Orkazai, R., Orkazai, S., et al. (2005). General

Practitioners’ Approach to Hypertension in Urban Pakistan Disturbing Trends in Practice.

Circulation, 111(10), 1278-1283.

Jafar, T. H., Levey, A. S., Jafary, F. H., White, F., Gul, A., Rahbar, M. H., et al. (2003). Ethnic

subgroup differences in hypertension in Pakistan. Journal of hypertension, 21(5), 905-

912.

82

Jiang, S., Hsu, Y.-H., Xu, X., Xing, H., Chen, C., Niu, T., et al. (2004). The C677T

polymorphism of the methylenetetrahydrofolate reductase gene is associated with the

level of decrease on diastolic blood pressure in essential hypertension patients treated by

angiotensin-converting enzyme inhibitor. Thrombosis research, 113(6), 361-369.

Jordan, D. M., Kiezun, A., Baxter, S. M., Agarwala, V., Green, R. C., Murray, M. F., et al.

(2011). Development and validation of a computational method for assessment of

missense variants in hypertrophic cardiomyopathy. The American Journal of Human

Genetics, 88(2), 183-192.

Karczewski, K. J., Tirrell, R. P., Cordero, P., Tatonetti, N. P., Dudley, J. T., Salari, K., et al.

(2012). Interpretome: a freely available, modular, and secure personal genome

interpretation engine. Paper presented at the Pac Symp Biocomput.

Kelly, A. D., Hill, K. E., Correll, M., Hu, L., Wang, Y. E., Rubio, R., et al. (2013). Next-

generation sequencing and microarray-based interrogation of microRNAs from formalin-

fixed, paraffin-embedded tissue: preliminary assessment of cross-platform concordance.

Genomics, 102(1), 8-14.

Kim, J.-I., Ju, Y. S., Park, H., Kim, S., Lee, S., Yi, J.-H., et al. (2009). A highly annotated whole-

genome sequence of a Korean individual. nature, 460(7258), 1011-1015.

Kircher, M. (2011). Understanding and improving high-throughput sequencing data production

and analysis. PhD Thesis. (http://www.qucosa.de)

Kitzman, J. O., MacKenzie, A. P., Adey, A., Hiatt, J. B., Patwardhan, R. P., Sudmant, P. H., et

al. (2011). Haplotype-resolved genome sequencing of a Gujarati Indian individual.

Nature biotechnology, 29(1), 59-63.

83

Kitzmann, K. M., Dalton III, W. T., Stanley, C. M., Beech, B. M., Reeves, T. P., Buscemi, J., et

al. (2010). Lifestyle interventions for youth who are overweight: a meta-analytic review.

Health Psychology, 29(1), 91.

Kloss-Brandstätter A, Pacher D, Schönherr S, Weissensteiner H, Binna R, Specht G, Kronenberg

F. (2011). HaploGrep: a fast and reliable algorithm for automatic classification of

mitochondrial DNA haplogroups. Human Mutation, 32(1):25-32.

Koboldt, D. C., Ding, L., Mardis, E. R., & Wilson, R. K. (2010). Challenges of sequencing

human genomes. Briefings in bioinformatics, 11(5), 484-498.

Kopelman, P. G., Caterson, I. D., & Dietz, W. H. (2009). Clinical obesity in adults and children:

John Wiley & Sons.

Lander, E. S., Linton, L. M., Birren, B., Nusbaum, C., Zody, M. C., Baldwin, J., et al. (2001).

Initial sequencing and analysis of the human genome. nature, 409(6822), 860-921.

Landrum, M. J., Lee, J. M., Riley, G. R., Jang, W., Rubinstein, W. S., Church, D. M., et al.

(2013). ClinVar: public archive of relationships among sequence variation and human

phenotype. Nucleic acids research, gkt1113.

Levy, S., Sutton, G., Ng, P. C., Feuk, L., Halpern, A. L., Walenz, B. P., et al. (2007). The diploid

genome sequence of an individual human. PLoS biology, 5(10), e254.

Li, H., & Durbin, R. (2009). Fast and accurate short read alignment with Burrows–Wheeler

transform. Bioinformatics, 25(14), 1754-1760.

Li, H., & Durbin, R. (2012). Inference of human population history from whole genome

sequence of a single individual. Nature, 475(7357), 493.

Li, H., Handsaker, B., Wysoker, A., Fennell, T., Ruan, J., Homer, N., et al. (2009). The sequence

alignment/map format and SAMtools. Bioinformatics, 25(16), 2078-2079.

84

Li, J. Z., Absher, D. M., Tang, H., Southwick, A. M., Casto, A. M., Ramachandran, S., et al.

(2008). Worldwide human relationships inferred from genome-wide patterns of variation.

Science, 319(5866), 1100-1104.

Li, X., Lian, F.-M., Guo, D., Fan, L., Tang, J., Peng, J.-B., et al. (2013). The rs1142345 in TPMT

Affects the Therapeutic Effect of Traditional Hypoglycemic Herbs in Prediabetes.

Evidence-Based Complementary and Alternative Medicine, 2013.

Lindholm, C. (2004). Swat Pathan Encyclopedia of Sex and Gender (pp. 833-840): Springer.

Mansoor, S., Amin, I., Hussain, M., Zafar, Y., Bull, S., Briddon, R., et al. (2001). Association of

a disease complex involving a begomovirus, DNA 1 and a distinct DNA beta with leaf

curl disease of okra in Pakistan. Plant Disease, 85(8), 922-922.

Mardis, E. R. (2008). Next-generation DNA sequencing methods. Annu. Rev. Genomics Hum.

Genet., 9, 387-402.

McKenna, A., Hanna, M., Banks, E., Sivachenko, A., Cibulskis, K., Kernytsky, A., et al. (2010).

The Genome Analysis Toolkit: a MapReduce framework for analyzing next-generation

DNA sequencing data. Genome research, 20(9), 1297-1303.

Mehdi, S., Qamar, R., Ayub, Q., Khaliq, S., Mansoor, A., Ismail, M., et al. (1999). The Origins

of Pakistani Populations Genomic Diversity (pp. 83-90): Springer.

Metspalu, M., Kivisild, T., Metspalu, E., Parik, J., Hudjashov, G., Kaldma, K., et al. (2004).

Most of the extant mtDNA boundaries in south and southwest Asia were likely shaped

during the initial settlement of Eurasia by anatomically modern humans. BMC genetics,

5(1), 26.

Metzker, M. L. (2010). Sequencing technologies—the next generation. Nature Reviews Genetics,

11(1), 31-46.

85

Meyer, F. (2006). Genome Sequencing vs. Moore's Law: Cyber Challenges for the Next Decade.

CTWatch Quarterly, 2(3).

Meyer, M., Kircher, M., Gansauge, M.-T., Li, H., Racimo, F., Mallick, S., et al. (2012). A high-

coverage genome sequence from an archaic Denisovan individual. Science, 338(6104),

222-226.

Miller, A. J., Matasci, N., Schwaninger, H., Aradhya, M. K., Prins, B., Zhong, G.-Y., et al.

(2013). Vitis phylogenomics: hybridization intensities from a SNP array outperform

genotype calls. PloS one, 8(11), e78680.

Miller, C. A., Hampton, O., Coarfa, C., & Milosavljevic, A. (2011). ReadDepth: a parallel R

package for detecting copy number alterations from short sequencing reads. PloS one,

6(1), e16327.

Mirza, I., & Jenkins, R. (2004). Risk factors, prevalence, and treatment of anxiety and depressive

disorders in Pakistan: systematic review. Bmj, 328(7443), 794.

Mohyuddin, A., Ayub, Q., Qamar, R., Zerjal, T., Helgason, A., Mehdi, S. Q., et al. (2001). Y-

chromosomal STR haplotypes in Pakistani populations. Forensic science international,

118(2), 141-146.

Nanan, D. (2002). The obesity pandemic-implications for Pakistan. JPMA, 52(342).

Ng, P. C., & Henikoff, S. (2003). SIFT: Predicting amino acid changes that affect protein

function. Nucleic acids research, 31(13), 3812-3814.

Ngamphiw, C., Assawamakin, A., Xu, S., Shaw, P. J., Yang, J. O., Ghang, H., et al. (2011).

PanSNPdb: the Pan-Asian SNP genotyping database. PloS one, 6(6), e21451.

Park, P. J. (2008). Epigenetics meets next-generation sequencing. Epigenetics, 3(6), 318-321.

86

Patowary, A., Purkanti, R., Singh, M., Chauhan, R. K., Bhartiya, D., Dwivedi, O. P., et al.

(2012). Systematic analysis and functional annotation of variations in the genome of an

Indian individual. Human mutation, 33(7), 1133-1140.

Patwari, P., & Lee, R. T. (2008). Mechanical control of tissue morphogenesis. Circulation

research, 103(3), 234-243.

Prado-Martinez, J., Sudmant, P. H., Kidd, J. M., Li, H., Kelley, J. L., Lorente-Galdos, B., et al.

(2013). Great ape genetic diversity and population history. Nature, 499(7459), 471-475.

Purcell, S., Neale, B., Todd-Brown, K., Thomas, L., Ferreira, M. A., Bender, D., et al. (2007).

PLINK: a tool set for whole-genome association and population-based linkage analyses.

The American Journal of Human Genetics, 81(3), 559-575.

Pushkarev, D., Neff, N. F., & Quake, S. R. (2009). Single-molecule sequencing of an individual

human genome. Nature biotechnology, 27(9), 847-850.

Race, E., & Group, G. W. (2005). The use of racial, ethnic, and ancestral categories in human

genetics research. The American Journal of Human Genetics, 77(4), 519-532.

Rakha, A., Shin, K.-J., Yoon, J. A., Kim, N. Y., Siddique, M. H., Yang, I. S., et al. (2011).

Forensic and genetic characterization of mtDNA from Pathans of Pakistan. International

journal of legal medicine, 125(6), 841-848.

Rasmussen, M., Guo, X., Wang, Y., Lohmueller, K. E., Rasmussen, S., Albrechtsen, A., et al.

(2011). An Aboriginal Australian genome reveals separate human dispersals into Asia.

Science, 334(6052), 94-98.

Redon, R., Ishikawa, S., Fitch, K. R., Feuk, L., Perry, G. H., Andrews, T. D., et al. (2006).

Global variation in copy number in the human genome. nature, 444(7118), 444-454.

87

Rizvi, S., Khan, M., Kundi, A., Marsh, D., Samad, A., & Pasha, O. (2004). Status of rheumatic

heart disease in rural Pakistan. Heart, 90(4), 394-399.

Rosenberg, N. A. (2006). Standardized subsets of the HGDP‐CEPH Human Genome Diversity

Cell Line Panel, accounting for atypical and duplicated samples and pairs of close

relatives. Annals of human genetics, 70(6), 841-847.

Saitou, N., & Nei, M. (1987). The neighbor-joining method: a new method for reconstructing

phylogenetic trees. Molecular biology and evolution, 4(4), 406-425.

Salleh, M. Z., Teh, L. K., Lee, L. S., Ismet, R. I., Patowary, A., Joshi, K., et al. (2013).

Systematic pharmacogenomics analysis of a Malay whole genome: proof of concept for

personalized medicine. PloS one, 8(8), e71554.

Sankararaman, S., Mallick, S., Dannemann, M., Prüfer, K., Kelso, J., Pääbo, S., et al. (2014).

The genomic landscape of Neanderthal ancestry in present-day humans. nature,

507(7492), 354-357.

Schork, N. J., Murray, S. S., Frazer, K. A., & Topol, E. J. (2009). Common vs. rare allele

hypotheses for complex diseases. Current opinion in genetics & development, 19(3), 212-

219.

Schwarz, U. I., Ritchie, M. D., Bradford, Y., Li, C., Dudek, S. M., Frye-Anderson, A., et al.

(2008). Genetic determinants of response to warfarin during initial anticoagulation. New

England Journal of Medicine, 358(10), 999-1008.

Sebastiani, P., Hadley, E. C., Province, M., Christensen, K., Rossi, W., Perls, T. T., et al. (2009).

A family longevity selection score: ranking sibships by their longevity, size, and

availability for study. American journal of epidemiology, kwp309.

88

Shah, S., Luby, S., Rahbar, M., Khan, A., & McCormick, J. (2001). Hypertension and its

determinants among adults in high mountain villages of the Northern Areas of Pakistan.

Journal of human hypertension, 15(2), 107-112.

Shera, A., Jawad, F., & Maqsood, A. (2007). Prevalence of diabetes in Pakistan. Diabetes

research and clinical practice, 76(2), 219-222.

Sherry, S. T., Ward, M.-H., Kholodov, M., Baker, J., Phan, L., Smigielski, E. M., et al. (2001).

dbSNP: the NCBI database of genetic variation. Nucleic acids research, 29(1), 308-311.

Siva, N. (2008). 1000 Genomes project. Nature biotechnology, 26(3), 256-256.

Speicher, M. R., & Carter, N. P. (2005). The new cytogenetics: blurring the boundaries with

molecular biology. Nature Reviews Genetics, 6(10), 782-792.

Spichenok, O., Budimlija, Z. M., Mitchell, A. A., Jenny, A., Kovacevic, L., Marjanovic, D., et al.

(2011). Prediction of eye and skin color in diverse populations using seven SNPs.

Forensic Science International: Genetics, 5(5), 472-478.

Streib, L. (2007). World’s fattest countries. Forbes. com. Online: http://www. forbes.

com/2007/02/07/worlds-fattest-countriesforbeslife-cx_ls_0208worldfat_5. html [Accessed

6 March 2013].

Sulem, P., Gudbjartsson, D. F., Stacey, S. N., Helgason, A., Rafnar, T., Magnusson, K. P., et al.

(2007). Genetic determinants of hair, eye and skin pigmentation in Europeans. Nature

genetics, 39(12), 1443-1452.

Tamura, K., Peterson, D., Peterson, N., Stecher, G., Nei, M., & Kumar, S. (2011). MEGA5:

molecular evolutionary genetics analysis using maximum likelihood, evolutionary

distance, and maximum parsimony methods. Molecular biology and evolution, 28(10),

2731-2739.

89

Tan, J., Yang, Y., Tang, K., Sabeti, P. C., Jin, L., & Wang, S. (2013). The adaptive variant

EDARV370A is associated with straight hair in East Asians. Human genetics, 132(10),

1187-1191.

Taus-Bolstad, S. (2008). Pakistan in pictures: Lerner Books [UK].

Thorn, C. F., Klein, T. E., & Altman, R. B. (2013). PharmGKB: the pharmacogenomics

knowledge base Pharmacogenomics (pp. 311-320): Springer.

Veeramah, K. R., & Hammer, M. F. (2014). The impact of whole-genome sequencing on the

reconstruction of human population history. Nature Reviews Genetics, 15(3), 149-162.

Wang, J., Wang, W., Li, R., Li, Y., Tian, G., Goodman, L., et al. (2008). The diploid genome

sequence of an Asian individual. nature, 456(7218), 60-65.

Wang, K., Li, M., & Hakonarson, H. (2010). ANNOVAR: functional annotation of genetic

variants from high-throughput sequencing data. Nucleic acids research, 38(16), e164-

e164.

Wang, Y., Xu, S., Liu, Z., Lai, C., Xie, Z., Zhao, C., et al. (2013). Meta-analysis on the

association between the TF gene rs1049296 and AD. The Canadian Journal of

Neurological Sciences, 40(05), 691-697.

Wheeler, D. A., Srinivasan, M., Egholm, M., Shen, Y., Chen, L., McGuire, A., et al. (2008). The

complete genome of an individual by massively parallel DNA sequencing. nature,

452(7189), 872-876.

Whiting, D. R., Guariguata, L., Weil, C., & Shaw, J. (2011). IDF diabetes atlas: global estimates

of the prevalence of diabetes for 2011 and 2030. Diabetes research and clinical practice,

94(3), 311-321.

90

Wishart, D. S., Knox, C., Guo, A. C., Cheng, D., Shrivastava, S., Tzur, D., et al. (2008).

DrugBank: a knowledgebase for drugs, drug actions and drug targets. Nucleic acids

research, 36(suppl 1), D901-D906.

Wong, L.-P., Ong, R. T.-H., Poh, W.-T., Liu, X., Chen, P., Li, R., et al. (2013). Deep whole-

genome sequencing of 100 southeast Asian Malays. The American Journal of Human

Genetics, 92(1), 52-66.

Wood, M. (2001). In the footsteps of Alexander the Great: a journey from Greece to Asia: Univ

of California Press.

Yoshiura, K.-i., Kinoshita, A., Ishida, T., Ninokata, A., Ishikawa, T., Kaname, T., et al. (2006).

A SNP in the ABCC11 gene is the determinant of human earwax type. Nature genetics,

38(3), 324-330.

Zheng, T., Su, C., Zhao, J., Zhang, X., Zhang, T., Zhang, L., et al. (2013). Effects of CYP3A5

and CYP2D6 genetic polymorphism on the pharmacokinetics of diltiazem and its

metabolites in Chinese subjects. Die Pharmazie-An International Journal of

Pharmaceutical Sciences, 68(4), 257-260.

LIST OF PUBLICATIONS Page: 91

91

PUBLICATIONS

Muhammad Ilyas, Jong-Soo Kim, Jesse Cooper, Young-Ah Shin, Hak-Min Kim, Yun Sung

Cho, Seungwoo Hwang, Hyunho Kim, Jaewoo Moon, Oksung Chung, JeHoon Jun, Achal

Rastogi, Sanghoon Song, Junsu Ko, Andrea Manica, Ziaur Rahman, Tayyab Husnain and Jong

Bhak. 2015. Whole genome sequencing of an ethnic Pathan (Pakhtun) from the north-west of

Pakistan. BMC Genomics. 16:172

Muhammad Ilyas, Ziaur Rahman, Tayyab Husnain and Jong Bhak. 2015. Pharmacogenomic

Profile of a Pakistani Individual. Sci. Tech. and Dev. 33 (4): 183-187

APPENDIX Page: 92-93

92

APPENDIX-I

WEBSITES USED

1000 Genome Project http://www.1000genomes.org The Personal Genome Project http://www.personalgenomes.org Simons Genome Diversity Project http://www.simonsfoundation.org Korean Personal Genomes Project http://kpgp.kr Complete Genomics http://www.completegenomics.com Iranian Genome Project http://www.irangenes.com Human Genome Organisation http://www.hugo-international.org Harvard E-commons http://ecommons.med.harvard.edu PubMed http://www.ncbi.nlm.nih.gov/pubmed Omictools http://omictools.com RNASeqBlog http://www.rna-seqblog.com Seqanswers http://seqanswers.com SNPedia http://www.snpedia.com Biobase http://www.biobase-international.com Biocomputing Platforms Ltd http://www.bcplatforms.com Bioinformatics Solutions http://www.bioinformaticssolutions.com Gataca http://www.gatacallc.com Genoptix Medical Laboratory http://www.genoptix.com Golden Helix http://www.goldenhelix.com Microsoft http://www.microsoft.com Unipro http://ugene.unipro.ru/ 23 and me http://www.23andme.com Ancestry.com http://www.ancestry.com Personal Genome Diagnostics http://www.personalgenome.com Beijing Genomics (BGI) http://www.genomics.cn Illumina http://www.illumina.com InterpretOmics http://www.interpretomics.co Population Genomics Initiative (PAPGI) http://www.papgi.org Billion Genomes Project http://billiongenome.com Indian Genome Variation Project http://www.igvdb.res

http://billiongenome.com/

http://www.igvdb.res/

93

APPENDIX-II

INSTITUTIONAL REVIEW BOARD (IRB) APPROVAL