Algorithmically Optimized Gene Selection for Targeted Clinical Sequencing Panels
-
Upload
thermo-fisher-scientific -
Category
Science
-
view
431 -
download
1
Transcript of Algorithmically Optimized Gene Selection for Targeted Clinical Sequencing Panels
Emily Williams, Yuan Tian, Yun Zhu, Carol Munroe, John Bucci, Yutao Fu, Fiona Hyland, and Corina Shtir, Clinical Next-Gen Sequencing Division, Thermo Fisher Scientific Inc., 5781 Van Allen Way, Carlsbad, CA, U.S.A, 92008.
RESULTS
• The Disease Association Database organizes diseases into an effective hierarchical
structure for lookup, using disease parent-child relationships established by NIH’s Unified
Medical Language System (UMLS)2.
• For any disease in the hierarchical tree, the gene scoring algorithm computes the scores to
summarize the strength of genes’ association with all of the disease’s child diseases.
Table 1. Disease annotation for the 28 identified gene clusters. INTRODUCTION
Selection of genes to include in genomic studies of disease
remains a difficult task. Current methods rely on expert opinion
or manual search engine use. With these methods, the
process and result are neither repeatable nor scalable. To
remedy this situation, we created the Informative Genetic
Content (IGC) system, which enables the algorithmic selection
of genes for inclusion in such studies, given one or more
diseases to target.
The IGC system stands on three components: a database
associating diseases with genes and other diseases, an
algorithm to rank the genes under consideration for inclusion in
a panel, and a module that clusters genes by families of
diseases. The first component, the database, maps diseases
to associated genes and scores each of these mappings
according to the strength of the relationship. The database also
maps diseases to other diseases, such that groups of diseases
or hierarchical relationships between diseases can be
identified. The second component enables the ranking of
candidate genes when multiple diseases are of interest. The
algorithm accounts for the common situation where two or
more diseases are associated with the same gene with varying
strengths of association, weighting and combining the scores
across the diseases associated with each gene. The final
component, the gene clustering module, groups genes by
pathogenic pathways, should the user want to consider
targeting a broader family of diseases affected by a closely
related set of genes.
We validated the IGC system through comparisons of our
automated gene selections with expertly curated gene panel
designs. We found a high degree of overlap between the IGC’s
gene selection and the gene lists chosen by experts,
supporting the viability of our system.
Together with the scalability and repeatability enabled by its
automation, the IGC system greatly improves the gene panel
selection process and therefore advances targeted genomic
studies.
CONCLUSIONS
We created a comprehensive, efficient, and informative engine, the IGC, to optimize
gene selection given diseases at any level of the disease ontology hierarchy:
• The Disease Association Database organizes diseases into an effective
hierarchical structure, and associates diseases to genes.
• The gene scoring algorithm ranks genes by disease relevance, and
summarizes the scores for diseases at any level of the hierarchy.
• The Virtual Panel Library efficiently groups genes into clusters by major
disease category, and further ranks the genes within clusters by their
relative importance to each category’s diseases.
* For Research Use only. Not for use in diagnostic procedures.
REFERENCES 1.Pinero J, Queralt-Rosinach N, Bravo A et al (2015) DisGeNET: a discovery platform for the dynamical exploration of
human diseases and their genes. Database 2015:bav028.
2.Bodenreider O (2004) The Unified Medical Language System (UMLS): integrating biomedical terminology. Nucleic
Acids Res. 2004 Jan 1;32(Database issue):D267-70.
3.Langfelder P, Horvath S (2008) WGCNA: an R package for weighted correlation network analysis. BMC
Bioinformatics 2008: 9:559
TRADEMARKS/LICENSING
© 2016 Thermo Fisher Scientific Inc. All rights reserved. All trademarks are the property of Thermo
Fisher Scientific or its subsidiaries unless otherwise specified.
Algorithmically optimized gene selection for targeted clinical sequencing panels
Thermo Fisher Scientific • 5781 Van Allen Way • Carlsbad, CA 92008 • thermofisher.com
Figure 1. Overview of IGC - Database and algorithms for identifying and
ranking gene-disease associations
Figure 2. Disease Association Database maps genes to diseases
Figure 4. Gene prioritization in disease hierarchy
The database establishes gene-disease relationships based on DisGeNET1, which
scores gene-disease associations according to expert-curated sources (e.g. CTD,
CLINVAR, and ORPHANET), predicted data using mouse models, and text-mining of
publications. Blue circles: two neurological diseases – schizophrenia and bipolar
disorder. Green circles: genes associated with these two diseases.
Figure 3. Gene Scoring Algorithm
Figure 5. Gene clustering identified 28 Virtual Panel Libraries associated
with major disease categories.
A
B
Disease Key
MeSH Category
Description
C04 Neoplasms
C05 Musculoskeletal Diseases
C06 Digestive System Diseases
C07 Stomatognathic Diseases
C08 Respiratory Tract Diseases
C09 Otorhinolaryngologic Diseases
C10 Nervous System Diseases
C11 Eye Diseases
C12 Male Urogenital Diseases
C13Female Urogenital Diseases and
Pregnancy Complications
C14 Cardiovascular Diseases
C15 Hemic and Lymphatic Diseases
C16Congenital, Hereditary, and Neonatal
Diseases and Abnormalities
C17 Skin and Connective Tissue Diseases
C18 Nutritional and Metabolic Diseases
C19 Endocrine System Diseases
C20 Immune System Diseases
Cluster Groups
The ranking score uses an unbiased gene
scoring method that accounts for both the
strength and number of gene-disease
pairs.
From the top 5,000 genes that are disease relevant according to the gene scoring algorithm, 28
gene clusters were identified using WGCNA algorithm3. A) Hierarchical clustering of genes
according to their association patterns with 16 high-level MeSH categories relevant to inherited
diseases. B) Gene cluster association scores with the 16 MeSH disease categories are shown
with p-values.
Module # Module Color GeneCount Disease Annotation
1 turquoise 530 Nervous System Diseases
2 blue 321 Nutritional and Metabolic Diseases
3 brown 307 Cardiovascular Diseases
4 yellow 280 Digestive System Diseases
5 green 253 Eye Diseases
6 red 250 Skin and Tissue Connective Diseases
7 black 229 Male and Female Urogenital Diseases
8 pink 205 Musculoskeletal Diseases
9 magenta 164 Nervous System Diseases; Nutritional and Metabolic Diseases
10 purple 150 Hemic and Lymphatic Diseases
11 greenyellow 140 Musculoskeletal Diseases; Nervous System Diseases
12 tan 137 Neoplasms
13 salmon 129 Respiratory Tract Diseases
14 cyan 111 Otorhinolaryngologic Diseases; Nervous System Diseases
15 midnightblue 90 Male Urogenital Diseases;
16 lightcyan 87 Immune; Male Urogenital Diseases; Female Urogenital Diseases and
Pregnancy Complications
17 grey60 76 Stomatognathic Diseases
18 lightgreen 69 Hemic and Lymphatic Diseases; Immune System Diseases
19 lightyellow 67 Female Urogenital Diseases and Pregnancy Complications; Endocrine
System Diseases
20 royalblue 63 Female Urogenital Diseases and Pregnancy Complications
21 darkred 61 Musculoskeletal Diseases; Skin and Connective Tissue Diseases
22 darkgreen 60 Musculoskeletal Diseases; Stomatognathic Diseases
23 darkgrey 55 Female and Male Urogenital Diseases; Nutritional and Metabolic
Diseases
24 darkturquoise 55 Nutritional and Metabolic Diseases; Endocrine System Diseases
25 darkorange 36 Musculoskeletal Diseases; Cardiovascular Diseases
26 orange 36 Immune System Diseases
27 white 35 Endocrine System Diseases
28 skyblue 34 Immune System Diseases; Skin and Connective Tissue Diseases
Disease of interest
DisGeNET Database1