Algorithmically Optimized Gene Selection for Targeted Clinical Sequencing Panels

1
Emily Williams, Yuan Tian, Yun Zhu, Carol Munroe, John Bucci, Yutao Fu, Fiona Hyland, and Corina Shtir, Clinical Next-Gen Sequencing Division, Thermo Fisher Scientific Inc., 5781 Van Allen Way, Carlsbad, CA, U.S.A, 92008. RESULTS The Disease Association Database organizes diseases into an effective hierarchical structure for lookup, using disease parent-child relationships established by NIH’s Unified Medical Language System (UMLS) 2 . For any disease in the hierarchical tree, the gene scoring algorithm computes the scores to summarize the strength of genes’ association with all of the disease’s child diseases. Table 1. Disease annotation for the 28 identified gene clusters. INTRODUCTION Selection of genes to include in genomic studies of disease remains a difficult task. Current methods rely on expert opinion or manual search engine use. With these methods, the process and result are neither repeatable nor scalable. To remedy this situation, we created the Informative Genetic Content (IGC) system, which enables the algorithmic selection of genes for inclusion in such studies, given one or more diseases to target. The IGC system stands on three components: a database associating diseases with genes and other diseases, an algorithm to rank the genes under consideration for inclusion in a panel, and a module that clusters genes by families of diseases. The first component, the database, maps diseases to associated genes and scores each of these mappings according to the strength of the relationship. The database also maps diseases to other diseases, such that groups of diseases or hierarchical relationships between diseases can be identified. The second component enables the ranking of candidate genes when multiple diseases are of interest. The algorithm accounts for the common situation where two or more diseases are associated with the same gene with varying strengths of association, weighting and combining the scores across the diseases associated with each gene. The final component, the gene clustering module, groups genes by pathogenic pathways, should the user want to consider targeting a broader family of diseases affected by a closely related set of genes. We validated the IGC system through comparisons of our automated gene selections with expertly curated gene panel designs. We found a high degree of overlap between the IGC’s gene selection and the gene lists chosen by experts, supporting the viability of our system. Together with the scalability and repeatability enabled by its automation, the IGC system greatly improves the gene panel selection process and therefore advances targeted genomic studies. CONCLUSIONS We created a comprehensive, efficient, and informative engine, the IGC, to optimize gene selection given diseases at any level of the disease ontology hierarchy: The Disease Association Database organizes diseases into an effective hierarchical structure, and associates diseases to genes. The gene scoring algorithm ranks genes by disease relevance, and summarizes the scores for diseases at any level of the hierarchy. The Virtual Panel Library efficiently groups genes into clusters by major disease category, and further ranks the genes within clusters by their relative importance to each category’s diseases. * For Research Use only. Not for use in diagnostic procedures. REFERENCES 1.Pinero J, Queralt-Rosinach N, Bravo A et al (2015) DisGeNET: a discovery platform for the dynamical exploration of human diseases and their genes. Database 2015:bav028. 2.Bodenreider O (2004) The Unified Medical Language System (UMLS): integrating biomedical terminology. Nucleic Acids Res. 2004 Jan 1;32(Database issue):D267-70. 3.Langfelder P, Horvath S (2008) WGCNA: an R package for weighted correlation network analysis. BMC Bioinformatics 2008: 9:559 TRADEMARKS/LICENSING © 2016 Thermo Fisher Scientific Inc. All rights reserved. All trademarks are the property of Thermo Fisher Scientific or its subsidiaries unless otherwise specified. Algorithmically optimized gene selection for targeted clinical sequencing panels Thermo Fisher Scientific • 5781 Van Allen Way • Carlsbad, CA 92008 • thermofisher.com Figure 1. Overview of IGC - Database and algorithms for identifying and ranking gene-disease associations Figure 2. Disease Association Database maps genes to diseases Figure 4. Gene prioritization in disease hierarchy The database establishes gene-disease relationships based on DisGeNET 1 , which scores gene-disease associations according to expert-curated sources (e.g. CTD, CLINVAR, and ORPHANET), predicted data using mouse models, and text-mining of publications. Blue circles: two neurological diseases schizophrenia and bipolar disorder. Green circles: genes associated with these two diseases. Figure 3. Gene Scoring Algorithm Figure 5. Gene clustering identified 28 Virtual Panel Libraries associated with major disease categories. A B Disease Key MeSH Category Description C04 Neoplasms C05 Musculoskeletal Diseases C06 Digestive System Diseases C07 Stomatognathic Diseases C08 Respiratory Tract Diseases C09 Otorhinolaryngologic Diseases C10 Nervous System Diseases C11 Eye Diseases C12 Male Urogenital Diseases C13 Female Urogenital Diseases and Pregnancy Complications C14 Cardiovascular Diseases C15 Hemic and Lymphatic Diseases C16 Congenital, Hereditary, and Neonatal Diseases and Abnormalities C17 Skin and Connective Tissue Diseases C18 Nutritional and Metabolic Diseases C19 Endocrine System Diseases C20 Immune System Diseases Cluster Groups The ranking score uses an unbiased gene scoring method that accounts for both the strength and number of gene-disease pairs. From the top 5,000 genes that are disease relevant according to the gene scoring algorithm, 28 gene clusters were identified using WGCNA algorithm 3 . A) Hierarchical clustering of genes according to their association patterns with 16 high-level MeSH categories relevant to inherited diseases. B) Gene cluster association scores with the 16 MeSH disease categories are shown with p-values. Module # Module Color GeneCount Disease Annotation 1 turquoise 530 Nervous System Diseases 2 blue 321 Nutritional and Metabolic Diseases 3 brown 307 Cardiovascular Diseases 4 yellow 280 Digestive System Diseases 5 green 253 Eye Diseases 6 red 250 Skin and Tissue Connective Diseases 7 black 229 Male and Female Urogenital Diseases 8 pink 205 Musculoskeletal Diseases 9 magenta 164 Nervous System Diseases; Nutritional and Metabolic Diseases 10 purple 150 Hemic and Lymphatic Diseases 11 greenyellow 140 Musculoskeletal Diseases; Nervous System Diseases 12 tan 137 Neoplasms 13 salmon 129 Respiratory Tract Diseases 14 cyan 111 Otorhinolaryngologic Diseases; Nervous System Diseases 15 midnightblue 90 Male Urogenital Diseases; 16 lightcyan 87 Immune; Male Urogenital Diseases; Female Urogenital Diseases and Pregnancy Complications 17 grey60 76 Stomatognathic Diseases 18 lightgreen 69 Hemic and Lymphatic Diseases; Immune System Diseases 19 lightyellow 67 Female Urogenital Diseases and Pregnancy Complications; Endocrine System Diseases 20 royalblue 63 Female Urogenital Diseases and Pregnancy Complications 21 darkred 61 Musculoskeletal Diseases; Skin and Connective Tissue Diseases 22 darkgreen 60 Musculoskeletal Diseases; Stomatognathic Diseases 23 darkgrey 55 Female and Male Urogenital Diseases; Nutritional and Metabolic Diseases 24 darkturquoise 55 Nutritional and Metabolic Diseases; Endocrine System Diseases 25 darkorange 36 Musculoskeletal Diseases; Cardiovascular Diseases 26 orange 36 Immune System Diseases 27 white 35 Endocrine System Diseases 28 skyblue 34 Immune System Diseases; Skin and Connective Tissue Diseases Disease of interest DisGeNET Database 1

Transcript of Algorithmically Optimized Gene Selection for Targeted Clinical Sequencing Panels

Page 1: Algorithmically Optimized Gene Selection for Targeted Clinical Sequencing Panels

Emily Williams, Yuan Tian, Yun Zhu, Carol Munroe, John Bucci, Yutao Fu, Fiona Hyland, and Corina Shtir, Clinical Next-Gen Sequencing Division, Thermo Fisher Scientific Inc., 5781 Van Allen Way, Carlsbad, CA, U.S.A, 92008.

RESULTS

• The Disease Association Database organizes diseases into an effective hierarchical

structure for lookup, using disease parent-child relationships established by NIH’s Unified

Medical Language System (UMLS)2.

• For any disease in the hierarchical tree, the gene scoring algorithm computes the scores to

summarize the strength of genes’ association with all of the disease’s child diseases.

Table 1. Disease annotation for the 28 identified gene clusters. INTRODUCTION

Selection of genes to include in genomic studies of disease

remains a difficult task. Current methods rely on expert opinion

or manual search engine use. With these methods, the

process and result are neither repeatable nor scalable. To

remedy this situation, we created the Informative Genetic

Content (IGC) system, which enables the algorithmic selection

of genes for inclusion in such studies, given one or more

diseases to target.

The IGC system stands on three components: a database

associating diseases with genes and other diseases, an

algorithm to rank the genes under consideration for inclusion in

a panel, and a module that clusters genes by families of

diseases. The first component, the database, maps diseases

to associated genes and scores each of these mappings

according to the strength of the relationship. The database also

maps diseases to other diseases, such that groups of diseases

or hierarchical relationships between diseases can be

identified. The second component enables the ranking of

candidate genes when multiple diseases are of interest. The

algorithm accounts for the common situation where two or

more diseases are associated with the same gene with varying

strengths of association, weighting and combining the scores

across the diseases associated with each gene. The final

component, the gene clustering module, groups genes by

pathogenic pathways, should the user want to consider

targeting a broader family of diseases affected by a closely

related set of genes.

We validated the IGC system through comparisons of our

automated gene selections with expertly curated gene panel

designs. We found a high degree of overlap between the IGC’s

gene selection and the gene lists chosen by experts,

supporting the viability of our system.

Together with the scalability and repeatability enabled by its

automation, the IGC system greatly improves the gene panel

selection process and therefore advances targeted genomic

studies.

CONCLUSIONS

We created a comprehensive, efficient, and informative engine, the IGC, to optimize

gene selection given diseases at any level of the disease ontology hierarchy:

• The Disease Association Database organizes diseases into an effective

hierarchical structure, and associates diseases to genes.

• The gene scoring algorithm ranks genes by disease relevance, and

summarizes the scores for diseases at any level of the hierarchy.

• The Virtual Panel Library efficiently groups genes into clusters by major

disease category, and further ranks the genes within clusters by their

relative importance to each category’s diseases.

* For Research Use only. Not for use in diagnostic procedures.

REFERENCES 1.Pinero J, Queralt-Rosinach N, Bravo A et al (2015) DisGeNET: a discovery platform for the dynamical exploration of

human diseases and their genes. Database 2015:bav028.

2.Bodenreider O (2004) The Unified Medical Language System (UMLS): integrating biomedical terminology. Nucleic

Acids Res. 2004 Jan 1;32(Database issue):D267-70.

3.Langfelder P, Horvath S (2008) WGCNA: an R package for weighted correlation network analysis. BMC

Bioinformatics 2008: 9:559

TRADEMARKS/LICENSING

© 2016 Thermo Fisher Scientific Inc. All rights reserved. All trademarks are the property of Thermo

Fisher Scientific or its subsidiaries unless otherwise specified.

Algorithmically optimized gene selection for targeted clinical sequencing panels

Thermo Fisher Scientific • 5781 Van Allen Way • Carlsbad, CA 92008 • thermofisher.com

Figure 1. Overview of IGC - Database and algorithms for identifying and

ranking gene-disease associations

Figure 2. Disease Association Database maps genes to diseases

Figure 4. Gene prioritization in disease hierarchy

The database establishes gene-disease relationships based on DisGeNET1, which

scores gene-disease associations according to expert-curated sources (e.g. CTD,

CLINVAR, and ORPHANET), predicted data using mouse models, and text-mining of

publications. Blue circles: two neurological diseases – schizophrenia and bipolar

disorder. Green circles: genes associated with these two diseases.

Figure 3. Gene Scoring Algorithm

Figure 5. Gene clustering identified 28 Virtual Panel Libraries associated

with major disease categories.

A

B

Disease Key

MeSH Category

Description

C04 Neoplasms

C05 Musculoskeletal Diseases

C06 Digestive System Diseases

C07 Stomatognathic Diseases

C08 Respiratory Tract Diseases

C09 Otorhinolaryngologic Diseases

C10 Nervous System Diseases

C11 Eye Diseases

C12 Male Urogenital Diseases

C13Female Urogenital Diseases and

Pregnancy Complications

C14 Cardiovascular Diseases

C15 Hemic and Lymphatic Diseases

C16Congenital, Hereditary, and Neonatal

Diseases and Abnormalities

C17 Skin and Connective Tissue Diseases

C18 Nutritional and Metabolic Diseases

C19 Endocrine System Diseases

C20 Immune System Diseases

Cluster Groups

The ranking score uses an unbiased gene

scoring method that accounts for both the

strength and number of gene-disease

pairs.

From the top 5,000 genes that are disease relevant according to the gene scoring algorithm, 28

gene clusters were identified using WGCNA algorithm3. A) Hierarchical clustering of genes

according to their association patterns with 16 high-level MeSH categories relevant to inherited

diseases. B) Gene cluster association scores with the 16 MeSH disease categories are shown

with p-values.

Module # Module Color GeneCount Disease Annotation

1 turquoise 530 Nervous System Diseases

2 blue 321 Nutritional and Metabolic Diseases

3 brown 307 Cardiovascular Diseases

4 yellow 280 Digestive System Diseases

5 green 253 Eye Diseases

6 red 250 Skin and Tissue Connective Diseases

7 black 229 Male and Female Urogenital Diseases

8 pink 205 Musculoskeletal Diseases

9 magenta 164 Nervous System Diseases; Nutritional and Metabolic Diseases

10 purple 150 Hemic and Lymphatic Diseases

11 greenyellow 140 Musculoskeletal Diseases; Nervous System Diseases

12 tan 137 Neoplasms

13 salmon 129 Respiratory Tract Diseases

14 cyan 111 Otorhinolaryngologic Diseases; Nervous System Diseases

15 midnightblue 90 Male Urogenital Diseases;

16 lightcyan 87 Immune; Male Urogenital Diseases; Female Urogenital Diseases and

Pregnancy Complications

17 grey60 76 Stomatognathic Diseases

18 lightgreen 69 Hemic and Lymphatic Diseases; Immune System Diseases

19 lightyellow 67 Female Urogenital Diseases and Pregnancy Complications; Endocrine

System Diseases

20 royalblue 63 Female Urogenital Diseases and Pregnancy Complications

21 darkred 61 Musculoskeletal Diseases; Skin and Connective Tissue Diseases

22 darkgreen 60 Musculoskeletal Diseases; Stomatognathic Diseases

23 darkgrey 55 Female and Male Urogenital Diseases; Nutritional and Metabolic

Diseases

24 darkturquoise 55 Nutritional and Metabolic Diseases; Endocrine System Diseases

25 darkorange 36 Musculoskeletal Diseases; Cardiovascular Diseases

26 orange 36 Immune System Diseases

27 white 35 Endocrine System Diseases

28 skyblue 34 Immune System Diseases; Skin and Connective Tissue Diseases

Disease of interest

DisGeNET Database1