1 HBASE – THE SCALABLE DATA STORE An Introduction to HBase XLDB Europe Workshop 2013: CERN, Geneva…
SCRI, Kolker Lab1 XLDB October 19, 2011 Functional Annotation of the Protein Sequence Universe...
-
date post
21-Dec-2015 -
Category
Documents
-
view
217 -
download
0
Transcript of SCRI, Kolker Lab1 XLDB October 19, 2011 Functional Annotation of the Protein Sequence Universe...
![Page 1: SCRI, Kolker Lab1 XLDB October 19, 2011 Functional Annotation of the Protein Sequence Universe Eugene Kolker eugene.kolker@seattlechildrens.org gnklkr@yahoo.com.](https://reader030.fdocuments.us/reader030/viewer/2022032704/56649d585503460f94a37ec4/html5/thumbnails/1.jpg)
SCRI, Kolker Lab 1
XLDB October 19, 2011
Functional Annotation of theFunctional Annotation of theProtein Sequence UniverseProtein Sequence Universe
Eugene [email protected]
Thanks to Jacek Becla and SLAC Team!
![Page 2: SCRI, Kolker Lab1 XLDB October 19, 2011 Functional Annotation of the Protein Sequence Universe Eugene Kolker eugene.kolker@seattlechildrens.org gnklkr@yahoo.com.](https://reader030.fdocuments.us/reader030/viewer/2022032704/56649d585503460f94a37ec4/html5/thumbnails/2.jpg)
2
OutlineOutline
• Life sciences challenges and 4th paradigm
• Protein Functional Annotation: –All vs. All BLAST: 10 mln vs. 10 mln proteins–Revitalizing COG db and clustering–Visualizing Protein Sequence Universe
• Ecosystem for Life sciences:
–Data-Enabled Life Sciences Alliance (DELSA)
![Page 3: SCRI, Kolker Lab1 XLDB October 19, 2011 Functional Annotation of the Protein Sequence Universe Eugene Kolker eugene.kolker@seattlechildrens.org gnklkr@yahoo.com.](https://reader030.fdocuments.us/reader030/viewer/2022032704/56649d585503460f94a37ec4/html5/thumbnails/3.jpg)
3
Life Sciences Challenges and 4Life Sciences Challenges and 4thth paradigm paradigm
Past Grand Challenges & Solutions
• Green Revolution – Norman Borlaug• Smallpox Eradication – William Foege• Farmington Heart Study – Thomas Dawber- Uber-goal, approach (data-driven, inter-disciplinary,
technologies, collaborative), supportive ecosystem
Present Grand Challenges, Solutions and 4th paradigm of science – Jim Grey
![Page 4: SCRI, Kolker Lab1 XLDB October 19, 2011 Functional Annotation of the Protein Sequence Universe Eugene Kolker eugene.kolker@seattlechildrens.org gnklkr@yahoo.com.](https://reader030.fdocuments.us/reader030/viewer/2022032704/56649d585503460f94a37ec4/html5/thumbnails/4.jpg)
4
One Grand Challenge – Protein AnnotationOne Grand Challenge – Protein Annotation
Assigning functions to protein sequences• Technologies produce huge data influx(e.g. mega-projects: 1000 genomes, EMP, i5K) • 30% of proteins - unknown function• Existing annotation databases overwhelmed • Many no longer supported (e.g. COG, Systers, ClusTr)
• Alignment (BLAST): ~96% similar (identical)Jacek likes vanilla ice cream with chocolate on the topJacek loves vanilla ice cream with chocolate on the top
• Function AssignmentIf the sequences are similar, their functions are similar too
![Page 5: SCRI, Kolker Lab1 XLDB October 19, 2011 Functional Annotation of the Protein Sequence Universe Eugene Kolker eugene.kolker@seattlechildrens.org gnklkr@yahoo.com.](https://reader030.fdocuments.us/reader030/viewer/2022032704/56649d585503460f94a37ec4/html5/thumbnails/5.jpg)
5
COG- Clusters of Orthologous Groups (of proteins)COG- Clusters of Orthologous Groups (of proteins)
• Developed by Eugene Koonin, David Lipman and NCBI Team – over 4500 citations
• Classification of proteins into groups with common function
• Prokaryotes (COG): 66 genomes, 200K proteins, 5K clusters
• Eukaryotes (KOG): 7 genomes, 113K proteins, 5K clusters
• Last updated: 2008, last curated: 2006
![Page 6: SCRI, Kolker Lab1 XLDB October 19, 2011 Functional Annotation of the Protein Sequence Universe Eugene Kolker eugene.kolker@seattlechildrens.org gnklkr@yahoo.com.](https://reader030.fdocuments.us/reader030/viewer/2022032704/56649d585503460f94a37ec4/html5/thumbnails/6.jpg)
6
Our approachOur approach
• Protein sequence databases grew from ~1 mln sequences to ~10 mln sequences (2010) and now ~14 mln sequences
• Thanks to Roger Barga and Microsoft Research Team: BLAST aligned All vs All (10 mln vs 10 mln) sequences and generated 3 filtered billion records
• Our Team (Roger Higdon): developed a robust alignment model based on length normalized bit score (LNBS)
![Page 7: SCRI, Kolker Lab1 XLDB October 19, 2011 Functional Annotation of the Protein Sequence Universe Eugene Kolker eugene.kolker@seattlechildrens.org gnklkr@yahoo.com.](https://reader030.fdocuments.us/reader030/viewer/2022032704/56649d585503460f94a37ec4/html5/thumbnails/7.jpg)
7
COGs and Protein functional groups (PFGs) COGs and Protein functional groups (PFGs)
![Page 8: SCRI, Kolker Lab1 XLDB October 19, 2011 Functional Annotation of the Protein Sequence Universe Eugene Kolker eugene.kolker@seattlechildrens.org gnklkr@yahoo.com.](https://reader030.fdocuments.us/reader030/viewer/2022032704/56649d585503460f94a37ec4/html5/thumbnails/8.jpg)
8
Hadoop and Clustering Hadoop and Clustering
• Clustering– Our Team: Natali Kolker and Bill Broomall– Implemented Hadoop/Hive to handle the data– Thanks to the Hadoop community– Use COG db as starting point LNBS >4.3: SENS & SPEC ~98%
– COGs expended ~30-fold – Cluster remaining 1.8M proteins into PFGs by
single linkage clustering
- Tested results by Amazon Elastic MapReduce
![Page 9: SCRI, Kolker Lab1 XLDB October 19, 2011 Functional Annotation of the Protein Sequence Universe Eugene Kolker eugene.kolker@seattlechildrens.org gnklkr@yahoo.com.](https://reader030.fdocuments.us/reader030/viewer/2022032704/56649d585503460f94a37ec4/html5/thumbnails/9.jpg)
9
Single linkage clustering on HadoopSingle linkage clustering on Hadoop
COG KOG
![Page 10: SCRI, Kolker Lab1 XLDB October 19, 2011 Functional Annotation of the Protein Sequence Universe Eugene Kolker eugene.kolker@seattlechildrens.org gnklkr@yahoo.com.](https://reader030.fdocuments.us/reader030/viewer/2022032704/56649d585503460f94a37ec4/html5/thumbnails/10.jpg)
10
Further COG developmentFurther COG development
• Working with NCBI
• Using psiBLAST instead of BLAST
• Evaluating the modified approach – Bacterial genomes– COG test/train data– Compare to all vs. all BLAST approach
• Assign new proteins to COG clusters
• Cluster unassigned proteins
![Page 11: SCRI, Kolker Lab1 XLDB October 19, 2011 Functional Annotation of the Protein Sequence Universe Eugene Kolker eugene.kolker@seattlechildrens.org gnklkr@yahoo.com.](https://reader030.fdocuments.us/reader030/viewer/2022032704/56649d585503460f94a37ec4/html5/thumbnails/11.jpg)
11
Protein Sequence UniverseProtein Sequence Universe
• Visualization and analysis framework– Explore the relationship between proteins
• Multidimensional Scaling– Project sequence data into 3D
• Analyze and visualize protein data• Benefit: Expand universe without All vs All
- Interpolation
• Thanks to Geoffrey Fox and Indiana University Team
![Page 12: SCRI, Kolker Lab1 XLDB October 19, 2011 Functional Annotation of the Protein Sequence Universe Eugene Kolker eugene.kolker@seattlechildrens.org gnklkr@yahoo.com.](https://reader030.fdocuments.us/reader030/viewer/2022032704/56649d585503460f94a37ec4/html5/thumbnails/12.jpg)
12
COG Protein UniverseCOG Protein Universe
![Page 13: SCRI, Kolker Lab1 XLDB October 19, 2011 Functional Annotation of the Protein Sequence Universe Eugene Kolker eugene.kolker@seattlechildrens.org gnklkr@yahoo.com.](https://reader030.fdocuments.us/reader030/viewer/2022032704/56649d585503460f94a37ec4/html5/thumbnails/13.jpg)
13
Close-up of related COGsClose-up of related COGs
![Page 14: SCRI, Kolker Lab1 XLDB October 19, 2011 Functional Annotation of the Protein Sequence Universe Eugene Kolker eugene.kolker@seattlechildrens.org gnklkr@yahoo.com.](https://reader030.fdocuments.us/reader030/viewer/2022032704/56649d585503460f94a37ec4/html5/thumbnails/14.jpg)
14
Why do YOU care?Why do YOU care?
• This protein universe is inside and outside you!
• President/Office of Science, Technology and Policy’s
Request on 21st Century BIOECONOMY - to meet grand
challenges in health, food, energy, and environment in lean
budget times
Knowledge ActionDATA
![Page 15: SCRI, Kolker Lab1 XLDB October 19, 2011 Functional Annotation of the Protein Sequence Universe Eugene Kolker eugene.kolker@seattlechildrens.org gnklkr@yahoo.com.](https://reader030.fdocuments.us/reader030/viewer/2022032704/56649d585503460f94a37ec4/html5/thumbnails/15.jpg)
15
Why DELSA Now? What is DELSA?Why DELSA Now? What is DELSA?
Data-Enabled Life Sciences AllianceData-Enabled Life Sciences Alliance• The research necessity of the LS community to work
across diverse domains and with computer, data and cyberinfrastructure experts to leverage LS opportunities
• Scientific progress and accelerated rate of LS result in a pressing need for reproducibility
• A perceived gap between the needs of data-enabled LS and current funding initiatives
• An urgent need to create supporting ecosystem of industry, academia, federal agencies, and foundations
DELSA includesDELSA includes:Bio-sciences, (global) health, ecology, environment, (gen)omics, evolution, computer sciences, policies, cyberinfrastructure, clinical research, management.
![Page 16: SCRI, Kolker Lab1 XLDB October 19, 2011 Functional Annotation of the Protein Sequence Universe Eugene Kolker eugene.kolker@seattlechildrens.org gnklkr@yahoo.com.](https://reader030.fdocuments.us/reader030/viewer/2022032704/56649d585503460f94a37ec4/html5/thumbnails/16.jpg)
16
DELSA: Vision and Mission
• Proposed in May 2011 at the NSF’s DISW2 workshopProposed in May 2011 at the NSF’s DISW2 workshop• Next meeting: Nov. 13, 2011, SuperComputing SC11, SeattleNext meeting: Nov. 13, 2011, SuperComputing SC11, Seattle
DELSA Vision:DELSA Vision: Sustainable and shared access to data, knowledge, tools,
and services to find solutions for the pressing needs of our globalsociety. DELSA will advance data-enabled life sciences bymoving from “one scientist-one project” to “collective innovation”.
DELSA Mission:DELSA Mission: The mission of DELSA is to become the leading voice and coordinating framework to accelerate data-enabled research in the life sciences community.
![Page 17: SCRI, Kolker Lab1 XLDB October 19, 2011 Functional Annotation of the Protein Sequence Universe Eugene Kolker eugene.kolker@seattlechildrens.org gnklkr@yahoo.com.](https://reader030.fdocuments.us/reader030/viewer/2022032704/56649d585503460f94a37ec4/html5/thumbnails/17.jpg)
SCRI, Kolker Lab 17
Thanks!
Team: Collaborators: Beth Stewart Ken Zaret Bill Broomall Gerald van BelleCarey Sheu Gordon CohenCourtney MacNealy-Koch Michael Portman Larissa Stanberry Roger Barga
Natali Kolker Support:Olga Castaner NSF, NIH,Roger Higdon McMillen Foundation Dean Welch Jim HendricksWinn Haynes Tom Hansen Thanks!
![Page 18: SCRI, Kolker Lab1 XLDB October 19, 2011 Functional Annotation of the Protein Sequence Universe Eugene Kolker eugene.kolker@seattlechildrens.org gnklkr@yahoo.com.](https://reader030.fdocuments.us/reader030/viewer/2022032704/56649d585503460f94a37ec4/html5/thumbnails/18.jpg)
18
SummarySummary
• Life sciences challenges and 4th paradigm
• Protein Functional Annotation: –All vs. All BLAST: 10 mln vs. 10 mln proteins–Revitalizing COG db and clustering–Visualizing Protein Sequence Universe
• Ecosystem for Life sciences:
–Data-Enabled Life Sciences Alliance (DELSA)
![Page 19: SCRI, Kolker Lab1 XLDB October 19, 2011 Functional Annotation of the Protein Sequence Universe Eugene Kolker eugene.kolker@seattlechildrens.org gnklkr@yahoo.com.](https://reader030.fdocuments.us/reader030/viewer/2022032704/56649d585503460f94a37ec4/html5/thumbnails/19.jpg)
Get InvolvedGet Involved
Life scientists, engineers, computer scientists, cyberinfrastructure, data and analysis
experts, companies & institutions:
• Get involved in DELSADELSA• Participate in building supporting ecosystem • Attend our workshop @SC11, Nov. 13: www.delsall.org• Contact: [email protected]
Chief Data Officer at Seattle Children’s: predictive analytics
19