Using Existing Products And Technologies For Scientific Research Dan Fay Director – North America...
-
Upload
jesse-ward -
Category
Documents
-
view
214 -
download
0
Transcript of Using Existing Products And Technologies For Scientific Research Dan Fay Director – North America...
Using Existing Products And Technologies For Scientific ResearchUsing Existing Products And Technologies For Scientific Research
Dan FayDirector – North AmericaTechnical ComputingMicrosoft Corporation
Dan FayDirector – North AmericaTechnical ComputingMicrosoft Corporation
Can “Here And Now” Technologies Reduce Time To Insight?Can “Here And Now” Technologies Reduce Time To Insight?
Can “Business” Tools and techniques for dealing with
Be used in scientific research to raise the bar and allow researchers to be scientists and not computer scientists.
The Problem For The e-ScientistThe Problem For The e-Scientist
Data ingest
Managing a petabyte
Common schema
How to organize it?
How to reorganize it?
How to coexist and cooperate with others?
Data ingest
Managing a petabyte
Common schema
How to organize it?
How to reorganize it?
How to coexist and cooperate with others?
Data Query and Visualization tools
Support/training
Performance
Execute queries in a minute
Batch (big) query scheduling
Data Query and Visualization tools
Support/training
Performance
Execute queries in a minute
Batch (big) query scheduling
Experiments &Instruments
Simulationsfacts
facts
answers
questions
?Literature
Other Archives facts
facts
ComputationalModeling
Real-worldData
Interpretation& Insight
PersistentDistributed
Data
Workflow,Data Mining& Algorithms
Visual Programmin
g
PersistentDistributed
Storage
Distributed Computatio
n
Interoperability & Legacy Support via
Web Services
Live Documents
Searching &
Visualization
Reputation& Influence
The Scripps Research InstitutePeter Kuhn LabThe Scripps Research InstitutePeter Kuhn Lab
Research FocusEarly detection and therapy management of cancer patients
Modulation of protein interactions for therapeutic intervention
ProjectsCancer bioengineering partnership
Structural Proteomics of SARS
Research FocusEarly detection and therapy management of cancer patients
Modulation of protein interactions for therapeutic intervention
ProjectsCancer bioengineering partnership
Structural Proteomics of SARS
TSRI GoalsTSRI Goals
Improve Collaboration Complex experimental data
Within Scripps and with outside organizations
Capture more data electronicallyImages
Discussions
Structured Data
To provide project data and decisions in context – e.g. annotations on 2D and 3D objects
Leverage existing productivity applications
Improve Collaboration Complex experimental data
Within Scripps and with outside organizations
Capture more data electronicallyImages
Discussions
Structured Data
To provide project data and decisions in context – e.g. annotations on 2D and 3D objects
Leverage existing productivity applications
The Collaborative Molecular Environment The Collaborative Molecular Environment
ApplicationAllows the user to establish context among projects, entities, and annotations
Easily collect data from multiple sources (notes, files, URLs, Screen Clipping)
Provides for Annotation on pictures, data, and molecules
Very simple reporting (not yet implemented)
Windows Presentation FoundationApplication container
Controls for annotating 2D and 3D images
Rapid application environment for images and 3D data
SharePoint 2007Supports the Application with standard Web Services
Provides the security context for project teams and external collaborators
Enables search of annotations in order to find relevant images
Provides a single repository for collaboration with internal and external (SSL) collaborators
Office 2007Captures metadata to describe application context (image, investigator, etc.)
ApplicationAllows the user to establish context among projects, entities, and annotations
Easily collect data from multiple sources (notes, files, URLs, Screen Clipping)
Provides for Annotation on pictures, data, and molecules
Very simple reporting (not yet implemented)
Windows Presentation FoundationApplication container
Controls for annotating 2D and 3D images
Rapid application environment for images and 3D data
SharePoint 2007Supports the Application with standard Web Services
Provides the security context for project teams and external collaborators
Enables search of annotations in order to find relevant images
Provides a single repository for collaboration with internal and external (SSL) collaborators
Office 2007Captures metadata to describe application context (image, investigator, etc.)
External Research & Programs
C-ME And 2D AnnotationC-ME And 2D Annotation
Annotating Protein DataAnnotating Protein Data
Data acquisition from Data acquisition from source systems and source systems and integrationintegration
Data transformation Data transformation and synthesisand synthesis
Data enrichment, Data enrichment, with business logic, with business logic, hierarchical viewshierarchical views
Data discovery via Data discovery via data miningdata mining
Data presentation Data presentation and distributionand distribution
Data access for Data access for the massesthe masses
IntegrateIntegrate AnalyzeAnalyze ReportReport
ResearchResearch
Water Content at 5 cm
y = 0.4712x
R2 = 0.70390.0
0.1
0.2
0.3
0.4
0.5
0.6
0.0 0.1 0.2 0.3 0.4 0.5 0.6
Tonzi
Vai
ra
Water Content at 20 cm
y = 0.5854x
R2 = 0.9163
0.0
0.1
0.2
0.3
0.4
0.5
0.6
0.0 0.1 0.2 0.3 0.4 0.5 0.6
Tonzi
Vai
ra
Comparison Of Soil MoistureComparison Of Soil Moisture
Thanks to Gretchen Miller – UC Berkeley& Catharine Van Ingen (MSR)
Temperature at North American Sites
-10
0
10
20
30
20 30 40 50 60 70 80
Latitude
Ave
rag
e T
emp
mer
atu
re in
oC
`
Other ApplicationsOther Applications
Dynameomics
Goal: Perform MD simulations of representatives of all fold families (unique structures) ----maximize sampling of fold and sequence spaceProtein Folding Protein Unfolding (more tractable}
In Protein Databank~17,000 structures>35,000 domains
CATH, SCOP, Dali fold classification methods consensus → 1130 non-redundant protein folds
www.dynameomics.org
Day et al, Prot. Sci., 2003Thanks toValerie Daggett – Univ of Washington
IGG-like: 1fna Rossman: 3chy TIM barrel: 1ypi Jelly Roll: 1sac - plait: 1ris fibronectin CheY TIM SAP S6
3-helix bundle: 1enh Globin: 1a6n 4-helix bundle: 2a0b -grasp: 1pgb EF-hand: 4icb engrailed homeodomain myoglobin phosphotransfer domain protein G calbindin
Trypsin-like serine Thioredoxin-like: 1ev4 OB fold: 1mjc IGG-like: 1e65 Cytochrome C: 1hrc protease: 1qq4 -lytic protease GST A1-1 CspA azurin cytochrome C
Rossman: 1ght SH3 barrel: 1shg FAD/NAD(P) binding knottin: 1snb C-type lectin: 2afp domain: 1ebd transposon resolvase -spectrin SH3 oxidoreductase neurotoxin BMK M8 type II antifreeze prot.
lipocalin: 1ifc trefoil: 1tld Zn finger: 2adr snake toxin: 1ntn acid protease: 1g6l fatty acid binding prot. bovine trypsin Zn finger (ADR) cobra neurotoxin HIV-1 protease
Rossman: 2pth GST (C-term): 1ev4 IL-8 like (OB): 1bf4 PLP dep. transferase: 1e5f Laminin-like: 1edm peptidyl tRNA GST A1-1 Sso7d methionine -lyase coagulation hydrolase factor IX
Top 30 folds
Represent ~50% of the structures in Protein Databank
Day et al, Prot. Sci., 2003
MD is the time dependent integration of the classical equations of motion for molecular system. Our MD methods have been qualitatively and quantitatively benchmarked against experiment for more than 50 proteins in the past 15 years
Fibronectin, a representative from the top ranked fold (IGG like) is prepared for molecular dynamics (MD) simulation by adding hydrogens (not shown) to the PDB structure and solvating it with explicit waters (red and white in ball & stick).
Example simulation system (1fna)Example simulation system (1fna)
Unfolding of fibronectin, a representative from the top ranked fold (IGG like), from its biologically active state (N) to a denatured, inactive state (D). During unfolding, it loses a critical hydrophobic contact in its core between a valine and a tyrosine
Native (N)
Denatured (D)Starting structure
10 nanoseconds21 ns
Example unfolding simulation (1fna)Example unfolding simulation (1fna)
200 targets complete –6 simulations of each
1 native, 5 unfolding
DOE INCITE Award 3,300,000 CPU hours
On NERSC
250 GB every 48 hours
Projecting that we will have100 TB (compressed)
Now a database is required
200 targets complete –6 simulations of each
1 native, 5 unfolding
DOE INCITE Award 3,300,000 CPU hours
On NERSC
250 GB every 48 hours
Projecting that we will have100 TB (compressed)
Now a database is required
DynameomicsDynameomics
OLAP – On-line Analysis ProcessingOLAP – On-line Analysis ProcessingMOLAP – Multi-dimensional OLAPMOLAP – Multi-dimensional OLAP
SQL Server 2005For Their PurposesSQL Server 2005For Their Purposes
Suite of applicationsRelational database engine
OLAP engine
Performance tools
Extraction, Transformation and Loading tools
Integrated development environment
Suite of applicationsRelational database engine
OLAP engine
Performance tools
Extraction, Transformation and Loading tools
Integrated development environment
MOLAP For Scientific AnalysisMOLAP For Scientific Analysis
Why A Multidimensional Database Is DesiredWhy A Multidimensional Database Is Desired
It is efficient, most of the time, only twoor three dimensions are actively in play
The multidimensionality allows userto select properties of interest andsideline the rest
Better than SQL by eliminating the need for complicated joins
Sparsity tolerant
It is efficient, most of the time, only twoor three dimensions are actively in play
The multidimensionality allows userto select properties of interest andsideline the rest
Better than SQL by eliminating the need for complicated joins
Sparsity tolerant
Faster Time to Insight Better integration to existing Windows infrastructure Integrated and familiar development environment
Faster Time to Insight Better integration to existing Windows infrastructure Integrated and familiar development environment
Fighting HIV With Computer Science Nebojsa Jojic and David Heckerman - MSRFighting HIV With Computer Science Nebojsa Jojic and David Heckerman - MSR
A major problem: Over 40 million infectedDrug treatments are effective but are anexpensive life commitment
Vaccine needed for third world countriesEffective vaccine could eradicate disease
Methods from computer science are helping with the design of vaccine
Machine learning: Finding biological patterns that may stimulate the immune system to fight the HIV virus
Optimization methods: Compressing these patternsinto a small, effective vaccine
A major problem: Over 40 million infectedDrug treatments are effective but are anexpensive life commitment
Vaccine needed for third world countriesEffective vaccine could eradicate disease
Methods from computer science are helping with the design of vaccine
Machine learning: Finding biological patterns that may stimulate the immune system to fight the HIV virus
Optimization methods: Compressing these patternsinto a small, effective vaccine
Developed Set Of Specialist ToolsDeveloped Set Of Specialist Tools
Chromatogram deconvolution
Pathway analysis/association/causal models
Clustering/Trees (phylo, haplotypes etc.)
Protein binding and folding
Sequence diversity models (epitomes)
Image analysis/classification
Evolution modeling and inference
Epitope prediction
Chromatogram deconvolution
Pathway analysis/association/causal models
Clustering/Trees (phylo, haplotypes etc.)
Protein binding and folding
Sequence diversity models (epitomes)
Image analysis/classification
Evolution modeling and inference
Epitope prediction
HIV: The Diabolical VirusHIV: The Diabolical Virus
The train-and-kill mechanism doesn’twork for HIV – the virus adapts through rapid mutation. As soon as the killercells get the upper hand, the epitopes start changing
Strategy Find peptides or epitopes that occur commonly across a *population* of HIV viruses
Compact the known or potential immune targets into a small vaccine
The train-and-kill mechanism doesn’twork for HIV – the virus adapts through rapid mutation. As soon as the killercells get the upper hand, the epitopes start changing
Strategy Find peptides or epitopes that occur commonly across a *population* of HIV viruses
Compact the known or potential immune targets into a small vaccine
HPC and HIV Vaccine DesignCarl Kadie and David HeckermanMachine Learning and Applied StatisticsMicrosoft Research
HPC and HIV Vaccine DesignCarl Kadie and David HeckermanMachine Learning and Applied StatisticsMicrosoft Research
Developed Software: 8 or so new research programs. Most .NET(C# & C++/CLI), One in ‘R’. One in native C++.
Hardware: Cluster of 25 IBM eServer 326, 2 processors per machine
Cluster Software: Windows Compute Cluster Server 2003
Developed Software: 8 or so new research programs. Most .NET(C# & C++/CLI), One in ‘R’. One in native C++.
Hardware: Cluster of 25 IBM eServer 326, 2 processors per machine
Cluster Software: Windows Compute Cluster Server 2003
Fusion EventsFusion Events
Integrated Discovery in Gene Networks
Integrate genome-scale data for discoveryand prediction
Incorporate Disease, multiple organisms
Create applied systems network standard
Integrated Discovery in Gene Networks
Integrate genome-scale data for discoveryand prediction
Incorporate Disease, multiple organisms
Create applied systems network standardThanks to:Mehmet Dalkilic – Indiana University
James Costello (PhD Candidate), Rupali Patwardhan, Sumit Middha, Brian Eads, John Colbourne, Scott BeasonJunguk Hur
Andrews-Dalkilic Laboratory
Microarray Co-Expression
Arbeitman – “Life Cycle of Drosophila”
Parisi – “Incyte Drosophila LifeArray v1.0”
White – “Larval Tissues-Specific Transcripts”
Protein-Protein Interaction
FlyGRID (Fly General Repository for Interaction Dataset)
DIP (Drosophila Interaction Database by CuraGen)
MINT (Molecular Interaction Database)
BIND (Biomolecular Interaction Network Database)
Genetic Interaction
Flybase
Phenotypic Data
Flybase
Binding Site
DNase I Footprint Database and Patser3 using PWM
RNAi Screens
Harvard RNAi Screen – Norbert Perrimon
Microarray Co-Expression
Arbeitman – “Life Cycle of Drosophila”
Parisi – “Incyte Drosophila LifeArray v1.0”
White – “Larval Tissues-Specific Transcripts”
Protein-Protein Interaction
FlyGRID (Fly General Repository for Interaction Dataset)
DIP (Drosophila Interaction Database by CuraGen)
MINT (Molecular Interaction Database)
BIND (Biomolecular Interaction Network Database)
Genetic Interaction
Flybase
Phenotypic Data
Flybase
Binding Site
DNase I Footprint Database and Patser3 using PWM
RNAi Screens
Harvard RNAi Screen – Norbert Perrimon
xl-caBIG Smart Clientxl-caBIG Smart Client
How to give scientists a graphical interface for accessing cancer Biomedical Informatics Grid (caBIG) data-services
http://xl-cabig-client.sourceforge.net/
PI - Katarzyna Macura Johns Hopkins caBIG In Vivo Imaging Workspace Subject Matter Expert
How to give scientists a graphical interface for accessing cancer Biomedical Informatics Grid (caBIG) data-services
http://xl-cabig-client.sourceforge.net/
PI - Katarzyna Macura Johns Hopkins caBIG In Vivo Imaging Workspace Subject Matter Expert
xl-caBIG Smart Clientxl-caBIG Smart Client
Reproducible Research DocumentReproducible Research Document
Broad InstituteBroad InstituteInfusion DevelopmentInfusion DevelopmentBroad InstituteBroad InstituteInfusion DevelopmentInfusion Development
BusinessBusinessIntelligenceIntelligence
SharePoint Products And TechnologiesMicrosoft Office SharePoint Server 2007SharePoint Products And TechnologiesMicrosoft Office SharePoint Server 2007
CollaborationCollaboration
SearchSearch
PortalPortalBusinessBusiness
FormsForms
PlatformPlatformServicesServices
Workspaces, Mgmt,Workspaces, Mgmt,Security, Storage,Security, Storage,
Topology, Site ModelTopology, Site Model
ContentContentManagementManagement
Server-based Excel Server-based Excel spreadsheets and data spreadsheets and data
visualization, Report visualization, Report Center, BI Web Parts, Center, BI Web Parts,
KPIs/DashboardsKPIs/Dashboards
Integrated document Integrated document management, records management, records
management, and Web management, and Web content management with content management with
policies and workflowpolicies and workflow
Rich and Web Rich and Web forms based front-forms based front-
ends, LOB ends, LOB actions, enterprise actions, enterprise
SSOSSO
Docs/tasks/calendars, blogs, Docs/tasks/calendars, blogs, wikis, e-mail integration, wikis, e-mail integration,
project management “lite”, project management “lite”, Outlook integration, Outlook integration,
offline docs/listsoffline docs/lists
Enterprise scalability,Enterprise scalability,contextual relevance, rich contextual relevance, rich
people and business people and business data searchdata search
Enterprise Portal Enterprise Portal template, Site template, Site Directory, My Directory, My Sites, social Sites, social networking, networking,
privacy controlprivacy control
Excel 2007Excel 2007
Design and author
BrowserBrowser
High quality web rending
Zero-footprint
Interactive: Set parameters, sort, filter, explore
Limit to browser access
View and View and Interact Interact
CustomCustomapplicationsapplications
Set values, perform calculations, get updated values via web services
Retrieve full workbook file
Programmatic AccessProgrammatic Access
Open in Excel for rich exploration and analysis
Open snapshots
Excel 2007Excel 2007
Export/Snapshot into ExcelExport/Snapshot into Excel
Spreadsheets stored in document libraries
Spreadsheet calculation and rendering
External data retrieval and caching
100% calculation fidelity
SharePoint platform and Excel servicesSharePoint platform and Excel services
Publish Publish SpreadsheetsSpreadsheets
Excel ServicesOverviewExcel ServicesOverview
Development Data WorkflowCollaboration PublicationsDevelopment Data WorkflowCollaboration Publications
.NET & Visual StudioF#Iron PythonSQL SeverSQL Server analysis ServicesWindows WorkflowSharePoint Server 2007Knowledge NetworkInstant MessengerConferenceXPAcademic Live, Onfolio, etc…
.NET & Visual StudioF#Iron PythonSQL SeverSQL Server analysis ServicesWindows WorkflowSharePoint Server 2007Knowledge NetworkInstant MessengerConferenceXPAcademic Live, Onfolio, etc…
ResourcesResources
Windows Compute Cluster ServerTuesday 12-1 BakerHigh-Performance Computing with Windowshttp://windowshpc.net/
Data miningwww.sqlserverdatamining.com/
Develop without Borders Challengewww.developwithoutborders.com
Technical Computing Blogs http://blogs.msdn.com/dan_fay and http://blogs.msdn.com/eScience
Windows Compute Cluster ServerTuesday 12-1 BakerHigh-Performance Computing with Windowshttp://windowshpc.net/
Data miningwww.sqlserverdatamining.com/
Develop without Borders Challengewww.developwithoutborders.com
Technical Computing Blogs http://blogs.msdn.com/dan_fay and http://blogs.msdn.com/eScience
© 2006 Microsoft Corporation. All rights reserved.Microsoft, Windows, Windows Vista and other product names are or may be registered trademarks and/or trademarks in the U.S. and/or other countries.
The information herein is for informational purposes only and represents the current view of Microsoft Corporation as of the date of this presentation.Because Microsoft must respond to changing market conditions, it should not be interpreted to be a commitment on the part of Microsoft,
and Microsoft cannot guarantee the accuracy of any information provided after the date of this presentation.MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.