io-Chem-BD, una solució per gestionar el Big Data en Química Computacional
-
Upload
csuc-consorci-de-serveis-universitaris-de-catalunya -
Category
Technology
-
view
107 -
download
0
Transcript of io-Chem-BD, una solució per gestionar el Big Data en Química Computacional
30/11/14
1
una solucio per ges/onar el Big Data en
Química Computacional
TSIUC’14 Universitat Autònoma de Barcelona, 2-‐XII-‐2014
Carles Bo ICIQ -‐ URV
Computa?onal Chemistry
30/11/14
2
Computa?onal Chemistry Taking experiment to cyberspace Nobel Prize Chemistry 2013 (1981, 1998)
NOBEL PRIZE IN CHEMISTRY 2013POPULAR SCIENCE BACKGROUND
Taking the experiment to cyberspace
Chemical reactions occur at lightning speed; electrons jump between atoms hidden from the prying eyes of scientists. The Nobel Laureates in Chemistry 2013 have made it possible to map the mysteri-ous ways of chemistry by using computers. Detailed knowledge of chemical processes makes it pos-sible to optimize catalysts, drugs and solar cells.
Chemists all over the world devise and carry out experiments on their computers on a daily basis. With the help of the methods that Martin Karplus, Michael Levitt and Arieh Warshel began to develop in the 1970s, they examined every tiny little step in complex chemical processes that are invisible to the naked eye.
In order for you, the reader, to get an idea of how mankind can benefit from this, we begin with an example. Put your lab coat on, because we have a challenge for you: to create artificial photosyn-thesis. The chemical reaction occurring in green leaves fills the atmosphere with oxygen and is one prerequisite for life on Earth. But it is also interesting from an environmental perspective. If you can mimic the photosynthesis you will be able create more efficient solar cells. When water molecules are split oxygen is created, but also hydrogen that could be used to power our vehicles. So there is ample reason for you to get engaged in this project. If you succeed, you could contribute to solving the problem with greenhouse effect.
Nob
el P
rize®
is a
regi
ster
ed tr
adem
ark
of th
e N
obel
Fou
ndat
ion.
Figure 1. Today chemists experiment just as much on their computers as they do in their labs. Theoretical results from computers are confirmed by real experiments that yield new clues to how the world of atoms works. Theory and practice cross-fertilize each other.
Permanent storage. Cer/fy results. Re-‐use results.
30/11/14
3
Our Big Data Problem (1)
Help researchers in their daily tasks (manage & store results, apps & tools)
Our Big Data Problem (2)
Manage files of former group members
30/11/14
4
Our Big Data Problem (3)
Suppor/ng Informa/on files Cer/fy results -‐ Reuse results
Yes, Comp Chem is a Big Data Problem
30/11/14
5
5 ★ Open Data Tim Berners-‐Lee
OL: Open license OF: Open format LD: Linked RE: Readable data URI: Accessible
Scien?sts
Submit jobs
Data Collec?on Manually
Reports (pdf files) Manually
HPC
Files TeraBytes >95% waste
Publishers
Files
Public
Informa?on
Present
30/11/14
6
Scien?sts
Submit jobs Workflows
Data Collec?on Automated
Reports XML
Automated
Cloud HPC HPC
on demand
Results Databases
XML
Publishers
Informa?on
Public
Files
Informa?on
Future
Scien?sts
Submit jobs
Data Collec?on Manually
Reports XML
Automated
HPC
HPC
Results Databases
XML
Publishers
Files
Public
Files
Informa?on
ioChem-‐BD
30/11/14
7
5 ★ Open Data Tim Berners-‐Lee
Present
ioChem-‐BD
Defini?on ioChem-‐BD is a Digital Repository aimed to manage and store Computa/onal Chemistry files (inputs & outputs), and comes to fill the gap between results genera?on and manuscripts publica?on, and raise data to 5* quality.
Created by the fusion of previous projects:
30/11/14
8
Goals • Build a distributed database of computa?onal chemistry results:
reduce size and increase value. • Set a common data standard among all quantum chemistry legacy
formats (XML -‐ CML). • Become a daily tool in data management, search and manipula?on • Redefine workflows: store results and publishing, open-‐data • Be open to add future func?onali?es for data manipula?on and
analysis
ioChem-‐BD features
• Dynamic independent templates for data extrac?on of data display • Data representa?on set on top of priori?es (XML-‐CML) • Responsive design (any device is able to render our content) • Data easily exportable to other formats • Secure connec?ons • Fully compliant with latest web standards
30/11/14
9
Performance of our new extrac?on library
0
50
100
150
200
250
300
350
400
450
112.73 502.88 1,012.32 1,914.19 1,914.19 2,559.18 2,573.73 3,421.10 3,486.16 5,076.22 30,229.58 68,328.04
Parsing /m
e (s)
File size (kB)
Conversion /me vs File size Plain text to CompChem CML
jumbo-‐converters
jumbo-‐saxon
jumbo-‐saxon with keep field
≈14x
≈4x
Upload Convert Store User interfaces
WEB
Shell
User files (input/output)
Conversion templates
Create &
Browse
Search
Manage
Publish Share
Convert
30/11/14
10
Workflow steps (1): Create
Results files are uploaded from user’s disk space -‐ Create shell client -‐ Create web interface -‐ Cer/ficate results (True Data) -‐ Valida/on (Convergence WF, Geometries)
Create: Shell client
30/11/14
11
Basic commands Command Descrip/on
start-‐rep-‐shell Connect to repository (mandatory)
exit-‐rep Disconnect from repository
lspro List current path contents
pwdpro Print current path
Project related commands Command Descrip/on
catpro Display project informa?on
cdpro Change to project
cpro Create a new project
mpro Modify a project
dpro Delete a project
findpro Find project by it’s name (regex allowed)
Calcula?on related commands Command Descrip/on
loadcalc Load calcula?on into repository
viewcalc View calcula?on informa?on
Create: Shell client
Create: Web interface
30/11/14
12
Workflow steps (2): Create
The Create module manages results and facilitates advanced data treatment
• Manage – Post-‐processing – Organize projects collec?ons – Enrich Data: Descrip?on, keywords, addi?onal files – Reports: Generate Sup. Info. files (pdf) for publishing – Reac?on Energy paths
– Consistency (level of theory) – Thermodynamic correc?ons – Kine?c Analysis ( TOF, % e.e.) – Molecular descriptors (QSAR) – etc …
Create: Web interface
30/11/14
13
Workflow steps (3): Browse
Results can then be published and made available for viewing and downloading by general public on Browse module Handle URL generator Rich XML Suppor?ng Informa?on files Linked to a published manuscript
Browse: Web interface
30/11/14
14
Current project status
• Private & Demo servers up ( www.iochem-‐bd.org) • Supported formats:
– Gaussian, ADF, VASP – Molcas (50%)
• Tes?ng integrity (user-‐driven tests) • Checking Data captured & displayed • Reports Module (50%)
• To do: sindicate distributed browsers, links to external databases, …
Acknowledgements
Moises Álvarez
N. Lopez, F. Maseras, J. M. Poblet, C. De Graaf