The NOMAD (Novel Materials Discovery) Laboratory – a ......Machine (JVM, the runtime used by java...

10
The NOMAD (Novel Materials Discovery) Laboratory – a European Centre of Excellence First 40 conversion layers Deliverable No: 1.3 Expected Delivery Date: 30/04/2018, M30 Actual Delivery Date: 15/06/2018, M32 Lead Beneficiary: Fritz Haber Institute of the Max Planck Society (MPG-FHI) 1 Contributing Beneficiaries: MPG-MPSD, Aalto University (AALTO), King’s College London (KCL), University of Cambridge (CAM), Danmarks Tekniske Universitet (DTU), Humboldt- Universitaet zu Berlin (HUB), University of Barcelona (UB) 1 Beneficiary MPG includes three groups active in the NOMAD Laboratory CoE - FHI, MPCDF and MPSD.

Transcript of The NOMAD (Novel Materials Discovery) Laboratory – a ......Machine (JVM, the runtime used by java...

Page 1: The NOMAD (Novel Materials Discovery) Laboratory – a ......Machine (JVM, the runtime used by java and scala) that runs several of the big data tools like Flink.5 To integrate with

TheNOMAD(NovelMaterialsDiscovery)

Laboratory–aEuropeanCentreofExcellence

First40conversionlayers

DeliverableNo:1.3

ExpectedDeliveryDate:30/04/2018,M30ActualDeliveryDate:15/06/2018,M32

LeadBeneficiary:FritzHaberInstituteoftheMaxPlanckSociety(MPG-FHI)1

ContributingBeneficiaries:MPG-MPSD,AaltoUniversity(AALTO),King’sCollegeLondon(KCL),UniversityofCambridge(CAM),DanmarksTekniskeUniversitet(DTU),Humboldt-

UniversitaetzuBerlin(HUB),UniversityofBarcelona(UB)

1BeneficiaryMPGincludesthreegroupsactiveintheNOMADLaboratoryCoE-FHI,MPCDFandMPSD.

Page 2: The NOMAD (Novel Materials Discovery) Laboratory – a ......Machine (JVM, the runtime used by java and scala) that runs several of the big data tools like Flink.5 To integrate with

NoMaD–ProjectNo.676580*

*TheacronymhasbeenchangedfromNoMaDtoNOMAD-NoMaDisusedhereinreferencetotheacronymusedintheGrantAgreement.Copyright 2016/2017/2018 by theNOMADConsortium. The information in this document is proprietary totheNOMADConsortium.This document contains preliminary information and is not subject to any license agreement or any otheragreementwiththeNOMADConsortium.Thisdocumentcontainsonlyintendedstrategies,developments,andfunctionalitiesandisnotintendedtobebinding upon to any particular course of business, product strategy, and/or development oftheNOMADConsortium.TheNOMADConsortiumassumes no responsibility for errors or omissions in this document. Furthermore,theNOMAD Consortiumdoes not warrant the accuracy or completeness of the information, text, graphics,links,orotheritemscontainedwithinthismaterial.Thisdocumentisprovidedwithoutawarrantyofanykind,eitherexpressor implied, includingbutnot limitedtothe impliedwarrantiesofmerchantability, fitnessforaparticular purpose, or non-infringement. TheNOMADConsortiumshall have noliabilityfor damages of anykind includingwithout limitationdirect, special, indirect,orconsequentialdamages thatmayresult fromtheuse of these materials. This limitation shall not apply in cases of intent or gross negligence. Thestatutoryliabilityforpersonalinjuryanddefectiveproductsisnotaffected.Inaddition,thematerialspresentedandviewsexpressedherearetheresponsibilityoftheauthor(s)only.TheEUCommissiontakesnoresponsibilityforanyusemadeoftheinformationsetout.

D1.3First40conversionlayers 2

ExecutiveSummaryWe developed converters (parsers) for the 40 leading simulation codes, extending the support offorce field simulationcodes.Thesewere (andcontinue tobe) runon theconstantlygrowingopenaccessdataavailableintheNoMaDRepository(http://nomad-repository.eu).Theparsingresultsareused topopulate theNOMADArchivewith code-independentdata. This code-independentdata isthefundamentaldatasourceforallotherworkpackages(WPs)intheNOMADLaboratoryCentreofExcellence (CoE). The source code for all the developed parsers is freely available onhttps://gitlab.rzg.mpg.de/nomad-lab.

Page 3: The NOMAD (Novel Materials Discovery) Laboratory – a ......Machine (JVM, the runtime used by java and scala) that runs several of the big data tools like Flink.5 To integrate with

NoMaD–ProjectNo.676580*

D1.3First40conversionlayers 3

TABLEOFCONTENTS

1 Introduction......................................................................................................................................4

2 Results..............................................................................................................................................42.1 DescriptionofaParser..............................................................................................................42.2 Parsinglibrary............................................................................................................................62.3 NOMADMetaInfochanges.......................................................................................................62.4 ParsersList.................................................................................................................................72.5 ParsingResults...........................................................................................................................8

3 Conclusion........................................................................................................................................9

RevisionHistory

Version1.0,submitted15/06/2018

Originalversion

Page 4: The NOMAD (Novel Materials Discovery) Laboratory – a ......Machine (JVM, the runtime used by java and scala) that runs several of the big data tools like Flink.5 To integrate with

NoMaD–ProjectNo.676580*

D1.3First40conversionlayers 4

1 IntroductionThegoalofworkpackage1(WP1)istomakethenumerousresultsofthevarioussimulationscodesusedinthecomputationalmaterialssciencecommunityavailableforanalysisbythewiderscientificcommunityand,inparticular,bythevariousWPsoftheNOMADLaboratoryCoE.

WP1constitutesthecoreoftheNOMADDatabase.Atalowlevel,itappearstobequitesimple:itisagrowing collection of files.Whatmakes this collection of files special is the content and how it isorganizedascode-independentfiles.ThefilesareorganizedaccordingtotheBagItrawdataarchivetheycomefrom,andtheyhaveauniqueidentifieraccordingtotheirprovenance.

These code-independent files arewritten in JSON (human readable andeasily used in conjunctionwithwebapplications)andHDF5 (efficientbinary representation, indexedaccess).Their content isorganized according to NOMAD Meta Info. The metadata structure described in length athttps://www.nomad-coe.eu/index.php?page=nomadmetainfoandinD1.1.

Tocreatethese files, thedata in the inputandoutputofalldifferentcodesneedstobeorganizedaccordingtotheNOMADMetaInfostructure-thisisthetaskoftheparser.Aseachsimulationcodesavesitsoutputinadifferentway,everycoderequiresitsown,independentparser.

WP1developedthetoolsandinfrastructuretohelpNOMADdeveloperstowriteparsers,runtheminparallel and test them. However, the heterogeneity of different codes currently used by thecomputationalmaterials science communitymeant that the parser developers needed to have anexcellent scientific knowledge and understanding of the code for which the parser was beingdeveloped.

DeliverableD1.2‘First30conversionlayers’describedthedevelopmentofparserforabinitiocodes.Here, inD1.3,we focus on the last 10 parsers developed for classical force field codes. Thus, thisdeliverable not only increases the number of simulation codes supported, but also the type ofsimulationcodessupported.Tosupportclassicalforcefieldcodes,wehadtoimprovetheMetaInfodescription(D1.1andsection2.3)tobetteraccommodateforcefieldsimulationcodes,inparticularimprovingthewaywerepresentMolecularDynamics(MD). Inaddition,forcefieldcodesuseoftenbinaryand/orcomplexdataformats,whicharenoteasytoparse.Therefore,WP1createdsupportfor several commonly used file formats in a python library available in the pymolfile(http://gitlab.mpcdf.mpg.de/nomad-lab/pymolfile)repository,leveragingtheVmdmolfileplugins.

Thegoalofthisdeliverableistodocumentoursupportforthe40simulationcodesidentifiedasmostwidely used by themodeling community, which consequently makes all the data generated withthesecodesanduploadedtotheNOMADRepositoryavailableforanalysiswiththeNOMADAnalyticsToolkitandforusebytheNOMADEncyclopedia.

2 Results

2.1 DescriptionofaParser

Allparsershavesomecommonfeatures,asdescribedbelow.

Page 5: The NOMAD (Novel Materials Discovery) Laboratory – a ......Machine (JVM, the runtime used by java and scala) that runs several of the big data tools like Flink.5 To integrate with

NoMaD–ProjectNo.676580*

D1.3First40conversionlayers 5

Thepythonandscalacodeofeachparserlivesinaseparategit2repository(seeTable1fortheactualaddress).Usinggitdescribe (acommandthatbuildsahumanreadableversionnumberforeachgitcommit3), each parser always has a unique version, which is stored in the parsed and code-independent (normalized) filesandcanbeused tocheckout4 thecode thatwill reproduceexactlythatnormalizedfile.Inthisway,thefilescanalwaysberegeneratedevenifwedeleteoldnormalizedfiles.

Parsersreceivethenameofthefiletoparse,andshouldemitaseriesofevents(valuesfound,MetaInfosectionsstartandend).Theuseofastreamofevents(whichshouldreplaythecalculationrun)means that these parsers are normally suited also to parse a calculation in progress (or could beadapted to support it).This capability isnotcurrently requiredby theNOMADLaboratoryCoEbutmightbeveryusefulinothercontexts.

We tested all the parsers bothwithpython3 andpython 2, but in production,weexclusively usepython3.python iswidelyused in thematerialssciencecommunity,and is thebestchoice for theparsersevenifitisnotthefastestoptionandaddsoverheadwhenintegratingwiththeJavaVirtualMachine(JVM,theruntimeusedbyjavaandscala)thatrunsseveralofthebigdatatoolslikeFlink.5To integrate with these tools, every parser also has a scala6 wrapper that can execute thecorrespondingpythoncommandandthatcanidentifythefilesofthatcode.ThisisusedbytheTreeParser,aprogramthatwedevelopedtoscanthedirectorytreeoftheRawDataArchivewiththegoalof associating files to their corresponding simulation code (and parser). The Calculation Parserprogramthencalls tocorrectparser foreach file identifiedandgenerates theParsedFiles.Finally,the normalizer applies the common transformations (see Figure 1) to end up with the code-independentfilesthatfeedtheNOMADArchiveanddrivethewholeNOMADLaboratoryCoE.

Figure 1: Overview of how the parsing infrastructureworks. This is available for all parser developers onserversoftheMPCDF,alongwithamailinglisttohelpthecommunicationbetweendevelopers.

2Aversioncontrolsystem(i.e.aprogramtokeepvariousversionsoffiles),seehttps://git-scm.com/3Acommitisaversionthathasbeenregistered(committed)ingit4Getallfilesexactlyasregisteredinagivenversion.5ApacheFlinkisanopensourceplatformfordistributedstreamandbatchdataprocessinghttps://flink.apache.org/.6Aprogramminglanguagehttp://www.scala-lang.org/

NOMADRepository

Raw DataUnique name Manageable sizeVerifiable ConsistencyFixed group of files

Raw Data(Bag-It archives)

Tree Parser1Calculation

Parser1

Data preparationList of

Archives to parse

List of files to parse Calculation

ParserN

Tree ParserN

…Parsed Files

List of files to

normalize

Normalized Files

ParsersClear versioningReproducible resultsCan use Docker containers or HPC clusters, or both

Parsed and Normalized Filesuse meta info JSON & HDF5

NormalizerNormalizationCommon transformationsCombines results of a raw data archive

Parsing

Page 6: The NOMAD (Novel Materials Discovery) Laboratory – a ......Machine (JVM, the runtime used by java and scala) that runs several of the big data tools like Flink.5 To integrate with

NoMaD–ProjectNo.676580*

D1.3First40conversionlayers 6

2.2 Parsinglibrary

Parsersshouldbeoptimizedtotheformatusedinthefiles.Thebestwaytoparseafreeformtextualoutputisquitedifferentfromtheoneforxml,andbinaryoutputsaredifferentagain.

As explained inD 1.2,we have developed a parsing library for loosely formatted human readabletext.

Tosimplifythesupportofforcefieldcodes,andingenerallongtrajectories,wedevelopedapythonlibrary, called pymolfile, and available at https://gitlab.mpcdf.mpg.de/nomad-lab/pymolfile. Itleverages the open source VMD molfile pluginshttp://www.ks.uiuc.edu/Research/vmd/plugins/molfile/. VMD is a molecular visualization programfordisplaying, animating, andanalyzing largebiomolecular systemsusing3-Dgraphics andbuilt-inscripting,and isactivelydevelopedandmaintained.Thepluginsallowtheparsingof structureandtrajectoriesinseveralformats.Thiswasseenasthebestwaytosupportawiderangeofformats,andsimplifytheparsingofforcefieldscodes.

2.3 NOMADMetaInfochanges

Toaccommodatethe longtrajectories,arbitraryquantitieswith irregularwritingfrequencies (forcefield codesarehighly scriptable),wehad to tweakabit the representationofMD likedata in theNOMADMetaInfo.

Figure2showsthemainsectionsinvolved.

Wearenowadaptingtheparserswehadalreadydevelopedtothisnewstructure.

Figure2:MainNOMADMetaInfosectionsusedtorepresentMoleculardynamicsandclassicalforce-fieldcodes

Page 7: The NOMAD (Novel Materials Discovery) Laboratory – a ......Machine (JVM, the runtime used by java and scala) that runs several of the big data tools like Flink.5 To integrate with

NoMaD–ProjectNo.676580*

D1.3First40conversionlayers 7

2.4 ParsersList

Aparsershould ideallyparseall thecontentof the inputsandoutputsof thecode.Unfortunately,the output of these simulation codes often varies depending on the input parameters.Given thatboththe informationintheoutputaswellas itsspecificformatcanchange, it isdifficulttoensurethataparserwillalwaysextractall informationcontainedintheinputandoutputfiles.Forparserswrittenwithourlibrary,wecandetectthepartofthefilethathasbeenignored.

Forthisdeliverable,wedecidedtodeclareaparserreadyifthefollowingconditionsweremet:

• Theparsercanparsealltheparameters/informationcontainedintheinputandoutputfilesofthecorrespondingcode.

• Theimportantquantitiesaresuccessfullyparsed:o codename,version,geometry,totalenergy,forces,ando forelectronicstructurecodesalso:electronic-structuremethod,XCfunctional,basis

setKohn-Shamenergiesandk-points.• Itpassesmanualinspectionofthegeneratedfiles.• Noobvioususefulquantitiesareincludedintheignoredpartsoftheoutput.• Thereisasetofrelevantinput/outputdataofthatcodeintheNOMADRepository.

Table 1 lists the parsers that satisfy these conditions. Their code can be inspected athttps://gitlab.mpcdf.mpg.de/nomad-lab/<repository>(clickontheRepositorycolumninthetableinthedigitalversion).

Table1:ListoftheparsersthatarereadyandhavebeenusedtogeneratetheNOMADArchive.Moreparsersare in development. The repository with the code of the parser is available athttps://gitlab.rzg.mpg.de/nomad-lab/<repository>.

code Codeurl Repository1. abinit http://www.abinit.org/ parser-abinit2. AMBER http://ambermd.org/ parser-amber

3. asap https://wiki.fysik.dtu.dk/ase/index.html parser-asap4. ATK http://quantumwise.com/ parser-atk5. BigDFT http://bigdft.org/ parser-big-dft6. CASTEP http://www.castep.org/ parser-castep7. CHARMM https://www.charmm.org/ parser-charmm8. cp2k http://www.cp2k.org/ parser-cp2k9. CPMD http://www.cpmd.org/ parser-cpmd

10. crystal http://www.crystal.unito.it/index.php parser-crystal11. DFTB+ http://www.dftb-plus.info/ parser-dftb-plus12. DL_POLY http://www.stfc.ac.uk/SCD/research/app/44516.aspx parser-dl-poly13. Elk http://elk.sourceforge.net/ parser-elk14. exciting http://exciting-code.org/ parser-exciting15. FHI-aims https://aimsclub.fhi-berlin.mpg.de/ parser-fhi-aims16. FLEUR http://www.flapw.de/pm/ parser-fleur17. FPLO http://www.fplo.de/ parser-fplo18. GAMESS http://www.msg.ameslab.gov/gamess/ parser-gamess

Page 8: The NOMAD (Novel Materials Discovery) Laboratory – a ......Machine (JVM, the runtime used by java and scala) that runs several of the big data tools like Flink.5 To integrate with

NoMaD–ProjectNo.676580*

D1.3First40conversionlayers 8

code Codeurl Repository19. Gaussian http://www.gaussian.com/ parser-gaussian20. GPAW https://wiki.fysik.dtu.dk/gpaw/ parser-gpaw21. GROMACS http://www.gromacs.org/ parser-gromacs22. GROMOS http://www.gromos.net/ parser-gromos23. GULP http://nanochemistry.curtin.edu.au/gulp/ parser-gulp24. Libatoms/

QUIPhttp://www.libatoms.org/Home/LibAtomsQUIP parser-lib-atoms

25. Molcas http://www.flapw.de/pm/ parser-molcas26. MOPAC http://openmopac.net/ parser-mopac27. NAMD http://www.ks.uiuc.edu/Research/namd/ parser-namd28. NWChem http://elk.sourceforge.net/ parser-nwchem29. octopus http://www.tddft.org/programs/octopus/wiki/index.php/Main

_Pageparser-octopus

30. onetep http://www2.tcm.phy.cam.ac.uk/onetep/ parser-onetep31. ORCA http://cec.mpg.de/forum/ parser-orca32. phonopy7 https://atztogo.github.io/phonopy/ parser-phonopy33. qBox http://qboxcode.org/ parser-qbox34. Quantum

Espressohttp://www.quantum-espresso.org/ parser-quantum-

espresso35. SIESTA http://departments.icmab.es/leem/siesta/ parser-siesta36. Smeagol https://www.tcd.ie/Physics/Smeagol/index.html parser-siesta37. Tinker https://dasher.wustl.edu/tinker/ parser-tinker38. turbomole http://www.turbomole.com/ parser-turbomole39. VASP https://www.vasp.at/ parser-vasp40. WIEN2k http://www.wien2k.at/ parser-wien2k

2.5 ParsingResults

TheparserswereusedtoscantheopenaccessdataoftheNOMADRepository,leadingto:

• parsingandnormalizationofinputandoutputfilesofmorethan50Mcalculations,• totalenergiesofmorethan37Mgeometries,• more than 300k different materials (different chemical elements, different compositions),

and• bandstructuresofmorethan1.9Mdifferentmaterials.

7Phonopysupportsseveralcodestoactuallycalculateenergyandforces.Parsingthephonopyoutputhastwoparts,onephonopyspecific,anotherspecifictohowthespecificcodewasconnectedtophonopy.Currentlyweparseandhandlethecodespecificpartonly forFHI-aims.Addingothercodesshouldberelativelystraightforwardwhendata for themwillbeaddedtotheNOMADRepository.

Page 9: The NOMAD (Novel Materials Discovery) Laboratory – a ......Machine (JVM, the runtime used by java and scala) that runs several of the big data tools like Flink.5 To integrate with

NoMaD–ProjectNo.676580*

D1.3First40conversionlayers 9

Figure 3 shows how much data each parser has generated. Obviously, there is a chicken-eggproblem:withoutaparser,data cannotbe recognized in theNOMADRepository butwithoutdatafortesting,developingaparserisnotparticularlyeasy.Forsomecodes(VASP,FHI-aims,exiting),wealreadyhada largeamountofcalculations intheNOMADArchivewhendevelopingtheparser.Forothercodes,westarteddevelopingparserswitha“builditandtheywillcome”approach.

3 ConclusionWedevelopedconverters(parsers) forthe40 leadingsimulationcodesandusedthemtopopulatetheNOMADArchivewithcode-independentdata.These40parsersweretestedandareconsideredready for use as their output data is reliable. The work on parsers will continue to ensure thatpossiblebugsarefixedinatimelymannerandthatparsersareadaptedtoupdatesofthesimulationcode and its output. Furthermore, computational optimizations to the parsing library that couldspeedupmostparsersarebeingundertaken.Still,theworkpresentedhererepresentsarobustbasisnot only forNOMAD,but also for thewhole community as robust parsers can simplify analysis ofcalculationsinmanycontexts.

Thisworkhasbeenpossiblethankstothehardworkofallparserdevelopers:

WaelChibiani,AdrielDominiguez,AndreaDroghetti,AdamFekete,HenningGlawe,LauriHimannen,Arvid Ihrig, Sami K. Kivisto, Franz Knuth, Ask Hjorth Larsen, Aliaksei Mazheika, Fawzi Mohamed,

Figure3:Amountofenergies(Singlepointcalculations)andseparaterunsavailableintheNOMADArchiveonApril15,2018foreachsimulationcode.

Page 10: The NOMAD (Novel Materials Discovery) Laboratory – a ......Machine (JVM, the runtime used by java and scala) that runs several of the big data tools like Flink.5 To integrate with

NoMaD–ProjectNo.676580*

D1.3First40conversionlayers 10

MicaelOliviera,BerkOnat,CarlPoelking,MassimoRiello,LorenzoPardini,HonghuiShang,MartinaStella,MikkelStrange,DariaTomecka,RosenndoValero,SebastiánAlarcónVillaseca.