The NOMAD (Novel Materials Discovery) Laboratory – a ......the fastest option and adds overhead...

11
The NOMAD (Novel Materials Discovery) Laboratory – a European Centre of Excellence First 30 conversion layers Deliverable No: D1.2 Lead Beneficiary: Fritz Haber Institute of the Max Planck Society (FHI-MPG) Contributing Beneficiaries: Max Planck Institute for the Structure and Dynamics of Matter (MPSD-MPG), Aalto University (AALTO), King’s College London (KCL), University of Cambridge (CAM), Danmarks Tekniske Universitet (DTU), Humboldt-Universitaet zu Berlin (HUB), Universitat de Barcelona (UB), Pintail Ltd (PT)

Transcript of The NOMAD (Novel Materials Discovery) Laboratory – a ......the fastest option and adds overhead...

Page 1: The NOMAD (Novel Materials Discovery) Laboratory – a ......the fastest option and adds overhead when integrating with the Java Virtual Machine (JVM, the runtime used by java and

TheNOMAD(NovelMaterialsDiscovery)Laboratory–aEuropeanCentreofExcellence

First30conversionlayersDeliverableNo:D1.2

LeadBeneficiary:FritzHaberInstituteoftheMaxPlanckSociety(FHI-MPG)

ContributingBeneficiaries:MaxPlanckInstitutefortheStructureandDynamicsofMatter(MPSD-MPG),AaltoUniversity(AALTO),King’sCollegeLondon(KCL),Universityof

Cambridge(CAM),DanmarksTekniskeUniversitet(DTU),Humboldt-UniversitaetzuBerlin(HUB),UniversitatdeBarcelona(UB),PintailLtd(PT)

Page 2: The NOMAD (Novel Materials Discovery) Laboratory – a ......the fastest option and adds overhead when integrating with the Java Virtual Machine (JVM, the runtime used by java and

NoMaD*–ProjectNo.676580

*TheacronymhasbeenchangedfromNoMaDtoNOMAD-NoMaDisusedhereinreferencetotheacronymusedintheGrantAgreement.

Copyright 2016 by theNOMADConsortium. The information in this document is proprietary totheNOMADConsortium.This document contains preliminary information and is not subject to any license agreement or any otheragreementwiththeNOMADConsortium.Thisdocumentcontainsonlyintendedstrategies,developments,andfunctionalitiesandisnotintendedtobebinding upon to any particular course of business, product strategy, and/or development oftheNOMADConsortium.TheNOMADConsortiumassume no responsibility for errors or omissions in this document. Furthermore,theNOMADConsortiumdoes not warrant the accuracy or completeness of the information, text, graphics,links,orotheritemscontainedwithinthismaterial.Thisdocumentisprovidedwithoutawarrantyofanykind,eitherexpressor implied, includingbutnot limitedtothe impliedwarrantiesofmerchantability, fitness foraparticular purpose, or non-infringement. TheNOMADConsortiumshall have noliabilityfor damages of anykind includingwithout limitationdirect, special, indirect,orconsequentialdamages thatmayresult fromtheuse of these materials. This limitation shall not apply in cases of intent or gross negligence. Thestatutoryliabilityforpersonalinjuryanddefectiveproductsisnotaffected.Inaddition,thematerialspresentedandviewsexpressedherearetheresponsibilityoftheauthor(s)only.TheEUCommissiontakesnoresponsibilityforanyusemadeoftheinformationsetout.

D1.2First30conversionlayers 2

ExecutiveSummaryWedevelopedconverters(parsers)forthe30leadingsimulationcodes.Thesewere(andcontinuetobe) run on the constantly growing open access data available in the NoMaD Repository(http://nomad-repository.eu). The parsing results are used to populate theNOMAD Archive withcode-independent data. This code-independent data is the fundamental data source for all otherworkpackages (WPs) in theNOMADLaboratoryCentreofExcellence (CoE).Thesourcecode forallthe developed parsers is freely available on https://gitlab.rzg.mpg.de/nomad-lab.

Page 3: The NOMAD (Novel Materials Discovery) Laboratory – a ......the fastest option and adds overhead when integrating with the Java Virtual Machine (JVM, the runtime used by java and

NoMaD–ProjectNo.676580

D1.2First30conversionlayers 3

First30conversionlayers1 Introduction.............................................................................................................................4

2 Goals........................................................................................................................................4

3 Results.....................................................................................................................................43.1 DescriptionofaParser.......................................................................................................43.2 Parsinglibrary....................................................................................................................53.3 ParsersList.........................................................................................................................63.4 ParsingResults...................................................................................................................83.5 Dependencies...................................................................................................................10

4 Conclusion...............................................................................................................................10

RevisionHistory

Version1.0,submitted07/11/2016

Originalversion

Page 4: The NOMAD (Novel Materials Discovery) Laboratory – a ......the fastest option and adds overhead when integrating with the Java Virtual Machine (JVM, the runtime used by java and

NoMaD–ProjectNo.676580

D1.2First30conversionlayers 4

1 IntroductionThegoalofworkpackage1(WP1)istomakethenumerousresultsofthevarioussimulationscodesusedinthecomputationalmaterialssciencecommunityavailableforanalysisbythewiderscientificcommunityand,inparticular,bythevariousWPsoftheNOMADLaboratoryCoE.

WP1constitutesthecoreoftheNOMADDatabase,butatalowlevelitappearstobequitesimple:itisagrowingcollectionoffiles.Whatmakesthiscollectionoffilesspecialisitscontentandhowitisorganizedascode-independentfiles.

The files are organized according to the BagIt raw data archive they come from, and they have auniqueidentifieraccordingtotheirprovenance.

These code-independent files arewritten in JSON (human readable andeasily used in conjunctionwithwebapplications)andHDF5 (efficientbinary representation, indexedaccess).Their content isorganized according to NOMAD MetaInfo, the metadata structure described in length athttps://www.nomad-coe.eu/index.php?page=nomadmetainfoandinD1.1.

Tocreatethese files, thedata in the inputandoutputofalldifferentcodesneedstobeorganizedaccordingtotheNOMADmetadatastructure-thisisthetaskoftheparser.Aseachsimulationcodesavesitsoutputinadifferentway,everycoderequiresitsown,independentparser.

WedevelopedthetoolsandinfrastructuretohelpNOMADdeveloperstowriteparsers,runtheminparallel and test them. However, the heterogeneity of different codes currently used by thecomputational materials science community meant that the parser developers must have anexcellent scientific knowledge and understanding of the code for which the parser is beingdeveloped.

2 GoalsThegoalofthisdeliverableistohavesupportforthe30simulationcodesidentifiedasmostwidelyusedby themodelingcommunity,andconsequentlyhaveall thedatageneratedwith thesecodesavailableforanalysiswiththeNOMADAnalyticsToolkitandforusebytheNOMADEncyclopedia.

3 Results

3.1 DescriptionofaParser

Allparsershavesomecommonfeatures,asdescribedbelow.

Thepythonandscalacodeofeachparserlivesinaseparategit1repository(seeTable1fortheactualaddress).Usinggitdescribe (acommandthatbuildsahumanreadableversionnumberforeachgitcommit2)eachparseralwayshasauniqueversion,whichisstoredintheparsedandnormalizedfilesandcanbeusedtocheckout3thecodethatwillreproduceexactlythatnormalizedfile.Inthisway,thefilescanalwaysberegeneratedevenifwedeleteoldnormalizedfiles.

Parsers receive the name of the file to parse, and should emit a series of events (values found,metadatasectionsstartandend).Theuseofastreamofevents(whichshouldreplaythecalculation

1 Aversioncontrolsystem(i.e.aprogramtokeepvariousversionsoffiles),seehttps://git-scm.com/2 Acommitisaversionthathasbeenregistered(committed)ingit3 Getallfilesexactlyasregisteredinagivenversion.

Page 5: The NOMAD (Novel Materials Discovery) Laboratory – a ......the fastest option and adds overhead when integrating with the Java Virtual Machine (JVM, the runtime used by java and

NoMaD–ProjectNo.676580

D1.2First30conversionlayers 5

run)meansthattheseparsersarenormallysuitedalsotoparseacalculationinprogress(orcouldbeadapted to support it).This capability isnotcurrently requiredby theNOMADLaboratoryCoEbutmightbeveryusefulinothercontexts.

We tested all the parsers bothwithpython 3 andpython 2, but in productionwe exclusively usepython3.pythoniswidelyusedinthecommunity,andthebestchoicefortheparsersevenifitisnotthe fastest option and adds overhead when integrating with the Java Virtual Machine (JVM, theruntimeusedby javaandscala)thatrunsseveralofthebigdatatools likeFlink4.To integratewiththese tools, every parser also has a scala5 wrapper that can execute the corresponding pythoncommandandthatcanidentifythefilesofthatcode.ThisisusedbytheTreeParser,aprogramthatwedevelopedtoscanthedirectorytreeoftheRawDataArchivewiththegoalofassociatingfilestotheir corresponding simulation code (and parser). The Calculation Parser program then calls tocorrectparserforeachfile identifiedandgeneratestheParsedFiles.Finally,thenormalizerappliesthecommontransformations(seeFigure1)toendupwiththecode-independentfilesthatfeedtheNOMADArchiveanddrivethewholeNOMADLaboratoryCoE.

Figure 1:Overview of how the parsing infrastructureworks. This is available for all parser developers onserversoftheMPCDF,alongwithamailinglisttohelpthecommunicationbetweendevelopers.

3.2 Parsinglibrary

Parsersshouldbeoptimizedtotheformatusedinthefiles.Thebestwaytoparseafreeformtextualoutputisquitedifferentfromtheoneforxml,andbinaryoutputsaredifferentagain.

We left the parser developers free to use their preferred strategy. Still, when writing so manyparsers, it didmake sense tomake themas simple and as general as possible. Theoutput formatchanges a lot from code to code but many have a loosely formatted human readable output.Therefore,wedevelopedagenerallibrarytomakeparsingofthiskindofoutputsimpler.

It uses hierarchical regular expressionmatchers. Togetherwith triggers that can execute arbitrarycode,thisdeclarativedefinitionallowedustodefineefficientandflexibleparsersthatcanalsogivedetailedinformationontheregionsofthefilesthatareignored.Thisisveryusefultoensurethatourparsersreallyextracttheinformationpresentinthefiles.

4 ApacheFlinkisanopensourceplatformfordistributedstreamandbatchdataprocessinghttps://flink.apache.org/.5 Aprogramminglanguagehttp://www.scala-lang.org/

NOMADRepository

Raw DataUnique name Manageable sizeVerifiable ConsistencyFixed group of files

Raw Data(Bag-It archives)

Tree Parser1Calculation

Parser1

Data preparationList of

Archives to parse

List of files to parse Calculation

ParserN

Tree ParserN

…Parsed Files

List of files to

normalize

Normalized Files

ParsersClear versioningReproducible resultsCan use Docker containers or HPC clusters, or both

Parsed and Normalized Filesuse meta info JSON & HDF5

NormalizerNormalizationCommon transformationsCombines results of a raw data archive

Parsing

Page 6: The NOMAD (Novel Materials Discovery) Laboratory – a ......the fastest option and adds overhead when integrating with the Java Virtual Machine (JVM, the runtime used by java and

NoMaD–ProjectNo.676580

D1.2First30conversionlayers 6

3.3 ParsersList

Aparsershould ideallyparseall thecontentof the inputsandoutputsof thecode.Unfortunately,the output of these simulation codes often varies depending on the input parameters.Given thatboththe informationintheoutputaswellas itsspecificformatcanchange, it isdifficulttoensurethataparserwillalwaysextractall informationcontainedintheinputandoutputfiles.Forparserswrittenwithourlibrary,wecandetectthepartofthefilethathasbeenignored.

Forthisdeliverable,wedecidedtodeclareaparserreadyifthefollowingconditionsweremet:

• Theparsercanparsealltheparameters/informationcontainedintheinputandoutputfilesofthecorrespondingcode.

• The important quantities (code name, version, electronic-structuremethod, XC functional,basisset,geometry,totalenergy,forces,Kohn-Shamenergiesandk-points)aresuccessfullyparsed.

• Itpassesmanualinspectionofthegeneratedfiles.• Noobvioususefulquantitiesareincludedintheignoredpartsoftheoutput.• Thereisasetofrelevantinput/outputdataofthatcodeintheNoMaDRepository.

Table 1 lists the parsers that satisfy these conditions. Their code can be inspected athttps://gitlab.mpcdf.mpg.de/nomad-lab/<repository>(clickontheRepositorycolumninthetableinthedigitalversion).

Page 7: The NOMAD (Novel Materials Discovery) Laboratory – a ......the fastest option and adds overhead when integrating with the Java Virtual Machine (JVM, the runtime used by java and

NoMaD–ProjectNo.676580

D1.2First30conversionlayers 7

Table1:ListoftheparsersthatarereadyandhavebeenusedtogeneratetheNOMADArchive.Moreparsersare in development. The repository with the code of the parser is available athttps://gitlab.rzg.mpg.de/nomad-lab/<repository>.

code Codeurl Repository1. abinit http://www.abinit.org/ parser-

abinit2. asap https://wiki.fysik.dtu.dk/ase/index.html parser-

asap3. ATK http://quantumwise.com/ parser-atk4. CASTEP http://www.castep.org/ parser-

castep5. cp2k http://www.cp2k.org/ parser-

cp2k6. CPMD http://www.cpmd.org/ parser-

cpmd7. crystal http://www.crystal.unito.it/index.php parser-

crystal8. DL_POLY http://www.stfc.ac.uk/SCD/research/app/44516.aspx parser-dl-

poly9. exciting http://exciting-code.org/ parser-

exciting10. FHI-aims https://aimsclub.fhi-berlin.mpg.de/ parser-fhi-

aims11. FLEUR http://www.flapw.de/pm/ parser-

fleur12. GAMESS http://www.msg.ameslab.gov/gamess/ parser-

gamess13. Gaussian http://www.gaussian.com/ parser-

gaussian14. GPAW https://wiki.fysik.dtu.dk/gpaw/ parser-

gpaw15. GULP http://nanochemistry.curtin.edu.au/gulp/ parser-

gulp16. LAMMPS http://lammps.sandia.gov/ parser-

lammps17. Molcas http://www.flapw.de/pm/ parser-

molcas18. MOPAC http://openmopac.net/ parser-

mopac19. NWChem http://elk.sourceforge.net/ parser-

nwchem20. octopus http://www.tddft.org/programs/octopus/wiki/index.php/Main_Page parser-

octopus21. onetep http://www2.tcm.phy.cam.ac.uk/onetep/ parser-

onetep22. ORCA http://cec.mpg.de/forum/ parser-

orca23. qBox http://qboxcode.org/ parser-

qbox

Page 8: The NOMAD (Novel Materials Discovery) Laboratory – a ......the fastest option and adds overhead when integrating with the Java Virtual Machine (JVM, the runtime used by java and

NoMaD–ProjectNo.676580

D1.2First30conversionlayers 8

code Codeurl Repository24. Quantum

Espressohttp://www.quantum-espresso.org/ parser-

quantum-espresso

25. QUIP/libatoms

http://www.libatoms.org/Home/LibAtomsQUIP parser-lib-atoms

26. SIESTA http://departments.icmab.es/leem/siesta/ parser-siesta

27. Smeagol https://www.tcd.ie/Physics/Smeagol/index.html parser-siesta

28. turbomole http://www.turbomole.com/ parser-turbomole

29. VASP https://www.vasp.at/ parser-vasp

30. WIEN2k http://www.wien2k.at/ parser-wien2k

3.4 ParsingResults

TheparserswereusedtoscantheopenaccessdataoftheNoMaDRepository,leadingto:

• parsingandnormalizationofinputandoutputfilesofmorethan2.6Mcalculations,• totalenergiesofmorethan15Mgeometries,• more than 288k different materials (different chemical elements, different compositions),

and• bandstructuresofmorethan260kdifferentmaterials.

Figure2showshowmuchdataeachparserhasgenerated.Obviously,thereisabitofachicken-eggproblem:Withoutaparser,datacannotberecognizedintheNoMaDRepositorybutwithoutdatafortesting,developingaparserisnotparticularlyeasy.

For some codes (VASP, FHI-aims, exiting), we already had a large amount of calculations in theNOMADDatabasewhendevelopingtheparser.Forothercodes,westarteddevelopingparserswitha“builditandtheywillcome”approach.

Page 9: The NOMAD (Novel Materials Discovery) Laboratory – a ......the fastest option and adds overhead when integrating with the Java Virtual Machine (JVM, the runtime used by java and

NoMaD–ProjectNo.676580

D1.2First30conversionlayers 9

Figure2:StatisticsonthecontentoftheNOMADArchiveon31October2016

Page 10: The NOMAD (Novel Materials Discovery) Laboratory – a ......the fastest option and adds overhead when integrating with the Java Virtual Machine (JVM, the runtime used by java and

NoMaD–ProjectNo.676580

D1.2First30conversionlayers 10

3.5 Dependencies

While not depending directly on the parsers, the encyclopedia (WP2), visualization (WP3), dataanalysis(WP4)WPsdependfullyonthedatageneratedbytheparsers.Thus,deliveringtheseparsersis an important pre-condition for the success of the NOMAD project. The data is standardized,meaningthatitisdescribedandcanbeaccessinauniformwaythroughtheNOMADmetadata,butnotnecessarilynormalizedintheidealcode-independentformat.Forexample,thesystemgeometryis describeby three lattice vectors, and the atompositions are given in Cartesian coordinates.Nospecialconstraintsareappliedtothesevalues.However,foranalysisitmightbeconvenienttohavethelatticevectorschoseninastandardway:Theshortestcouldbethefirstandbealongthexaxisand atom positions could be folded inside the first cell, … For this reason, as shown in Figure 1,parsing is followed by a normalization step that can produce derived data that is useful ormorecode-independent.

AttheCECAMWorkshop‘TowardsaCommonFormatforComputationalMaterialsScienceData’ inLausanne (Jan2016), itwasdecidedhow tonormalizedifferentquantities.Only totalenergiesarenotyetnormalizedusingtheenergyofreferencesolidssystemsasbaseline.ThisrequiresreferencecalculationsforeachsimulationcodeandwillbedoneinM13forover80%ofthedata.

Basicallynowavefunctionshavebeenstored inthedataof therepository, thereforenormalizationhasbeenpostponed.Infact,itisquitepossiblethatwavefunctionswillnotbeuploadedinthefutureand it may be faster to recalculate them than to store and retrieve them. We will observe thedevelopmentofthissituationovertheremainderoftheproject.

Morethan260Kbandstructuresthatwereevaluatedalongthehighsymmetrypathdefined inthepublicationbyW.SetyawanandS.Curtarolo[High-throughputelectronicbandstructurecalculations:Challengesandtools,Comp.Mater.Sci.49(2010)299]arestoredassuchintheNOMADArchive,asdescribed in D1.1. For phonons,we plan to store the hessian, and recalculate spectra, zero pointenergies, free energieswith phonopy6.Other derived quantities like elastic constants and thermaltransport,whichareoftencomputedwithmultiplecalculations,needastrategytodetectthemanddecidetowhichobjecttheyareassociated.Itisexpectedthatdevelopmentofthebestwaytostorethemwillgohandinhandwiththedevelopmentoftoolstoanalyzeandvisualizethem.

4 ConclusionWedevelopedconverters(parsers) forthe30 leadingsimulationcodesandusedthemtopopulatetheNOMADArchivewithcode-independentdata.These30parsersweretestedandareconsideredready foruseas theiroutputdata is reliable.Theworkontheparserswillcontinuetoensurethatpossiblebugsarefixedinatimelymannerandthatparsersareadaptedtoupdatesofthesimulationcode and its output. Furthermore, computational optimizations to the parsing library that couldspeedupmostparsersarebeingundertaken.Still,theworkpresentedhererepresentsarobustbasisnotonly for theNOMADproject, but also for thewholeelectronic-structure theory community asrobustparserscansimplifyanalysisofcalculationsinmanycontexts.

6 https://atztogo.github.io/phonopy/

Page 11: The NOMAD (Novel Materials Discovery) Laboratory – a ......the fastest option and adds overhead when integrating with the Java Virtual Machine (JVM, the runtime used by java and

NoMaD–ProjectNo.676580

D1.2First30conversionlayers 11

Thisworkhasbeenpossiblethankstothehardworkofallparserdevelopers:

WaelChibiani,AdrielDominiguez,AndreaDroghetti,AdamFekete,HenningGlawe,LauriHimannen,SamiK.Kivisto,FranzKnuth,AskHjorthLarsen,AliakseiMazheika,FawziMohamed,MicaelOliviera,CarlPoelking,MassimoRiello,LorenzoPardini,HonghuiShang,MartinaStella,MikkelStrange,DariaTomecka,RosenndoValero,SebastiánAlarcónVillaseca.