Introduction of New Paradigms to Materials Science ... · The Novel Materials Discovery (NOMAD)...
Transcript of Introduction of New Paradigms to Materials Science ... · The Novel Materials Discovery (NOMAD)...
7/5/2017
1
Introduction of New Paradigms to Materials Science
1st paradigm:EmpiricalScience
Experiments
2nd paradigm:Model-based
theoretical science
Laws of clas-sical mecha-
nics, electrody-namics, ther-modynamics,
quantum mechanics
3rd paradigm:Computational
science (simulations)
Density-func-tional theory,
molecular dynamics
4th paradigm:Big-Data
driven science
Machine lear-ning, com-
pressed sen-sing, relation-ship mining,
anomaly detection
Starting times
Scientific knowledge
1600 1960 2010 year
add figuresadd figures
add figures
Change ininternalenergy
Heatadded
to system
Work done
by system
ΔU = Q – W
add figures
The Novel Materials Discovery (NOMAD) Laboratorymaintains the largest Repository for input and output files of all important computational materials science codes.
From its open-access data it builds several Big-Data Services helping to advance materials science and engineering.
Watch a 3-minute summary on the NOMAD Laboratory CoE
NOMAD Scope and Overview
Data is the raw material of the 21st century.
Surprisingly, extreme-scale aspects of Big-Data are very much under-ex-plored in materials science and engineering, one reason being that ‘towards exascale’ computing initiatives typically focus on standard hardware and software challenges. Clearly, much of the value of high-throughput calcula-tions is wasted without deeper Big-Data driven analysis of the results. This is the extreme-scale computing challenge addressed by NOMAD. ... more
7/5/2017
2
https://youtu.be/yawM2ThVlGw
Currently, NOMAD holds >30 million total-energy calculations. -- Amount is rapidly increasing
Additional high-through-put calculations needed for creating important “missing data”
The NOMADScope
7/5/2017
3
https://repository.nomad-coe.eu also described at youtube.com
The NOMAD LaboratoryA European Centre of Excellence
There are 30-40 important codes used in computational materials science.
Nomenclature, data representation, and file formats of the input and output files of these codes are different. The heterogeneity could hardly be worse.
https://repository.nomad-coe.eu/
https://youtu.be/UcnHGokl2Nc
The NOMAD Repository accepts /requests in- and output files of all important codes. Currently, the NOMAD Repository contains > 30 million total-energy calculations.
The NOMAD LaboratoryA European Centre of Excellence
https://repository.nomad-coe.eu/
https://youtu.be/UcnHGokl2Nc
The NOMAD Repository accepts /requests in- and output files of all important codes. Currently, the NOMAD Repository contains > 30 million total-energy calculations.
https://repository.nomad-coe.eu also described at youtube.com
7/5/2017
4
The NOMAD Archive stores - in a code-independent format - calculations performed with all the most important and widely used electronic-structure and force-field codes.
Summary statistics of the Archive content, updated June 1st, 2017:
Total-Energy Calculations Bulk Crystals Surfaces Molecules
Different Geometries Chemical Compositions Band Structures Phonon Calculations
The NOMAD Archive stores - in a code-independent format - calculations performed with all the most important and widely used electronic-structure and force-field codes.
Summary statistics of the Archive content, updated June 1st, 2017:
Total-Energy Calculations Bulk Crystals Surfaces Molecules
Different Geometries Chemical Compositions Band Structures Phonon Calculations
7/5/2017
5
The NOMAD Archive
90% of the VASP files are from AFLOWliband OQMD
10,000
100
1
# G
eom
etri
es p
er C
om
po
siti
on
1 Mio
1,000
1
# C
om
po
siti
on
s
The Big-Data Challenge
Query and readout what was stored; high-throughput screening.
until recently
Volume (amount of data),
Variety (heterogeneity of form and meaning of data),
Velocity at which data may change or new data arrive,
Veracity (uncertainty of quality).
7/5/2017
6
The Big-Data Challenge
(Big) Data of materials does not only provide direct information but the data is structured.
How can we find (and understand) this structure?
The Big-Data Challenge of Computational Materials Science
Code-independent representation of the computed properties
• Unique classification and labeling of the data
• Data normalization
Generic and code-specific metadata
Veracity
Analyze the influence of xc functionals, basis sets, pseudopotentials, force-fields, various other approximations …
Define error bars and confidence levels
7/5/2017
7
Metadata for Computational Material Science
Purposes of metadata • when storing “key” – “value” pairs, metadata is the “key”. E.g. “program_name” is a
metadata and “VASP” is a possible “value”.• Describe and organize data• Keep track of relations between data
section_runprogram_name FHI-aims
program_version 081912
section_systemsimulation_cell [[1.4e-9 ...]]atom_positions [[0.0,....]...]atom_labels ["Cu",...]
section_methodbasis_set fhi_aims_tight
XC_method DFT_GGA_PBE
section_single_configuration_calculation...
Structures and names: MetadataValues: Data
SI Units:• length: m• energy: J• …
Nested sections (hierarchical tree structure)References to other sections
Metadata for Computational Material Science
section_runprogram_name FHI-aims
program_version 081912
section_systemsimulation_cell [[1.4e-9 ...]]atom_positions [[0.0,....]...]atom_labels ["Cu",...]
section_methodbasis_set fhi_aims_tight
XC_method DFT_GGA_PBE
section_single_configuration_calculation...
Structures and names: MetadataValues: Data
SI Units:• length: m• energy: J• …
Nested sections (hierarchical tree structure)References to other sections
At this point, NOMAD has defined 434 metadata for the code-independent representation and 1,736 code-specific “keys”.
7/5/2017
8
Storing and Exchanging Data
With the NOMAD Metadata we have a conceptual model, flexible enough to be useful and support several file formats:
JSON: human readable & web ready
HDF5: indexed and optimized for multidimensional arrays
Parquet: a modern format for Big-Data storage (immutable and efficient columnar storage for hierarchical data)
The Big-Data Challenge of Computational Materials Science
Code-independent representation of the computed properties
• Unique classification and labeling of the data
• Data normalization
Generic and code-specific metadata
Veracity
Analyze the influence of xc functionals, basis sets, pseudopotentials, force-fields, various other approximations …
Define error bars and confidence levels
7/5/2017
9
Fig. 1. Historical evolution of the predicted
equilibrium lattice parameter for silicon. All
calculations within
framework. Values from
15, 16,
) are compared with (i)
predictions from the different codes used in
this study (2016 data points, magnified in
or calculations
with lower numerical settings) and (ii) the
0 K and
point effects (red line)
Historical evolution of the
predicted equilibrium
lattice parameter for
silicon
Science 351, aad3000 (2016)
Si Lattice Parameter Calculated with DFT-PBE
Science 351, aad3000 (2016)
Science 351, aad3000 (2016)
7/5/2017
10
Science 351, aad3000 (2016)
Science 351, aad3000 (2016)
Just The First Stepbulk crystals
and PBE xc functional
An exhaustive test set(71 elemental solids)
Δ-Value Project in Materials ScienceScience 351, aad3000 (2016)
Test methods/codes (pseudopoten-tials, relativistic treatment, etc.)
and numerical accuracy (basis sets, k-points, etc.)
7/5/2017
11
Code Version Basis Electron treatment Δ-valuemeV/atom
WIEN2k 13.1 LAPW/APW+lo all-electron 0
FHI-aims 081213 tier2 all-electron (relativistic 0.2numerical orbitals atomic_zora scalar)
Exciting development LAPW+xlo all-electron 0.2version
Quantum 5.1 plane waves SSSP Accuracy (mixed 0.3ESPRESSO NC/US/PAW potential lib.)Elk 3.1.5 APW+lo all-electron 0.3VASP 5.2.12 plane waves PAW 2015 0.4FHI-aims 081213 tier2 all-electron (relativistic 0.4
numerical orbitals zora scalar 1e-12)CASTEP 9.0 plane waves OTFG CASTEP 9.0 0.5
https://molmod.ugent.be/deltacodesdftScience 351, aad3000 (2016)
Comparing Solid-State DFT Codes, Potentials, and Basis Sets (K. Lejaeghere et al.)
Comparing Solid-State DFT Codes, Potentials, and Basis Sets (K. Lejaeghere et al.)
Science 351, aad3000 (2016)
slide from S. Cottenier
7/5/2017
12
Code Version Basis Electron treatment Δ-valuemeV/atom
WIEN2k 13.1 LAPW/APW+lo all-electron 0
FHI-aims 081213 tier2 all-electron (relativistic 0.2numerical orbitals atomic_zora scalar)
Exciting development LAPW+xlo all-electron 0.2version
Quantum 5.1 plane waves SSSP Accuracy (mixed 0.3ESPRESSO NC/US/PAW potential lib.)Elk 3.1.5 APW+lo all-electron 0.3VASP 5.2.12 plane waves PAW 2015 0.4FHI-aims 081213 tier2 all-electron (relativistic 0.4
numerical orbitals zora scalar 1e-12)CASTEP 9.0 plane waves OTFG CASTEP 9.0 0.5
https://molmod.ugent.be/deltacodesdftScience 351, aad3000 (2016)
Comparing Solid-State DFT Codes, Potentials, and Basis Sets (K. Lejaeghere et al.)
https://nomad-coe.eu
https://nomad-coe.eu
Request by colleagues form industry
Give guidelines about …
Date for special issues missing
Analyzing and estimating error bars from high-accuracy references
7/5/2017
13
(Big)-Data Analytics
(Big) Data of materials does not only provide direct information but the data is structured.
How can we find (and understand) this structure?
(*) Work performed in collaboration with
Luca Ghiringhelli, Jan Vybiral, Claudia Draxl, Mario Boley,
Bryan Goldsmith, Runhai Ouyang, et al.
o crystal-structure prediction
o figure of merit of thermoelectrics (as function of T)
o turn-over frequency of catalytic materials (as function of T and p)
o efficiency of photovoltaic systems
o etc.
Dmitri Mendeleev(1834-1907)
From the periodic table of the elements to a chart (a map) of mate-rials: Organize materials according to their properties and functions.
Learning Descriptors for Materials-Science (Big) Data
7/5/2017
14
Only DFT-LDA: Can we predict not yet calculated LDA structures from ZA and ZB?
82 octet AB binary compounds
We need to arrange the data such that statistical learning is efficient. We need a good set of descriptive parameters.
d1
d2
RS
?
Crystal-Structure PredictionClassification “Zincblende/Wurtzite or Rocksalt?”
How to find d1, d2?In reality the representation will be higher than 2-dimensional.
A map of materials
Only DFT-LDA: Can we predict not yet calculated LDA structures from ZA and ZB?
82 octet AB binary compounds
We need to arrange the data such that statistical learning is efficient. We need a good set of descriptive parameters.
d1
d2
RS
?
Crystal-Structure PredictionClassification “Zincblende/Wurtzite or Rocksalt?”
How to find d1, d2?In reality the representation will be higher than 2-dimensional.
A map of materials
7/5/2017
15
Targ
et P
rop
erty
Targ
et P
rop
erty
Calculation # Descriptor D
Find Structure in Big Data That Is A Priori “Not Visible”Data Fitting, Statistical Learning, Machine Learning
Arrange/organize materials with respect to a property and a set of simple descriptive parameters (a descriptor).
Targ
et P
rop
erty
Targ
et P
rop
erty
Calculation # Descriptor D
Find Structure in Big Data That Is A Priori “Not Visible”Data Fitting, Statistical Learning, Machine Learning
The descriptor can be designed: Rupp, von Lilienfeld, Behler, Csanyi, Seko, Tsuda, Tanaka, …
The descriptor can be selected out of a large set of candidates: Ozolins, Ghiringhelli, Ouyang.
More data means a better representation. Will we ever have enough data?
Arrange/organize materials with respect to a property and a set of simple descriptive parameters (a descriptor).
7/5/2017
16
Targ
et P
rop
erty
Targ
et P
rop
erty
Calculation # Descriptor D
Find Structure in Big Data That Is A Priori “Not Visible”Data Fitting, Statistical Learning, Machine Learning
The descriptor can be designed: Rupp, von Lilienfeld, Behler, Csanyi, Seko, Tsuda, Tanaka, …
The descriptor can be selected out of a large set of candidates: Ozolins, Ghiringhelli, Ouyang.
More data means a better representation. Will we ever have enough data?
Arrange/organize materials with respect to a property and a set of simple descriptive parameters (a descriptor).
We have data {Pi} at “coordinates” {xi} xi = set of descriptive parameters (descriptor)
Linear regression:
Polynomial kernel
Gaussian kernel
K(xi, xk) = xi . xk P(xi) = xi . c*
K(xi, xk) = exp ( Σj ( xi xk )2 / 2σj2 )
Kernel Regression
Pi = P(xi) = Σk=1 ck K(xi, xk)N
K(xi, xk) = ( xi . xk + c ) d
7/5/2017
17
…. the Gaussian Kernel
“With five Gaussians you can fit an elephant, and if you use a sixth one, the animal will waive its tail.”
1975 (probably since earlier): Experimentalists used “Gaussians least square fits” to fit their results (to fit any curve they produced); photoemission, vibrational spectroscopy, etc.
Now we are using a sum over thousands Gaussians to fit our data (KRRwith Gaussian kernels; used in nearly all ML work).
Kernel Regression
We have data {Pi} at “coordinates” {xi} xi = set of descriptive parameters (descriptor)
Linear regression:
Polynomial kernel
Gaussian kernel
K(xi, xk) = xi . xk P(xi) = xi . c*
K(xi, xk) = exp ( Σj ( xi xk )2 / 2σj2 )
Pi = P(xi) = Σk=1 ck K(xi, xk)N
For successful learning, we need a “good” descriptor: P(xi) P(di)
K(xi, xk) = ( xi . xk + c ) d
7/5/2017
18
Kernel Regression
We have data {Pi} at “coordinates” {xi} xi = set of descriptive parameters (descriptor)
Linear regression:
Polynomial kernel
Gaussian kernel
K(xi, xk) = xi . xk P(xi) = xi . c*
K(xi, xk) = exp ( Σj ( xi xk )2 / 2σj2 )
Pi = P(xi) = Σk=1 ck K(xi, xk)N
For successful learning, we need a “good” descriptor: P(xi) P(di)
K(xi, xk) = ( xi . xk + c ) d
For materials science and quantum mechanics we need knowledge-based, domain-specific approaches. Even fitting with >10,000 parameters may not capture the enormous variety and intricate nature of materials phenomena.
Statistical Learning (Machine Learning) vs. Compressed Sensing
fit and/or interpolation of known data points { Pi } and building a function P(d)
the key scientific challenge: find a reliable, low-dimensional descriptor d.
kernel ridge regression linear
+ +minimize
7/5/2017
19
fit and/or interpolation of known data points { Pi } and building a function P(d)
the key scientific challenge: find a reliable, low dimensional descriptor d.
kernel ridge regression linear
+ +minimize
least absolute shrinkage and selection operator (LASSO) for feature selection
R. Tibshirani, J. Royal Statist. Soc. B 58, 267 (1996)
Statistical Learning (Machine Learning) vs. Compressed Sensing
fit and/or interpolation of known data points { Pi } and building a function P(d)
the key scientific challenge: find a reliable, low dimensional descriptor d.
kernel ridge regression linear
+ +minimize
least absolute shrinkage and selection operator (LASSO) for feature selection
R. Tibshirani, J. Royal Statist. Soc. B 58, 267 (1996)
Statistical Learning (Machine Learning) vs. Compressed Sensing
l2 norm: sqrt(x12 + y1
2 )x1
y1
l1 norm: | x1| + | y1| Manhattan (taxi cab) distance
7/5/2017
20
Primary Features for theRock Salt/Zincblende Structure Prediction
free atoms
free dimers
Enabling Feature Spaces with Billions of Elements by Sure Independence Screening
[1] J. Fan and J. Lv, J. R. Statist. Soc. B 70, 849 (2008)
Runhai Ouyang
1. Systematically construct a huge feature space (1011) from
0 (23 primary features): 𝑅 = {+, , ∙, 1, 2, 3, , exp, log, ||}
2. Select top ranked features using Sure Independence Screening (SIS)[1] (correlation learning). Select n features corresponding to the n largest projection on the target property, i.e. largest components of the vector ( 𝑫𝑇𝒚 ).
𝑫 : matrix of the feature space (82 x 100 billion elements)
y : target property (here: rock salt-zincblende energy differences; 82 elements)
7/5/2017
21
Iterative Application of Sure Independence Screening
And l0
Huge feature space (100 billion elements) constructed from 0 (23 elements)
…𝑆1 (~1,000elements)
𝑆2 𝑆1
l0
𝑆𝑛 𝑖=1
𝑛−1
𝑆𝑖
l0
SIS𝒚( ) SISRn-1( )
l0
SIS = sure independence screening Si = feature subspace
𝑹𝑛: residual of the n-dim. descriptor w.r.t. y; e.g.: R1 = y c1*d1+c0
SISR1( )
1-dim. descriptor 2-dim. descriptor n-dim. descriptor
Using no information on BN and C we would have predicted the existence and unusual stability of these materials.
“The Map” -- Compressed Sensing --2-Dimensíonal Descriptor
The complexity and science is in the
descriptor -- identified from >10,000 features.
L.M. Ghiringhelli, J. Vybiral, S.V. Levchenko, C. Draxl, and M. Scheffler, PRL 114 (2015).
7/5/2017
22
https://nomad-coe.eu
https://nomad-coe.eu
Request by colleagues form industry
Give guidelines about …
Date for special issues missing
Crystal structure prediction (probably the most fundamental and important challenge in materials science)
Predicting energy differences between crystal structures
Building structure maps for crystal-structure classification
Topological Insulators (Quantum Spin Hall Systems)
Spin-orbit coupling, electric field, strain
Trivial insulator
Topological insulator
Topological transition
Surface states
Fermi level
Ener
gy
Momentum
2D topological insulators: No backscattering in edge stated.
Promising materials in spintronics applications.
Characterized by a Z2
topological index.
7/5/2017
23
Towards an Understanding Topological InsulatorsFunctionalized Honeycomb (2D) Systems
ABC2 CompoundsCalculate geometry, bandstruc-ture, and Z2
for 220 materials.
Towards an Understanding Topological InsulatorsFunctionalized Honeycomb (2D) Systems
ABC2 CompoundsCalculate geometry, bandstruc-ture, and Z2
for 220 materials.
7/5/2017
24
Calculate geometry, bandstructure, and Z2 for 220 materials.
ABC2 Compounds
Towards an Understanding Topological InsulatorsFunctionalized Honeycomb (2D) Systems
7/5/2017
25
Towards an Understanding Topological InsulatorsFunctionalized Honeycomb (2D) Systems
Compounds functionalized with F, Cl, Br, I are represented by diamonds, squares, circles and triangles, respectively.
Metals: green,
Functionalization Dependent-QSHIs: red,
Functionalization Independent-QSHIs: blue,
Trivial insulators: white/grey.
Colored background: compressed sensing
ABC2 Compounds
Towards an Understanding Topological InsulatorsFunctionalized Honeycomb (2D) Systems
Compounds functionalized with F, Cl, Br, I are represented by diamonds, squares, circles and triangles, respectively.
Metals: green,
Functionalization Dependent-QSHIs: red,
Functionalization Independent-QSHIs: blue,
Trivial insulators: white/grey.
Colored background: compressed sensing
ABC2 Compounds
The identified descriptors only include properties of
the free atoms!
7/5/2017
26
Towards an Understanding Topological InsulatorsFunctionalized Honeycomb (2D) Systems
Compounds functionalized with F, Cl, Br, I are represented by diamonds, squares, circles and triangles, respectively.
Metals: green,
Functionalization Dependent-QSHIs: red,
Functionalization Independent-QSHIs: blue,
Trivial insulators: white/grey.
Colored background: compressed sensing
ABC2 Compounds
The descriptor identifies materials that do not followcommonly believed rules of
thumb.
Success Stories and Ongoing Projects (at the NOMAD home page)
7/5/2017
27
ΔE
energy of the free reactants
energy of
the products
e.g. CO2
e.g. CO and O2
reaction coordinate
En
erg
y
Catalysis
Jöns Jakob Berzelius1779-1848
Wilhelm Ostwald 1853-1932
Nobel prize for chemistry 1909
A catalyst is a material that accelerates the kinetics of a chemical reaction. But it is not part of the final chemical product.
Catalysis
Transform CO2 or CO Into Something Useful
Carbon-aceous Acid
Formalde-hyde
Methanol
Methane
We need better catalysts!
Our understanding of the dynamics and kine-tics of heterogeneous catalysis is very shallow. Structure, and composi-tion of the material that is catalytically active (at realistic p and high Tconditions) is largely unknown.
7/5/2017
28
Significant Progress … But Our Knowledge Is Still Close to Zero
About 240,000 inorganic compounds have been synthesized so far. Many more are possible.
And what do we know?
Elastic constants: about 200 compounds
Super conductors ≈ 1000
Dielectric constant ≈ 300-400
For almost every property we are below 1% in coverage ….
7/5/2017
29
The NOMAD LaboratoryA European Centre of Excellence
We need to develop domain-specificcompressed-sensing and other
machine-learning tools toidentify causal descriptors
(physical models).
The amount of different materials is huge. However, the number of materials that exhibit a certain
function, is rather small, i.e. the space of chemical and structural compounds is sparsely populated.
Reality
big-data analytics in materials science
Rel
eva
nce
of
an
ew t
ech
no
logy
Time
Reality
big-data analytics in materials science
The NOMAD LaboratoryA European Centre of Excellence
We need to develop domain-specificcompressed-sensing and other
machine-learning tools toidentify causal descriptors
(physical models).
The amount of different materials is huge. However, the number of materials that exhibit a certain
function, is rather small, i.e. the space of chemical and structural compounds is sparsely populated.
Reality
big-data analytics in materials science
Rel
eva
nce
of
an
ew t
ech
no
logy
Time
Reality
Perception
we are probably here
big-data analytics in materials science