finding patterns, correlations and descriptors in...

1
Finding patterns, correlations and descriptors in materials-science data using subgroup discovery Bryan R. Goldsmith, Mario Boley, Jilles Vreeken, Luca M. Ghiringhelli, Matthias Scheffler Fritz Haber Institute of the Max Planck Society, Theory Department, Berlin, Germany Introduction Conclusions Pollution abatment Catalyst stability and dynamics under reaction conditions Gas phase gold clusters Octet binary compounds Support affects cluster reactivity and other properties Experiment Predicted 100 K 100 K Intensity / arb. u Frequency 200 K P = 68% 26% 1% 6% Compute features A and target features Y for each structure Subgroup discovery Define binary selectors and utility functions Subgroup discovery Zr Nb Mo Tc Pd Rh Ru Ag Patterns between material properties and material behavior are typically discovered from experience [1] E. Fernández, J. M. Soler, I. L. Garzón, L. C. Balbás, Phys. Rev. B 70, 165403 (2004) [2] J. K. Norskov, T. Bligaard, J. Rossmeisl, C. H. Christensen, Nat. Chem. 1, 37 (2009) Can we automatically discover materials insight using big-data analytics tools? Comprehensible Intuitive Knowledge discovery Regions in Europe that have T max,march 7.97 °C and T max,sep. 17.5 °C Alps Pyrenees Northern Europe [1] W. Duivesteijn, A. J. Feelders, and A. Knobb, Data Mining and Knowledge Discovery 30, 47 (2016) Subgroup discovery (SGD) aims to identify descriptors of statistically unusual subgroups having some property of interest [1] Material features, 1 ,…, e.g., energy, bonding topology, number of atoms [2] Target material features, 1 ,…, e.g., HOMO-LUMO gap [3] Binary selectors, 1 ,…, ∈→ {false, true} Is there an even number of atoms in the gold cluster? Is the median coordination number four? [4] Find selector = 1 ⋅ ∧⋯∧ that maximizes quality = ext ( ) 1− ext is the coverage of points where is true is the utility function (optimization criteria) [2] www.realkd.org Material features a The features computed for each cluster geometry are: total energy (E), normalized radius of gyration (R g0 ), ionization potential (IP), electron affinity (EA), HOMO-LUMO energy gap (E HL ), chemical hardness (∆η), cluster size (N), atom coordination histogram, and intramolecular van der Waals energy (E vdw ), among others. Coexistence of isomers 'N even' 'N odd' B. R. Goldsmith, M. Boley, J. Vreeken, L. M. Ghiringhelli, M. Scheffler, In progress 1 2 3 4 5 6 7 8 Density Functional Approximations used for all replica-exchange simulations: PBE + many-body dispersion atomic zora scalar relativistic correction spin polarized ‘light-tier 1’ settings Replica-exchange molecular dynamics (REMD) An enhanced, unbiased, sampling method Simulated in the generalized canonical (NVT) ensemble 10 replicas, 100-850 K Figure reprinted from G. Pilania, J. E. Gubernatis, and T. Lookman, Sci. Rep. 5, 17504 (2015) L. M. Ghiringhelli, J. Vybiral, S. V. Levchenko, C. Draxl, M. Scheffler, Phys. Rev. Lett. 114, 105503 (2015) Zincblende (ZB) Rocksalt (RS) https://www.nomad-coe.eu/ Subgroup discovery is a useful data-mining tool Finds physicochemical descriptions of gold clusters Finds descriptors that classify binary octet compounds Big data in materials science contains hidden structures and correlations that may not be detectable by standard tools. Subgroup discovery can find interesting local patterns in materials-science data. RS ZB/WZ [1] [2] Two exemplary applications of SGD here 1) Find descriptors of gold clusters (of size 5-14) to discern structure-property relationships. 2) Find descriptors that classify 82 octet AB-type binary compounds. Ab initio replica-exchange molecular dynamics to generate 25 000 neutral gold cluster (of size 5-14) configurations from the canonical distribution Gas phase clusters serve as interesting model systems to probe the fundamental physicochemical properties of condensed matter Subgroup discovery can find intuitive rule-based models ( 1 ZB , 1 RS ) that classify 79 of the 82 octet AB-type binary compounds as either RS or ZB RS structures more readily form in compounds with ionic character, whereas ZB structures tend to form in compounds with more covalent character. The octet AB-type binary compounds have generated sustained interest to find descriptors that can classify their crystal structures as rocksalt or zincblende SGD to find relations between geometrical and physicochemical properties Funding and acknowledgements: B. R. Goldsmith thanks his colleagues at the Fritz Haber Institute for stimulating discussions and acknowledges the Humboldt foundation for a Postdoctoral fellowship. + four other nonplanar structures [1] FHI aims code with CREEDO + realKD library [2] The HOMO-LUMO gap oscillates between high and low for small nanoclusters with an even or odd number of atoms The binding energy of oxygen to a material is linearly correlated with the d-band center’s position with respect to the Fermi level subgroup (pattern) random noise random noise + pattern Gold cluster configurations are examined for subgroups of the HOMO-LUMO energy gap with a low variation distribution. (a) The HOMO-LUMO energy gap of each cluster is shown against their relative total energy. (b) The probability distribution of the HOMO-LUMO energy gap is shown with the three main subgroups labeled. (c) The HOMO- LUMO energy gap of each cluster configuration as a function of cluster size. Planar/quasi-planar and nonplanar (compact, three-dimensional) gold cluster configurations are denoted by circles and squares, respectively. Red: subgroup described by 2 HL . Gold cluster configurations are examined for patterns involving intra-cluster van der Waals interactions. (a) SGD finds subgroup selectors using the variation reduction utility function with the intramolecular vdW interaction energy (referenced to its maximum value) as the target variable (Δ vdW ). Red: subgroup described by 1 vdW ; Purple: additional points included in generalized variant ext( 2 vdW ¬ 1 vdW ) = ext( 2 vdW ) ext( 1 vdW ); Grey: points described by neither selector ¬( 1 vdW 2 vdW )=¬ 2 vdW . (b) The intramolecular vdW energy per atom difference between the lowest energy planar and nonplanar gold cluster geometries as a function of gold cluster . Gold cluster configurations are examined for linear correlations involving chemical hardness and the total energy of the cluster. (a) The electronic hardness is shown against the cluster stability. (b) The hardness is shown as a function of cluster size. Red: subgroup described by 1 hd ; Blue: additional points included in generalized variant ext( 2 hd ¬ 1 hd ) = ext( 2 hd ) ext( 1 hd ). ext( 2 hd ) ext( 1 hd ) is displayed in Figure 3a only; Grey: points described by neither selector ¬ 1 hd 2 hd . Application of subgroup discovery to the 82 octet binary semiconductors helps us identify interpretable selectors that describe subgroups of the rocksalt (RS) and zincblende (ZB) structures. (a) The subgroups described by selectors σ 1 RS (describing the RS subgroup) and σ 1 ZB (describing the ZB subgroup) with the highest quality value for a two-dimensional descriptor. (b) The subgroups described by selectors consisting of St. John and Bloch’s r σ and r π descriptors. (c) The subgroups described by selectors consisting of the two-dimensional descriptor found by Ghiringhelli and coworkers using LASSO+ℓ 0 . The dashed black line denotes the linear separating hyperplane that the two-dimensional descriptor was originally optimized to describe (via LASSO+ℓ 0 ). The squares and circles denote zincblende and rocksalt crystal structures, respectively. Green: rocksalt subgroup described by σ i RS ; Blue: zincblende subgroup described by σ i ZB ; Grey: points described by neither selector. The dashed blue and green lines denote the (non-linear) intersection of axis-parallel hyperplanes that contain subgroups. (c) NOMAD data analytics tool kit https://labdev-nomad.esc.rzg.mpg.de/home/

Upload: others

Post on 31-Aug-2020

8 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Finding patterns, correlations and descriptors in ...cheresearch.engin.umich.edu/goldsmith/images/subgroup-discovery_… · Finding patterns, correlations and descriptors in materials-science

Finding patterns, correlations and descriptors in materials-science data using subgroup discovery

Bryan R. Goldsmith, Mario Boley, Jilles Vreeken, Luca M. Ghiringhelli, Matthias Scheffler Fritz Haber Institute of the Max Planck Society, Theory Department, Berlin, Germany

Introduction

Conclusions

Pollution abatment Catalyst stability and dynamics under reaction conditions

Gas phase gold clusters

Octet binary compounds

Support affects cluster reactivity and other properties

Experiment

Predicted

100 K

100 K

Inte

nsity

/ ar

b. u

Frequency 200 K P = 68%

26% 1%

6%

Compute features A and target features Y for each structure

Subgroup discovery

Define binary selectors and utility functions

Subgroup discovery

Zr

Nb Mo Tc

Pd

Rh Ru

Ag

Patterns between material properties and material behavior are typically discovered from experience

[1] E. Fernández, J. M. Soler, I. L. Garzón, L. C. Balbás, Phys. Rev. B 70, 165403 (2004) [2] J. K. Norskov, T. Bligaard, J. Rossmeisl, C. H. Christensen, Nat. Chem. 1, 37 (2009)

Can we automatically discover materials insight using big-data analytics tools?

Comprehensible Intuitive

Knowledge discovery

Regions in Europe that have Tmax,march ≤ 7.97 °C and Tmax,sep. ≤ 17.5 °C

Alps

Pyrenees

Northern Europe

[1] W. Duivesteijn, A. J. Feelders, and A. Knobb, Data Mining and Knowledge Discovery 30, 47 (2016)

Subgroup discovery (SGD) aims to identify descriptors of statistically unusual subgroups having some property of interest

[1] Material features, 𝑎1, … ,𝑎𝑚 ∈ 𝐴

e.g., energy, bonding topology, number of atoms

[2] Target material features, 𝑦1, … ,𝑦𝑛 ∈ 𝑌

e.g., HOMO-LUMO gap

[3] Binary selectors, 𝑐1, … , 𝑐𝑘 ∈ 𝑋 → {false, true}

Is there an even number of atoms in the gold cluster?

Is the median coordination number ≥ four?

[4] Find selector 𝜎 = 𝑐1 ⋅ ∧ ⋯∧ 𝑐𝑙 ⋅

that maximizes quality 𝑞 = ext 𝜎𝑃

𝛼 𝑢(𝑌𝜎)1−𝛼

ext 𝜎𝑃

is the coverage of points where 𝜎 is true

𝑢 𝑌𝜎 is the utility function (optimization criteria)

[2] www.realkd.org

Material features a The features computed for each cluster geometry are: total energy (∆E), normalized radius of gyration (Rg0), ionization potential (IP), electron affinity (EA), HOMO-LUMO energy gap (EHL), chemical hardness (∆η), cluster size (N), atom coordination histogram, and intramolecular van der Waals energy (∆Evdw), among others.

Coexistence of isomers

'N even'

'N odd'

B. R. Goldsmith, M. Boley, J. Vreeken, L. M. Ghiringhelli, M. Scheffler, In progress

1 2 3 4 5 6 7 8

Density Functional Approximations used for all replica-exchange simulations: PBE + many-body dispersion atomic zora scalar relativistic correction spin polarized ‘light-tier 1’ settings

Replica-exchange molecular dynamics (REMD) An enhanced, unbiased, sampling method

Simulated in the generalized canonical (NVT) ensemble 10 replicas, 100-850 K

Figure reprinted from G. Pilania, J. E. Gubernatis, and T. Lookman, Sci. Rep. 5, 17504 (2015) L. M. Ghiringhelli, J. Vybiral, S. V. Levchenko, C. Draxl, M. Scheffler, Phys. Rev. Lett. 114, 105503 (2015)

Zincblende (ZB)

Rocksalt (RS)

https://www.nomad-coe.eu/

Subgroup discovery is a useful data-mining tool

Finds physicochemical descriptions of gold clusters

Finds descriptors that classify binary octet compounds

Big data in materials science contains hidden structures and correlations that may not be detectable by standard tools. Subgroup discovery can find interesting local patterns in materials-science data.

RS

ZB/WZ

[1] [2]

Two exemplary applications of SGD here

1) Find descriptors of gold clusters (of size 5-14) to discern structure-property relationships. 2) Find descriptors that classify 82 octet AB-type binary compounds.

Ab initio replica-exchange molecular dynamics to generate 25 000 neutral gold cluster (of size 5-14) configurations from the canonical distribution

Gas phase clusters serve as interesting model systems to probe the fundamental physicochemical properties of condensed matter

Subgroup discovery can find intuitive rule-based models (𝜎1ZB, 𝜎1RS) that classify 79 of the 82 octet AB-type binary compounds as either RS or ZB

RS structures more readily form in compounds with ionic character, whereas ZB structures tend to form in compounds with more covalent character.

The octet AB-type binary compounds have generated sustained interest to find descriptors that can classify their crystal structures as rocksalt or zincblende

SGD to find relations between geometrical and physicochemical properties

Funding and acknowledgements: B. R. Goldsmith thanks his colleagues at the Fritz Haber Institute for stimulating discussions and acknowledges the Humboldt foundation for a Postdoctoral fellowship.

+ four other nonplanar structures

[1]

FHI aims code with CREEDO + realKD library[2]

The HOMO-LUMO gap oscillates between high and low for small nanoclusters with an even or odd number of atoms

The binding energy of oxygen to a material is linearly correlated with the d-band center’s position with respect to the Fermi level

subgroup (pattern) random noise

random noise + pattern

Gold cluster configurations are examined for subgroups of the HOMO-LUMO energy gap with a low variation distribution. (a) The HOMO-LUMO energy gap of each cluster is shown against their relative total energy. (b) The probability distribution of the HOMO-LUMO energy gap is shown with the three main subgroups labeled. (c) The HOMO-LUMO energy gap of each cluster configuration as a function of cluster size. Planar/quasi-planar and nonplanar (compact, three-dimensional) gold cluster configurations are denoted by circles and squares, respectively. Red: subgroup described by 𝜎2HL.

Gold cluster configurations are examined for patterns involving intra-cluster van der Waals interactions. (a) SGD finds subgroup selectors using the variation reduction utility function with the intramolecular vdW interaction energy (referenced to its maximum value) as the target variable (Δ𝐸vdW). Red: subgroup described by 𝜎1vdW; Purple: additional points included in generalized variant ext(𝜎2vdW ∧ ¬𝜎1vdW) = ext(𝜎2vdW) ∖ ext(𝜎1vdW); Grey: points described by neither selector ¬(𝜎1vdW ∨ 𝜎2vdW) = ¬𝜎2vdW. (b) The intramolecular vdW energy per atom difference between the lowest energy planar and nonplanar gold cluster geometries as a function of gold cluster .

Gold cluster configurations are examined for linear correlations involving chemical hardness and the total energy of the cluster. (a) The electronic hardness is shown against the cluster stability. (b) The hardness is shown as a function of cluster size. Red: subgroup described by 𝜎1hd; Blue: additional points included in generalized variant ext(𝜎2hd ∧ ¬𝜎1hd) = ext(𝜎2hd) ∖ ext(𝜎1hd). ext(𝜎2hd) ∖ ext(𝜎1hd) is displayed in Figure 3a only; Grey: points described by neither selector ¬ 𝜎1hd ∨ 𝜎2hd .

Application of subgroup discovery to the 82 octet binary semiconductors helps us identify interpretable selectors that describe subgroups of the rocksalt (RS) and zincblende (ZB) structures. (a) The subgroups described by selectors σ1RS (describing the RS subgroup) and σ1ZB (describing the ZB subgroup) with the highest quality value for a two-dimensional descriptor. (b) The subgroups described by selectors consisting of St. John and Bloch’s rσ and rπ descriptors. (c) The subgroups described by selectors consisting of the two-dimensional descriptor found by Ghiringhelli and coworkers using LASSO+ℓ0. The dashed black line denotes the linear separating hyperplane that the two-dimensional descriptor was originally optimized to describe (via LASSO+ℓ0). The squares and circles denote zincblende and rocksalt crystal structures, respectively. Green: rocksalt subgroup described by σiRS; Blue: zincblende subgroup described by σiZB; Grey: points described by neither selector. The dashed blue and green lines denote the (non-linear) intersection of axis-parallel hyperplanes that contain subgroups.

(c)

NOMAD data analytics tool kit https://labdev-nomad.esc.rzg.mpg.de/home/