materials project validation, provenance, and sandboxes by dan gunter

17
The Materials Project Validation, Provenance and Sandboxes

Upload: dan-gunter

Post on 20-Jun-2015

407 views

Category:

Science


2 download

DESCRIPTION

Summary of Goals, Progress, and Next steps for these three aspects of the Materials Project (materialsproject.org) infrastructure * Validation: constantly guard against bugs in core data and imported data * Provenance: know how data came to be * Sandboxes: combine public and non-public data; "good fences make good neighbors" Presenter: Dan Gunter, LBNL

TRANSCRIPT

Page 1: Materials Project Validation, Provenance, and Sandboxes by Dan Gunter

The Materials Project Validation, Provenance and Sandboxes

Page 2: Materials Project Validation, Provenance, and Sandboxes by Dan Gunter

Goals

• Validation– constantly guard against bugs in core data

and imported data

• Provenance– know how data came to be

• Sandboxes– Combine public and non-public data; "good

fences make good neighbors"

Page 3: Materials Project Validation, Provenance, and Sandboxes by Dan Gunter

Validation

(Internal) Database ID

External ID What we expectedWhat we got

Page 4: Materials Project Validation, Provenance, and Sandboxes by Dan Gunter

Validation runs all the time

• Rules with "constraints" for every database (and sandbox)• Test constraints against entire DB every night email reports• Validation engine, etc. all open-source software in pymatgen-db

Remote server

Validation engine

Rules

MP Databases

Reports(email, web pages, ..)

Page 5: Materials Project Validation, Provenance, and Sandboxes by Dan Gunter

Rules have a simple syntax

_aliases:

- snl_id = mps_id

- energy = analysis.e_above_hull

materials:

-

filter:

constraints:

- final_energy_per_atom <= 0

- initial_structure.lattice.volume > 0

- initial_structure.lattice.a > 0

- initial_structure.lattice.b > 0

- initial_structure.lattice.c > 0

- initial_structure.lattice.matrix size 3

- formation_energy_per_atom <= 5

- formation_energy_per_atom > -5

- cpu_time > 5

- e_above_hull > -0.000001

- final_energy < 0

- reduced_cell_formula size$ nelements

# Check num. ICSD sources for selected compounds

-

filter:

- task_id = "mp-540081"

constraints:

- icsd_id size> 10

-

filter:

- task_id = "mp-20379"

constraints:

- icsd_id size 1

-

filter:

- task_id = "mp-13634"

constraints:

- icsd_id size> 0

-

filter:

- task_id = "mp-600022"

constraints:

- icsd_id size 0

# NiO2 phases should never become stable

-

filter:

- e_above_hull = 0

constraints:

- pretty_formula != 'NiO2'

tasks:

-

filter:

- state = "successful"

constraints:

- output.final_energy_per_atom <= 0

Page 6: Materials Project Validation, Provenance, and Sandboxes by Dan Gunter

Validation summary

Easy-to-use, integrated, efficient tools to report errors

Next steps– Record all check results in DB– More sophisticated checks (Map/Reduce)– Make it easier to add new checks internally– Make it easier to add new check for anyone

• per-sandbox or even per-user ("MP Alerts")

Page 7: Materials Project Validation, Provenance, and Sandboxes by Dan Gunter

Provenance: How do I know that the data is correct?

Page 8: Materials Project Validation, Provenance, and Sandboxes by Dan Gunter

Types of provenance in the system

1) Calculation workflows– FireWorks records calculation inputs, .. results in great detail

2) External datasets– Structure Notation Language standardizes the naming of data

sources and publications

3) Post-calculation data transformations– New "builders" provides framework for tracking creation of final

database products

(1) (2)

(3)

Page 9: Materials Project Validation, Provenance, and Sandboxes by Dan Gunter

Provenance is availablefor every material

Page 10: Materials Project Validation, Provenance, and Sandboxes by Dan Gunter

Provenance in DBStructure Notation Language

"snl_final": {

"about": {

"created_at": {

"string": "2014-02-22 19:07:00.383869",

"@class": "datetime",

"@module": "datetime"

},

"_materialsproject": {

"submission_id": 52621,

"snl_id": 398676,

"spacegroup": {

"lattice_type": "tetragonal",

"symbol": "P4_2/mmc",

"number": 131,

"point_group": "4/mmm",

"crystal_system": "tetragonal",

"hall": "-P 4c 2"

}

},

"_cedergroup": {

"BURP_sids": [

409544,

409545,

409546

],

"icsd_ids": [

],

"e_above_hull": 0.075125350000000423734

},

"references": "",

"authors": [

{

"name": "Geoffroy Hautier",

"email": "[email protected]"

},

{

"name": "Bo Xu",

"email": "[email protected]"

}

],

"remarks": [

"supplementary compounds from MIT matgen database"

],

"projects": [

"MIT matgen"

],

"history": [

{

"url": "http://www.fiz-karlsruhe.de/icsd_home.html",

"name": "Inorganic Crystal Structure Database",

"description": {

"Collection code": 24692

}

},

{

"url": "",

"name": "",

"description": {

"source": null,

"orig_name": "Basic substitution code.",

"formula": "O1 Pd1"

}

},

{

"url": "http://ceder.mit.edu/",

"name": "MIT Ceder group research database",

"description": {

"source": 105986,

"orig_name": "",

"formula": "FeO"

}

},

{

"url": "http://www.materialsproject.org",

"name": "Materials Project structure optimization",

"description": {

"fw_id": 820305,

"task_type": "GGA optimize structure (2x)",

"task_id": "mp-753682"

}

},

{

"url": "http://www.materialsproject.org",

"name": "Materials Project structure optimization",

"description": {

"fw_id": 820308,

"task_type": "GGA+U optimize structure (2x)",

"task_id": "mp-776678"

}

}

]

},

Metadata

Crystal

DB sources

References History of structure optimizations

Page 11: Materials Project Validation, Provenance, and Sandboxes by Dan Gunter

Future work: unified view of provenance

VASP result

ICSD

VASP result

VASP result

Post-processing

Material properties

Computation

Data import processing

e.g., Defects

Page 12: Materials Project Validation, Provenance, and Sandboxes by Dan Gunter

Sandbox example: Multivalent

JCESR users

Non-JCESR users

Page 13: Materials Project Validation, Provenance, and Sandboxes by Dan Gunter

Multivalent app

Page 14: Materials Project Validation, Provenance, and Sandboxes by Dan Gunter

Sandboxes = Database + Apps

Core data Core data +multivalent materials

Non-JCESR users

JCESR users

Page 15: Materials Project Validation, Provenance, and Sandboxes by Dan Gunter

Technical challenges

• Pre-process data for real-time search

• Interfaces for per-user access control– https://materialsproject.org/materials/1234?sandbox=jcesr

– Web UI elements

and

Page 16: Materials Project Validation, Provenance, and Sandboxes by Dan Gunter

Future: dynamic sandbox creation

Current:– Large & significant

additional data / apps• e.g., JCESR

– Longer-term connections to MP data

• e.g. porous materials

– Companies• e.g. VW/Stanford

Futuresmall collab.per-user?CoD?

Page 17: Materials Project Validation, Provenance, and Sandboxes by Dan Gunter

Summary

• Validation– guard against bugs by checking all data daily

and at data import/creation time

• Provenance– universal standard for annotating data

provenance

• Sandboxes– unified view of distinct databases– onramp for new collaborations and data