materials project validation, provenance, and sandboxes by dan gunter

Post on 20-Jun-2015

407 Views

Category:

Science

2 Downloads

Preview:

Click to see full reader

DESCRIPTION

Summary of Goals, Progress, and Next steps for these three aspects of the Materials Project (materialsproject.org) infrastructure * Validation: constantly guard against bugs in core data and imported data * Provenance: know how data came to be * Sandboxes: combine public and non-public data; "good fences make good neighbors" Presenter: Dan Gunter, LBNL

TRANSCRIPT

The Materials Project Validation, Provenance and Sandboxes

Goals

• Validation– constantly guard against bugs in core data

and imported data

• Provenance– know how data came to be

• Sandboxes– Combine public and non-public data; "good

fences make good neighbors"

Validation

(Internal) Database ID

External ID What we expectedWhat we got

Validation runs all the time

• Rules with "constraints" for every database (and sandbox)• Test constraints against entire DB every night email reports• Validation engine, etc. all open-source software in pymatgen-db

Remote server

Validation engine

Rules

MP Databases

Reports(email, web pages, ..)

Rules have a simple syntax

_aliases:

- snl_id = mps_id

- energy = analysis.e_above_hull

materials:

-

filter:

constraints:

- final_energy_per_atom <= 0

- initial_structure.lattice.volume > 0

- initial_structure.lattice.a > 0

- initial_structure.lattice.b > 0

- initial_structure.lattice.c > 0

- initial_structure.lattice.matrix size 3

- formation_energy_per_atom <= 5

- formation_energy_per_atom > -5

- cpu_time > 5

- e_above_hull > -0.000001

- final_energy < 0

- reduced_cell_formula size$ nelements

# Check num. ICSD sources for selected compounds

-

filter:

- task_id = "mp-540081"

constraints:

- icsd_id size> 10

-

filter:

- task_id = "mp-20379"

constraints:

- icsd_id size 1

-

filter:

- task_id = "mp-13634"

constraints:

- icsd_id size> 0

-

filter:

- task_id = "mp-600022"

constraints:

- icsd_id size 0

# NiO2 phases should never become stable

-

filter:

- e_above_hull = 0

constraints:

- pretty_formula != 'NiO2'

tasks:

-

filter:

- state = "successful"

constraints:

- output.final_energy_per_atom <= 0

Validation summary

Easy-to-use, integrated, efficient tools to report errors

Next steps– Record all check results in DB– More sophisticated checks (Map/Reduce)– Make it easier to add new checks internally– Make it easier to add new check for anyone

• per-sandbox or even per-user ("MP Alerts")

Provenance: How do I know that the data is correct?

Types of provenance in the system

1) Calculation workflows– FireWorks records calculation inputs, .. results in great detail

2) External datasets– Structure Notation Language standardizes the naming of data

sources and publications

3) Post-calculation data transformations– New "builders" provides framework for tracking creation of final

database products

(1) (2)

(3)

Provenance is availablefor every material

Provenance in DBStructure Notation Language

"snl_final": {

"about": {

"created_at": {

"string": "2014-02-22 19:07:00.383869",

"@class": "datetime",

"@module": "datetime"

},

"_materialsproject": {

"submission_id": 52621,

"snl_id": 398676,

"spacegroup": {

"lattice_type": "tetragonal",

"symbol": "P4_2/mmc",

"number": 131,

"point_group": "4/mmm",

"crystal_system": "tetragonal",

"hall": "-P 4c 2"

}

},

"_cedergroup": {

"BURP_sids": [

409544,

409545,

409546

],

"icsd_ids": [

],

"e_above_hull": 0.075125350000000423734

},

"references": "",

"authors": [

{

"name": "Geoffroy Hautier",

"email": "geoffroy.hautier@uclouvain.be"

},

{

"name": "Bo Xu",

"email": "boxu14@mit.edu"

}

],

"remarks": [

"supplementary compounds from MIT matgen database"

],

"projects": [

"MIT matgen"

],

"history": [

{

"url": "http://www.fiz-karlsruhe.de/icsd_home.html",

"name": "Inorganic Crystal Structure Database",

"description": {

"Collection code": 24692

}

},

{

"url": "",

"name": "",

"description": {

"source": null,

"orig_name": "Basic substitution code.",

"formula": "O1 Pd1"

}

},

{

"url": "http://ceder.mit.edu/",

"name": "MIT Ceder group research database",

"description": {

"source": 105986,

"orig_name": "",

"formula": "FeO"

}

},

{

"url": "http://www.materialsproject.org",

"name": "Materials Project structure optimization",

"description": {

"fw_id": 820305,

"task_type": "GGA optimize structure (2x)",

"task_id": "mp-753682"

}

},

{

"url": "http://www.materialsproject.org",

"name": "Materials Project structure optimization",

"description": {

"fw_id": 820308,

"task_type": "GGA+U optimize structure (2x)",

"task_id": "mp-776678"

}

}

]

},

Metadata

Crystal

DB sources

References History of structure optimizations

Future work: unified view of provenance

VASP result

ICSD

VASP result

VASP result

Post-processing

Material properties

Computation

Data import processing

e.g., Defects

Sandbox example: Multivalent

JCESR users

Non-JCESR users

Multivalent app

Sandboxes = Database + Apps

Core data Core data +multivalent materials

Non-JCESR users

JCESR users

Technical challenges

• Pre-process data for real-time search

• Interfaces for per-user access control– https://materialsproject.org/materials/1234?sandbox=jcesr

– Web UI elements

and

Future: dynamic sandbox creation

Current:– Large & significant

additional data / apps• e.g., JCESR

– Longer-term connections to MP data

• e.g. porous materials

– Companies• e.g. VW/Stanford

Futuresmall collab.per-user?CoD?

Summary

• Validation– guard against bugs by checking all data daily

and at data import/creation time

• Provenance– universal standard for annotating data

provenance

• Sandboxes– unified view of distinct databases– onramp for new collaborations and data

top related