materials project validation, provenance, and sandboxes by dan gunter
DESCRIPTION
Summary of Goals, Progress, and Next steps for these three aspects of the Materials Project (materialsproject.org) infrastructure * Validation: constantly guard against bugs in core data and imported data * Provenance: know how data came to be * Sandboxes: combine public and non-public data; "good fences make good neighbors" Presenter: Dan Gunter, LBNLTRANSCRIPT
The Materials Project Validation, Provenance and Sandboxes
Goals
• Validation– constantly guard against bugs in core data
and imported data
• Provenance– know how data came to be
• Sandboxes– Combine public and non-public data; "good
fences make good neighbors"
Validation
(Internal) Database ID
External ID What we expectedWhat we got
Validation runs all the time
• Rules with "constraints" for every database (and sandbox)• Test constraints against entire DB every night email reports• Validation engine, etc. all open-source software in pymatgen-db
Remote server
Validation engine
Rules
MP Databases
Reports(email, web pages, ..)
Rules have a simple syntax
_aliases:
- snl_id = mps_id
- energy = analysis.e_above_hull
materials:
-
filter:
constraints:
- final_energy_per_atom <= 0
- initial_structure.lattice.volume > 0
- initial_structure.lattice.a > 0
- initial_structure.lattice.b > 0
- initial_structure.lattice.c > 0
- initial_structure.lattice.matrix size 3
- formation_energy_per_atom <= 5
- formation_energy_per_atom > -5
- cpu_time > 5
- e_above_hull > -0.000001
- final_energy < 0
- reduced_cell_formula size$ nelements
# Check num. ICSD sources for selected compounds
-
filter:
- task_id = "mp-540081"
constraints:
- icsd_id size> 10
-
filter:
- task_id = "mp-20379"
constraints:
- icsd_id size 1
-
filter:
- task_id = "mp-13634"
constraints:
- icsd_id size> 0
-
filter:
- task_id = "mp-600022"
constraints:
- icsd_id size 0
# NiO2 phases should never become stable
-
filter:
- e_above_hull = 0
constraints:
- pretty_formula != 'NiO2'
tasks:
-
filter:
- state = "successful"
constraints:
- output.final_energy_per_atom <= 0
Validation summary
Easy-to-use, integrated, efficient tools to report errors
Next steps– Record all check results in DB– More sophisticated checks (Map/Reduce)– Make it easier to add new checks internally– Make it easier to add new check for anyone
• per-sandbox or even per-user ("MP Alerts")
Provenance: How do I know that the data is correct?
Types of provenance in the system
1) Calculation workflows– FireWorks records calculation inputs, .. results in great detail
2) External datasets– Structure Notation Language standardizes the naming of data
sources and publications
3) Post-calculation data transformations– New "builders" provides framework for tracking creation of final
database products
(1) (2)
(3)
Provenance is availablefor every material
Provenance in DBStructure Notation Language
"snl_final": {
"about": {
"created_at": {
"string": "2014-02-22 19:07:00.383869",
"@class": "datetime",
"@module": "datetime"
},
"_materialsproject": {
"submission_id": 52621,
"snl_id": 398676,
"spacegroup": {
"lattice_type": "tetragonal",
"symbol": "P4_2/mmc",
"number": 131,
"point_group": "4/mmm",
"crystal_system": "tetragonal",
"hall": "-P 4c 2"
}
},
"_cedergroup": {
"BURP_sids": [
409544,
409545,
409546
],
"icsd_ids": [
],
"e_above_hull": 0.075125350000000423734
},
"references": "",
"authors": [
{
"name": "Geoffroy Hautier",
"email": "[email protected]"
},
{
"name": "Bo Xu",
"email": "[email protected]"
}
],
"remarks": [
"supplementary compounds from MIT matgen database"
],
"projects": [
"MIT matgen"
],
"history": [
{
"url": "http://www.fiz-karlsruhe.de/icsd_home.html",
"name": "Inorganic Crystal Structure Database",
"description": {
"Collection code": 24692
}
},
{
"url": "",
"name": "",
"description": {
"source": null,
"orig_name": "Basic substitution code.",
"formula": "O1 Pd1"
}
},
{
"url": "http://ceder.mit.edu/",
"name": "MIT Ceder group research database",
"description": {
"source": 105986,
"orig_name": "",
"formula": "FeO"
}
},
{
"url": "http://www.materialsproject.org",
"name": "Materials Project structure optimization",
"description": {
"fw_id": 820305,
"task_type": "GGA optimize structure (2x)",
"task_id": "mp-753682"
}
},
{
"url": "http://www.materialsproject.org",
"name": "Materials Project structure optimization",
"description": {
"fw_id": 820308,
"task_type": "GGA+U optimize structure (2x)",
"task_id": "mp-776678"
}
}
]
},
Metadata
Crystal
DB sources
References History of structure optimizations
Future work: unified view of provenance
VASP result
ICSD
VASP result
VASP result
Post-processing
Material properties
Computation
Data import processing
e.g., Defects
Sandbox example: Multivalent
JCESR users
Non-JCESR users
Multivalent app
Sandboxes = Database + Apps
Core data Core data +multivalent materials
Non-JCESR users
JCESR users
Technical challenges
• Pre-process data for real-time search
• Interfaces for per-user access control– https://materialsproject.org/materials/1234?sandbox=jcesr
– Web UI elements
and
Future: dynamic sandbox creation
Current:– Large & significant
additional data / apps• e.g., JCESR
– Longer-term connections to MP data
• e.g. porous materials
– Companies• e.g. VW/Stanford
Futuresmall collab.per-user?CoD?
Summary
• Validation– guard against bugs by checking all data daily
and at data import/creation time
• Provenance– universal standard for annotating data
provenance
• Sandboxes– unified view of distinct databases– onramp for new collaborations and data