crystallography open database (cod) … · x-ray crystallography is an extremely powerful method...
TRANSCRIPT
Crystallography Open Database (COD)
Saulius Gražulis, Andrius Merkys, and Antanas Vaitkus
Contents
1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 A Short History of COD . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53 Scope and Contents of the COD . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74 COD Data Semantics and Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85 Accessing the COD . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
5.1 Web Access to the COD . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105.2 Using the RESTful Interfaces . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115.3 Querying SQL Database . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
6 COD Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 167 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
Abstract
The Crystallography Open Database (COD, http://crystallography.net/) is as ofthe time of writing the largest open-access collection of mineral, metal organic,organometallic, and small organic crystal structures, excluding biomolecules thatare stored separately in the Protein Data Bank (http://wwpdb.org/). Unlike otherexisting chemical crystal structure databases, the COD is fully open – all itsstructures may be downloaded, used and re-disseminated without restriction,along with the results derived from them. Currently, the COD contains >385,000records and is growing constantly, encompassing most structures published inpeer-reviewed academic press and donations by individual researchers. This
S. Gražulis (�) · A. Merkys · A. VaitkusDepartment of Protein-DNA Interactions, Vilnius University Institute of Biotechnology, Vilnius,Lithuaniae-mail: [email protected]
© Springer International Publishing AG, part of Springer Nature 2018W. Andreoni, S. Yip (eds.), Handbook of Materials Modeling,https://doi.org/10.1007/978-3-319-42913-7_66-1
1
2 S. Gražulis et al.
article describes how data are organized in the COD and how the database canbe queried, downloaded, and processed for various purposes.
1 Introduction
X-ray crystallography is an extremely powerful method for determining innerstructure of the condensed matter. Soon after the discovery of X-rays (Röntgen1896) and the first records of their diffraction on crystalline samples (Friedrich et al.1912a,b), the number of structures determined by this technique started to grow. Anexplanation of the X-ray scattering using first principles (Bragg and Bragg 1913;Bragg 1913) allowed determination of structural models for a vast variety of solidmaterials in a uniform way, from simple inorganics to very large biomolecules. Asmore and more crystal structures were appearing, it became evident that the numbers(such as crystal unit cell parameters, atomic coordinates) in their descriptions,made uniform by the availability of the common scattering theory, possess a greatvalue themselves and efforts to collect them systematically were started. The firstcollections were in paper form (Hermann and Ewald 1931; IUCr 2017c; A. I.Kitajgorodskij 1955), and numeric data were accompanying crystallographicpublications in journals dedicated for this field from the very first publications (forinstance, in the Acta Crystallographica journal started by the IUCr in 1948 (Clewsand Cochran 1948).
Growing availability and power of electronic computers allowed crystallogra-phers to use them for structure determination and prompted the idea that crystalstructure data can also be handled automatically (Brown and McMahon 2002). Afirst dedicated crystallographic database, the CSD, was established by the CCDC in1965 (Groom and Allen 2014) to collect structures of small organic molecules andembraced computer-assisted methods for information storage and retrieval (Allenet al. 1979). Data about inorganic crystals (Kaduk 2002), alloys (White et al. 2002)and powder diffraction data (Kabekkodu et al. 2002) were historically kept inseparate archives. Today, we have a whole range of databases, differing by theirscope, size and licensing model, covering various aspects of crystallographic data(Table 1).
As seen in from the Table 1, various licensing models were employed to supportoperations of the databases. About a third of all resources, and some of the oldestand the largest ones, use a subscription-based model, where a user of these databasesmust agree to a license and is restricted with respect of what he or she may dowith the data obtained from the resource. As long as the main vehicle of databasedissemination were paper editions or magnetic tape reels that could be used onlyin computer centers, such situation seemed fairly acceptable. In the epoch ofubiquitous computer access and with the advent of the Internet, however, researchersexpressed concerns that certain licensing clauses are overly restrictive. So, therestriction to disseminate derived results was mentioned as an impediment forscientific work (Baldi et al. 2011; Andronico et al. 2011). As a result, several moderndatabases were created anew, following an open-access dissemination model, and
Crystallography Open Database (COD) 3
Tab
le1
Ove
rvie
wof
larg
estc
ryst
allo
grap
hic
data
base
s,th
eir
subj
ecta
reas
,siz
esan
dlic
ensi
ngm
odel
s
No.
Dat
abas
eR
ecor
dsL
icen
seC
urre
ntU
RL
Est
.R
efer
ence
1PD
F38
0,00
0Su
bscr
iptio
nba
sed
http
://w
ww
.icdd
.com
/pro
duct
s/pd
f4.
htm
1941
Fabe
ran
dFa
wce
tt(2
002)
2C
SD80
0,00
0Su
bscr
iptio
nba
sed
http
://w
ww
.ccd
c.ca
m.a
c.uk
/sol
utio
ns/
csd-
syst
em/c
ompo
nent
s/cs
d/19
65G
room
etal
.(20
16)
3PD
B12
4,00
0O
pen
acce
ssht
tp://
ww
w.r
csb.
org/
pdb
1971
Prot
ein
Dat
aB
ank
(197
1);
Ber
man
etal
.(20
12)
4IC
SD20
0,00
0Su
bscr
iptio
nba
sed
http
s://i
csd.
fiz-k
arls
ruhe
.de/
1987
Bel
sky
etal
.(20
02)
5N
DB
8600
Ope
nac
cess
http
://nd
bser
ver.r
utge
rs.e
du/
1992
Ber
man
etal
.(1
992)
;N
aray
anan
etal
.(20
14)
6Pa
ulin
gfil
e29
0,00
0Su
bscr
iptio
nba
sed
http
://pa
ulin
gfile
.com
http
://cr
ystd
b.ni
ms.
go.jp
/inde
x_en
.htm
l19
95V
illar
set
al.(
1998
,200
4)
7IZ
AZ
eolit
eda
taba
se17
6O
pen
acce
ssht
tp://
ww
w.iz
a-st
ruct
ure.
org/
data
base
s/19
96B
aerl
oche
ret
al.(
2007
)
8C
RY
STM
ET
170,
000
Subs
crip
tion
base
dht
tp://
ww
w.T
othC
anad
a.co
mht
tps:
//cd
s.dl
.ac.
uk/c
gi-b
in/n
ews/
disp
?cr
ystm
et
1996
Whi
teet
al.(
2002
)
9B
ilbao
serv
erht
tp://
ww
w.c
ryst
.ehu
.es
1997
Aro
yoet
al.(
2011
)
10A
MC
SD20
,000
Ope
nac
cess
http
://rr
uff.
geo.
ariz
ona.
edu/
AM
S/am
csd.
php
2003
Dow
nsan
dH
all-
Wal
lace
(200
3);R
ajan
etal
.(20
06)
(con
tinu
ed)
4 S. Gražulis et al.
Tab
le1
(con
tinue
d)
No.
Dat
abas
eR
ecor
dsL
icen
seC
urre
ntU
RL
Est
.R
efer
ence
11C
OD
367,
000
Publ
icdo
mai
nht
tp://
ww
w.c
ryst
allo
grap
hy.n
et/c
od20
03G
ražu
liset
al.(
2009
,201
2)
12PC
OD
1,00
0,00
0Pu
blic
dom
ain
http
://w
ww
.cry
stal
logr
aphy
.net
/pco
d20
03L
eB
ail(
2005
)
13M
POD
300
Publ
icdo
mai
nht
tp://
mpo
d.ci
mav
.edu
.mx
2010
Pepp
onie
tal.
(201
2)
14B
-Inc
StrD
B(B
ilbao
Inco
mm
ensu
rate
Stru
ctur
esD
atab
ase)
140
Ope
nac
cess
http
://w
ebbd
cris
ta1.
ehu.
es/in
cstr
db/
2010
Aro
yoet
al.(
2006
)
15T
CO
D2,
600
Publ
icdo
mai
nht
tp://
ww
w.c
ryst
allo
grap
hy.n
et/tc
od20
13M
erky
set
al.
(201
7);
Cha
teig
ner
etal
.(20
15)
16R
RU
FF47
,000
Ope
nac
cess
http
://rr
uff.
info
/20
15L
afue
nte
etal
.(20
15)
17M
AG
ND
ATA
(Bilb
aoM
agne
ticSt
ruct
ure
Dat
abas
e)
428
Ope
nac
cess
http
://w
ebbd
cris
ta1.
ehu.
es/m
agnd
ata/
2015
Pere
z-M
ato
etal
.(20
15)
Crystallography Open Database (COD) 5
in certain cases can be used in situations where licensing requirements are toorestricting (Sadowski and Baldi 2013). Among them, the Crystallography OpenDatabase (COD) is currently the largest and the oldest open resource of smallmolecule crystal structures, providing access to data in mineralogy and chemicalcrystallography and placing all its collection in public domain.
2 A Short History of COD
The COD project started as a community initiative, when crystallographers on theSDPD (Structure Determination by Powder Diffraction) discussed possible modesof crystallographic data dissemination. It was 2003, computers were becomingcheap, Internet connections widely available and free/libre open source software(F/LOSS) ubiquitous. Armel Le Bail raised a question whether it is possible tobuild an entirely open and free for everyone to use crystallographic database byjoining community efforts. Answering that question, Michael Berndt (1964–2003)listed three conditions that were necessary and sufficient for community resourcecreation and curation: “A small team of engaged scientists with some experiencein database and software design to coordinate the project; the authors (i.e., thescientific community = you) who provide the project with database entries /. . . /; freesoftware (a) for maintaining the database, (b) for data evaluation and calculationof derived data.” With this plan in mind, the COD project started and turned outto be a viable alternative to the top-down, heavy-funded database projects. From2003 to 2007, the COD database master copy was maintained by Armel Le Bailat the Le Mans University in France. In 2007 its collection of 50,000 recordswas ported to the Institute of Biotechnology in Vilnius, Lithuania, the softwaredevelopment for the COD, and database maintenance was continued. When theInstitute of Biotechnology was merged with the Vilnius University in 2011, theCOD development continued by the joint team from the Vilnius University Instituteof Biotechnology and the Faculty of Mathematics and Informatics.
Despite the several transfers of maintainership, the COD is governed by aninternational COD Advisory Board (AB), listed on the COD Web site and operatingvia the mailing list. The COD AB establishes the COD data management policiesand sets inclusion criteria for the COD data. In this way, a continuity of databasequality is maintained.
During the period of 10 years since 2007, the COD was growing constantly andattained >385,000 records in 2017 (Fig. 1). This was possible with the introductionof the new data deposition Web site (Fig. 2) that allowed both manual and automaticuploads of data to the COD and after development of automated data collection anddeposition software that deposits available structures to the COD automatically. Thisautomation in turn is highly facilitated by the introduction of the CrystallographicInterchange Framework (CIF) (Hall et al. 1991; IUCr 2017b). The CIF frameworkwas initially used to facilitate crystallographic paper publication and to reducetyping errors in data by providing automated means of crystallographic dataprocessing (Brown and McMahon 2002). Introduction of electronic data handling
6 S. Gražulis et al.
0
50000
100000
150000
200000
250000
300000
350000
400000
2008 2009 2010 2011 2012 2013 2014 2015 2016 2017
CO
D r
ecor
d nu
mbe
r
Year
COD records
Fig. 1 Growth of the COD database by year
in the publication process significantly reduced typing errors in data publication,a significant step towards reliable data reuse. Not only that: availability of crystalstructure descriptions in a standardized, machine readable form as supplementarymaterial for scientific publications greatly facilitated reuse of that data. As a resultthe COD data acquisition subsystem can ingest automatically all necessary valuesand formulate structure description records, using information publicly availablewith the IUCr publications and from journals of some other publishers that makethe necessary information publicly available. The same CIF framework makesit possible for the COD to present all its data collection in a widely accepted,standardized form, so that researchers can use the same software to process theCOD CIFs as for the outputs of structure determination programs or from journalWeb pages.
The data collection procedure conducted by the COD is not completely straight-forward, though. Virtually all structures, even though represented in standardCIF format next to the publications, lack essential metadata such as publicationbibliography; sometimes computed items such as cell volumes or space groupnames are missing or presented in a non-standardized form. Such information isautomatically inserted by the COD data processing pipeline (Gražulis et al. 2009).Moreover, a non-negligible part of supplementary files, although it does containnecessary data in a form similar to the CIF, does not strictly follow the CIF syntax.Since the number of such cases was too large to be corrected manually, an error-correcting CIF parser was implemented (Merkys et al. 2016). The same procedureis followed when data is deposited by researchers into the COD using the Webdeposition interface (Gražulis et al. 2012). In this way the COD ensures that allstructure descriptions that enter its collection are syntactically correct, i.e., conformto the syntax defined by the IUCr (2017a).
Crystallography Open Database (COD) 7
Fig. 2 The Crystallography Open Database Web site
With this setup, the COD is ready to grow further, to provide open access tocrystal structure data for researchers and all interested parties, and to evolve tomeet challenges of the new millennium. Computing landscape changes rapidly,with new techniques, languages, formats and protocols coming and going every day,and computer architectures changing fast enough so that any reasonable scientificarchive must outlive many generations of computer software and hardware. Thebasic principles of the COD design and the successful operation of the COD formore than a decade hint that the methods chosen by the COD founders were soundand that the COD will successfully evolve into the future.
3 Scope and Contents of the COD
The COD collects machine-readable descriptions of crystal structures for inorganiccompounds, minerals, small organic molecules, metal-organic and organometalliccompounds. Proteins, nucleic acids and their complexes, glycoproteins and thelike are as a rule excluded from the COD, since they are systematically collectedin an open-access database, the Protein Data Bank (PDB) (Berman et al. 2012).Most of the “small molecule” structures in the COD are refined using assumptionof independent atom parameters (using full-matrix least squares refinement), anda spherical atom model. This makes the COD suitable, for example, to generaterestraints on molecular geometries and to refine larger molecules or molecularassemblies (Long et al. 2017a,b). We must note, however, that this assumption doesnot necessarily hold for all COD entries. For larger entries, or when disorder ispresent, restraints can be put the by authors on the thermal displacement parameters.
8 S. Gražulis et al.
For structures solved using powder diffraction techniques, restraints on bond lengthsand angles can be also used. Finally, some structures in the COD are solved byhybrid methods, using powder diffraction to carry out Rietveld refinement andto use DFT to further refine atomic parameters; some structures are reportedentirely based on DFT calculations. Obviously, determining bond length and angleparameters from restrained structures would result in circular reasoning, since thesame restraints were already used during the structure refinement process. Thus,the user is advised to inspect structure determination parameters and to select thosestructures that are suitable for his or her work.
4 COD Data Semantics and Selection
To facilitate structure selection, the COD maintains a set of flags that describeexperimental and refinement techniques used for structure determination. Inthe COD SQL table, the “method” column of the ‘data‘ table describesthe experimental technique which can be “single crystal”, “powderdiffraction” or “theoretical”. If the value of this column is NULL,the method is most probably single crystal diffraction. Unfortunately, in manystructures the most popular method, “single crystal”, is not mentionedexplicitly, so this assumption is a certain guess; but the structures solved by“powder diffraction” or “theoretical” methods are usually markedmore accurately and are less numerous, so the guess should be reasonably safe.Structures marked as “theoretical” are in fact solved by DFT computationswithout using any structure-specific experimental data. These structures are ofcourse more appropriate to a different database, the TCOD, which is dedicated totheoretical structures, and are in fact also most likely deposited there. They endedup in the COD since they were provided as supplementary material to some papersand were not marked as being theoretical and only later data curation revealed theirdetermination method. Several important theoretical structures, e.g., from the DFTmethod error estimate studies (Lejaeghere et al. 2014), were deposited to the CODbefore the TCOD was fully operational but were deemed important enough so thatpermanent storage in a database for these data records is necessary. Since the CODpolicy is not to delete any records, so that once assigned COD IDs remain stable,the policy of the COD is to mark its entries with appropriate flags, but not to removethem.
Further the COD database tables contain several fields describing experimentaltechniques, taken from the IUCr Core CIF dictionary. The “radiation”,“radType” and “radSymbol” columns of the “data” table are deriveddirectly from the CIF data items _diffrn_radiation_probe,_diffrn_radiation_type and _diffrn_radiation_xray_symbol,respectively. These data items allow distinguishing between structures obtainedfrom X-ray, neutron and electron diffraction data (the “radiation” column canhave values “x-ray”, “neutron” or “electron” for the respective radiationtypes). Again, like with the “single crystal” value, the most popular
Crystallography Open Database (COD) 9
radiation type, “x-ray”, is often not marked and thus represented as a NULLvalue. We can expect that authors are more attentive when they submit a structuremade by a less common method, but certain caution is of course appropriate.
When selecting records from the COD, one must keep in mind certain book-keeping data items. Certain structures are deposited to the COD that are deliberatelyworse than the best possible interpretation; this is usually done in publications todemonstrate that the main interpretation of data offered by authors is correct orindeed the best one. COD policy is to include such structures (so that the paperclaims can be easily verified) but to mark them as “suboptimal.” In COD CIFs,such structures are marked with _cod_suboptimal_structure yes and_cod_related_optimal_struct data items, and in the COD ‘data‘ tableit has a non-NULL value in the “optimal” column pointing the related optimalstructure. Unless explicit comparison of suboptimal and optimal structures is sought,only structures with NULL “optimal” values should be selected.
Another issue is structures that contain known problems. Again, the CODpolicy is not to remove such structures, once they were included in the COD, butto flag them appropriately. This flag is recorded in the COD database ‘data‘table “status” column. Possible values for this column are “warnings”,“errors”, and “retracted”. The “warnings” level indicates that thestructure might be after all correct but there are strange features, unusual description,or wrong metadata in it. The “errors” mark structures that either have beenproven wrong by subsequent published observations, authors’ corrigenda or containserious data consistency problems that prevent correct interpretation of the structure.In all cases, _cod_error_description gives a human readable descriptionof the problem. Finally, the “retracted” in the “status” column indicatesthat the structure was retracted and should not be used under any circumstances.The reasons for retraction may vary, but usually this flag indicates very seriousproblems up to the outright scientific fraud, as was the case discovered in oneIUCr investigation (Harrison et al. 2009); in such cases, the original publicationsare retracted as well.
The last thing to take care about is the presence of duplicated entries in the COD.Unfortunately, due to less stringent admission procedures in the earliest days ofthe COD, or due to programming or data encoding errors, sometimes the samestructure is deposited more than once to the COD. Once again, when such situationis detected, neither entry is removed from the COD; instead, one entry, usually themost complete one, is declared to be the “main” entry describing this structure, andthe others are marked as “duplicates” using the _cod_duplicate_entry dataitem. If the main entry is missing some information that is present in the duplicates,this information is merged into the main entry and committed as a new revision.Duplicate entries are marked by a non-NULL “duplicateof” column in the‘data‘ table. Thus, to select only those entries that are not marked as duplicates,one needs to select entries that have “duplicateof” column set to NULL.
It must be noted that only technical duplicates are flagged as such in the COD,i.e., only structures that are originating from the same original description and fromthe same publication. Two structures of the same compound reported in different
10 S. Gražulis et al.
publications are not considered duplicates and are stored as different COD records.Even when the same data file is published as supplementary material to two differentpublications, it is deposited under two different COD identifiers. The rationalehere is that a COD record reports an instance of the crystal structure solutionreported somewhere, and all such cases must be represented in the database. Furtherreduction of the multiple records is the responsibility of the COD user, and, indeed,different tasks will require different uniqueness criteria – in some cases these willbe based in chemical identity, in other cases on crystal structure identity, and theCOD must provide sufficient data for all such queries.
Collecting all above considerations into one SQL query, we can select all non-retracted experimental structures that are not marked as duplicates and have atomiccoordinates with a query displayed in Listing 1; the query there reports number ofsuch entries in the current COD SQL database and can be used for further narrowingdown the selection based on crystal parameters.
Listing 1 Number of non-retracted experimental structures with coordinates in the COD that arenot marked as duplicates
#!/bin/bash
mysql -h sql.crystallography.net cod -u cod_reader -t -e \’select count(*), current_timestamp() from datawhere duplicateof is null
and flags like "%has coordinates%"and (status is null or status != "retracted")and (method is null or method != "theoretical")’
+----------+---------------------+| count(*) | current_timestamp() |+----------+---------------------+| 383573 | 2017-12-04 14:07:01 |+----------+---------------------+
5 Accessing the COD
5.1 Web Access to the COD
The COD offers several methods to access its structure collection. The one thatrequires least effort to learn is probably to use query forms (see Fig. 3). Multipleparameters can be specified, most of which should be self-explanatory; their exactmeaning, however, is the same as in the RESTful query fields and can be looked upin the Table 2.
Results from a Web query are displayed in a separate browser page as a HTMLtable (Fig. 4); in addition to that, options are provided to download the list ofresulting structures as a list of COD identifiers, download URLs or as a CSV formattable. For a small number of hits, a ZIP archive of all found CIFs is offered, but for
Crystallography Open Database (COD) 11
Fig. 3 The Web query form of the Crystallography Open Database
Table 2 RESTful interface search parameters and their descriptions
Parameter Description
format The format in which the results will be returned
formula The empirical chemical formula of the crystal. Chemical elementsymbols in the formula must be ordered according to the Hill notationand separated by a space symbol, i.e., “C8 H10 N4 O2”
el1, el2, . . . , el8 Chemical element symbols that must appear in the chemical formula
nel1, nel2, . . . , nel4 Chemical element symbols that must not appear in the chemicalformula
strictmin, strictmax The minimum/maximum number of distinct chemical elements thatmust appear in the chemical formula
amin, amax The minimum/maximum value of the lattice parameter a
bmin, bmax The minimum/maximum value of the lattice parameter b
cmin, cmax The minimum/maximum value of the lattice parameter c
minZ, maxZ The minimum/maximum Z value of the lattice
year The year of publication of the crystal structure
a larger number of structures (typically more than several thousands), this option isnot available in order to avoid excessive stress on the COD servers, and instead auser is advised to download the COD structures in full and pick the desired CIFsusing the COD identifier list resulting from the search.
5.2 Using the RESTful Interfaces
The COD offers a RESTful interface that allows one to retrieve information aboutCOD entries based on certain criteria as well as the crystal structure files themselves.The REST (REpresentational State Transfer) is an architectural style of network-based programs that was outlined in the doctoral dissertation of Roy Fielding (2000).
12 S. Gražulis et al.
Fig. 4 An example result page from a Crystallography Open Database Web query
The main ideas of this architecture relevant for the COD are to use a client-server design (the COD server serves multiple clients), to make the COD serverstateless as much as possible (thus the same request to the COD server should yieldidentical results if repeated several times), to use standard connections based onHTTP protocol and stable Web URIs, and to use standard formats (CIF, HTML)to exchange information. An interface based on the ideas of REST, a so-calledRESTful interface, has the benefit of not requiring a specialized client programsince the queries can be executed by any piece of software capable of resolvingURIs including, but not limited to, most Internet browsers.
COD RESTful search query URIs adhere to the HTTP GET query formattaking http://www.crystallography.net/cod/result as the basis URI. For example, aquery that returns a list of COD IDs associated with structures that contain theLi and O atoms and were published in 2017 would take a form of: http://www.crystallography.net/cod/result?el1=Li&el2=O&year=2017&format=lst
As mentioned above, specialized software is not required, but it can, however,ease the construction of the query strings. An example of the same request rewrittento use the cURL program is given in Listing 2.
Listing 2 Querying the RESTful interface using cURL
#!/bin/bash
curl ’http://www.crystallography.net/cod/result’ \-d ’el1=Li’ \-d ’el2=O’ \-d ’year=2017’ \-d ’format=lst’
Crystallography Open Database (COD) 13
Several more examples of COD RESTful interface queries using cURL are givenin listings Listings 3, 4, 5, and 6. Description of the used query parameters is givenin Table 2. The full list of supported parameters and formats can be acquired athttp://wiki.crystallography.net/RESTful_API/.
Listing 3 Count of structures that contain Fe atoms, but no O atoms
#!/bin/bash
curl ’http://www.crystallography.net/cod/result’ \-d ’el1=Fe’ \-d ’nel1=O’ \-d ’format=count’
Listing 4 Information about entries that contain only Fe and N atoms in JSON format
#!/bin/bash
curl ’http://www.crystallography.net/cod/result’ \-d ’el1=Fe’ \-d ’el2=N’ \-d ’strictmin=2’ \-d ’strictmax=2’ \-d ’format=json’
Listing 5 Text file with URLs of entries that have the “C O2” chemical formula
#!/bin/bash
curl ’http://www.crystallography.net/cod/result’ \-d ’formula’=’C O2’-d ’format=urls’
Listing 6 ZIP archive containing CIF files of entries that have cell length between 30 Å and35 Å and Z number between 3 and 4
curl ’http://www.crystallography.net/cod/result’ \-d ’amin=30&amax=35’ \-d ’bmin=30&bmax=35’ \-d ’cmin=30&cmax=35’ \-d ’minZ=3&maxZ=4’ \-d ’format=zip’
5.3 Querying SQL Database
SQL (Structure Query Language) is arguably the most powerful method of interro-gating relational databases and offers more features than the COD Web page or even
14 S. Gražulis et al.
than the COD RESTful interface. The Crystallography Open Database offers a read-only access to its data tables so that SQL queries can be carried out by user or bythird-party software. Covering SQL language syntax and its use is beyond the scopeof this chapter, but numerous textbooks and on-line references of SQL exist, as wellas excellent documentation of several F/LOSS implementations of SQL (MySQL isone of them). In this text we provide just a few examples that demonstrate how SQLqueries can be used for querying the COD out of the box.
The COD SQL tables are constructed automatically from the COD CIF collec-tion. Tables are updated by the post-commit hooks of the Subversion repository; thusthe SQL tables should be always in sync with the CIF collection. In the COD, thedataflow is always from CIFs to the SQL database; thus all changes in tables must befirst recorded and versioned in the main repository. Thus, MySQL acts essentiallyas a fast search cache for the COD, making use of index tables and query optimizer.The COD MySQL ‘data‘ table contains also the “svnrevision” column thatrecords Subversion revision from which each row is produced. In addition to that,all COD MySQL tables are dumped nightly in text form and committed to thesame Subversion repository as the CIF collection. These archives provide meansto reproduce queries that were run some time ago, should this necessity arise forscientific computation reproducibility.
The simplest query counts number of records in the current revision of the COD(Listing 7). A more elaborate form of this query which filters structures that areusually unwanted is provided in the Listing 1. Further examples (Listings 8, 9,and 10) demonstrate how various chemical features can be queried. Specifically,the Listing 9 shows how the COD MySQL server can be queried using regularexpressions, an extension of the SQL language. These queries permit selectionsbased on atom chemical types, among other possibilities.
Listing 7 Number of entries in the COD
#!/bin/bash
mysql -h sql.crystallography.net cod -u cod_reader -t -e \’select count(*), current_timestamp() from data’
+----------+---------------------+| count(*) | current_timestamp() |+----------+---------------------+| 387948 | 2017-12-04 14:07:02 |+----------+---------------------+
Listing 8 DOIs and publication years of structures of cucurbituril
#!/bin/bash
mysql -h sql.crystallography.net cod -u cod_reader -t -e \’select file, doi, year from datawhere chemname like "%cucurbituril%"’
Crystallography Open Database (COD) 15
+---------+---------------------------+------+| file | doi | year |+---------+---------------------------+------+| 2200062 | 10.1107/S1600536800019498 | 2001 || 4320271 | 10.1021/ic015520p | 2001 || 4320272 | 10.1021/ic015520p | 2001 || 4320689 | 10.1021/ic010362n | 2001 || 4320690 | 10.1021/ic010362n | 2001 || 4508668 | 10.1021/cg060062m | 2006 || 4508669 | 10.1021/cg060062m | 2006 |+---------+---------------------------+------+
Listing 9 Number of hydrocarbons
#!/bin/bash
mysql -h sql.crystallography.net cod -u cod_reader -t -e \’select count(*), current_timestamp() from datawhere formula regexp"- C[[:digit:]]* H[[:digit:]]* -"’
+----------+---------------------+| count(*) | current_timestamp() |+----------+---------------------+| 1250 | 2017-12-04 14:07:02 |+----------+---------------------+
Listing 10 Five most voluminous MOFs
#!/bin/bash
mysql -h sql.crystallography.net cod -u cod_reader -t -e \’select file, chemname, vol from datawhere chemname like "%MOF%"order by vol desclimit 5’
+---------+-------------+--------+| file | chemname | vol |+---------+-------------+--------+| 4111295 | mesoMOF-1 | 122163 || 1519417 | Y-ftw-MOF-3 | 111361 || 1519416 | Y-ftw-MOF-2 | 64231 || 7032763 | MOF-205-NO2 | 27851 || 7032762 | MOF-205-NH2 | 27846 |+---------+-------------+--------+
16 S. Gražulis et al.
6 COD Applications
Even though the COD is not as large as some older crystallographic databases, it hasnumerous applications due to its open nature. One immediate possibility where theCOD excels is teaching. Using the COD one can give students some real-life datasearch and crystallographic applications, illustrate structures of various compounds,and provide insights into modern chemical research areas (Gražulis et al. 2015).Advantages of the COD are its extremely rapid release cycle (the database is updateddaily), permissive license that allows students to download arbitrary parts or eventhe whole database to their computers, and its availability on the Internet where itcan be accessed from or outside the classroom.
Another widely accepted application of the COD is its use for material identi-fication with the help of powder diffraction method and search-match procedure.Largest diffractometer vendors (among them Bruker, PANalytical, Rigaku) haveadapted the COD collection for their software and ship it with their equipment,providing regular updates on the COD Web site or on their own pages. Sincethe COD is an open database, these updates are free of charge for the end users.The COD has currently accumulated enough mineral structures so that it can beused for the SOLSA project (http://solsa-mining.eu), where the database is used,together with other information sources, as a tool for material identification anddata dissemination.
In bioinformatics and drug design, the COD is used as a source of open data forrestraint libraries (Long et al. 2017a,b). It is also used in DataWarrior (Sander et al.2015) as one of the sources of chemical information and in the OpenMoleculesWeb site (http://www.openmolecules.org/). Software testing benefits from largecollection of COD data, where different cases need to be examined and data needsto be stored in regression tests. Finally, the COD is used in fundamental research toanswer different questions about matter (see, e.g., recent works on MOFs (First andFloudas 2013), hydrogen storage (Breternitz and Gregory 2015), or characterizationof 2D materials (Mounet et al. 2018).
7 Conclusions
The more than decade-long history of the COD has demonstrated that it is possibleto build a lasting, high-quality scientific database using an open-access licensingmodel. At its current state, the COD is useful for a range of academic and industrialapplications. Most importantly, this open database provides everyone with theaccess to knowledge in its own field of small molecule crystallography. At the sametime, there are a lot of obvious improvements that can be done. Clearly the CODneeds a more comprehensive data collection. More community organization effortshould be done, to involve more people in data correction, collection, and ensuringquality of the COD records. More links with the rest of the Internet data resourcesshould be made, integrating the COD more closely into the Linked Open Data
Crystallography Open Database (COD) 17
Cloud. None of these tasks seems to be outside the reach of current possibilities,and so one can expect that in due time, the COD is expanded to include all thesefeatures.
Acknowledgements This project has received funding from the European Union’s Horizon 2020research and innovation program under grant agreement No 689868.
References
Allen FH, Bellard S, Brice MD, Cartwright BA, Doubleday A, Higgs H, Hummelink T,Hummelink-Peters BG, Kennard O, Motherwell WDS, Rodgers JR, Watson DG (1979) TheCambridge crystallographic data centre: computer-based search, retrieval, analysis and displayof information. Acta Crystallogr Sect B Struct Crystallogr Crystal Chem 35(10):2331–2339
Andronico A, Randall A, Benz RW, Baldi P (2011) Data-driven high-throughput prediction of the3-D structure of small molecules: review and progress. J Chem Inf Model 51:760–776
Aroyo MI, Perez-Mato JM, Capillas C, Kroumova E, Ivantchev S, Madariaga G, Kirov A,Wondratschek H (2006) Bilbao crystallographic server: I. Databases and crystallographiccomputing programs. Zeitschrift für Kristallographie – Crystalline Materials 221(1):15–27
Aroyo MI, Perez-Mato JM, Orobengoa D, Tasci E, de la Flor G, Kirov A (2011) Crystallographyonline: Bilbao crystallographic server. Bulg Chem Commun 43(2):183–197
Baerlocher C, McCusker LB, Olson DH (2007) Atlas of zeolite framework types, 6th revised edn.Elsevier, Amsterdam/London/New York/Oxford/Paris/Shannon/Tokyo
Baldi P (2011) Data-driven high-throughput prediction of the 3-D structure of small molecules:review and progress. A response to the letter by the Cambridge crystallographic data centre. JChem Inf Model 51:3029
Belsky A, Hellenbrandt M, Karen VL, Luksch P (2002) New developments in the InorganicCrystal Structure Database (ICSD): accessibility in support of materials research and design.Acta Crystallogr B 58:364–369
Berman HM, Olson WK, Beveridge DL, Westbrook J, Gelbin A, Demeny T, Hsieh SH, SrinivasanAR, Schneider B (1992) The nucleic acid database: a comprehensive relational database ofthree-dimensional structures of nucleic acids. Biophys J 63:751–759
Berman HM, Kleywegt GJ, Nakamura H, Markley JL (2012) The protein data bank at 40: reflectingon the past to prepare for the future. Structure 20:391–396
Bragg WH (1913) The reflection of x-rays by crystals. (II) Proc R Soc A Math Phys Eng Sci89(610):246–248
Bragg WH, Bragg WL (1913) The reflection of x-rays by crystals. Proc R Soc Lond A Math PhysEng Sci 88:428–438
Breternitz J, Gregory D (2015) The search for hydrogen stores on a large scale; a straightforwardand automated open database analysis as a first sweep for candidate materials. Crystals 5:617–633
Brown ID, McMahon B (2002) CIF: the computer language of crystallography. Acta CrystallogrB 58:317–324
Chateigner D, Grazulis S, Pérez O, Pepponi G, Lutterotti L (2015) COD, PCOD, TCOD,MPOD. . . open structure and property databases. http://www.ecole.ensicaen.fr/~chateign/danielc/abstracts/Chateigner_abstract_JNCO2013.pdf accessed 2018-10-03
Clews CJB, Cochran W (1948) The structures of pyrimidines and purines. I. A determination ofthe structures of 2-amino-4-methyl-6-chloropyrimidine and 2-amino-4,6-dichloropyrimidine byx-ray methods. Acta Crystallogr 1(1):4–11
Downs RT, Hall-Wallace M (2003) The American mineralogist crystal structure database. AmMiner 88:247–250
18 S. Gražulis et al.
Faber J, Fawcett T (2002) The powder diffraction file: present and future. Acta Crystallogr B 58(3Part 1):325–332
Fielding RT (2000) Architectural Styles and the design of network-based software architectures.Ph.D. thesis, University of California, Irvine
First EL, Floudas CA (2013) Mofomics: computational pore characterization of metal-organicframeworks. Microporous Mesoporous Mater 165:32–39
Friedrich W, Knipping P, Laue M (1912) Interferenzerscheinungen bei Röntgenstrahlen. Einequantitative Prüfung der Theorie für die Interferenz-Erscheinungen bei Röntgenstrahlen. Bay-erische Akademie der Wissenschaften, Mathematisch-Physikalische Klasse, Sitzungsberichte,pp 303–322
Friedrich W, Knipping P, Laue M (1912) Interferenzerscheinungen bei Röntgenstrahlen. Einequantitative Prüfung der Theorie für die Interferenz-Erscheinungen bei Röntgenstrahlen,II. Bayerische Akademie der Wissenschaften, Mathematisch-Physikalische Klasse, Sitzungs-berichte, pp 363–373
Gražulis S, Chateigner D, Downs RT, Yokochi AFT, Quirós M, Lutterotti L, Manakova E, Butkus J,Moeck P, Le Bail A (2009) Crystallography open database: an open-access collection of crystalstructures. J Appl Crystallogr 42(4):726–729
Gražulis S, Daškevic A, Merkys A, Chateigner D, Lutterotti L, Quirós M, Serebryanaya NR,Moeck P, Downs RT, Le Bail A (2012) Crystallography open database (COD): an open-accesscollection of crystal structures and platform for world-wide collaboration. Nucleic Acids Res40(D1):D420–D427
Gražulis S, Sarjeant AA, Moeck P, Stone-Sundberg J, Snyder TJ, Kaminsky W, Oliver AG, SternCL, Dawe LN, Rychkov DA, Losev EA, Boldyreva EV, Tanski JM, Bernstein J, Rabeh WM,Kantardjieff KA (2015) Crystallographic education in the 21st century. J Appl Crystallogr48(6):1964–1975
Groom CR, Allen FH (2014) The Cambridge structural database in retrospect and prospect. AngewChem Int Ed 53:662–671
Groom CR, Bruno IJ, Lightfoot MP, Ward SC (2016) The Cambridge structural database. ActaCrystallogr B 72(2):171–179
Hall SR, Allen FH, Brown ID (1991) The crystallographic information file (CIF): a new standardarchive file for crystallography. Acta Crystallogr A 47(6):655–685
Harrison WTA, Simpson J, Weil M (2009) Editorial. Acta Crystallogr E Struct Rep Online66(1):e1–e2
Hermann C, Ewald PP (1931) Strukturbericht 1913-1928: Zeitschrift für Kristallographie, Kristall-geometrie, Kristallphysik, Kristallchemie. Akademische Verlagsgesellschaft, Leipzig
IUCr (2017) A formal grammar for CIF. https://www.iucr.org/resources/cif/spec/version1.1/cifsyntax, accessed 2018-10-03
IUCr (2017) Crystallographic information framework. https://www.iucr.org/resources/cif,accessed 2018-10-03
IUCr (2017) Structure reports. https://www.iucr.org/publications/other/structure-reports, accessed2018-10-03
Kabekkodu SN, Faber J, Fawcett T (2002) New powder diffraction file (pdf-4) in relationaldatabase format: advantages and data-mining capabilities. Acta Crystallogr B 58:333–337
Kaduk JA (2002) Use of the inorganic crystal structure database as a problem solving tool. ActaCrystallogr B 58(Pt 3 Pt 1):370–379
Lafuente B, Downs RT, Yang H, Stone N (2015) The power of databases: the RRUFF project. In:Highlights in mineralogical crystallography. W. De Gruyter, Berlin, pp 1–30
Le Bail A (2005) Inorganic structure prediction with grinsp. J Appl Crystallogr 38:389–395Lejaeghere K, Van Speybroeck V, Van Oost G, Cottenier S (2014) Error estimates for solid-state
density-functional theory predictions: an overview by means of the ground-state elementalcrystals. Crit Rev Solid State Mater Sci 39:1–24
Long F, Nicholls RA, Emsley P, Gražulis S, Merkys A, Vaitkus A, Murshudov GN (2017)ACEDRG: a stereo-chemical description generator for ligands. Acta Crystallogr D 73(2):112–122
Crystallography Open Database (COD) 19
Long F, Nicholls RA, Emsley P, Gražulis S, Merkys A, Vaitkus A, Murshudov GN (2017)Validation and extraction of stereochemical information from small molecular databases. ActaCrystallogr D 73(2):103–111
Merkys A, Vaitkus A, Butkus J, Okulic-Kazarinas M, Kairys V, Gražulis S (2016)COD::CIF::Parser: an error-correcting CIF parser for the Perl language. J Appl Crystallogr49(1):292–301
Merkys A, Mounet N, Cepellotti A, Marzari N, Gražulis S, Pizzi G (2017) A posteriori metadatafrom automated provenance tracking: integration of AiiDA and TCOD. J Cheminform 9(1):56
Mounet N, Gibertini M, Schwaller P, Campi D, Merkys A, Marrazzo A, Sohier T, Castelli IE,Cepellotti A, Pizzi G, Marzari N (2018) Novel two-dimensional materials from high-throughputcomputational exfoliation of experimentally known compounds. Nature Nanotechnology,13(3):246–252
Narayanan BC, Westbrook J, Ghosh S, Petrov AI, Sweeney B, Zirbel CL, Leontis NB, Berman HM(2014) The nucleic acid database: new features and capabilities. Nucleic Acids Res 42:D114–D122
Pepponi G, Gražulis S, Chateigner D (2012) MPOD: a material property open database linked tostructural information. Nucl Instrum Methods Phys Res Sect B: Beam Interact Mater Atoms284(0):10–14. E-MRS 2011 Spring Meeting, Symposium M: X-ray techniques for materialsresearch-from laboratory sources to free electron lasers
Perez-Mato JM, Gallego SV, Tasci ES, Elcoro L, de la Flor G, Aroyo MI (2015) Symmetry-basedcomputational tools for magnetic crystallography. Annu Rev Mater Res 45(1):217–248
Protein Data Bank (1971) Protein data bank. Nat New Biol 233:22–23Rajan H, Uchida H, Bryan DL, Swaminathan R, Downs RT, Hall-Wallace M (2006) Building the
American mineralogist crystal structure database: a recipe for construction of a small internetdatabase. In: Sinha AK (ed) Geoinformatics: data to knowledge, Geological Society of America,Boulder, vol 397, 73–80
Röntgen WC (1896) On a new kind of rays. Nature 53:274–276Sadowski P, Baldi P (2013) Small-molecule 3d structure prediction using open crystallography
data. J Chem Inf Model 53:3127–3130Sander T, Freyss J, von Korff M, Rufener C (2015) DataWarrior: an open-source program for
chemistry aware data visualization and analysis. J Chem Inf Model 55(2):460–473Villars P, Onodera N, Iwata S (1998) The linus pauling file (LPF) and its application to materials
design. J Alloys Compd 279:1–7Villars P, Cenzual K, Daams J, Chen Y, Iwata S (2004) Data-driven atomic environment prediction
for binaries using the mendeleev number: part 1. Composition {AB}. J Alloys Compd367(1–2):167–175. Proceedings of the {VIII} international conference on crystal chemistry ofintermetallic compounds
White PS, Rodgers JR, Le Page Y (2002) Crystmet: a database of the structures and powderpatterns of metals and intermetallics. Acta Crystallogr B 58(Pt 3 Pt 1):343–348
A. I. Kitajgorodskij. Organiqeskaffl kristallohimiffl, t. 1. Izdatel~stvo Akademii NaukSSSR, sen. 1955