utility data analysis tools file standard format · conversions of real data files using the...

1
Design and implementations of the new HUPO Proteomics Standards Initiative’s mass spectrometer output file standard format: mzML 1.0 Eric W Deutsch 1 , Pierre-Alain Binz 2 , Darren Kessner 3 , Matt Chambers 4 , Luisa Montecchi-Palazzi 5 , Jim Shofstahl 6 , Josh Tasman 1 , Randall K Julian 7 , Fredrik Levander 8 , Puneet Souda 9 , and Lennart Martens 5 1 Institute for Systems Biology , Seattle, WA; 2 Swiss Institute for Bioinformatics and Geneva Bioinformatics, Geneva, Switzerland; 3 Cedars-Sinai Medical Center, Los Angeles, CA; 4 Vanderbilt University, Nashvillie, TN, 5 European Bioinformatics Institute, Hinxton, UK, 6 Thermo Fisher, San Jose, CA, 7 Indigo Biosystems, Carmel, IN, 8 Lund University, Lund, Sweden, 9 University of California Los Angeles, Los Angeles, CA Overview mzML is a new data format for the storage and exchange of mass spectrometer output files. It follows on the successful mzXML and mzData formats. mzML has been designed by merging the best aspects of both previous formats into a single unified format that is intended to replace all earlier formats. Version 1.0.0 just released Accompanied by a controlled vocabulary and semantic validation rules Many implementations of the format already exists, insuring quick adoption of the format Developed with full participation of academic researchers, hardware and software vendors Vendors have committed to supporting the new format once released. Format has been tested with several instance documents and many implementations of the format during beta testing mzML is expected to replace mzXML and mzData, but not expected to completely replace vendor binary formats. History mzML has been under development for two years with full participation of academic researchers, hardware and software vendors. It was first conceived at the PSI meeting in 2006, two years after mzXML and mzData were released. The original working name of dataXML was changed to the final mzML name in Lyon. The format was submitted to the PSI document process in November 2007 wherein it passed through formal internal and then community review. Version 1.0.0 of the format was completed just prior to this conference. mzData 1.05 mzXML 3.0 mzML 0.90 SFO 2006-05 dataXML 0.6 DC 2006-09 ISB 2006-11 Lyon 2007-04 EBI 2007-06 mzML 0.91 PSI Doc Proc 2007-11 mzML 0.99 RC Toledo 2008-04 mzML 1.0.0 Done! 2008-06 Early Development Final Development Schema Outline The mzML schema is designed to contain all the information for a single MS run, including meta data about the spectra plus all the spectra themselves, either in centroided (peak list) or profile mode. The header at the top of the file encodes information about the source of the data as well as information about the sample, instrument and software that processed the data. Semantic Validator One of the benefits of the previous mzData format was its considerable flexibility in allowing writers of the format to encode additional information relevant to the specific instrument or setup, even if it cannot be handled by all software. However, this considerable flexibility led to different dialects as the same information could be encoded in different ways. We have solved this problem with the semantic validator tool for mzML. The semantic validator can: ProteoWizard Software Implementations The best way to test a new format is by implementing it in software. Inevitably as a format is implemented, one finds minor inconsistencies or missing features. The initial release of mzML is strengthened by the breadth of implementations that already exist and have exercised the various use cases: Ensure that an mzML document is well formed and conforms to the xsd XML schema Ensure that controlled vocabulary terms are used in the correct places in the document Allow alternate rules based on the type of data being written Allow different levels of compliance (e.g., basic mzML, MIAPE-MS compliant mzML) Semantic rules can be updated along with the controlled vocabulary without changing the schema Available as a web page (see below) or as a standalone tool Conclusions The mzML format is now complete and mzML 1.0.0 is released. We encourage all authors and vendors to begin supporting this new format in new and updated software. The format includes the best features from pre-existing open formats and has additional support for chromatograms and some other features deemed highly desirable. It is expected that the schema will remain stable for at least a year, hopefully more. However, the controlled vocabulary and semantic validation rules will continue to be updated and refined as all authors and vendors finish implementing their software for mzML. To learn more, see the mzML Development Page: The mzML effort has involved many people in the PSI and in the community. We gratefully acknowledge the contributions of: Jari Häkkinen (Lund) Brian Pratt (Insilicos) Erik Nilsson (Insilicos) Mike Coleman (Stowers) Luis Mendoza (ISB) David Shteynberg (ISB) Lars Nilse (Manchester) Benito Cañas (Madrid) Lola Gutierrez (Madrid) Alberto Medina (Madrid) Trish Wheztel (U Penn) Eva Duchoslav (MDS Sciex) Henning Hermjakob (EBI) Angel Pizarro (U Penn) Phil Jones (EBI) Jimmy Eng (U Washington) Kent Laursen (Indigo) Sandra Orchard (EBI) Chris Taylor (EBI) Patrick Pedrioli (ETHZ) Sean Seymour (ABI) David Creasy (Matrix Science) Howard Read (Waters) Jim Langridge (Waters) Jayson Falkner (U Michigan) David Horn (Agilent) Ruth McNally (Cardiff) Ron Beavis (UBC) Norman Paton (Manchester) Ruedi Aebersold (ETHZ) Marc Sturm (U Tuebingen) Parag Mallick (CSHS) Rune Philosof David Sparkman (U Pacific) Wilfred Tang (ABI) Marius Kallhardt (Bruker) PSI Steering Group PSI Participants http://psidev.info/index.php?q=node/257 proprietary format mass spectrometer B mass spectrometer A converter mzXML mzData mzML search engine A search engine B Native format pepXML analysisXML Public repository mzML is a common open format to record the output of mass spectrometers prior to database searching or other downstream processing of the spectra. It is expected that by 2009: Instrument vendors will write out or convert to mzML Search engines or other spectrum processing software will read and process mzML Data repositories will accept, process, and store mzML documents Natively mass spectrometers store output in a variety of proprietary formats Hinders data sharing Makes writing vendor-neutral software difficult Solution: develop a vendor-neutral open format that all vendor software can write vendor-neutral software can read and write Controlled Vocabulary Much of the metadata encoded in the mzML is in the form of cvParams, an XML element that provides a reference to a specific concept within the PSI MS controlled vocabulary. Each term has an explicit and detailed definition, and may have information about its data type and what kind of units it requires, if any. The controlled vocabulary is edited in OBO format with the OBO-Edit software and is read in by most readers and writers of mzML. The controlled vocabulary can be easily adjusted and extended without modifying the schema. Example Instance Documents In order to exercise the schema and demonstrate that the various use cases have been adequately modeled, we have developed several example instance documents. Some of the documents are hand-crafted with an ordinary editor, while others are written out as a software test as part of the ProteoWizard reference implementation. In addition, several instance documents are conversions of real data files using the reference or other implementations of converters. Documentation The full specification of the format is presented in a specification document that described various aspects of the format as well as all details of the format elements. HTML and PDF documentation are generated by programmatically combining 4 different components: xsd schema, controlled vocabulary, semantic validation mapping file, example documents. ProteoWizard C++ library reference implementation: reads mzML and writes mzML. Can convert mzXML and RAW to mzML. RAMP (Random Access Minimal Parser) C library can read mzML, mzXML, mzData files via same API. mzML reading performed with ProteoWizard Trans Proteomic Pipeline (TPP) can read and process data in mzML format via RAMP ISB format converters: ReAdW (Thermo), Wolf (Waters), mzWiff (ABI/MDS), Trapper (Agilent) Thermo Fisher beta RAW → mzML converter Semantic Validator and Java library reads and validates that a document is semantically correct Phenyx search engine can read and search spectra in mzML format NCBI C++ mzML reader classes Insilicos Viewer – file browser and spectrum viewer can read and display spectra from mzML, mzXML, mzData, RAW formats SeeMS file browser, spectrum viewer, chromatogram viewer, annotater for mzML, mzXML Proteios Software Environment includes converters for peak lists of various formats to mzML and performs reading of mzML files InSilicoSpectro open source library (Perl) has a spectrum file format conversion tool that reads mzML Hermes mzML ↔ mzData ↔ mzXML converter (Java) Online validator at the ProDaC site allows anyone to upload any file and perform semantic validation on any mzML file A downloadable version of the semantic validator can be run locally on any platform supporting Java to validate files without needing to transmit them to a remote web site. On-line validator Local validator New term requests may be emailed to: [email protected] Semantic validator: concept Ontology Access component Validator layer Actual validator implementation CV rule reader component OLS – http://www.ebi.ac.uk/ols OBO file CV mapping file Xpath based XML indexer component XML file to validate Ontology config file Objectrules file Semantic all y correc t file Se mant i c a lly inc orrec t f i le OBO-Edit is used to maintain the controlled vocabulary: organize the structure, add new terms, update definitions. The controlled vocabulary is easily parsed by software, such as reader and writer software, as well as central vocabulary services, like the Ontology Lookup Service web site. Maintenance on mzML will continue with the PSI Mass Spectrometry Standards Working Group. It is expected that the schema will remain stable, but minor updates to the controlled vocabulary and semantic validation rules may be necessary. mzML run spectrum spectrumDescription binaryDataArray binaryDataArray ••• precursorList scan spectrumList ••• spectrum spectrum cvList referenceableParamGroupList sampleList acquisitionSettingsList dataProcessingList softwareList instrumentConfigurationList chromatogramList ••• chromatogram chromatogram chromatogram binaryDataArray binaryDataArray Each spectrum contains a header with scan information and optionally precursor information, followed by two or more base64-encoded binary data arrays. Chromatograms may be encoded in mzML in a special element that contains one or more cvParams to describe the type of chromatogram, followed by two base64- encoded binary data arrays. mzML may be enclosed in a special indexing wrapper schema to allow random access into the file, allowing software to pull out one or more arbitrary spectra. C++ library, with modular design for testability and extensibility builds with native compilers on all major platforms (MSVC on Windows, gcc on Linux, XCode on OSX) open source license suitable for both academic and commercial projects (Apache v2) internal data model is a one-to-one translation of mzML data elements to C++ data structures plug-in Reader interface for reading of both open and vendor proprietary data formats: mzXML, Thermo RAW, MGF, with more Readers in development msconvert tool provides general file format conversion, including native centroiding and zlib compression SeeMS and mspicture visualization tools allows visualization of mass spec data used by RAMP and TPP for mzML support CLI binding allows use from .NET languages (C++/CLI, C#, VB.NET); SWIG bindings for scripting (from Java, Python, Perl, R) in development. an "Application Note" describing ProteoWizard has been accepted for publication by the journal Bioinformatics http://proteowizard.sourceforge.net -- developer contributions are welcome! The ProteoWizard software project, initiated by the Spielberg Family Center for Applied Proteomics at the Cedars-Sinai Medical Center, provides a modular and extensible set of open-source, cross-platform tools and libraries. The tools perform proteomics data analyses; the libraries enable rapid tool creation by providing a robust, pluggable development framework that simplifies and unifies data file access, and performs standard chemistry and LCMS dataset computations. During the final stages of mzML development, refinement, and testing, the ProteoWizard library has provided the testing and reference implementation of mzML. analysis peak detection 1 preprocessing tools SeeMS msconvert data data structures data format abstraction utility math parsing encoding testing data extraction peak detection 2 msaccess mspicture High level architecture of ProteoWizard Hyperlinks to various example documents on the mzML development web site. A sample snippet of an example mzML document, showing the top header portion of the file. Part of the HTML documentation page for mzML, showing the available documentation for the <sourceFile> element. Title page of the full mzML Specification Document NativeID references back to the original scan references in the source data

Upload: doanxuyen

Post on 19-Jul-2018

225 views

Category:

Documents


0 download

TRANSCRIPT

De

sig

n a

nd

im

ple

menta

tio

ns o

f th

e n

ew

HU

PO

Pro

teo

mic

s S

tan

da

rds In

itia

tive

’s m

ass s

pe

ctr

om

ete

r o

utp

ut file

sta

nda

rd fo

rma

t: mzML 1.0

Eric W

Deuts

ch

1,

Pie

rre-A

lain

Bin

z2,

Darr

en K

essner3

, M

att C

ham

bers

4,

Luis

a M

onte

cchi-P

ala

zzi5

, Jim

Shofs

tahl6

, Josh T

asm

an

1,

Randall

K J

ulia

n7,

Fre

drik L

evander8

, P

uneet

Souda

9,

and L

ennart

Mart

ens

5

1In

stitu

te f

or

Syste

ms B

iolo

gy ,

Seatt

le, W

A;

2S

wis

s I

nstitu

te f

or

Bio

info

rmatics a

nd G

eneva B

ioin

form

atics,

Geneva,

Sw

itzerland;

3C

edars

-Sin

ai M

edic

al C

ente

r, L

os A

ngele

s,

CA

; 4

Vanderb

ilt U

niv

ers

ity,

Nashvill

ie,

TN

, 5

Euro

pean B

ioin

form

atics I

nstitu

te,

Hin

xto

n,

UK

,6

Therm

o F

isher,

San J

ose,

CA

, 7

Indig

o B

iosyste

ms,

Carm

el, I

N,

8Lund U

niv

ers

ity,

Lund,

Sw

eden,

9U

niv

ers

ity o

f C

alif

orn

ia L

os A

ngele

s,

Los A

ngele

s,

CA

Overv

iew

mzM

L is a

ne

w d

ata

form

at fo

r th

e s

tora

ge

and

exchan

ge

of

ma

ss s

pe

ctr

om

ete

r ou

tpu

t file

s.

It f

ollo

ws o

n t

he

su

cce

ssfu

l m

zX

ML

and

mzD

ata

fo

rma

ts.

mzM

L h

as b

een

de

sig

ned

by m

erg

ing t

he

be

st

aspe

cts

of

bo

th

pre

vio

us fo

rma

ts in

to a

sin

gle

un

ifie

d f

orm

at

tha

t is

in

tend

ed

to

rep

lace

all

ea

rlie

r fo

rma

ts.

Ve

rsio

n 1

.0.0

just

rele

ased

Accom

pa

nie

d b

y a

con

trolle

d v

ocab

ula

ry a

nd

sem

antic

va

lida

tio

n r

ule

s

Ma

ny im

ple

me

nta

tion

s o

f th

e fo

rma

t a

lre

ad

y e

xis

ts,

insu

ring

quic

k a

do

ptio

n o

f th

e fo

rma

t

De

ve

lop

ed

with

full

pa

rtic

ipa

tion

of a

cad

em

ic r

ese

arc

he

rs,

ha

rdw

are

and

so

ftw

are

ve

ndo

rs

Ve

nd

ors

ha

ve

co

mm

itte

d to

sup

po

rtin

g th

e n

ew

fo

rmat

on

ce

re

lease

d.

Fo

rmat

has b

ee

n t

este

d w

ith

se

ve

ral in

sta

nce

do

cu

men

ts

an

d m

an

y im

ple

me

nta

tion

s o

f th

e f

orm

at d

urin

g b

eta

testin

g

mzM

L is e

xp

ecte

d to

re

pla

ce

mzX

ML

an

d m

zD

ata

, b

ut n

ot

exp

ecte

d to

co

mple

tely

re

pla

ce

ve

nd

or

bin

ary

fo

rma

ts. H

isto

ry

mzM

L has been under

develo

pm

ent

for

two years

w

ith fu

ll part

icip

ation of

academ

ic re

searc

hers

, hard

ware

and soft

ware

vendors

. It w

as f

irst

conceiv

ed a

t th

e P

SI

meeting i

n 2

006,

two y

ears

aft

er

mzX

ML a

nd m

zD

ata

were

re

leased.

The o

rigin

al

work

ing n

am

e o

f data

XM

Lw

as c

hanged t

o t

he f

inal

mzM

L nam

e in

L

yon.

The f

orm

at

was subm

itte

d to

th

e P

SI

docum

ent

pro

cess i

n N

ovem

ber

2007 w

here

in i

t passed t

hro

ugh f

orm

al

inte

rnal

and t

hen c

om

munity r

evie

w.

Vers

ion 1

.0.0

of

the f

orm

at

was c

om

ple

ted just prior

to this

confe

rence.

mzD

ata

1.0

5

mzX

ML

3.0

mzM

L

0.9

0

SF

O

2006-0

5

data

XM

L

0.6

DC

2006-0

9

ISB

2006-1

1

Lyon

2007-0

4

EB

I

2007-0

6

mzM

L

0.9

1

PS

I D

oc P

roc

2007-1

1

mzM

L

0.9

9 R

C

Tole

do

2008-0

4

mzM

L

1.0

.0

Done!

2008-0

6

Earl

y D

evelo

pm

ent

Fin

al D

evelo

pm

ent

Schem

a O

utlin

e

The m

zM

L s

chem

a is d

esig

ned t

o c

onta

in a

ll th

e info

rmation f

or

asin

gle

MS

run,

inclu

din

g m

eta

data

abo

ut

the s

pectr

a p

lus a

ll

the s

pectr

a t

hem

selv

es,

either

in c

entr

oid

ed

(peak l

ist)

or

pro

file

mode.

The h

eader

at

the t

op o

f th

e f

ile e

ncodes i

nfo

rmation

about th

e s

ourc

e o

f th

e d

ata

as w

ell

as info

rmation

about th

e s

am

ple

, in

str

um

ent and s

oft

ware

that pro

cessed the d

ata

.

Sem

antic V

alid

ato

r

One o

f th

e b

enefits

of

the p

revio

us m

zD

ata

form

at

was i

ts c

onsid

era

ble

fle

xib

ility

in a

llow

ing w

rite

rs o

f th

e f

orm

at

to e

ncode

additio

nal

info

rmation r

ele

vant

to t

he s

pecific

instr

um

ent

or

setu

p,

even i

f it c

annot

be h

andle

d b

y a

ll soft

ware

. H

ow

ever,

this

consid

era

ble

fle

xib

ility

led t

o d

iffe

rent

dia

lects

as t

he s

am

e info

rmation c

ould

be e

ncoded in d

iffe

rent

wa

ys.

We h

ave s

olv

ed t

his

pro

ble

m w

ith the s

em

antic v

alid

ato

r to

ol fo

r m

zM

L.

The s

em

antic v

alid

ato

r can:

Pro

teoW

izard

Softw

are

Im

ple

menta

tions

The b

est

wa

y t

o t

est

a n

ew

form

at

is b

y i

mple

menting i

t in

soft

ware

. In

evitably

as a

form

at

is i

mple

mente

d,

one f

inds m

inor

inconsis

tencie

s o

r m

issin

g f

eatu

res.

The initia

l re

lease o

f m

zM

Lis

str

ength

ened b

y t

he b

readth

of

imple

menta

tions t

hat

alread

y

exis

t and h

ave e

xerc

ised the v

arious u

se c

ases:

Ensure

that

an m

zM

L d

ocum

ent

is w

ell

form

ed a

nd c

onfo

rms t

o t

he x

sd

XM

L s

chem

a

Ensure

that

contr

olle

d v

ocab

ula

ry t

erm

s a

re u

sed in t

he c

orr

ect

pla

ces in t

he d

ocum

ent

Allo

w a

ltern

ate

rule

s b

ased o

n t

he

type o

f d

ata

bein

g w

ritt

en

Allo

w d

iffe

rent

levels

of

com

plia

nce (

e.g

., b

asic

mzM

L,

MIA

PE

-MS

com

plia

nt m

zM

L)

Sem

antic r

ule

s c

an b

e u

pdate

d a

long w

ith t

he c

ontr

olle

d v

ocab

ula

ry w

ith

out

cha

ngin

g

the s

chem

a

Availa

ble

as a

web p

age (

see b

elo

w)

or

as a

sta

nd

alo

ne t

ool

Conclu

sio

ns

The m

zM

L f

orm

at

is n

ow

com

ple

te a

nd m

zM

L 1

.0.0

is r

ele

ased.

We e

ncoura

ge a

ll auth

ors

and v

endors

to b

egin

support

ing t

his

new

fo

rmat

in new

and update

d soft

ware

. T

he fo

rmat

inclu

des th

e best

featu

res fr

om

pre

-exis

ting open fo

rmats

and has

additio

nal support

for

chro

mato

gra

ms a

nd s

om

e o

ther

featu

res d

eem

ed h

ighly

de

sirable

.

It i

s e

xpecte

d t

hat

the s

chem

a w

ill r

em

ain

sta

ble

for

at

least

ayear,

hopefu

lly m

ore

. H

ow

ever,

the c

ontr

olle

d v

ocabula

ry a

nd

sem

antic v

alid

ation r

ule

s w

ill c

ontinue to b

e u

pdate

d a

nd r

efined a

s a

ll auth

ors

and v

endors

fin

ish im

ple

menting t

heir s

oft

ware

for

mzM

L.

To

le

arn

more

, se

e th

e m

zM

L D

eve

lopm

en

t P

ag

e:

The m

zM

L e

ffort

has involv

ed m

an

y p

eople

in the P

SI

and in the c

om

munity. W

e g

rate

fully

acknow

ledge th

e c

ontr

ibutions o

f:

Ja

ri H

äkkin

en

(Lu

nd

)

Bri

an

Pra

tt (

Insili

co

s)

Eri

k N

ilsso

n (

Insili

cos)

Mik

e C

ole

ma

n (

Sto

we

rs)

Lu

is M

en

do

za

(IS

B)

Da

vid

Sh

teynb

erg

(IS

B)

La

rs N

ilse

(M

an

ch

este

r)

Be

nito

Ca

ña

s (

Ma

dri

d)

Lo

la G

utie

rre

z (

Ma

dri

d)

Alb

ert

o M

ed

ina

(M

ad

rid

)

Tri

sh

Whe

zte

l (U

Pen

n)

Eva

Du

ch

osla

v (

MD

S S

cie

x)

He

nn

ing

He

rmja

kob

(E

BI)

An

ge

l P

iza

rro

(U

Pe

nn

)

Ph

il Jo

ne

s (

EB

I)

Jim

my E

ng

(U

Wash

ing

ton

)

Ke

nt

La

urs

en

(In

dig

o)

Sa

nd

ra O

rch

ard

(E

BI)

Ch

ris T

aylo

r (E

BI)

Pa

tric

k P

ed

rioli

(ET

HZ

)

Se

an

Se

ym

ou

r (A

BI)

Da

vid

Cre

asy (

Ma

trix

Scie

nce

)

Ho

wa

rd R

ea

d (

Wate

rs)

Jim

Lan

grid

ge

(W

ate

rs)

Ja

yson

Falk

ne

r (U

Mic

hig

an

)

Da

vid

Ho

rn (

Ag

ilen

t)

Ru

th M

cN

ally

(C

ard

iff)

Ro

n B

ea

vis

(U

BC

)

No

rma

n P

ato

n (

Ma

nch

este

r)

Ru

ed

i A

eb

ers

old

(E

TH

Z)

Ma

rc S

turm

(U

Tu

eb

ing

en

)

Pa

rag

Ma

llick (

CS

HS

)

Ru

ne

Philo

so

f

Da

vid

Sp

ark

ma

n (

U P

acific

)

Wilf

red

Ta

ng

(A

BI)

Ma

riu

s K

allh

ard

t (B

ruke

r)

PS

I S

tee

rin

g G

rou

p

PS

I P

art

icip

an

ts

htt

p://p

sid

ev.info

/index.p

hp?q=node/2

57

proprietary

format

mass

spectrometer B

mass

spectrometer A

converter

mzXML

mzData

mzML

search

engine A

search

engine B

Native format

pepXML

analysisXML

Public repository

mzM

L is a

com

mon o

pen form

at to

record

the o

utp

ut of m

ass s

pectr

om

ete

rs p

rior

to d

ata

base s

earc

hin

g o

r oth

er

dow

nstr

eam

pro

cessin

g o

f th

e s

pectr

a. It is

expecte

d that by 2

009:

Instr

um

en

t ve

nd

ors

will

wri

te o

ut

or

co

nve

rt t

o m

zM

L

Se

arc

h e

ng

ine

s o

r o

the

r sp

ectr

um

pro

ce

ssin

g

so

ftw

are

will

rea

d a

nd

pro

ce

ss m

zM

L

Da

ta r

ep

osito

rie

s w

ill a

ccep

t, p

rocess, a

nd

sto

re

mzM

L d

ocu

men

ts

Natively

mass s

pectr

om

ete

rs s

tore

outp

ut

in a

variety

of

pro

prieta

ry f

orm

ats

•H

inders

data

sharin

g

•M

akes w

riting v

en

dor-

ne

utr

al softw

are

difficult

Solu

tion:

de

velo

p a

ve

nd

or-

neutr

al ope

n f

orm

at

•th

at

all

vend

or

soft

ware

can w

rite

•ven

dor-

ne

utr

al soft

ware

can r

ead a

nd w

rite

Contr

olle

d V

ocabula

ry

Much o

f th

e m

eta

data

encoded in t

he m

zM

L is in t

he f

orm

of

cvP

ara

ms,

an X

ML e

lem

ent

that

pro

vid

es a

refe

rence t

o a

specific

concept

within

the P

SI

MS

contr

olle

d v

ocabula

ry.

Ea

ch t

erm

has a

n e

xplic

it a

nd d

eta

iled d

efinitio

n,

an

d m

ay h

ave i

nfo

rmation

about

its d

ata

type a

nd w

ha

t kin

d o

f units it

requires,

if a

ny.

The c

ontr

olle

d v

ocabula

ry is e

dited in

OB

O f

orm

at

with t

he O

BO

-Edit

soft

ware

and i

s r

ead i

n b

y m

ost

readers

and w

rite

rs o

f m

zM

L.

The

contr

olle

d v

ocabula

ry c

an b

e e

asily

ad

juste

d a

nd e

xte

nded

without m

odifyin

g the s

chem

a.

Exam

ple

Insta

nce D

ocum

ents

In o

rder

to e

xerc

ise the s

chem

a a

nd d

em

onstr

ate

that th

e v

arious

use c

ases h

ave b

een a

dequate

ly m

odele

d, w

e h

ave d

evelo

ped

severa

l exam

ple

insta

nce d

ocum

ents

. S

om

e o

f th

e d

ocum

ents

are

hand-c

rafted w

ith a

n o

rdin

ary

editor,

wh

ile o

thers

are

written

out

as a soft

ware

te

st

as part

of

the P

rote

oW

izard

refe

rence im

ple

menta

tion.

In additio

n,

severa

l in

sta

nce docum

ents

are

convers

ions o

f re

al data

file

s u

sin

g the r

efe

rence o

r oth

er

imple

menta

tions o

f convert

ers

.

Docum

enta

tion

The f

ull

specific

ation o

f th

e f

orm

at

is p

resente

d i

n a

specific

ation d

ocum

ent

that

described v

arious

aspects

of

the f

orm

at

as w

ell

as a

ll deta

ils o

f th

e f

orm

at

ele

ments

. H

TM

L a

nd P

DF

docum

enta

tion

are

genera

ted

by

pro

gra

mm

atically

com

bin

ing

4

diffe

rent

com

ponents

: xsd

schem

a,

contr

olle

d

vocabula

ry, sem

antic v

alid

ation m

appin

g file

, exam

ple

docum

ents

.

Pro

teoW

izard

C+

+ lib

rary

refe

rence im

ple

menta

tion:

reads m

zM

L a

nd w

rite

s m

zM

L.

Can c

onvert

mzX

ML a

nd R

AW

to m

zM

L.

RA

MP

(R

andom

Access M

inim

al P

ars

er)

C lib

rary

can r

ea

d m

zM

L, m

zX

ML,

mzD

ata

file

s v

ia s

am

e A

PI.

mzM

L r

eadin

g p

erf

orm

ed w

ith P

rote

oW

izard

Tra

ns P

rote

om

ic P

ipelin

e (

TP

P)

can r

ead a

nd p

rocess d

ata

in m

zM

Lfo

rmat

via

RA

MP

ISB

form

at convert

ers

: R

eA

dW

(Therm

o),

Wolf (

Wate

rs),

mzW

iff(A

BI/

MD

S),

Tra

pper

(Agile

nt)

Therm

o F

isher

beta

RA

W →

mzM

L c

onvert

er

Sem

antic V

alid

ato

r an

d J

ava lib

rary

reads a

nd v

alid

ate

s t

hat

a d

ocum

ent is

sem

antically

corr

ect

Phen

yx

searc

h e

ngin

e c

an r

ea

d a

nd s

earc

h s

pectr

a in m

zM

L f

orm

at

NC

BI

C+

+ m

zM

L r

ead

er

cla

sses

Insili

cos V

iew

er

–file

bro

wser

and

spectr

um

vie

wer

can r

ead a

nd d

ispla

y s

pectr

a f

rom

mzM

L, m

zX

ML,

mzD

ata

, R

AW

form

ats

SeeM

Sfile

bro

wser,

spectr

um

vie

wer,

chro

mato

gra

m v

iew

er,

ann

ota

ter

for

mzM

L, m

zX

ML

Pro

teio

sS

oft

ware

En

vironm

ent

inclu

des c

onvert

ers

for

peak lis

ts o

f various f

orm

ats

to m

zM

L a

nd p

erf

orm

s r

eadin

g o

f m

zM

L f

iles

InS

ilicoS

pectr

oope

n s

ourc

e lib

rary

(P

erl)

has a

spectr

um

file

form

at convers

ion t

ool th

at

reads m

zM

L

Herm

es m

zM

L ↔

mzD

ata

↔m

zX

ML c

onvert

er

(Java)

Onlin

e valid

ato

r at

the P

roD

aC

site allo

ws

an

yone

to

uplo

ad

any

file

and

perf

orm

sem

antic v

alid

ation o

n a

ny m

zM

L f

ile

A

dow

nlo

adable

vers

ion

of

the

sem

antic

valid

ato

r ca

n

be

run

locally

on

an

y

pla

tform

support

ing J

ava to v

alid

ate

file

s w

ithout needin

g t

o tra

nsm

it them

to a

rem

ote

web s

ite.

On-lin

e v

alidato

r

Local validato

r

New

term

requests

may b

e e

maile

d t

o:

psid

ev-m

s-v

ocab@

lists

.sourc

efo

rge.n

et

Sem

antic v

alidato

r: c

oncept

Onto

log

y A

ccess

com

pone

nt

Valid

ato

r la

yer

Actu

al valid

ato

r im

ple

menta

tion

CV

rule

read

er

com

pone

nt

OLS

–htt

p:/

/ww

w.e

bi.ac.u

k/o

ls

OB

O f

ile

CV

mappin

g f

ile

Xp

ath

based

XM

L in

de

xer

com

pone

nt

XM

L f

ile t

o v

alid

ate

Onto

log

y c

onfig

file

Obje

ctr

ule

sfile

Seman

tically

correc

tfile

Sem

antic

ally

inco

rrec

t file

OB

O-E

dit is

used to

m

ain

tain

th

e contr

olle

d vocabula

ry:

org

aniz

e the s

tructu

re, add n

ew

term

s, update

defin

itio

ns.

The c

ontr

olle

d v

ocabula

ry i

s e

asily

pars

ed b

y s

oft

ware

, such

as r

eader

and w

rite

r soft

ware

, as w

ell

as c

entr

al

vocabula

ry

serv

ices, lik

e the O

nto

log

y L

ookup S

erv

ice w

eb s

ite

.

Main

tenance o

n m

zM

L w

ill c

ontinue

with the P

SI M

ass S

pectr

om

etr

y

Sta

ndard

s W

ork

ing G

roup. It is

expecte

d that th

e s

chem

a w

ill r

em

ain

sta

ble

, but m

inor

update

s to the

contr

olle

d v

ocabula

ry a

nd s

em

antic

valid

ation r

ule

s m

ay b

e n

ecessary

.

mzM

L

run

spectr

um

spectr

um

Description

bin

ary

Data

Arr

ay

bin

ary

Data

Arr

ay

••

pre

curs

orL

ist

scan

spectr

um

Lis

t

••

•spectr

um

spectr

um

cvLis

t

refe

renceable

Para

mG

roupLis

t

sam

ple

Lis

t

acquis

itio

nS

ett

ingsLis

t

data

Pro

cessin

gLis

t

soft

ware

Lis

t

instr

um

entC

onfigura

tionLis

t

chro

mato

gra

mLis

t

••

•chro

mato

gra

m

chro

mato

gra

m

chro

mato

gra

m

bin

ary

Data

Arr

ay

bin

ary

Data

Arr

ay

Each s

pectr

um

conta

ins a

header

with s

can info

rmation

and o

ptionally

pre

curs

or

info

rmation, fo

llow

ed b

y t

wo o

r

more

base64-e

ncoded b

inary

data

arr

ays.

Chro

mato

gra

ms m

ay b

e e

ncoded in m

zM

L in a

specia

l

ele

ment th

at conta

ins o

ne o

r m

ore

cvP

ara

ms

to d

escri

be

the t

ype o

f chro

mato

gra

m, fo

llow

ed b

y t

wo b

ase64-

encoded b

inary

data

arr

ays.

mzM

L m

ay b

e e

nclo

sed in a

specia

l in

dexin

g w

rapper

schem

a to a

llow

random

access into

the file

, allo

win

g

soft

ware

to p

ull

out one o

r m

ore

arb

itra

ry s

pectr

a.

C+

+ lib

rary

, w

ith m

odula

r desig

n f

or

testa

bili

ty a

nd e

xte

nsib

ility

build

s w

ith n

ative c

om

pile

rs o

n a

ll m

ajo

r pla

tform

s (

MS

VC

on W

indow

s,

gcc

on L

inu

x,

XC

od

eon O

SX

)

ope

n s

ourc

e lic

ense s

uitable

for

both

acad

em

ic a

nd c

om

merc

ial pro

jects

(A

pache v

2)

inte

rnal d

ata

model is

a o

ne-t

o-o

ne t

ransla

tion o

f m

zM

L d

ata

ele

ments

to C

++

data

str

uctu

res

plu

g-in R

ead

er

inte

rface f

or

rea

din

g o

f both

op

en a

nd v

end

or

pro

prie

tary

data

form

ats

: m

zX

ML,

Therm

o R

AW

, M

GF

, w

ith m

ore

Read

ers

in d

evelo

pm

ent

msconvert

tool pro

vid

es g

enera

l file

form

at convers

ion,

inclu

din

g n

ative c

entr

oid

ing

an

d z

libcom

pre

ssio

n

SeeMS

and mspicture

vis

ualiz

ation t

ools

allo

ws v

isualiz

ation o

f m

ass s

pec d

ata

used b

y R

AM

P a

nd T

PP

for

mzM

L s

upp

ort

CLI

bin

din

g a

llow

s u

se f

rom

.N

ET

lang

ua

ges (

C+

+/C

LI, C

#,

VB

.NE

T);

SW

IG b

indin

gs f

or

scripting (

from

Java,

Pyth

on,

Perl,

R)

in d

evelo

pm

ent.

an "

Applic

ation N

ote

" d

escribin

g P

rote

oW

izard

has b

een a

ccepte

d f

or

public

ation b

y t

he journ

al B

ioin

form

atics

http://p

rote

ow

izard

.sourc

efo

rge.n

et

--develo

per

contr

ibutions a

re w

elc

om

e!

The P

rote

oW

izard

soft

ware

pro

ject, in

itia

ted b

y th

e S

pie

lberg

F

am

ily C

ente

r fo

r A

pplie

d P

rote

om

ics at

the C

edars

-Sin

ai

Medic

al

Cente

r, p

rovid

es a

modula

r and e

xte

nsib

le s

et

of

op

en-s

ourc

e,

cro

ss-p

latform

tools

and l

ibra

ries.

The

tools

perf

orm

pro

teom

ics d

ata

analy

ses;

the l

ibra

ries e

nable

rapid

too

l cre

ation b

y p

rovid

ing a

robust, p

luggable

develo

pm

ent

fram

ew

ork

that

sim

plif

ies a

nd u

nifie

s

data

file

access,

and p

erf

orm

s s

tandard

chem

istr

y a

nd

LC

MS

data

set

com

puta

tions.

During t

he f

inal

sta

ges o

f m

zM

L d

evelo

pm

ent,

refinem

ent, a

nd testing, th

e P

rote

oW

izard

libra

ry h

as p

rovid

ed the testing a

nd r

efe

rence im

ple

menta

tion

of m

zM

L.

analy

sis

pe

ak

de

tectio

n 1

pre

pro

ce

ssin

g

tools

Se

eM

Sm

sco

nve

rt

data

da

ta

str

uctu

res

da

ta f

orm

at

ab

str

actio

n

utility

ma

thp

ars

ing

en

co

din

gte

stin

g

da

ta

extr

actio

np

ea

k

de

tectio

n 2

msa

cce

ss

msp

ictu

re

Hig

h le

vel a

rch

itec

ture

of

Pro

teo

Wiz

ard

Hyperl

inks

to

various

exam

ple

docum

ents

on

the

mzM

L d

eve

lopm

ent w

eb s

ite.

A s

am

ple

snip

pet

of

an e

xam

ple

mzM

L d

ocum

ent, s

how

ing

the top h

eader

port

ion o

f th

e f

ile.

Part

of

the

HT

ML

docum

enta

tion

page

for

mzM

L,

show

ing

the

availa

ble

docum

enta

tion for

the <

sourc

eF

ile>

ele

men

t.

Title

page

of

the

full

mzM

L

Specific

ation D

ocum

ent

Na

tive

IDre

fere

nces b

ack t

o the

ori

gin

al sca

n r

efe

ren

ce

s in

th

e

so

urc

e d

ata