data production using cernvm and lxcloud dag toppe larsen warsaw, 2014-02-11

Data production using CernVM and LxCloud

Dag Toppe Larsen

Warsaw, 2014-02-11

2

Outline

● CernVM/LxCloud data production● Automatic data production

● Data production management● Production database● Web interface

3

CernVM cluster at LxCloud

● Requested and obtained new “NA61” project on final production Lxcloud service● Same quota (200VCPUs/instances) as before● Access controlled by new e-group “na61-cloud”● Migration completed

● Software currently used:● Legacy: 13e● Shine: v0r5p0● Software, databases & calibration data distributed via CvmFS

● Mass production of BeBe160 (11_040) to compare/validate to latest LxBatch mass production● If results are “similar”, CernVM validation is considered successful

4

Test production

● Recently, a new BeBe160 test production was submitted to CernVM running on LxCloud

● Job description file created by automatic data production manager– But manually submitted to CernVM cluster

● Output written to/castor/cern.ch/na61/prod/Be_Be_158_11/040_13e_v0r5p0_pp_cvm2_phys

● To be compared to/castor/cern.ch/na61/11/prod/13E040

● (Same legacy, shine, global key, mode)

● For some reason, legacy software does not enter event loop (next slide)● Shine part of processing appear to work OK though

5

CernVM production error

● Should have got:<StdUnmark:> Unmarking...DSPACK 1.602, 1 Aug 2007 (dswrite, server: dag_28311_lxplus0099)Staging dataset: bos:/afs/cern.ch/work/d/dag/test/run-014923x023.bosDSPACK 1.602, 1 Aug 2007 (dsopen, server: dag_28311_lxplus0099)Input file: /tmp/R.28582.fifoRead definitionsDSPACK 1.602, 1 Aug 2007 (dsread, server: dag_28311_lxplus0099)Read one event

________________________________________________________________________________

Run: 14923 Event: 1896087552________________________________________________________________________________

● But got:<StdUnmark:> Unmarking... DSPACK 1.602, 1 Aug 2007 (dswrite, server: na61_31426_server-31edd847-e7c4-4a9a-a968-d52264f87fed)Staging dataset: bos:/home/condor/execute/dir_31365/run-014923x028/run-014923x028.bosDSPACK 1.602, 1 Aug 2007 (dsopen, server: na61_31426_server-31edd847-e7c4-4a9a-a968-d52264f87fed)Input file: /tmp/R.31686.fifoDS_OPEN_TOOL Error: No definition blockFinishing....DSPACK 1.602, 1 Aug 2007 (dskill, server: na61_31426_server-31edd847-e7c4-4a9a-a968-d52264f87fed)

● What does “DS_OPEN_TOOL Error: No definition block” mean?

● Did not get this error when producing data on CernVM in the past

● If the exact same production script is ran on LxPlus, using software from CvmFS (also mounted on Lxplus/batch), it works fine

● Some missing file (that is found on AFS in the case of LxPlus)?

6

Automatic data production

7

Production DB

● Production DB has grown a bit beyond what was originally intended● Difficult to work with the production information without a proper SQL

database● Tedious to access information from Castor and bookkeeping DB● Elog data not always consistent (needed to be standardised)

– Elog data needed as input for data production (magnetic field)

● Created a sqlite DB with three tables: run, production and chunkproduction● To contain information about all runs as well as all produced chunks

– After importing information from bookkeeping DB (elog), productions can be initiated without first query bookkeeping DB

– For transferring information back to bookkeeping DB after production, propose to run SQL queries from bookkeeping DB to retrieve relevant information

● Elog information imported can be used to select data for processing/analysis via web interface (target in/out, trigger, run length, “ok”, etc.)

8

Production DB schema● runs

● All information for given run● Primary key: run● Fields target, beam, momentum, define reaction

run belongs to● Information imported from elog via bookkeeping

DB● Most fields are obtained by “data-mining” elog● Field elog contains original elog entry

● production● All information for given production● Combination of target, beam, momentum, year,

key, legacy, shine, mode, os, source, type should be unique

● Primary key: production (automatically generated ID)

● chunkproductions● All produced chunks● One row per produced chunk● Primary key: production, run, chunk● Table has potential to contain order of ~10^6 rows

9

runs table

● Contains all information for given run● Fields beam, target, momentum & year define which

reaction run belongs to● Information imported from eLog via bookkeeping database

● All eLog information for all runs is imported● Elog information is processed and stored in separate fields

● Including fields defining the reaction● Original eLog entry also stored to allow later reprocessing

● Field “ok” intended to mark runs that are “ok” or not● To be set manually

10

chunkproductions table

● Stores all chunks produced

● Associated to production, run and chunk

● Has potential to contain order of 10^6 rows● By far largest table in DB● Potential performance

issue● Only using numerical

values

● production: e.g. 1● run: e.g. 123456● chunk e.g. 123● rerun: number of times chunk

has failed and been reprocessed

● status: waiting / processing / checking / ok / failed (numeric values)

● size_*: size of output files● error_*: number of errors of

given type found in log file from latest processing

11

productions table

● A unique combination of target, beam, momentum, year, key, legacy, shine, mode, os, source, type is a production

● Primary key production● Auto-generated

unique number

● production: e.g. 1● target: e.g. Be● beam: e.g. Be● momentum: e.g. 158● year: e.g. 11● key: e.g. 040● legacy: e.g. 13c● shine: e.g. v0r5p0● mode: e.g. pp● os: e.g. slc5● source: e.g. phys (sim)● type: e.g. prod (test)● path_in: raw file used path● path_out: path for output files● path_layout: how output files are stored under

path_out● description: free text

12

Automated data production system commands

./na61prod Usage: ./na61prod <command> <key=value><command> one of:elogImport - import all elog information from bookkeepingelogConvert - process elog information and fill database setProduction - register new production in databaseproduce - start new productioncheck - check, resubmit and update database for errorssetRunOk - mark runs as OK<key=value> any of:runs - list and/or range of runs [all]type - prod or test [prod]beam - beam type No default valuetarget - target type No default valuemomentum - beam momentum No default valueyear - year of data taking No default valuekey - global key (no year) [latest]legacy - version of legacy software [latest]shine - version of Shine software [latest]mode - pp or pA [pp]os - cvm2 or slc6 [cvm2]source - phys or sim [phys]path_in - path to data (for sim. Data) [root://castorpublic.cern.ch//castor/cern.ch/na61]comment - free-text production comment []ok - 0 or 1 [1]<command> one of:setNameValue - set possible value for key-value pair<key=value> any of:name - type, legacy, shine, mode, os, source, path_in, path_out or path_layoutvalue - value corresponding to name []pref - preferred value, 0 or 1 [1]The system will choose [default] values for keys that are not set.

13

Data production command usage

● na61prod command=elogImport runs=7000-18000● Will obtain eLog information for all runs in this range

● na61prod command=elogConvert runs=all● Process imported eLog information and fill relevant fields in runs table

● na61prod command=setProduction beam=Be target=Be momentum=158 year=11 comment=”New TPC calibration data.”● Registers a new production in the production table using default values

● na61prod command=setProduction beam=Be target=Be momentum=158 year=11 key=044 prod=test os=slc6 mode=pA path_in=root://castorpublic.cern.ch//castor/na61/path/to/simulated/ data source=sim runs=14866-14880,14888,14890,15000-15011 comment=”Check of simulated data before mass production.”● Registering a new production with less default values...

● na61prod command=produce beam=Be target=Be momentum=158 year=11● Initiates automatic submitting of reaction (must already be registered using setProduction command)

● na61prod command=check● Checks completed jobs for errors, resubmits as needed, updates database. Intended to be run from cron job.

● na61prod command=setNameValue name=shine value=v0r8p0● Registers a new version of Shine in the name/value database, and makes it default choice for future

productions.

14

Automatic data production manager status

● Can generate the files needed for submitting jobs (both LxBatch & CernVM)

● Now uses native SqLite language bindings for better performance

● Named value key pair table implemented to store allowed/default values for production parameters

● Part being worked on:● Automatic submitting/checking/resubmitting jobs● Not “difficult”, but rather “tedious”

● Plan to soon use it for productions (both LxBatch and CernVM)

15

Web interface

16

Web interface● Web interface to production DB

● http://cern.ch/na61cld/cgi-bin/prod● Experimenting with best interface/usability for different use cases● Currently can only display information

● Will add ability to log in for starting productions, etc● Working on script that will import information about already

existing productions into database● Can generate list of chunks from set of filtering criteria

● Link below will generate a list of chunks for BeBe13 (2012), marked as “ok”, with target in and GTPC on for production with global key “12_021”, legacy “13f”, shine “v0r6p0”, mode “pp” and from physics data, and data in mini shoe format from EOS

● http://cern.ch/na61cld/cgi-bin/chunkli?productions.key=021&productions.legacy=13f&productions.mode=pp&productions.os=slc6&productions.shine=v0r6p0&productions.source=phys&productions.type=prod&runs.beam=Be&runs.momentum=13&runs.ok=1&runs.status_gtpc=1&run

http://cern.ch/na61cld/cgi-bin/prod

http://cern.ch/na61cld/cgi-bin/chunkli?productions.key=021&productions.legacy=13f&productions.mode=pp&productions.os=slc6&productions.shine=v0r6p0&productions.source=phys&productions.type=prod&runs.beam=Be&runs.momentum=13&runs.ok=1&runs.status_gtpc=1&runs.target=Be&runs.target_position=1&runs.year=12&format=minishoe.root&location=cern_eos




17

General plan forward

● Complete CernVM test BeBe160 test production

● Finish the automatic submission/checking/resubmit of jobs for automatic data production manager

● Add possibility to submit jobs from web interface

18

Proposal to migrate software, calibration data & databases to CvmFS

● CvmFS is based on the HTTP protocol● Distributed globally via an hierarchy of cache servers● Files are compressed on server side● Downloaded on-demand, decompressed and semi-permanently cached on client side

– A bit slow first time a software is ran (to allow for software download), but at native speeds at later runs ● Originally developed to distribute software to CernVM virtual machines● Has gained popularity on conventional (non-virtualised) computing clusters as well

– Is now the preferred way of distributing software for e.g. ATLAS & CMS

● Proposal: migrate NA61/SHINE software, calibration data & databases to CvmFS, also for LxBatch/Plus processing● Tedious to maintain two parallel installations on both CvmFS & AFS● CvmFS is mounted on all LxPlus/Batch machines: /cvmfs/na61.cern.ch/● External dependencies (e.g. ROOT, etc.) is available from /cvmfs/sft.cern.ch/● Will make it possible to use identical production scripts on CernVM & LxBatch● Will globally distribute NA61 software, i.e. NA61 processing at e.g. Belgrade will not require a

separate NA61 software installation, but mount /cvmfs/na61.cern.ch/ directly● CvmFS can be mounted directly on any computer (both virtualised and non-virtualised) by

installing the FUSE drivers, allowing “official” installation of Shine to run directly on laptops anywhere

data production using cernvm and lxcloud dag toppe larsen warsaw, 2014-02-11

Documents

production information

production script

standardisedelog data

bookkeeping dbelog data

new bebe160 test production

databases calibration

bookkeeping db elog

definitions dspack