the bookkeeping sql api

23
8th December 200 4 Tim Adye 1 The Bookkeeping The Bookkeeping SQL API SQL API Tim Adye Rutherford Appleton Laboratory Bookkeeping / Data Distribution Parallel BaBar Collaboration Meeting 8 th December 2004

Upload: dyllis

Post on 16-Jan-2016

40 views

Category:

Documents


0 download

DESCRIPTION

The Bookkeeping SQL API. Tim Adye Rutherford Appleton Laboratory Bookkeeping / Data Distribution Parallel BaBar Collaboration Meeting 8 th December 2004. Talk Plan. The problem “Why not just write SQL?” The BaBar SQL API User view: Perl classes and command-line tool Behind the scenes - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: The Bookkeeping SQL API

8th December 2004

Tim Adye 1

The BookkeepingThe BookkeepingSQL APISQL API

Tim AdyeRutherford Appleton Laboratory

Bookkeeping / Data Distribution ParallelBaBar Collaboration Meeting

8th December 2004

Page 2: The Bookkeeping SQL API

8th December 2004

Tim Adye 2

Talk Plan

• The problem• “Why not just write SQL?”

• The BaBar SQL API• User view: Perl classes and command-line tool• Behind the scenes• Table schema configuration classes• Summary of features

• Could this be generalised to other applications?• Possible improvements• Comparison with other DBIx packages

• Summary and references

Page 3: The Bookkeeping SQL API

8th December 2004

Tim Adye 3

User Access

• Users need to query database to find out what data to process• May also need other information

• eg. luminosity, run numbers, file sizes• Mostly select by dataset, but may need to limit further

• eg. only data available locally, taken at peak energy, excluding some problem datataking period

• Cannot expect users to know which tables to use, how to join them, or even the SQL syntax• Even worse if the schema change

• Cannot expect developers to code for all combinations of queries with all possible selections• Previous ad hoc tools (some mine!) tried to do this and

it was a nightmare, even for a simpler table structure

Page 4: The Bookkeeping SQL API

8th December 2004

Tim Adye 4

Page 5: The Bookkeeping SQL API

8th December 2004

Tim Adye 5

BaBar SQL API – user view

• Each column that users might want to query or select on is given a unique logical name – regardless of which table it lives in

• These names are used to specify query values$query->addValues('collection', 'gbytes');

and selections$query->addSelector('dataset', 'Dilepton-*');

$query->addSelector('run', '10000-19999');• Different types of data allow for different selection syntax,

eg. wildcards for names, or ranges for run numbers.

• Can also use SQL expressions (in terms of logical names)

$query->addValues('SUM(lumi)/1000');

and sorting, row limits, etc

Each of these happens to be ina different table

Page 6: The Bookkeeping SQL API

8th December 2004

Tim Adye 6

SQL API – returning results• That’s enough to generate a valid SQL SELECT

query. To return the results:-my $sta = $query->execute();

while (my $row = $sta->fetch()) {

print $row->gbytes(), $row->collection(), "\n";

}• The $query object collects the user requests• $query->execute() returns a “statement accessor” (like a

DBI statement handle).• $sta iterates over row objects, each of which has accessors

for each query value, gbytes and collection.

• That’s all there is to it!• After the usual DBI connect, and $query object

instantiation (see later), these statements form a working program

Page 7: The Bookkeeping SQL API

8th December 2004

Tim Adye 7

Command-Line Tools

• Standard BaBar tools use this API to create job configuration, create datasets, calculate luminosities, etc.• Standard tasks, but optionally allowing additional

selections

• Also provide an “expert” tool that allows access to full API functionality from the command line• This has proved very popular, with many “non-

experts” making their own unique queries

Page 8: The Bookkeeping SQL API

8th December 2004

Tim Adye 8

Examples

$ BbkUser --dataset=A0-Run4-OnPeak-R14 \ --is_local=1 --file_status=0 \ dse_lumi events gbytes file \

--style=adye --display

DSE_LUMI EVENTS GBYTES FILE

======== ====== ====== ====================================================

1250.3 526115 1.6 /store/PRskims/R14/14.4.0d/A0/02/A0_0239.01.root

1250.3 526115 0.8 /store/PRskims/R14/14.4.0d/A0/02/A0_0239.02HBCA.root

1348.4 576239 1.6 /store/PRskims/R14/14.4.0d/A0/02/A0_0240.01.root

1348.4 576239 1.0 /store/PRskims/R14/14.4.0d/A0/02/A0_0240.02HBCA.root

...

156 rows returned

$ BbkUser –-collection-file=coll.lis \

tot_gbytes collection

Page 9: The Bookkeeping SQL API

8th December 2004

Tim Adye 9

What happens behind your back

• The SQL API• translates the logical names to table columns• selects the required tables and joins

• including otherwise unused tables required for the joins

• generates and executes a valid SQL SELECT statement

• creates a statement accessor object• dynamically generates a class for the row objects

with accessors for each query value

Page 10: The Bookkeeping SQL API

8th December 2004

Tim Adye 10

Our Example

• That first BbkUser command involved 5 tables• including one that provides the join between dataset

and collection tables

SELECT dse.lumi_sum AS "dse_lumi", dse.output_nev AS "events",

file.bytes, dse.name AS "collection",

file.suffix AS "file_suffix", ds.id AS "ds_id",

dse.id AS "dse_id", dtd.id AS "dtd_id",

dtd.link_status

FROM bbk_dataset ds, bbk_dsentities dse, data_files dfile,

bbk_files file, bbk_dstodse dtd

WHERE ds.id=dtd.ds_id

AND dtd.dse_id=dse.id

AND dse.id=file.dse_id

AND file.id=dfile.file_id

AND ds.name='A0-Run4-OnPeak-R14'

AND dse.is_local='1'

AND dfile.status='0';

The SQL API can even

pretty-print it like this for you (What’s shown here is somewhat abbreviated: actual command includes full database and table names in case of ambiguities)

BbkUser --dataset=A0-Run4-OnPeak-R14 --is_local=1 --file_status=0

dse_lumi events gbytes file

Page 11: The Bookkeeping SQL API

8th December 2004

Tim Adye 11

Page 12: The Bookkeeping SQL API

8th December 2004

Tim Adye 12

Table schema configuration classes

• Mapping between logical names and table columns is defined in the configuration classes.• One class per table• Can also define special properties of each column (eg.

whether to allow ranges (“100-199”) for selection).

• Possible joins between tables defined here too• Use logical column names for join conditions, so one table

class does not need to know about column names in other classes.

• In most cases it’s just a matter of listing logical vs. column names• with a little Perl syntactic sugar

• Inheritence of config classes expresses commonalities• eg. common id and created, and modified columns

Page 13: The Bookkeeping SQL API

8th December 2004

Tim Adye 13

Example Table Configurationsub table { return 'bbk_files' }

sub tableConfig {

return {

alias => 'file',

columns => [

bytes => 'bytes‘,

uuid => 'uuid',

checksum => 'checksum',

file_suffix => 'suffix',

nfiles => 'COUNT(DISTINCT file_id)',

gbytes => '(bytes/1073741824)',

tot_gbytes => 'SUM(bytes)/1073741824',

file_dse_id => { column => 'dse_id', selectorType => 'range' },

file => { valueAction => 'addLfnValue', selectorAction => 'lfnSelector' },

],

joins => [

dse_id => 'file_dse_id',

file_id => 'dfile_file_id',

],

}}

Page 14: The Bookkeeping SQL API

8th December 2004

Tim Adye 14

Putting it all together

• Configuration classes must be registered with $query object

my $query = new BbkSqlSelect($bbkconfig);

$query->addModules(new MyTableClass($bbkconfig));

but of course it is usually simpler to provide a $query object pre-registered with all the table configs as part of a specific API.

Page 15: The Bookkeeping SQL API

8th December 2004

Tim Adye 15

Overriding and Synthetic Columns

• A crutial advantage of this system is that it allows us to override the default behaviour• Allows us to hide complexities from user• Make even complex schema changes transparent to users

• A logical name can refer to• ordinary database column name• SQL expression (in terms of database columns, or other logical names)

• Perl method to pre- or post-process selection or query value• “synthetic” query value or selection

• can return calculated value or alter behaviour

• Global post-processing• Can be triggered by value, selection, or table inclusion• Allows global filtering of returned rows

Page 16: The Bookkeeping SQL API

8th December 2004

Tim Adye 16

What happens behind your back 2

• We already used some of these features without noticing!1. Dataset names can be found in the bbk_dataset or the

bbk_aliases table• Requires a check and translation using the alias table

2. Datasets can evolve with time, with collection being added or removed• Need to query dataset for any time in the past, or use

tagged dataset alias (like a CVS tag)• Implemented by automatically including date selection in

query, and post-processing returned results to remove deleted collections

3. File names are made from a collection name + a suffix• $query->addSelector('file') splits the file name for the

query and the $row->file() accessor rejoins them

Page 17: The Bookkeeping SQL API

8th December 2004

Tim Adye 17

Features

• Supports Oracle 8 and MySQL 3.23• Most queries that can be expressed in both these dialects

can be expressed by users via the API – without breaking the paradigm of a flat namespace• aggregatation and grouping• sorting and distinct• MySQL’s LIMIT emulated in Oracle• inner and outer joins (generates Oracle or MySQL syntax)• Does not support UNION or subqueries

• Could be added, but not in MySQL 3.23

• Convenience features• automatic Getopt specification• query results display formatting and summary table

generation• Configuration class summary table generation

Page 18: The Bookkeeping SQL API

8th December 2004

Tim Adye 18

Limitations

• Assumes tables can be joined in a unique way• ie. the joins form an acyclic graph• can still select different joins with explicit switches

• Each column must have its own unique logical name• This is usually a good thing• but if the same data is held in different columns, it

would be more efficient to automatically select from tables that are already included

Page 19: The Bookkeeping SQL API

8th December 2004

Tim Adye 19

A public version• Current version has a few BaBar-specific pieces

• BaBar Connection/Configuration manager – can use DBI directly

• BaBar Options manager – can use Getopt directly• BaBar base objects – borrow required methods• BaBar table formatting class – publish this too… otherwise just uses standard Perl modules

but with different table configs could be used elsewhere• Already do this in BaBar – used for QA and TM databases• Maybe I’m making some other assumptions that are true of

our database and requirements, but not more generally so. I can’t think of any.

• Needs a better name!• This is really an SQL API creator

DBIx::SqlAbstractor ???

Page 20: The Bookkeeping SQL API

8th December 2004

Tim Adye 20

Possible improvements• Tidy up code!

• User and config APIs are OK, but in between it’s pretty ugly• Separate functionality that can be used on its own

• Already true of the DBI statement accessor class

• More SQL dialects: PostgreSQL, MS SQL?• New SQL syntax: subqueries, UNIONs,…• INSERT, UPDATE, etc

• these don’t need joins, so hand-coding not such a problem

• Automatic selection of different join possibilities• Automatic generation of default table classes from SQL

schema• Could use “The SQL Fairy”• though not much work to do it by hand

Page 21: The Bookkeeping SQL API

8th December 2004

Tim Adye 21

Why not use another package?

• More than 100 DBIx and other SQL access packages in CPAN

• Could not find any that do all (or even most) of• hide table structure from user• allow multi-table queries, taking care of joins

automatically• do not impose their own conventions on table schema• allow query values and selections to be overridden• allow transparent post-processing of query results• provide accessor functions for query results

• I believe that taken together these features provide a clear and easy to use abstraction

Page 22: The Bookkeeping SQL API

8th December 2004

Tim Adye 22

Feedback and Discussion

• Would this be useful outside BaBar?• Is it a good idea to make a public release?

• eg. on CPAN

• Does it need any improvements?• New features• Make it compatible with some other standards

• eg. sit on top of another abstraction like DBIx::Table

• A better name!

Page 23: The Bookkeeping SQL API

8th December 2004

Tim Adye 23

References• BaBar Bookkeeping project

http://slac.stanford.edu/BFROOT/www/Computing/Distributed/Bookkeeping/Documentation/

• BaBar Bookkeeping presentation and paperhttp://indico.cern.ch/contributionDisplay.py?contribId=338&sessionId=7&confId=0

D.A. Smith et al.,BaBar Book Keeping project –a distributed meta-data catalog of the BaBar event store,Proc. Computing in High Energy and Nuclear Physics 2004 (CHEP04).

• CPAN Database Interfaces (see particularly DBIx)

http://cpan.uwinnipeg.ca/chapter/Database_Interfaces