Transcript
Page 1: A Comparison of Data Analysis Packages

A comparison of data analysis packages

CHEP2000 9-Feb 2000

A Comparison of Data Analysis Packages

Irwin Gaines, Jeff Kallenbach

Fermilab

Page 2: A Comparison of Data Analysis Packages

A comparison of data analysis packages

CHEP2000 9-Feb 2000

Outline

• Introduction: a little history

• Build vs. Buy: general considerations

• User Requirements

• Basic Features

• Advanced features

• Conclusions

Page 3: A Comparison of Data Analysis Packages

A comparison of data analysis packages

CHEP2000 9-Feb 2000

Introduction

• Previous generation HEP experiments have used a ubiquitous homemade product: PAW

• Why? Commercial systems did not offer either functionality or, more important, performance

• Use of a universal product allows:– data sharing (ntuple files)– procedure and environment sharing (kumac files)

Page 4: A Comparison of Data Analysis Packages

A comparison of data analysis packages

CHEP2000 9-Feb 2000

Build vs. Buy

• Old days (70’s-80’s): in house development effort “free”, any software purchase is expensive

• More recently(90’s):attractive licensing terms, development costs should be amortized over as large a user base as possible, Support?

• Now: Consider full product lifetime costs, including development, licensing, support. Does product need to be customized or enhanced to meet HEP needs?

build buy

Page 5: A Comparison of Data Analysis Packages

A comparison of data analysis packages

CHEP2000 9-Feb 2000

Project Scope

Selecting events based on programmed selection criteria

Preparing various statistical distributions of various mathematical functions of data in the selected events

Linking in high level language programs to process event data prior to plotting

Modifying selection criteria and plotted functions interactively

Fitting the distributions

Comparing and performing calculations on different distributions

Preserving selection criteria and functions for later use or to pass to others

Saving samples of events in a variety of specialized formats for later analysis

Accessing these specially formatted event samples to make plots, fits, statistical outputs, etc.

Page 6: A Comparison of Data Analysis Packages

A comparison of data analysis packages

CHEP2000 9-Feb 2000

User Requirements

• Web reference: http://www.fnal.gov/projects/runii/pasrec/

• Data Access

• Data Analysis

• Data Presentation

• Usability

• Support and Maintenance

Page 7: A Comparison of Data Analysis Packages

A comparison of data analysis packages

CHEP2000 9-Feb 2000

User Requirements: Data Access

• Access rates (online)

• Access rates (offline)

• Serial vs. random access

• Granularity of access

• Foreign I/O Formats

• Specialized optimized output formats

Page 8: A Comparison of Data Analysis Packages

A comparison of data analysis packages

CHEP2000 9-Feb 2000

User Requirements : Data Analysis

• Scripting language

• User control

• Data selection

• Input/Output

• Numerical and mathematical functionality

• Offline compatibility

• Prototyping

Page 9: A Comparison of Data Analysis Packages

A comparison of data analysis packages

CHEP2000 9-Feb 2000

User Requirements: Data Presentation

• Interactive visualization

• Presentation quality graphical output

• Formal publication graphical output

Page 10: A Comparison of Data Analysis Packages

A comparison of data analysis packages

CHEP2000 9-Feb 2000

User Requirements: Usability

• Batch vs. interactive

• Sharing data structures

• Shared access by several clients

• Parallel processing (using distinct data streams)

• Debugging and profiling

• Modularity (user code)

• Modularity (system code)

• Access to source code

• Robustness

• Web based documentation

• Use of standards

• Portability

• Scalability

• Performance

• User Friendliness

Page 11: A Comparison of Data Analysis Packages

A comparison of data analysis packages

CHEP2000 9-Feb 2000

User Requirements: Support

• Maturity– customer base– product lifetime– product survivability

• product support

• licensing

Page 12: A Comparison of Data Analysis Packages

A comparison of data analysis packages

CHEP2000 9-Feb 2000

User Requirements: Maintenance

• who provides maintenance

• what does it cost

• maintenance infrastructure

• maturity and completeness

• modularity

• portability

• standards

• reliability and security

• application specific issues

Page 13: A Comparison of Data Analysis Packages

A comparison of data analysis packages

CHEP2000 9-Feb 2000

Main Contenders

• Homemade package: ROOT

• Commercial Package: IDL (other commercial packages offer similar features; IDL appeared to be most aggressive in licensing terms)

Page 14: A Comparison of Data Analysis Packages

A comparison of data analysis packages

CHEP2000 9-Feb 2000

Basic Features

• plotting

• fitting

• event selection

• command languages

• event I/O

Page 15: A Comparison of Data Analysis Packages

A comparison of data analysis packages

CHEP2000 9-Feb 2000

Gee Whiz plots

Page 16: A Comparison of Data Analysis Packages

A comparison of data analysis packages

CHEP2000 9-Feb 2000

Plots, Fits, Event selection

• ROOT: from browser, from tree viewer, from command line

• All plots are active,can be manipulated, saved for later use, printed in a variety of formats

• IDL:command line examples on following slides

• plots can be either static or active, displayed or printed

Page 17: A Comparison of Data Analysis Packages

A comparison of data analysis packages

CHEP2000 9-Feb 2000

Displaying a Histogram

Display a histogram The Canvas

Open the a root fileBrowse the file

Page 18: A Comparison of Data Analysis Packages

A comparison of data analysis packages

CHEP2000 9-Feb 2000

Fitting, Coloring, and Zooming• Adding a gaussian fit• Coloring the histogram• Zooming

Page 19: A Comparison of Data Analysis Packages

A comparison of data analysis packages

CHEP2000 9-Feb 2000

The Tree Viewer

Tree Viewer buttons:– Variables

– Slider

– XYZ

– Draw, Scan, Break

– Ilist, Olist

– Gopt

– Weight

Page 20: A Comparison of Data Analysis Packages

A comparison of data analysis packages

CHEP2000 9-Feb 2000

Scripting language

• ROOT• CINT C++ interpreter

(almost full C++ syntax)• commands are methods of

root classes• Full access to compiled

code (in any language)

• IDL• “natural” control

language (see examples)

• commands are part of scripting syntax

• full access to compiled code (in any language)

Page 21: A Comparison of Data Analysis Packages

A comparison of data analysis packages

CHEP2000 9-Feb 2000

IDL command language

chain=["d3_51.nhis","d3_68.nhis","d3_99.nhis","d3_19.nhis","d3_04.nhis"]

mass=htGetVar(chain,"Rmass")

cut4=where(lsig gt 5 and iso1 lt .05 and clsec gt .05 and iso2 lt .03)

plot,histogram(mass(cut4),binsize=mybin)

• concatenate several files of ntuples

• read in a variable

• event selection (cut on several variables)

• plot histogram

Page 22: A Comparison of Data Analysis Packages

A comparison of data analysis packages

CHEP2000 9-Feb 2000

IDL Command Language

• Fit plot and draw fit

• plot->liveplot for interactive plots

dist = histogram(mass(cut4),binsize=mybin)x=findgen(134)*mybin+1.7 dfit=gaussfit(x,dist,a)plot,x,dist oplot,x,dfit,color=20

Page 23: A Comparison of Data Analysis Packages

A comparison of data analysis packages

CHEP2000 9-Feb 2000

Page 24: A Comparison of Data Analysis Packages

A comparison of data analysis packages

CHEP2000 9-Feb 2000

Page 25: A Comparison of Data Analysis Packages

A comparison of data analysis packages

CHEP2000 9-Feb 2000

Reading ntuples with IDLht2IDL - An Interface between HEP Data files and IDLAs part of our investigation of the Interactive Data Language (IDL) for use in our environment, we have assembled a prototype of what we call ht2IDL (for "hepTuple to IDL). The is a small package of C++ code and IDL procedure files which enable the user to access HEP data stores, such as HBOOK files, from the IDL session. It uses the HepTuple package from PAT.

How the package worksLike most modern tools, IDL provides the capability to interface with external functions written by the user. This is accomplished by writing some code, using a C-based interface, then compiling it and linking it into a shared-object file. Then, by creating some simple helper files for IDL, and starting IDL from the correct directory, where all of the new interface code lies, the user has access to all of the new functionality provided the written code and the IDL "External Interface" In our prototype, this was all accomplished on an SGI/IRIX system. In order to attempt to achieve maximum compatibility with the RunII environment, it was decided to use KCC. In principal there is no reason it should not work with CC or g++. Then, referring to the IDL External Developers' Guide, we wrote some code which uses the HepTuple library to read HBOOK files, load the data into data structures compatible with IDL, and then return them to the IDL session. We have written a prototype provides an interface to the HBOOK files (using HepTuple), makefiles and some documentation on how to use them, and sample IDL scripts (called "procedure" files) to invoke the ht2IDL functions and display and manipulate the results.

http://patwww.fnal.gov/pas/idl/ht2idl.html

Page 26: A Comparison of Data Analysis Packages

A comparison of data analysis packages

CHEP2000 9-Feb 2000

Support Features

• Commercial products have excellent documentation, generally good support, but– you pay for it– hard to customize, usually don’t get source

• homemade products moving to free software support model (support by community) – can modify source to enhance or customize– relatively easy to use other’s code

• both require a local support organization

Page 27: A Comparison of Data Analysis Packages

A comparison of data analysis packages

CHEP2000 9-Feb 2000

ROOT How To’s

Page 28: A Comparison of Data Analysis Packages

A comparison of data analysis packages

CHEP2000 9-Feb 2000

Advanced Features

• Optimized I/O and very large data samples

• Using native user objects

• Customized GUIs

• Accessing over web

Page 29: A Comparison of Data Analysis Packages

A comparison of data analysis packages

CHEP2000 9-Feb 2000

Optimized I/O

• Two separate issues:– data in memory vs. data on disk (efficient disk

access necessary for large data files)– can’t improve on disk speed unless objects that

are read together are next to each other on disk (column wise n-tuple and generalizations)

Page 30: A Comparison of Data Analysis Packages

A comparison of data analysis packages

CHEP2000 9-Feb 2000

ROOT I/O

• Many years of struggle/experience to use disk based data

• optimized data formats for efficient access: CWNT--> split trees

• Formats designed with HEP type data access in mind

Page 31: A Comparison of Data Analysis Packages

A comparison of data analysis packages

CHEP2000 9-Feb 2000

IDL I/O

• Basically memory based• Associated I/O allows mapping an IDL array or structure

variable onto a file:– I/O occurs automatically when the associated variable is

subscripted, accessing only the desired object– data set size limited by file size rather than memory size– direct access to each element in the file; including convenient

event selection by indexing– files can have multiple associated structures (full events,

tracks, hits, etc)– performance still limited by record structure

Page 32: A Comparison of Data Analysis Packages

A comparison of data analysis packages

CHEP2000 9-Feb 2000

Access to user objects

• Root script language is C++, user classes can be used by interpreter if their header files are run through rootcint to create dictionary

• IDL supports structures, a collection of scalars, arrays and other structures. Needs an external structure definition file to allow use in commands; no automatic way to create these from class headers

Page 33: A Comparison of Data Analysis Packages

A comparison of data analysis packages

CHEP2000 9-Feb 2000

IDL GUI BuilderAvailable in IDL 5.3, the IDL GUIBuilder enables you to build intuitive GUIs with drag-and-drop ease. A convenient control palette with icons such as radio buttons, checkboxes, and horizontal and vertical sliders let you quickly construct interfaces that users understand. Widget properties are easily editable. Pre-made bitmaps give you graphical cues for customizing buttons relevant to their function. Also, widgets are arranged in row and column geometry for on-screen consistency. At the code level, built-in comments help you understand what each widget and event will accomplish.

Page 34: A Comparison of Data Analysis Packages

A comparison of data analysis packages

CHEP2000 9-Feb 2000

What Is ION?• An easy method for users to leverage the graphics and

analysis power of IDL in web based applets and applications

• Allows users to share IDL applications with non-IDL users

• Easy set-up, use and management

Page 35: A Comparison of Data Analysis Packages

A comparison of data analysis packages

CHEP2000 9-Feb 2000

ION Overview

Client Machine

Client Machine

Web Browser

ION Client

Server Machine

Web Server

ION Server

IDL

Internet

ION Application

HTTP Data, Java Classes

IDL Commands,Graphic Primitives

Page 36: A Comparison of Data Analysis Packages

A comparison of data analysis packages

CHEP2000 9-Feb 2000

ION Applications• Web publishing is obvious, but what else?• Applications based on ION

– Workgroups can develop and easily deploy data processing and visualization apps with ION

– Thin clients download fast and can be updated easily

– Applications can exist in any Java enabled machine and still access the power of IDL

Page 37: A Comparison of Data Analysis Packages

A comparison of data analysis packages

CHEP2000 9-Feb 2000

Conclusions• Both satisfy user requirements• Commercial products offer all basic functionality

and many attractive advanced features• Homemade products still better optimized for

specific HEP use• Support models evolving (open source model)• Can we mix and match to get best of both

worlds?


Top Related