resource predictors in hep applications john huth, harvard sebastian grinstein, harvard peter hurst,...
TRANSCRIPT
![Page 1: Resource Predictors in HEP Applications John Huth, Harvard Sebastian Grinstein, Harvard Peter Hurst, Harvard Jennifer M. Schopf, ANL/NeSC](https://reader036.vdocuments.net/reader036/viewer/2022070401/56649f215503460f94c3995a/html5/thumbnails/1.jpg)
Resource Predictors in HEP Applications
John Huth, HarvardSebastian Grinstein, Harvard
Peter Hurst, HarvardJennifer M. Schopf, ANL/NeSC
![Page 2: Resource Predictors in HEP Applications John Huth, Harvard Sebastian Grinstein, Harvard Peter Hurst, Harvard Jennifer M. Schopf, ANL/NeSC](https://reader036.vdocuments.net/reader036/viewer/2022070401/56649f215503460f94c3995a/html5/thumbnails/2.jpg)
The Problem
• Large data sets gets recreated, and scientists want to know if they should– Fetch a copy of the data– Recreate it locally
• This problem can be considered in the context of a virtual data system that tracks how data is created so recreation is feasible
![Page 3: Resource Predictors in HEP Applications John Huth, Harvard Sebastian Grinstein, Harvard Peter Hurst, Harvard Jennifer M. Schopf, ANL/NeSC](https://reader036.vdocuments.net/reader036/viewer/2022070401/56649f215503460f94c3995a/html5/thumbnails/3.jpg)
To make this decision you need
• 1) Estimate of time to recreate data– Info about data provenance, machine types,
etc
• 2) Estimate of data transfer time
• 3) Framework to allow you to take advantage of these choices by adapting the workflow accordingly
![Page 4: Resource Predictors in HEP Applications John Huth, Harvard Sebastian Grinstein, Harvard Peter Hurst, Harvard Jennifer M. Schopf, ANL/NeSC](https://reader036.vdocuments.net/reader036/viewer/2022070401/56649f215503460f94c3995a/html5/thumbnails/4.jpg)
To make this decision you need
• 1) Estimate of time to recreate data– Info about data provenance, machine types,
etc
• 2) Estimate of data transfer time
• 3) Framework to allow you to take advantage of these choices by adapting the workflow accordingly– OUR AREA OF CONCENTRATION
![Page 5: Resource Predictors in HEP Applications John Huth, Harvard Sebastian Grinstein, Harvard Peter Hurst, Harvard Jennifer M. Schopf, ANL/NeSC](https://reader036.vdocuments.net/reader036/viewer/2022070401/56649f215503460f94c3995a/html5/thumbnails/5.jpg)
Regeneration Time Estimates
• Previous work (Chep 2004, “Resource Predictors in HEP Applications”)
• Estimate runtime of ATLAS application– End-to-end estimation since no low-level application
model available– Used data about input parameters (number of events,
versioning, debug on/off, etc) and benchmark data (using nbench)
• Estimates are accurate to 10% for event generation and reconstruction, 25% for event simulation
![Page 6: Resource Predictors in HEP Applications John Huth, Harvard Sebastian Grinstein, Harvard Peter Hurst, Harvard Jennifer M. Schopf, ANL/NeSC](https://reader036.vdocuments.net/reader036/viewer/2022070401/56649f215503460f94c3995a/html5/thumbnails/6.jpg)
Regeneration Time Estimate Accuracy
![Page 7: Resource Predictors in HEP Applications John Huth, Harvard Sebastian Grinstein, Harvard Peter Hurst, Harvard Jennifer M. Schopf, ANL/NeSC](https://reader036.vdocuments.net/reader036/viewer/2022070401/56649f215503460f94c3995a/html5/thumbnails/7.jpg)
File Transfer Time Estimates
• Much previous work (e.g. Vazhkudai and Schopf, IJHPCA Vol 17, No. 3, August 2003 )
• We use simple end-to-end history data from GridFTP logs to estimate behavior– Simple approach works well on our
networks/machines– Average bandwidth used with no file-size
filtering
![Page 8: Resource Predictors in HEP Applications John Huth, Harvard Sebastian Grinstein, Harvard Peter Hurst, Harvard Jennifer M. Schopf, ANL/NeSC](https://reader036.vdocuments.net/reader036/viewer/2022070401/56649f215503460f94c3995a/html5/thumbnails/8.jpg)
Testbed
• Files transferred from BNL to Harvard and from CERN to Harvard– BNL (aftpexp01.bnl.gov): 4x 3GHz Xeon, Linux 2.4.21-
37.ELsmp, 2.0GB RAM, 1.0 GBit/s NIC– Harvard: 2x 3.4GHz P4, Linux 2.4.20-21.EL.cernsmp, 1.5GB
RAM, 1.0 GBit/s NIC
• Typical network routes:– Harvard –NoX – ManLan – ESNet – BNL
Typical Latency 7.8 ms
– Harvard – NoX – ManLan – Chicago (Abilene) – CERNTypical Latency 148 ms
• Bottlenecks are in machines at each end (e.g. disk access)
![Page 9: Resource Predictors in HEP Applications John Huth, Harvard Sebastian Grinstein, Harvard Peter Hurst, Harvard Jennifer M. Schopf, ANL/NeSC](https://reader036.vdocuments.net/reader036/viewer/2022070401/56649f215503460f94c3995a/html5/thumbnails/9.jpg)
Network Routing
![Page 10: Resource Predictors in HEP Applications John Huth, Harvard Sebastian Grinstein, Harvard Peter Hurst, Harvard Jennifer M. Schopf, ANL/NeSC](https://reader036.vdocuments.net/reader036/viewer/2022070401/56649f215503460f94c3995a/html5/thumbnails/10.jpg)
Transfer Benchmarking
• Transfer files from BNL to Harvard– 20 files each 25MB, 50MB, 100MB, 250MB,
500MB, 1GB
• Average file transfer times are linear with file size
• Initially quiet machines, network– Transfers of 100MB files have variance ~5%
![Page 11: Resource Predictors in HEP Applications John Huth, Harvard Sebastian Grinstein, Harvard Peter Hurst, Harvard Jennifer M. Schopf, ANL/NeSC](https://reader036.vdocuments.net/reader036/viewer/2022070401/56649f215503460f94c3995a/html5/thumbnails/11.jpg)
Time vs File Size, BNL(Quiet network)
![Page 12: Resource Predictors in HEP Applications John Huth, Harvard Sebastian Grinstein, Harvard Peter Hurst, Harvard Jennifer M. Schopf, ANL/NeSC](https://reader036.vdocuments.net/reader036/viewer/2022070401/56649f215503460f94c3995a/html5/thumbnails/12.jpg)
Transfer Variance, BNL(100 MB files, quiet network)
![Page 13: Resource Predictors in HEP Applications John Huth, Harvard Sebastian Grinstein, Harvard Peter Hurst, Harvard Jennifer M. Schopf, ANL/NeSC](https://reader036.vdocuments.net/reader036/viewer/2022070401/56649f215503460f94c3995a/html5/thumbnails/13.jpg)
Transfer Benchmarking
• Some data taken during “Service Challenge 3”
• Average file transfer times are still linear with file size, but have larger variance
![Page 14: Resource Predictors in HEP Applications John Huth, Harvard Sebastian Grinstein, Harvard Peter Hurst, Harvard Jennifer M. Schopf, ANL/NeSC](https://reader036.vdocuments.net/reader036/viewer/2022070401/56649f215503460f94c3995a/html5/thumbnails/14.jpg)
Time vs File Size, BNL(Busy network)
![Page 15: Resource Predictors in HEP Applications John Huth, Harvard Sebastian Grinstein, Harvard Peter Hurst, Harvard Jennifer M. Schopf, ANL/NeSC](https://reader036.vdocuments.net/reader036/viewer/2022070401/56649f215503460f94c3995a/html5/thumbnails/15.jpg)
Transfer Variance, BNL(100 MB files, busy network)
![Page 16: Resource Predictors in HEP Applications John Huth, Harvard Sebastian Grinstein, Harvard Peter Hurst, Harvard Jennifer M. Schopf, ANL/NeSC](https://reader036.vdocuments.net/reader036/viewer/2022070401/56649f215503460f94c3995a/html5/thumbnails/16.jpg)
But our concentration was on the framework
• Given ways to estimate application run time and file transfer time, we want to plug them into an existing framework to make better resource management decisions
• Could be implemented as a post-processor to optimize DAG’s produced by Chimera
![Page 17: Resource Predictors in HEP Applications John Huth, Harvard Sebastian Grinstein, Harvard Peter Hurst, Harvard Jennifer M. Schopf, ANL/NeSC](https://reader036.vdocuments.net/reader036/viewer/2022070401/56649f215503460f94c3995a/html5/thumbnails/17.jpg)
Workflow Optimization
• A script parses the DAG, looking for I/O, binaries
• I/O files indexed in Replica Location Service (RLS)
• Client queries database for execution parameters, bandwidths
• Script evaluates execution, transfer times and rewrites fastest DAG
![Page 18: Resource Predictors in HEP Applications John Huth, Harvard Sebastian Grinstein, Harvard Peter Hurst, Harvard Jennifer M. Schopf, ANL/NeSC](https://reader036.vdocuments.net/reader036/viewer/2022070401/56649f215503460f94c3995a/html5/thumbnails/18.jpg)
Our Strawman Application
• ATLAS event reconstruction jobs take ~20Mins to calculate a 100 Meg file
• File transfer Boston to BNL ~15 Sec/ 100 Meg file
• We created simplified jobs that would have average execution times equal to the file transfer times in order to have a situation closer to the one originally hypothesized
• Likely to be more common as data access becomes more contentious, and machines/calculations speed up
![Page 19: Resource Predictors in HEP Applications John Huth, Harvard Sebastian Grinstein, Harvard Peter Hurst, Harvard Jennifer M. Schopf, ANL/NeSC](https://reader036.vdocuments.net/reader036/viewer/2022070401/56649f215503460f94c3995a/html5/thumbnails/19.jpg)
Framework Tests
• Generate “Non-optimized” DAG’s – linear chains which use a random mixture of transfers and calculations to instantiate 10, 20, or 40 files.
• Operate on these DAG’s with our optimizer to produce “Optimized” DAG’s
• Submit both “Non-optimized” and “Optimized” DAG’s and compare processing times
• For our particular strawman we expect the “Optimized” DAG’s to be 25% faster than the “Non-optimized”
![Page 20: Resource Predictors in HEP Applications John Huth, Harvard Sebastian Grinstein, Harvard Peter Hurst, Harvard Jennifer M. Schopf, ANL/NeSC](https://reader036.vdocuments.net/reader036/viewer/2022070401/56649f215503460f94c3995a/html5/thumbnails/20.jpg)
Framework Tests
![Page 21: Resource Predictors in HEP Applications John Huth, Harvard Sebastian Grinstein, Harvard Peter Hurst, Harvard Jennifer M. Schopf, ANL/NeSC](https://reader036.vdocuments.net/reader036/viewer/2022070401/56649f215503460f94c3995a/html5/thumbnails/21.jpg)
Comparison of Results
![Page 22: Resource Predictors in HEP Applications John Huth, Harvard Sebastian Grinstein, Harvard Peter Hurst, Harvard Jennifer M. Schopf, ANL/NeSC](https://reader036.vdocuments.net/reader036/viewer/2022070401/56649f215503460f94c3995a/html5/thumbnails/22.jpg)
Optimized Results
![Page 23: Resource Predictors in HEP Applications John Huth, Harvard Sebastian Grinstein, Harvard Peter Hurst, Harvard Jennifer M. Schopf, ANL/NeSC](https://reader036.vdocuments.net/reader036/viewer/2022070401/56649f215503460f94c3995a/html5/thumbnails/23.jpg)
Summary
• Implementation works
• A 28% time savings is seen
• Works with crude bandwidth predictions– More sophisticated predictions for dynamic
situations would be helpful
• Most useful when regeneration and transfer times are similar.