dod center for geosciences / atmospheric research colorado state university overview of the data...
TRANSCRIPT
DOD Center for Geosciences / Atmospheric Research Colorado State University
Overview of theData Processing and Error Analysis System (DPEAS)
Andrew S. Jones
Colorado State University (CSU)Cooperative Institute for Research in the Atmosphere (CIRA)
DOD Center for Geosciences / Atmospheric Research (CG/AR)Fort Collins, CO
DOD Center for Geosciences / Atmospheric Research Colorado State University
What is it? Data processing system for “large” data analysis
tasks using common PCs Features:
2nd generation system (replaces an earlier system called PORTAL (Jones et al., 1995))
Parallel implementation Web-based documentation and monitoring Incorporates a Fortran-interpreter for input tasks Virtualized I/O subsystem (only memory-resident data
structures are needed, data algorithms now function like a model) Able to failover to redundant hardware Extensible User Module
Error Analysis code is still under development Implemented on Windows NT/2000 OS
DOD Center for Geosciences / Atmospheric Research Colorado State University
What Does it Do? Global merge capabilities for numerous data sets Current system in operational use for 2+ years at CIRA
Current average operational throughput rates using 15 processors on 8 PCs is 17 TB/yr (47 GB/day).
Measured max. throughput rate is: 2.5 PB/yr (7.1 TB/day) Simplifies
Powerful abstraction layers allow anyone to write parallel code Virtual I/O subsystem reduces end-user code complexities Users interact using a language most already know
Easily Scales Limited process “cross-talk” improves scaling behavior Tests have shown that a 2000 machine cluster is physically
feasible. Basically… just add hardware.
DOD Center for Geosciences / Atmospheric Research Colorado State University
10 Data Types Are Currently Supported
Reads and Writes HDF-EOS natively GOES IMAGER (McIDAS) NOAA AVHRR GAC and LAC (McIDAS) NOAA AMSU-A and B (HDF-EOS) DMSP SSM/I (Byte Stream) DMSP SSM/T-2 (NGDC OIS) DMSP OLS (NGDC OIS) TRMM TMI and VIRS (HDF) User extensible… (your format here)
DOD Center for Geosciences / Atmospheric Research Colorado State University
EXPERIMENTAL CLUSTER (nights only/7)
OPERATIONAL CLUSTER (24/7)
STORAGE VIEW
PROCESSOR VIEW
Primary Backup W1 W2 W3
W5 W6
Legend Primary Backup Wn Worker
Cluster Summary - All Ingest Processes - Most Higher Level Remapped Products
9 Processors 3.0 GFlops 2.25 GB RAM
Cluster Summary - Large Global Sectors
6 Processors 2.5 GFlops 2.5 GB RAM
Primary Backup
MirroredSet
240 GB 240 GB
W1 W2
66 GB 240 GB
W4
The Hardware
DOD Center for Geosciences / Atmospheric Research Colorado State University
Failover Mode
PrimaryX
EXPERIMENTAL CLUSTER (nights only/7)
OPERATIONAL CLUSTER (24/7)
STORAGE VIEW
PROCESSOR VIEW
Primary Backup W1 W2 W3
W5 W6
Failover Steps:Automated1. Synchronize states2. Promote the Backup
Restore Steps:Manually initiated1. Demote the Backup2. Restore Mirror Set3. Synchronize states4. Reactivate Primary
Backup
MirroredSet
240 GB 240 GB
Legend Primary Backup Wn Worker
W1 W2
66 GB 240 GB
W4
X
DOD Center for Geosciences / Atmospheric Research Colorado State University
Module ContextGUIs
Command Shell Interpreter
Internet InformationServices
Web Browser
Other Applications
DPEAS Fortran Interpreter
DPEAS HDF-EOSVirtual I/O Subsystem
Analysis Modules User Modules
DPEASSystemState
Batch Job Client
TranslationModules
OutputModules
Operating System (Windows 2000)
Explorer Command Line
DPEAS Data Processing Engine
Sp
awn
Su
bta
skDPEAS Input Script
Command Line Script
DP
EA
S S
ub
task
Batch JobService
This is DPEAS
DOD Center for Geosciences / Atmospheric Research Colorado State University
An example of a DPEAS input script file
DOD Center for Geosciences / Atmospheric Research Colorado State University
How DPEAS Starts
Program Start
DPEAS Initialization
Interpreting DPEAS script declarations
Interpreting DPEAS script executable statements
DOD Center for Geosciences / Atmospheric Research Colorado State University
How DPEAS Ends
Program End
DPEAS Summary
Interpreting DPEAS script executable statements
DOD Center for Geosciences / Atmospheric Research Colorado State University
How Are Spawned Input Scripts and Jobs Created?
All spawned DPEAS jobs run machine-generated DPEAS input scripts which are generated by the data processing engine from the Master DPEAS input script (The examples shown previously were examples of DPEAS machine-generated code)
This is automated within DPEAS and the user code goes along for the free ride since it is part of the DPEAS executable (it’s like meeting a friendly virus which helps to spread your code along with it)
DOD Center for Geosciences / Atmospheric Research Colorado State University
What Does DPEASParallelism Look Like?
Do loop contentsare sent to other resources in parallel
The new jobs run the same “DPEAS.exe”, but execute only the subtask operations
Completed Jobsallow additional jobs to start
DOD Center for Geosciences / Atmospheric Research Colorado State University
The 3 Programming Steps to Add a User Routine to DPEAS1. Insert a program “hook”
The program hook makes the main DPEAS programaware of the existence of your wrapper routine.
2. Create a wrapper routineThe wrapper routine tells the DPEAS fortraninterpreter how to parse and interact with yourapplication subroutine arguments.
3. Create an application routineThe application routine performs the “real” work.You can do anything you want within the applicationroutine.
DOD Center for Geosciences / Atmospheric Research Colorado State University
How does the “User_Module.f90” relate to my DPEAS Input Scripts?
User_Module.f90
Program HookWrapper Routine
Application Routine
DPEAS InputScript
OrdinaryFortran Compiler
Compile Interpret AutomatedParallelization
Using Self-Replication
"DPEAS.exe"Interprets DPEAS
Input Script
End
Return toMaster
"DPEAS.exe"Interprets DPEAS
Input Script
DPEAS InputScript
Subtask
DOD Center for Geosciences / Atmospheric Research Colorado State University
User Example:The user’s application routine
Using the virtual I/O data via pointers
1. Find each MW channel
2. Allocate a new output array data structure
Your science code looks like this
DOD Center for Geosciences / Atmospheric Research Colorado State University
User Example:The results: Complete integration
The new user routine is now fully integrated into DPEAS
DOD Center for Geosciences / Atmospheric Research Colorado State University
User Example:The output HDF-EOS file
DOD Center for Geosciences / Atmospheric Research Colorado State University
150 GHzEffective Emissivity
Calculated from:GOES-08 IMAGERNOAA-15 AMSU-B
User Example:The output image representation
DOD Center for Geosciences / Atmospheric Research Colorado State University
Creates 2 new routines: Wrapper routine Application routine
Requires 25 lines of executable code: 2 – Program hook 4 – Wrapper routine 19 – Application routine
2 – Variable assignments 3 – Science algorithm 14 – Virtual I/O library calls
(using only 2 Virtual I/O library routines)
User Example:Summary
Small overhead for gaining massive parallelism capabilities!
DOD Center for Geosciences / Atmospheric Research Colorado State University
Creates 2 new routines: Wrapper routine Application routine
Requires 59 lines of executable code: 2 – Program hook 4 – Wrapper routine 53 – Application routine
2 – Variable assignments 3 – Science algorithm 48 – HDF-EOS library calls
(using 26 HDF-EOS library routines)
User Example:How complex would the user routine be, if written without the Virtual I/O library?
Answer: Without the DPEAS Virtual I/O library there would be:
24 additional I/O routines called by the user (+1200%)
34 additional lines of user code (+236%)
DOD Center for Geosciences / Atmospheric Research Colorado State University
User Example:Conclusions
Implementation Insights Minimal amount of end-user code is required The effort and resources involved are small
(The DPEAS program recompiled in < 30 s on the user’s desktop)
Virtual I/O Insights The DPEAS virtual I/O access method is less complex than
traditional HDF-EOS file access methods
End user’s perspective End users are protected from technical data format issues End users can develop higher quality code by leveraging
shared robust common modules Scalability is greatly enhanced with little end user effort
DOD Center for Geosciences / Atmospheric Research Colorado State University
Summary
DPEAS can process large data sets in an efficient manner while maintaining centralized management controls and error handling behaviors
Parallelism of the code is automatic and runs on “cheap hardware”
Failover capabilities make the system more robust User code is shielded from complexities of the
system using software abstraction layers Little training is needed since user interfaces are in
a known scientific language User modules directly access data from memory –
obsolesces traditional file access methods but maintains needed file compatibility
DOD Center for Geosciences / Atmospheric Research Colorado State University
What did I learn aboutHDF-EOS in the process?
HDF-EOS is an excellent “universal” data format It works for all satellite sensors types I have encountered to date (10+)
HDF-EOS requires serious software design before the implementation stage
It is my experience that “Time” information as a geo/time field for sectorizing is overrated and is likely to cause future software design headaches with the more complex sensors if encouraged to be the “norm”
DOD Center for Geosciences / Atmospheric Research Colorado State University
My 2 cents: How HDF-EOScould be made even better
(Hopefully someone has already thought of these things,and this short list will be a reaffirmation)
Given that GOES data, for example, and other multi-detector sensors can have multiple times for each channel for the same geolocation position, and that in addition, they can and do interrupt their sensor scans at any time…
Treat “Time” as a data attribute Currently I associate “Time” and other associated
arrays with its principle data array by nomenclature It would be better to use data array attribute
“groups”. Then “Time”, “Calibration”, and other associated arrays could be grouped with the data array through the data format.
DOD Center for Geosciences / Atmospheric Research Colorado State University
Why Data Attributes? Many data channels have “associated” information
For example, it might be very meaningful to associate the min. and max. of a grid location with its mean value
It would be better if there was a standard way of showing that group association, so we don’t have to understand each other’s unique nomenclatures, “intent”, or have to resort to the use of unusual “mixed” HDF/HDF-EOS data files
Data attributes should not be arbitrarily limited in scope, but have full data type ranges
Units could also be incorporated through data attributes
DOD Center for Geosciences / Atmospheric Research Colorado State University
The End
DOD Center for Geosciences / Atmospheric Research Colorado State University
Appendix The following series of slides show how a
user can easily modify DPEAS
1. The user’s program hook
2. … wrapper routine
3. … application routine(using the virtual I/O data via pointers)
4. Usage of the new user routine in a DPEAS input script file
5. The Results: Complete Integration
DOD Center for Geosciences / Atmospheric Research Colorado State University
User Example:The user’s program hook
2 lines of code
DOD Center for Geosciences / Atmospheric Research Colorado State University
User Example:The user’s wrapper routine
4 lines of executable code
DOD Center for Geosciences / Atmospheric Research Colorado State University
User Example:The user’s application routine
Using the virtual I/O data via pointers
1. Find each MW channel
2. Allocate a new output array data structure
Your science code looks like this
DOD Center for Geosciences / Atmospheric Research Colorado State University
User Example:Usage of the new user routine in a
DPEAS input script file
DOD Center for Geosciences / Atmospheric Research Colorado State University
User Example:The results: Complete integration
The new user routine is now fully integrated into DPEAS
DOD Center for Geosciences / Atmospheric Research Colorado State University
Where Do I Find DPEAS?
DPEAS Home Page:
http://luna.cira.colostate.edu/DPEAS/DPEAS_frame.htm
Please direct questions to [email protected]