The Future of The Future of MOCHAMOCHA
Nick Roussopoulos
October 5, 2001
Stanford Oct 5, 2001 Nick Roussopoulos 2
The ProblemThe Problem
• Data Sources for an enterprise are:
– Distributed• Internet, intranets, extranets
– Heterogeneous• Web servers, relational databases, file systems
– Mission-critical• Weather service, ocean temperature, stock status, …
– Costly to replace or upgrade• Risk of breaking it and loss of investment
Distributed and Distributed and heterogeneous data sourcesheterogeneous data sources
Stanford Oct 5, 2001 Nick Roussopoulos 3
The ProblemThe Problem
Internet
Oracle 8i Informix XML Data Text Data
High volume access from everywhereHigh volume access from everywhere
ClientClient
ClientClient
Client
ClientClient
ClientClient
Client
ClientClient
ClientClient
Client
ClientClient
ClientClient
Client
Stanford Oct 5, 2001 Nick Roussopoulos 4
Client-Server Client-Server 2-tier architecture
complex FAT clients
Bad Idea
ClientClient
ClientClient
Client
Internet
Oracle 8i Informix XML Data Text Data
Stanford Oct 5, 2001 Nick Roussopoulos 5
Middleware 3-tier architectureMiddleware 3-tier architecture
Oracle 8i Informix XML Data Text Data
Internet
Translator Translator Translator Translator
IntegrationServer Catalog
ClientClient
Client
ClientClient
Client
Thin & fit clients
Stanford Oct 5, 2001 Nick Roussopoulos 6
Nice but…Nice but…
• Most middleware solutions are static• Not flexible for dynamic environments• Not scalable to hundreds of client and server sites
• Development cost is high• One-site-at-a-time at a fixed cost
• Maintenance cost is high• Upgrades are practically redevelopments
Stanford Oct 5, 2001 Nick Roussopoulos 7
A dynamic world needs Code extensibility A dynamic world needs Code extensibility & auto-deployment& auto-deployment
• Need for user-defined types and functions– Polygon– Composite() – image aggregation
• Porting and manual installation of code (C/C++)– Operating System– Hardware Platform
• High cost of code maintenance– Updates on all platforms– Version management
• Security in hostile platforms
Stanford Oct 5, 2001 Nick Roussopoulos 8
Code Deployment ProblemCode Deployment Problem
ClientClient
Oracle 8i Informix XML Data Text Data
Internet
Translator Translator Translator Translator
IntegrationServer Catalog
Not Scalable
Stanford Oct 5, 2001 Nick Roussopoulos 9
Query ProcessingQuery Processing
• Query execution options– Limited by site-dependent software
• Composite() – must be ported before use
• Most processing done at the Integration Server– Powerful Data Servers are under-utilized
• I/O Nodes
– Excessive data movement over the network• Network bottleneck
• Slow internet access
Stanford Oct 5, 2001 Nick Roussopoulos 10
Query Processing ProblemQuery Processing Problem
ClientClient
Oracle 8i Informix XML Data Text Data
Internet
Translator Translator Translator Translator
IntegrationServer Catalog
100MB
100MB
100MB 200MB
200MB
200MB
Inefficient & not scalable
Stanford Oct 5, 2001 Nick Roussopoulos 11
SolutionSolution
MOCHAMOCHAMiddleware Based On a Code SHipping Architecture
Stanford Oct 5, 2001 Nick Roussopoulos 12
MOCHAMOCHA Solution: Ship Java Code Solution: Ship Java Code MochletsMochlets
Select location, Composite(image)From RastersWhere week BETWEEN t1 and t2Group By location
Client
Oracle Informix
DAP DAPQPC
CodeRepository
Catalog
Virginia
MarylandVirginiaTexas
Q
Q
Q
QQQQ
Q QNo code porting & no maintenance
Stanford Oct 5, 2001 Nick Roussopoulos 13
MOCHAMOCHA Solution: Filter Data @ Source Solution: Filter Data @ Source
Select location, Composite(image)From RastersWhere week BETWEEN t1 and t2Group By location
Client
Oracle Informix
DAP DAPQPC
CodeRepository
Virginia
MarylandVirginiaTexas
Catalog200MB
tuples
100MB
tuples
results
200KB
results
150KB
results
150KB
results
200KBresults
150KB
results
200KB
results
350KB
results
350KB
No bandwidth waste
Stanford Oct 5, 2001 Nick Roussopoulos 14
Software architectureSoftware architecture
Client
DBMS OS File
DAP DAPQPC
CodeRepository
Catalog
Stanford Oct 5, 2001 Nick Roussopoulos 15
QPC: The Query Processing CoordinatorQPC: The Query Processing Coordinator
Client API
Query Parser
Catalog Manager
Query Optimizer
Execution Engine
CodeLoader
SQL &XML
Proc.Interface
DAP Access API
XMLCatalog
CodeRepository
DAP
QPC Controls and Coordinates Query Execution
Stanford Oct 5, 2001 Nick Roussopoulos 16
DAP: The Data Access ProviderDAP: The Data Access Provider
DAP Provides QPC withRemote Access to the Data
Data Source
DAP Access API
Control Module
Execution Engine
CodeLoader
SQL &XML
Proc.Interface
Data Source Access Layer
JDBC I/O API DOM JNI
Stanford Oct 5, 2001 Nick Roussopoulos 17
Data Server: Storage SystemData Server: Storage System
• Stores and Manages the data sets– database, web server, file system, XML repository
Data Server
Stanford Oct 5, 2001 Nick Roussopoulos 18
Processing a Query in Processing a Query in MOCHAMOCHA
Query Parsing
Resource Discovery
Query Optimization
Metadata and Control
Exchange
Code Deployment Phase
Query Execution Table Rasters location image week band
Select location, Composite(image)From RastersWhere week BETWEEN t1 and t2Group By location
Query:
Stanford Oct 5, 2001 Nick Roussopoulos 19
Plan GenerationPlan Generation
Client
Client
Informix Oracle
QPC
DAP DAP
CodeRepository
Catalog
Coordination Thread
Execution Thread
Execution Thread
Select location, Composite(image)From RastersWhere week BETWEEN t1 and t2Group By location
Stanford Oct 5, 2001 Nick Roussopoulos 20
Automatic Code DeploymentAutomatic Code Deployment
Client
Client
Informix Oracle
QPC
DAP DAP
CodeRepository
Catalog
Coordination Thread
Execution Thread
Execution Thread
Select location, Composite(image)From RastersWhere week BETWEEN t1 and t2Group By location
Stanford Oct 5, 2001 Nick Roussopoulos 21
Data ProcessingData Processing
Client
Client
Informix Oracle
QPC
DAP DAP
CodeRepository
Catalog
Coordination Thread
Execution Thread
Execution Thread
Select location, Composite(image)From RastersWhere week BETWEEN t1 and t2Group By location
Stanford Oct 5, 2001 Nick Roussopoulos 22
Features of Features of MOCHAMOCHA
• Automatic code deployment • “Plug-N-Play”• no system-wide installations
• Metadata and Schema Mapping framework• XML, RDF • easy to exchange and map schemas • semi-automatic mapping
• Query optimization based on code shipping– reduce data movement overhead
• filters at the source• expands at the client
• metrics for code (operator) placement• optimization for selection, union and join plans
Stanford Oct 5, 2001 Nick Roussopoulos 23
MOCHA Demo: Global Land Cover MOCHA Demo: Global Land Cover FacilityFacility
• Integrates the following DAP sites– University of New Hampshire (Webster), NASA GSFC, UMD-
CS, UMD-Geography, UMD-UMIACS SP-2 HPSS
• GLCF hosts the QPC• Operations supported:
– Coverage queries– Visualization of preview images for– Data sets MODIS, TM, AVHRR– GIS Features
• Dynamic Sub-setting of TM scenes
• Composites of GIS Features and AVHRR images
Stanford Oct 5, 2001 Nick Roussopoulos 24
Multi-Sensor Analysis of the Multi-Sensor Analysis of the Los Alamos Fire Event Using Los Alamos Fire Event Using MOCHAMOCHA
• Data Synergy and Multi-Resolution Instrument Analysis using MOCHA – Access data residing at various data sources– Utilize image processing tools
• Fire Analysis required a multi-resolution approach– MOCHA is independent of instrument or resolution specifics
• High Resolution: IKONOS and TM data
• Moderate Resolution: 250m MODIS
• Coarse Resolution: AVHRR and DMSP
Stanford Oct 5, 2001 Nick Roussopoulos 25
MOCHAMOCHA Search Utility Search Utility
Stanford Oct 5, 2001 Nick Roussopoulos 26
MOCHAMOCHA Search Utility (cont’d) Search Utility (cont’d)
Stanford Oct 5, 2001 Nick Roussopoulos 27
MOCHAMOCHA Search Utility (cont’d) Search Utility (cont’d)
Stanford Oct 5, 2001 Nick Roussopoulos 28
MOCHAMOCHA Query Results Query Results
Stanford Oct 5, 2001 Nick Roussopoulos 29
MOCHAMOCHA ETM+ Subsetting Utility ETM+ Subsetting Utility
Stanford Oct 5, 2001 Nick Roussopoulos 30
May 9, 2000 Los Alamos (Bands 1,2,3) May 9, 2000 Los Alamos (Bands 1,2,3)
Stanford Oct 5, 2001 Nick Roussopoulos 31
May 9, 2000 Los Alamos (Bands 7,5,4) May 9, 2000 Los Alamos (Bands 7,5,4)
Stanford Oct 5, 2001 Nick Roussopoulos 32
Multi-Sensor QueryMulti-Sensor Query
Stanford Oct 5, 2001 Nick Roussopoulos 33
Tabular Query ResultsTabular Query Results
Stanford Oct 5, 2001 Nick Roussopoulos 34
MODIS: May 11, 2000: During FireMODIS: May 11, 2000: During Fire
Stanford Oct 5, 2001 Nick Roussopoulos 35
MODIS: May 24, 2000: After FireMODIS: May 24, 2000: After Fire
Stanford Oct 5, 2001 Nick Roussopoulos 36
DMSP: Night Visibility of FireDMSP: Night Visibility of Fire
Stanford Oct 5, 2001 Nick Roussopoulos 37
IKONOS 4m resolutionIKONOS 4m resolution
Stanford Oct 5, 2001 Nick Roussopoulos 38
IKONOS 4m SubsetIKONOS 4m Subset
Stanford Oct 5, 2001 Nick Roussopoulos 39
IKONOS 1m resolutionIKONOS 1m resolution
Stanford Oct 5, 2001 Nick Roussopoulos 40
IKONOS 1m SubsetIKONOS 1m Subset
Stanford Oct 5, 2001 Nick Roussopoulos 41
MOCHAMOCHA Metadata Publishing Framework Metadata Publishing Framework
• Provides information about system resources• Data sources• schemas and mappings• user-defined types and functions
• Automates operation of MOCHA• Incremental system growth
• neither fixed nor hardwired parameters • no extension by re-compilation
• Share metadata with others (Internet)• machine readable form
Stanford Oct 5, 2001 Nick Roussopoulos 43
MOCHAMOCHA Catalog Organization Catalog Organization
• Metadata about “resources”– Local and global tables– UDF data types and operators– Schema mapping rules– DAPs
• Each one has Uniform Resource Identifier (URI) global namespace– e.g.: mocha://cs1.umd.edu/EarthSci/Polygon
• Modeled with RDF, serialized with XML easy to understand, use and exchange
Stanford Oct 5, 2001 Nick Roussopoulos 44
RDF Model: Data TypesRDF Model: Data Types
mocha:T
ype
mocha:Class
mocha:Repos ito
ry
mocha:Size
mocha:Creator
mocha://cs1.umd.edu/EarthSci/Raster
Raster
Raster.class cs1.umd.edu/EarthSci 1 megabyte
Stanford Oct 5, 2001 Nick Roussopoulos 45
XML Serialization: Data TypesXML Serialization: Data Types
• W3C Standards• Easy to specify using
GUI tools• Easy to exchange • Crawlers can harvest it• Stored in
– DB– File System
<rdf:Description about= “mocha://cs1.umd.edu/EarthSci/Raster”> <mocha:Type>Raster</mocha:Type> <mocha:Class> Raster.class </mocha:Class> <mocha:Repository>
cs1.umd.edu/EarthSci </mocha:Repository> <mocha:Size> 1 MB</mocha:Size> <mocha:Creator>[email protected] </mocha:Creator></rdf:Description>
Stanford Oct 5, 2001 Nick Roussopoulos 46
Other Resources in Other Resources in MOCHAMOCHA
• Local and Global tables– data sources + columns + types
• UDF Functions– argument types + return type– code repository
• Schema mapping rules• DAPs
– URL– login information
Stanford Oct 5, 2001 Nick Roussopoulos 47
Schema Mapping in Schema Mapping in MOCHAMOCHA
locationimageweekband
point1point2photodateband
• Direct column mappings• Complex Expressions
RastersRastersRastersMDRastersMD
rect()
week()
Stanford Oct 5, 2001 Nick Roussopoulos 48
MOCHAMOCHA Schema Mapping Rules Schema Mapping Rules
• Use XML to encode mapping rules
• Schema mapping sub-plans– leaf nodes
<MapList> <mi mapped = “direct”> <mocha:Column>image</mocha:Column> <mocha:Expr>photo</mocha:Expr> </mi> <mi mapped = “expression”> <mocha:Column> location </mocha:Column> <mocha:Expr> rect(point1, point2) </mocha:Expr> </mi>…
PlanTree
SMPSMP SMP
Stanford Oct 5, 2001 Nick Roussopoulos 50
MOCHAMOCHA Optimization Framework Optimization Framework
• Query optimization based on heuristics• cost = network + CPU + I/O• Network is the dominant factor (WAN)
• optimize for it first
• CPU and I/O are cheaper• optimize for them later
• Operator placement: Enhanced Hybrid Shipping • Code• Data
Stanford Oct 5, 2001 Nick Roussopoulos 51
Operator Placement in Operator Placement in MOCHAMOCHA
• Data-Reducing Operators– “Filter” the data – aggregates, predicates, projections, semi-joins
• Composite(), Overlaps() , AvgEnergy()
Push to the DAPs• Return distilled results• Less data movement
Composite()
Stanford Oct 5, 2001 Nick Roussopoulos 52
Operator Placement in Operator Placement in MOCHAMOCHA
• Data-Inflating Operators• “Expand” the data • projections, image processing, some joins …
• DoubleResolution(), RotateSolid()
• Pull to the QPC• Data Shipping policy [FJK96]• Only send back raw arguments• Less data movement
DoubleRes()
Stanford Oct 5, 2001 Nick Roussopoulos 53
Placement Metric: VRFPlacement Metric: VRF
Volume Reduction Factor: Given operator f and relation R, then VDA
VDTfVRF )(
•VDT - volume of data transmitted after applying f to R•VDA - volume of data originally present in R
f is Data-Reducing VRF < 1
Composite()
f is Data-Inflating VRF 1
DoubleRes()
Stanford Oct 5, 2001 Nick Roussopoulos 54
Goal: Plans with small CVRFGoal: Plans with small CVRF
Cumulative Volume Reduction Factor:Given a plan P to solve query Q over relations R1, …, Rn
CVDA
CVDTPCVRF )(
• CVDT - volume of data transmitted by applying all operators in P to R1, …, Rn• CVDA- volume of data originally present in R1, …, Rn
Search SpaceOptimizer searchesfor plans that move
minimal amount of data.
CVRF(Plan) [0,1]
Stanford Oct 5, 2001 Nick Roussopoulos 55
MOCHAMOCHA Query Optimizer Query Optimizer
• System R style– Left-deep plans (joins at QPC)– cost: execution time (network + CPU + I/O)– operator placement : VRF and plan cost– selections, unions and joins
• Placement Policy: Enhanced Hybrid Shipping– Code Shipping: operators at DAPs– Data Shipping: operators at QPC– generalizes Hybrid Shipping [FJK96]
Stanford Oct 5, 2001 Nick Roussopoulos 56
Sequoia 2000 BenchmarkSequoia 2000 Benchmark
• Goals of first experiment:– Measure how good code shipping can be– Validate heuristics being proposed
• VRF
• CVRF
• Configured MOCHA with plans that place operators– at DAP with code shipping– at QPC with data shipping
Stanford Oct 5, 2001 Nick Roussopoulos 57
Reducing vs. InflatingReducing vs. Inflating
Ru
nnin
g T
ime
(sec
s)0
200
400600
8001000
12001400
16001800
2000
DB CPU NET MISC
QPC QPCQPC
DAPDAP
DAP
Query Class
Q1 Q2 Q3
• Query classes– Q1: Composite of all images
– Q2: Clipping and sub-setting
– Q3: Double resolution of images
Performance– composites
• 99% data reduction• 4-1 better performance
– clipping and expansion• 80% data reduction• 3-1 better performance
Validates heuristics
Stanford Oct 5, 2001 Nick Roussopoulos 58
VRF vs. SelectivityVRF vs. Selectivity
• Selectivity and cardinality not
enough for distributed predicate
placement
• Consider 50% selectivity
• DAP CVRF = 0.01
• QPC CVRF = 1
0
100
200
300
400
500
600
700
800
DB CPU NET MISC
Ru
nnin
g T
ime
(sec
s)
SelectivityQ
PC
DA
P
QP
C
DA
P
QP
C
DA
P
QP
C
DA
P
QP
C
DA
P
0 .25 .50 .75 1
VRF is a better metric
Stanford Oct 5, 2001 Nick Roussopoulos 59
WAN ExperimentWAN Experiment
• Sites used:– University of Maryland (QPC)– University of Puerto Rico– Oregon Graduate Institute– University of North Dakota– University of Alabama
Stanford Oct 5, 2001 Nick Roussopoulos 60
Union with Data-ReducingUnion with Data-Reducing
Execution Time Q6
0
100
200
300
400
500
600
700
DS QS EHS
Execution Policy
Ex
ec
uti
on
Tim
e (
se
cs
)
• EHS is the better option– Filters data – 2-1 better performance– Minimal resource usage
Resouce Usage for Q6
0
200
400
600
800
1000
1200
DS QS EHS
Execution Policy
Usa
ge
T
ime
(se
cs)
Q6:Select landuse, locationFrom polygonsWhere perimeter(location) > 2000.0
Sites: UPR and OGI
Stanford Oct 5, 2001 Nick Roussopoulos 61
Union with Reducing and InflatingUnion with Reducing and Inflating
Execution of Q5
0
500
1000
1500
2000
DS QS EHS
Execution Policy
Ex
ec
uti
on
Tim
e (
se
cs
)
Resource Usage for Q5
0
5001000
1500
2000
25003000
3500
DS QS EHS
Execution Policy
Usa
ge
Tim
e (
secs
)
Q5:Select landuse, location, triangulate(location)From PolygonsWhere perimeter(location) > 2000.0
EHS is better than DS and QS• 2-1 better than QS• 6-1 better than DS
• Consumes least resourcesSites: UPR and OGI
Stanford Oct 5, 2001 Nick Roussopoulos 62
Join with Data-ReducingJoin with Data-Reducing
Execution Time Q8
0100200300400500600700
DS QS EHS
Execution Policy
Exe
cuti
on
Tim
e (
se
cs)
• EHS is the better option• 3-1 better performance
– Minimal resource usage• Same pattern as with unions
– Data movement is the key
Q8:Select P.landuse, R.location, R.weekFrom polygons P, rasters RWhere overlaps(P.location, R.location)And perimeter(P.location) > 2000.0
Sites: UPR and OGI
Resouce Usage for Q8
0
200400
600
800
10001200
1400
DS QS EHS
Execution Policy
Usa
ge
T
ime
(se
cs)
Stanford Oct 5, 2001 Nick Roussopoulos 64
MOCHAMOCHA System Status System Status
• Operational MOCHA prototype– It’s real!– over 40,000 lines of 100% Java code (JDK 1.3)– People involved:
• Manuel Rodriguez-Martinez (lead)• Mike McGann
• Steve Kelley
• Vadim Katz
• John Towshend, Frank Lindsay, Ben White (Geographers)
• Joseph JaJa (Algorithms)
– Tested with NASA ESIP Federation
• Los Alamos fire
– Supports: Oracle, Postgres, Informix, Sybase, HPSS
Stanford Oct 5, 2001 Nick Roussopoulos 65
Features of Features of MOCHAMOCHA
• Automatic Code Deployment• Scalable middleware architecture• Query optimization based on data movement reduction• Metadata publishing framework [RMR00a]
• RDF and XML • Publish schemas, mappings, types and functions• Drives automatic code deployment
• Schema mapping rules expressed in XML • attach as leaf nodes in query plan• extensible
Stanford Oct 5, 2001 Nick Roussopoulos 66
MOCHAMOCHA Publications Publications
• Research papers and talks– ACM SIGMOD 2000– EDBT 2000
• Demos – ACM SIGMOD 2000– SSDBM 2001– NASA ESIP meetings and workshops– U.S. National Academy of Sciences
Stanford Oct 5, 2001 Nick Roussopoulos 67
The Future of The Future of MOCHAMOCHA
A Million Site A Million Site MOCHAMOCHA
Stanford Oct 5, 2001 Nick Roussopoulos 68
The Future of The Future of MOCHAMOCHA
• The role of MOCHA in distributed software systems– sensors– satellites – network switches and routers– laptops, palm computers– custom-built devices– cars, planes, boats– people (fireman), animals (whales)
Stanford Oct 5, 2001 Nick Roussopoulos 69
Network of Network of MOCHAMOCHA enabled sensors enabled sensors
• Sensors are deployed in an area using ad hoc network techniques
• Sensors run Java JDK 1.3
• Lighter Sensors run Java JDK 1.3 Micro Edition
DAP
DAP
DAPDAP
DAP
DAP
DAPDAP
DAPDAP
DAP
DAP
DAP
Stanford Oct 5, 2001 Nick Roussopoulos 70
Organization of sensorsOrganization of sensors
Leader
Normal Sensor Groups
• Sensors are grouped together for specific goal or service
• data acquisition
• data aggregation, analysis
• data streaming
• Group leaders are responsible for • establishing themselves
(broadcast, voting, …)
• coordination among sensors
• making decisions (agents)
• participate in other higher level groups (hybrid P2P)
Stanford Oct 5, 2001 Nick Roussopoulos 71
Concrete Example (from NASA)Concrete Example (from NASA)
• Constellation of Satellites (with sensors)• A group observes Gamma radiation
– aggregates measurements– determines an important radiation event
• Group leader tells other peer group leaders to instruct their sensors to observe the Gamma radiation event (reaction).
• system adapts to changes in the environment
Stanford Oct 5, 2001 Nick Roussopoulos 72
MOCHAMOCHAss Code Shipping feature for Code Shipping feature for
• upgrades to fix bugs• fresh code to gather data
– at different resolution– new aggregates or functions
• dynamically configured code – application-specific security protocol– location-dependent encryption