bigdata spatial analytics - amazon s3...hadoop tools arcmap catalog demo time the “zoo” • pig...
TRANSCRIPT
BigDataSpatial Analytics
Mansour Raad
Story Time...
is hereby granted to
to certify that he/she has completed to satisfaction
The CCDH Exam
Cloudera, Inc. 210 Portage Avenue Palo Alto, CA 94306 www.cloudera.com
___________________________ Date Granted
Test Date:
___________________________ Authorized Signature
Mansour Raad
March 2, 2012
Mar 09, 2012
is hereby granted to
to certify that he/she has completed to satisfaction
The CCDH Exam
Cloudera, Inc. 210 Portage Avenue Palo Alto, CA 94306 www.cloudera.com
___________________________ Date Granted
Test Date:
__ __________________________Authorized Signature
March 2, 2012
Mar 09, 2012
Finally, a big nail...
Input 1
U.S.Demographic
Data
Demographic Info
• Location
• Gender
• Race
• Income
• Age
Input 2
~1000 Locations
Task...
For Each LocationFor Each Demographic
50 Mile Heatmap
“Traditional Way”
• 14 Days Later
• 850GB Raster
Gotta Be A Better Way !
Hadoop
$> cat input | map | sort | reduce > out
Advantage
• Parallelism
• Fast Input Stream
• Fast Computational Geometry
• Distributed Cache
Vector / Raster
Cooperative Processing
g.beginGradientFill(GradientType.RADIAL,[ 0xFF0000, 0x0000FF ], ...);g.drawRect(x, y, 200, 200);g.endFill();bitmapData.draw(shape, null, null, BlendMode.SCREEN, null, true);
Where To Run 10 Nodes ?
~238 MB Vectorvs.
~850 GB Raster
Best Visualizer ?
What is Big Data ?
Great Story Telling Tool !
Data Democratizer!Beyond Dashboard!Can have best ML, best model, best team, all useless if u cannot tell a story of results!
What Is Big Data ?
(academic)
Beyond Traditional Means !
Traditional Processing
Traditional Database
•Too Big
•Too Fast
•Unstructured
Forcing new ways of thinking !
Big Data Sources...
Catch all wordsjust like “Cloud” was 3 year ago !
WebLogs
“Internet Of Things”
Imagery
Health Records
VOLUME
VELOCITY VARIETY
Volume
• Very Large Amount
• More Parameters
• Multi Node
• Storage
• Processing -Simple math is more effective with large parameters-Scalable storage-Program to data rather data to program
Velocity
• Rate of digital flow
• Streaming
• Event Processing
• Feedback Loop
• Recommendations - Clicks, locations- Mobile / Smartphones- Last 5 min snapshot of traffic is no good when crossing the street- CERN
Velocity Engines
• IBM InfoSphere Streams
• Twitter Storm
• Apache S4
Variety
• Unstructured
• Incomplete
• Semantically Different
Data is messy
Storage Variety
• NoSQL
• Columnar (HBase)
• Key/Value (Redis)
• Document (MongoDB)
• Graph (Neo4J)
Hadoop
HDFS
• Multi-TB Storage
• Inexpensive Nodes
• Fault Tolerant
• Concurrent Reading
• Brings Programs To Data
MapReduce
• Software Framework
• Parallel Processing
• Jobs Executed on HDFS
• Java / Python / C++
• Spatial Libraries
MapReduce Job
input | map | sort | reduce | output
Java Jars packaged and sent to data nodes for execution
Apache Hive
“SQL”
MapReduce Job
HDFSCSVTSV
JSONBINARY
MapReduce
hive> select * from cities where country=‘lebanon’;
Spatial Storage
• CSV,TSV Lat,Lon
• Esri JSON format
• {geometry:{x:-123,y:45},attributes:{}}
• Custom
What About Spatial ?
User Defined Functions
• select tolower(“ESRI”);
• select * from mytable where cos(rad) < 0.1;
Spatial UDF !
select * from citieswhere near(x,y,-84.2,39.4);
select * from citieswhere contains(x,y,’#mypolys’);
PythonGeoProcessing
HDFSRDBMS
“small data” “big data”
HadoopTools
ArcMapCatalog
Demo Time
The “Zoo”
• Pig - high level language for hadoop
• HBase - real/time random access to hdfs
• Flume - streaming data flow
• Mahout - machine learning
• Zookeeper - distributed state management
Processing Evolution
• Transactional - Batch
• Operational - Dashboard
• Analytical - Exploratory
• Intelligent - Real/Time, predictive
Fixed Schema
Variable Schema
“[T]here are known knowns; there are things we know that we know.There are known unknowns; that is to say there are things that, we now know we don't know.But there are also unknown unknowns – there are things we do not know we don't know.”
—United States Secretary of Defense, Donald Rumsfeld
@mraad
Date Event Location
March 21, 2013Esri DC Meet Up – Big Data & Location Analytics Washington, DC
April 18, 2013 Esri DC Meet Up Washington, DC
March 23–26, 2013 Esri Partner Conference Palm Springs, CA
March 25–28, 2013 Esri Developer Summit Palm Springs, CA
July 6–9, 2013 Esri National Security Summit San Diego, CA
July 8–12, 2013 Esri International User Conference San Diego, CA
Upcoming Events