data sciences at sandia national laboratories
DESCRIPTION
Data Sciences at Sandia National Laboratories. Sept, 2014 Steven Castillo, Ph.D. ISR Systems Engineering and Decision Support Presentation to University of Illinois at Urbana-Champaign. Approved for UUR: SAND2014-17938 PE. Motivation Slide. 2. 4. 4. 2. 9. 24. 39. 24. 18. 37. - PowerPoint PPT PresentationTRANSCRIPT
1
Sandia National Laboratories is a multi-program laboratory managed and operated by Sandia Corporation, a wholly owned subsidiary of Lockheed Martin Corporation, for the U.S. Department of Energy’s National Nuclear Security Administration under contract DE-AC04-94AL85000.
Data Sciences at Sandia National Laboratories
Sept, 2014Steven Castillo, Ph.D.
ISR Systems Engineering and Decision Support
Presentation toUniversity of Illinois at Urbana-Champaign
Approved for UUR: SAND2014-17938 PE
2
Motivation SlideMission Context• The analysis of data is a common point of success or failure• We see exponential growth in the size, complexity and information
density of data far exceeding available human analysis capabilities.• This research challenge intersects most Sandia Mission Areas.
Why Sandia? Envision, create the future Pathfinder, data-centric systems for
high-consequence decision making: Sandia delivers sensors into several mission areas. Sandia has access to relevant data sets and analysts. We live the full data science cycle at the cutting edge as part of executing
our broad national security missions.
4
18
2
24
37
92
24
4
39
3
Contributors to current limitations:• Only a small fraction of the data is ever examined by
analysts. Workflows are labor intensive and devoid of effective computational tools.
• Systems do not exploit the relationship discovery potential of the data or identify meaningful, defensible trends and patterns.
4
4
Intelligence
Information
Data
Signal
Physics
Incr
easi
ng V
alue
Dec
reas
ing
Vol
ume
Analyst (2014)
Analyst (2017)
Sensors• Increasing density• Increasing data rates• Increasing information density
5
Focus: High Consequence Decision Making
Crosscutting Capabilities: Fundamental mathematics and science transformed into advanced mission capability
Sandia World Class S&T: Research FoundationsSupporting National Security Missions: Nuclear Weapons, Defense & Intelligence
Human Analyst Centric – Transforming Information into Actionable Intelligence Discovery and Disambiguation Patterns of Life: anomaly detection Patterns of Life: predictive analytics Robust Data Analysis: incomplete, vulnerable and uncertain data Intelligent Data Collection: tasking of sensors, optimal data sets
6
Research Portfolio – An Investment in Next Generation Analytics for the Nation
Current portfolio PANTHER Grand Challenge – Pixels to
Intelligence: Pattern Analytics Counter Adversarial Data Analytics -
internal research & development Adversarial Modeling - internal
research & development DARPA sponsored graph modeling Customer sponsored streaming/graph
algorithms Customer funded cyber defense
Portfolio focus on innovative R&D• Foster high-risk
projects• Promote R&D
impact on mission stakeholders
7
Contrast NationalSecurity to Google
Google NS Data to Decisions
Data Gathering Highly Cooperative/dense, homogeneous (Android, Chrome, Google Searches, etc.)
Adversarial Environments/sparse/diverse (can’t see everything all the time)
Data Analytics Environment
Large, homogeneous computing environment (cloud)
Diverse architectures, available throughput, bandwidth
Human Analysis Consumer performing “buy” decisions
Analyst extracting actionable intelligence
Decision Consequences Small – millions of consumers making decisions related to shopping
High – tactical and strategic, UQ and validation are critical
8
Grand Challenges – Internal R&D
Laboratory Directed Research & Development (LDRD) Sandia initiates 1-2 GC LDRDs/ yr Cross laboratory, interdisciplinary High risk, urgent needs, low TRL (technology
readiness level) Three year effort
Create a Lasting National Technical Capability Required to assemble highly engaged external
advisory board Cooperative development with partners to
shape technical vision
9
PANTHER Grand ChallengePattern Analytics to Support High Performance Exploitation and Reasoning: Crosscutting Innovation in Geospatial-Temporal Analysis
PANTHER R&D: Scalable, temporal and geospatial relationship discovery for robust, mission- relevant pattern analysis.
How is PANTHER different? Develop new mathematical approaches to this problem with a deep understanding of human capabilities and information landscape.
Challenge Problem Categories:
Signature Search – enable searches for signatures that are difficult to discretize and subject to interruptions in space and time.Motion & Trajectory Analysis –discover patterns in motion datasets at multiple semantic scales under conditions of intermittent data.
10
Panther Team - Interdisciplinary
PI: Kristina Czuchlewski, PhD PM: Bill Hart, PhD Team Leads: Jim Chow, Ph.D., Randy Brost, Ph.D., Laura
McNamara, Ph.D. and David Stracuzzi, Ph.D.
11
R&D Challenge Questions
Where are chemical processing plants?
Where are active businesses?
Did someone arrive in a car and enter a building? Which one(s)?
Which aircraft flights are point-to-point?
Are any aircraft flying search patterns?
Are any aircraft flying search patterns over sensitive locations?
Is there an activity surge?
Is an aircraft flight pattern departing from normal? What might it do next?
What can we infer given limited data?
For any of the above, what is the result confidence?
Signature Search
Temporal Patterns
Limited Data
Uncertainty/Quality
Trajectory Analysis
12
Current Paradigm New ParadigmState of the Art Sandia R&D PANTHER Grand Challenge
Approach Single frame or target recognition
Statistical, pixel-level processing for specific tactical missions
Identify patterns/ relationships and activities over multiple target classes and extend to strategic missions. Achieve 107 reduction in data volume in space & time.
Spatial Single frame Specific areas (10s of km2)
Achieve 104 single frame reduction in data presented.Query and exploit 100s of km2
Temporal Minutes of data, real-time and offline.
Weeks of information, available in minutes
Pattern based query of up to 1 year’s worth of data. Available in minutes.
Example ATR, CCD, matched filters
Statistical Normalized Coherence
Object-based image analysis: features, entities, patterns and relationships. Ask questions that cannot currently be answered.
Will it work?
Limited, does not scale.
Limited, scales for specific missions.
Unproven. Scalable & extensible to other domains.
13
FY14 Highlights Unsupervised computational methods for
detecting outliers (with no a priori knowledge) within a TB-scale database… in under 30 minutes. New geometric feature vector enables
comparisons. Result: 44 seconds on our Netezza DB
machine to compute a trajectory segmentation from points.
Result: 20 minutes of wall-clock time to turn 1.2 billion points into 15 million trajectories.
Geospatial feature extraction and classification for signature relationship and temporal trending searches. Pixel statistics extracted via new superpixel
segmentation algorithm. Unique temporal attributes of coherent
changes exploited for labeling.
14
FY14 Highlights, cont. New, efficient graph algorithm formulation developed
to enable geospatial-temporal topological search complexity. Multi-source data search under one framework Representation and search under heterogeneous
temporal conditions (ephemeral and activity). Complex sensor feature data “compressed” for analysis
with pointers back to original sensor source. Intermittency/ interruption nodes implemented.
Eye-tracking experiments are enabling visual cognition/ search models – and eventual user interface design(s).
Pattern match quality, statistical and probabilistic approaches are under investigation for characterizing uncertainty in geospatial temporal graph representations.
Bright
Roof
Shadow
E21 E23 E24
E14
E32
E33
E34
E35
B1
B2B3
B4 B5
B6B7
B8
G1
G2
G3
G4
R1
R2
R3
R4
R5
R6
R7
OP1
P1
P2
P3 P4
15
GeoSpatial Semantic Graph Representations of Features & Activity
Graph includes activity: @ t=1, the graph includes objects with
location From t=1 to t=2, the graph encodes change Nodes for activity events.1
Node attributes include time observed. No persistence expected. Spatial and temporal relationship edges.1 t=1
t=2
t=3
t=1
t=2
t=3
A
A
A
A
16
Signature Search
A signature (over space/ time) encodes a desired question.
For example, “Where are buildings with nearby grass, pavement, and dirt?
Graph template:
Building
Grass
Pavement
Dirt
17
t=1
t=2
t=3
A
A
A
A
Matches
Graph search finds all matches to signature template. In this case all red nodes with adjacent
green, grey, and tan nodes. New approach to graph search,
polynomial time complexity Searches are saved
t=1
t=2
t=3
18
Advantages Efficient representations in time (only store change).
Relationship, change, and temporal analysis over multiple times and heterogeneous spatial ensembles in the same query. Change detection
Activity characterization
Efficient search algorithms – computation not limited by brittle relational databases.
Potential to take full advantage of graph topology search advances, enhanced by geospatial-temporal semantics.
Feature-based analysis Multi-modality, in a single search representation.
Sensor agnostic – PANTHER emphasizes SAR.
19
Extracting Statistical Distributions Using Superpixel Segmentation
• Superpixel Segmentation• Divides image into compact regions
containing pixels similar in spatial proximity and intensity
• Derives from large body of research in optical image processing community extended to SAR imagery
• PANTHER Approach• Superpixel segmentation algorithms
enable high quality segmentations and efficient execution
• De-noising advances enable utilization of novel image processing techniques.
Superpixel Segmentation of SAR Image
20
Initial Automated Static Feature Extraction Result
Exploit ~21 days worth of change via coherenceResult: automatically extract paved roads
21
Did someone arrive by car and visit a building?
Query algorithms must handle disconnections & interruptions
Static and ephemeral features for time slice A:*
Note: Based on hand-annotated primitive features.* Note: Hand-edited for explanation clarity.
Data semantics:BuildingFenceDirt RoadGravelDesertShadowLow VegetationHuman TracksVehicle TracksVehicle
BFDRGDSLVHTVTV
22
Example: Connected SignatureResult from star-graph algorithm:*
Correct
Correct
Correct
Correct
Why wasn’t this found?
Note: Based on hand-annotated primitive features.* Note: Hand-edited for explanation clarity.
23
Result of Disconnected Signature SearchResult from star-graph algorithm:*
Correct
Correct
Correct
Correct
Correct
Note: Based on hand-annotated primitive features.* Note: Hand-edited for explanation clarity.
24
Diversity of ProblemsPower Plant Search Tank Complex Search
Site Activity Analysis
Construction Analysis
Activity Analysis -- Interrupted Signature
All of these were solved by the same code.
25
Discovery of Flight Patterns
Collections of geometric descriptions can describe a trajectory.Extensions: impact of time. Forgot Something
Avoid
Holding Pattern
Mapping
26
Big Feature Space Advantage Clustering, leading to
unsupervised learning techniques
Previous examples showed searching the space for a specific pattern or a specific volume of the feature space
But, with clustering, the computer can group the different patterns in the feature space without knowing a priori what they are.
Perhaps most importantly, many clustering algorithms specifically identify outliers in the feature space that correspond to odd behaviors
Did not specify “find this,” only told routine to “make groups of similar flights.”This was one of many clusters that had distinctive shapes
Forgot Something, Revisited
27
Discovery of Odd flights
Clustering done based on geometric featuresMany clusters found, but what remains is…
Represents approximately 700 out of a total of 50,000 flights from one day
Note: we have ~5M points/day, ~1GB/day, currently >300GB
28
Better knowledge & deeper insights from Big Data
in minutes, not months; over months, not hours or minutes; covering hundreds, not 10s of km2
29
Technical Areas for Collaboration Graph Analytics Tensor Analysis Computational Geometry Digital Signal and Image Processing Machine Learning Scalable and Highly Distributed Architectures Human Performance and Cognitive Understanding Machine Feature Identification in Imagery (EO, SAR, LIDAR) Uncertainty Quantification and Propagation – from Sensor to
Answer Robust Data Analysis Intelligent Data Collection
30
University Collaborations
PhD Interns: Colorado State University, Utah State University Faculty have clearances and clear expertise in critical areas Students can gain clearances Year-round appointments give flexibility for Sandia work, university
requirements Win-win-win: Sandia gains valuable technical support in critical
challenge areas, Student gets a degree/publications/potential future employment, University and professor fulfill mission and gain a strong supporter.
Critical Skills Master’s Program (CSMP): University of Illinois Similar to old One-Year-on-Campus-Program
NDA’s: USU, CSU, UIUC (in progress)