data sciences at sandia national laboratories

30
1 Sandia National Laboratories is a multi-program laboratory managed and operated by Sandia Corporation, a wholly owned subsidiary of Lockheed Martin Corporation, for the U.S. Department of Energy’s National Nuclear Security Administration under contract DE-AC04-94AL85000. Data Sciences at Sandia National Laboratories Sept, 2014 Steven Castillo, Ph.D. ISR Systems Engineering and Decision Support Presentation to University of Illinois at Urbana- Champaign Approved for UUR: SAND2014-17938 PE

Upload: blythe

Post on 07-Jan-2016

34 views

Category:

Documents


0 download

DESCRIPTION

Data Sciences at Sandia National Laboratories. Sept, 2014 Steven Castillo, Ph.D. ISR Systems Engineering and Decision Support Presentation to University of Illinois at Urbana-Champaign. Approved for UUR: SAND2014-17938 PE. Motivation Slide. 2. 4. 4. 2. 9. 24. 39. 24. 18. 37. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Data Sciences at Sandia National Laboratories

1

Sandia National Laboratories is a multi-program laboratory managed and operated by Sandia Corporation, a wholly owned subsidiary of Lockheed Martin Corporation, for the U.S. Department of Energy’s National Nuclear Security Administration under contract DE-AC04-94AL85000.

Data Sciences at Sandia National Laboratories

Sept, 2014Steven Castillo, Ph.D.

ISR Systems Engineering and Decision Support

Presentation toUniversity of Illinois at Urbana-Champaign

Approved for UUR: SAND2014-17938 PE

Page 2: Data Sciences at Sandia National Laboratories

2

Motivation SlideMission Context• The analysis of data is a common point of success or failure• We see exponential growth in the size, complexity and information

density of data far exceeding available human analysis capabilities.• This research challenge intersects most Sandia Mission Areas.

Why Sandia? Envision, create the future Pathfinder, data-centric systems for

high-consequence decision making: Sandia delivers sensors into several mission areas. Sandia has access to relevant data sets and analysts. We live the full data science cycle at the cutting edge as part of executing

our broad national security missions.

4

18

2

24

37

92

24

4

39

Page 3: Data Sciences at Sandia National Laboratories

3

Contributors to current limitations:• Only a small fraction of the data is ever examined by

analysts. Workflows are labor intensive and devoid of effective computational tools.

• Systems do not exploit the relationship discovery potential of the data or identify meaningful, defensible trends and patterns.

Page 4: Data Sciences at Sandia National Laboratories

4

4

Intelligence

Information

Data

Signal

Physics

Incr

easi

ng V

alue

Dec

reas

ing

Vol

ume

Analyst (2014)

Analyst (2017)

Sensors• Increasing density• Increasing data rates• Increasing information density

Page 5: Data Sciences at Sandia National Laboratories

5

Focus: High Consequence Decision Making

Crosscutting Capabilities: Fundamental mathematics and science transformed into advanced mission capability

Sandia World Class S&T: Research FoundationsSupporting National Security Missions: Nuclear Weapons, Defense & Intelligence

Human Analyst Centric – Transforming Information into Actionable Intelligence Discovery and Disambiguation Patterns of Life: anomaly detection Patterns of Life: predictive analytics Robust Data Analysis: incomplete, vulnerable and uncertain data Intelligent Data Collection: tasking of sensors, optimal data sets

Page 6: Data Sciences at Sandia National Laboratories

6

Research Portfolio – An Investment in Next Generation Analytics for the Nation

Current portfolio PANTHER Grand Challenge – Pixels to

Intelligence: Pattern Analytics Counter Adversarial Data Analytics -

internal research & development Adversarial Modeling - internal

research & development DARPA sponsored graph modeling Customer sponsored streaming/graph

algorithms Customer funded cyber defense

Portfolio focus on innovative R&D• Foster high-risk

projects• Promote R&D

impact on mission stakeholders

Page 7: Data Sciences at Sandia National Laboratories

7

Contrast NationalSecurity to Google

Google NS Data to Decisions

Data Gathering Highly Cooperative/dense, homogeneous (Android, Chrome, Google Searches, etc.)

Adversarial Environments/sparse/diverse (can’t see everything all the time)

Data Analytics Environment

Large, homogeneous computing environment (cloud)

Diverse architectures, available throughput, bandwidth

Human Analysis Consumer performing “buy” decisions

Analyst extracting actionable intelligence

Decision Consequences Small – millions of consumers making decisions related to shopping

High – tactical and strategic, UQ and validation are critical

Page 8: Data Sciences at Sandia National Laboratories

8

Grand Challenges – Internal R&D

Laboratory Directed Research & Development (LDRD) Sandia initiates 1-2 GC LDRDs/ yr Cross laboratory, interdisciplinary High risk, urgent needs, low TRL (technology

readiness level) Three year effort

Create a Lasting National Technical Capability Required to assemble highly engaged external

advisory board Cooperative development with partners to

shape technical vision

Page 9: Data Sciences at Sandia National Laboratories

9

PANTHER Grand ChallengePattern Analytics to Support High Performance Exploitation and Reasoning: Crosscutting Innovation in Geospatial-Temporal Analysis

PANTHER R&D: Scalable, temporal and geospatial relationship discovery for robust, mission- relevant pattern analysis.

How is PANTHER different? Develop new mathematical approaches to this problem with a deep understanding of human capabilities and information landscape.

Challenge Problem Categories:

Signature Search – enable searches for signatures that are difficult to discretize and subject to interruptions in space and time.Motion & Trajectory Analysis –discover patterns in motion datasets at multiple semantic scales under conditions of intermittent data.

Page 10: Data Sciences at Sandia National Laboratories

10

Panther Team - Interdisciplinary

PI: Kristina Czuchlewski, PhD PM: Bill Hart, PhD Team Leads: Jim Chow, Ph.D., Randy Brost, Ph.D., Laura

McNamara, Ph.D. and David Stracuzzi, Ph.D.

Page 11: Data Sciences at Sandia National Laboratories

11

R&D Challenge Questions

Where are chemical processing plants?

Where are active businesses?

Did someone arrive in a car and enter a building? Which one(s)?

Which aircraft flights are point-to-point?

Are any aircraft flying search patterns?

Are any aircraft flying search patterns over sensitive locations?

Is there an activity surge?

Is an aircraft flight pattern departing from normal? What might it do next?

What can we infer given limited data?

For any of the above, what is the result confidence?

Signature Search

Temporal Patterns

Limited Data

Uncertainty/Quality

Trajectory Analysis

Page 12: Data Sciences at Sandia National Laboratories

12

Current Paradigm New ParadigmState of the Art Sandia R&D PANTHER Grand Challenge

Approach Single frame or target recognition

Statistical, pixel-level processing for specific tactical missions

Identify patterns/ relationships and activities over multiple target classes and extend to strategic missions. Achieve 107 reduction in data volume in space & time.

Spatial Single frame Specific areas (10s of km2)

Achieve 104 single frame reduction in data presented.Query and exploit 100s of km2

Temporal Minutes of data, real-time and offline.

Weeks of information, available in minutes

Pattern based query of up to 1 year’s worth of data. Available in minutes.

Example ATR, CCD, matched filters

Statistical Normalized Coherence

Object-based image analysis: features, entities, patterns and relationships. Ask questions that cannot currently be answered.

Will it work?

Limited, does not scale.

Limited, scales for specific missions.

Unproven. Scalable & extensible to other domains.

Page 13: Data Sciences at Sandia National Laboratories

13

FY14 Highlights Unsupervised computational methods for

detecting outliers (with no a priori knowledge) within a TB-scale database… in under 30 minutes. New geometric feature vector enables

comparisons. Result: 44 seconds on our Netezza DB

machine to compute a trajectory segmentation from points.

Result: 20 minutes of wall-clock time to turn 1.2 billion points into 15 million trajectories.

Geospatial feature extraction and classification for signature relationship and temporal trending searches. Pixel statistics extracted via new superpixel

segmentation algorithm. Unique temporal attributes of coherent

changes exploited for labeling.

Page 14: Data Sciences at Sandia National Laboratories

14

FY14 Highlights, cont. New, efficient graph algorithm formulation developed

to enable geospatial-temporal topological search complexity. Multi-source data search under one framework Representation and search under heterogeneous

temporal conditions (ephemeral and activity). Complex sensor feature data “compressed” for analysis

with pointers back to original sensor source. Intermittency/ interruption nodes implemented.

Eye-tracking experiments are enabling visual cognition/ search models – and eventual user interface design(s).

Pattern match quality, statistical and probabilistic approaches are under investigation for characterizing uncertainty in geospatial temporal graph representations.

Bright

Roof

Shadow

E21 E23 E24

E14

E32

E33

E34

E35

B1

B2B3

B4 B5

B6B7

B8

G1

G2

G3

G4

R1

R2

R3

R4

R5

R6

R7

OP1

P1

P2

P3 P4

Page 15: Data Sciences at Sandia National Laboratories

15

GeoSpatial Semantic Graph Representations of Features & Activity

Graph includes activity: @ t=1, the graph includes objects with

location From t=1 to t=2, the graph encodes change Nodes for activity events.1

Node attributes include time observed. No persistence expected. Spatial and temporal relationship edges.1 t=1

t=2

t=3

t=1

t=2

t=3

A

A

A

A

Page 16: Data Sciences at Sandia National Laboratories

16

Signature Search

A signature (over space/ time) encodes a desired question.

For example, “Where are buildings with nearby grass, pavement, and dirt?

Graph template:

Building

Grass

Pavement

Dirt

Page 17: Data Sciences at Sandia National Laboratories

17

t=1

t=2

t=3

A

A

A

A

Matches

Graph search finds all matches to signature template. In this case all red nodes with adjacent

green, grey, and tan nodes. New approach to graph search,

polynomial time complexity Searches are saved

t=1

t=2

t=3

Page 18: Data Sciences at Sandia National Laboratories

18

Advantages Efficient representations in time (only store change).

Relationship, change, and temporal analysis over multiple times and heterogeneous spatial ensembles in the same query. Change detection

Activity characterization

Efficient search algorithms – computation not limited by brittle relational databases.

Potential to take full advantage of graph topology search advances, enhanced by geospatial-temporal semantics.

Feature-based analysis Multi-modality, in a single search representation.

Sensor agnostic – PANTHER emphasizes SAR.

Page 19: Data Sciences at Sandia National Laboratories

19

Extracting Statistical Distributions Using Superpixel Segmentation

• Superpixel Segmentation• Divides image into compact regions

containing pixels similar in spatial proximity and intensity

• Derives from large body of research in optical image processing community extended to SAR imagery

• PANTHER Approach• Superpixel segmentation algorithms

enable high quality segmentations and efficient execution

• De-noising advances enable utilization of novel image processing techniques.

Superpixel Segmentation of SAR Image

Page 20: Data Sciences at Sandia National Laboratories

20

Initial Automated Static Feature Extraction Result

Exploit ~21 days worth of change via coherenceResult: automatically extract paved roads

Page 21: Data Sciences at Sandia National Laboratories

21

Did someone arrive by car and visit a building?

Query algorithms must handle disconnections & interruptions

Static and ephemeral features for time slice A:*

Note: Based on hand-annotated primitive features.* Note: Hand-edited for explanation clarity.

Data semantics:BuildingFenceDirt RoadGravelDesertShadowLow VegetationHuman TracksVehicle TracksVehicle

BFDRGDSLVHTVTV

Page 22: Data Sciences at Sandia National Laboratories

22

Example: Connected SignatureResult from star-graph algorithm:*

Correct

Correct

Correct

Correct

Why wasn’t this found?

Note: Based on hand-annotated primitive features.* Note: Hand-edited for explanation clarity.

Page 23: Data Sciences at Sandia National Laboratories

23

Result of Disconnected Signature SearchResult from star-graph algorithm:*

Correct

Correct

Correct

Correct

Correct

Note: Based on hand-annotated primitive features.* Note: Hand-edited for explanation clarity.

Page 24: Data Sciences at Sandia National Laboratories

24

Diversity of ProblemsPower Plant Search Tank Complex Search

Site Activity Analysis

Construction Analysis

Activity Analysis -- Interrupted Signature

All of these were solved by the same code.

Page 25: Data Sciences at Sandia National Laboratories

25

Discovery of Flight Patterns

Collections of geometric descriptions can describe a trajectory.Extensions: impact of time. Forgot Something

Avoid

Holding Pattern

Mapping

Page 26: Data Sciences at Sandia National Laboratories

26

Big Feature Space Advantage Clustering, leading to

unsupervised learning techniques

Previous examples showed searching the space for a specific pattern or a specific volume of the feature space

But, with clustering, the computer can group the different patterns in the feature space without knowing a priori what they are.

Perhaps most importantly, many clustering algorithms specifically identify outliers in the feature space that correspond to odd behaviors

Did not specify “find this,” only told routine to “make groups of similar flights.”This was one of many clusters that had distinctive shapes

Forgot Something, Revisited

Page 27: Data Sciences at Sandia National Laboratories

27

Discovery of Odd flights

Clustering done based on geometric featuresMany clusters found, but what remains is…

Represents approximately 700 out of a total of 50,000 flights from one day

Note: we have ~5M points/day, ~1GB/day, currently >300GB

Page 28: Data Sciences at Sandia National Laboratories

28

Better knowledge & deeper insights from Big Data

in minutes, not months; over months, not hours or minutes; covering hundreds, not 10s of km2

Page 29: Data Sciences at Sandia National Laboratories

29

Technical Areas for Collaboration Graph Analytics Tensor Analysis Computational Geometry Digital Signal and Image Processing Machine Learning Scalable and Highly Distributed Architectures Human Performance and Cognitive Understanding Machine Feature Identification in Imagery (EO, SAR, LIDAR) Uncertainty Quantification and Propagation – from Sensor to

Answer Robust Data Analysis Intelligent Data Collection

Page 30: Data Sciences at Sandia National Laboratories

30

University Collaborations

PhD Interns: Colorado State University, Utah State University Faculty have clearances and clear expertise in critical areas Students can gain clearances Year-round appointments give flexibility for Sandia work, university

requirements Win-win-win: Sandia gains valuable technical support in critical

challenge areas, Student gets a degree/publications/potential future employment, University and professor fulfill mission and gain a strong supporter.

Critical Skills Master’s Program (CSMP): University of Illinois Similar to old One-Year-on-Campus-Program

NDA’s: USU, CSU, UIUC (in progress)