toward a scalable knowledge space on the cloud: initial integration and evaluation

23
HPEC10-1 DMS 06/24/22 MIT Lincoln Laboratory Toward a Scalable Knowledge Space on the Cloud: Initial Integration and Evaluation Delsey Sherrill, Jonathan Kurz, Craig McNally, and Will Smith {dsherrill, jonkurz, cmcnally, william.smith}@ll.mit.edu High Performance Embedded Computing Workshop 15-16 September 2010

Upload: kristy

Post on 07-Feb-2016

57 views

Category:

Documents


1 download

DESCRIPTION

Toward a Scalable Knowledge Space on the Cloud: Initial Integration and Evaluation. Delsey Sherrill, Jonathan Kurz, Craig McNally, and Will Smith {dsherrill, jonkurz, cmcnally, william.smith}@ll.mit.edu High Performance Embedded Computing Workshop 15-16 September 2010. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Toward a Scalable Knowledge Space on the Cloud: Initial Integration and Evaluation

HPEC10-1DMS 04/22/23

MIT Lincoln Laboratory

Toward a Scalable Knowledge Space on the Cloud:

Initial Integration and Evaluation

Delsey Sherrill, Jonathan Kurz, Craig McNally, and Will Smith{dsherrill, jonkurz, cmcnally, william.smith}@ll.mit.edu

High Performance Embedded Computing Workshop

15-16 September 2010

Page 2: Toward a Scalable Knowledge Space on the Cloud: Initial Integration and Evaluation

MIT Lincoln LaboratorySKS-2DMS

Attempted Terrorist Attack 12/25/09

11 November/UKCable to US:

“pledge to jihad”

“Umar Farouk”

Anwar al-Awlaki

Umar Farouk Abdulmutallab

19 November/CIAUFA’s father: “son in Yemen”,“extreme religious views”

U.S Embassy, Nigeria

25 December/DHSCash ticket, no luggage checked

NWA flight 253Amsterdam

Detroit

AugustUS Intel: “meeting to plan operation”

“Nigerian”

Al Qaida of the Arabian Peninsula / Yemen

Key breakdowns:- Dissemination and access- Name ambiguity- Structured/unstructured data correlation

Key breakdowns:- Dissemination and access- Name ambiguity- Structured/unstructured data correlation

Page 3: Toward a Scalable Knowledge Space on the Cloud: Initial Integration and Evaluation

MIT Lincoln LaboratorySKS-3DMS

Challenges

• Dissemination and access– “Silos of excellence”– Coarse-grained classification (default to “system high”)– Varying levels of clearance among DoD, IC, Coalition partners

• Name ambiguity– Aliases, common names– Spelling variation (foreign names, typos)– Partial name references– Lack of structured data context

• Structured / unstructured data correlation– Data volumes overwhelm capacity for human review

» Structured: 102 passengers x 104 daily flights into US = 106 reservations / day» Unstructured: 104 new reports per day; years of archives

– Variations in dates, times, locations, etc. expressed in free text

Page 4: Toward a Scalable Knowledge Space on the Cloud: Initial Integration and Evaluation

MIT Lincoln LaboratorySKS-4DMS

Outline

• Introduction

• Structured Knowledge Space (SKS) Overview

• SKS-on-Cloud Integration

• SKS-on-Cloud Benchmarking

• Future Work & Summary

Page 5: Toward a Scalable Knowledge Space on the Cloud: Initial Integration and Evaluation

MIT Lincoln LaboratorySKS-5DMS

Structured Knowledge Space (SKS)

SKS can address key intelligence challenges by enriching unstructured documents and supporting discovery over the

network to users at multiple classification levels

SKS can address key intelligence challenges by enriching unstructured documents and supporting discovery over the

network to users at multiple classification levels

Secure Data Store

Secure Data Store

Search EngineSearch Engine

Indexed Text

DocumentCollection

Target Folder 1A

Target is moving from Leander 15RWQ1545 back into JOA Bear on the evening of 17 April 2006 will stop @ OBJ1A between 2000 and 2200 hours.

Significance: AQI LeaderSources of INTEL: OGI

DescriptionExtremist OperativesRabah MuhtadiAlrhu OldegiUmar Nawaf

Indexed Text

.txt

Target DocumentIndexed Text &

Extracted Entities

Document Ingest

• Dissemination and sharing

• Name ambiguity

• Structured/unstructured data correlation

Web-based SearchReal-time

AlertsReal-time

Alerts

Named entity recognition , query expansion Secure multilevel access , web search

Geo/time extraction, alerting

Page 6: Toward a Scalable Knowledge Space on the Cloud: Initial Integration and Evaluation

MIT Lincoln LaboratorySKS-6DMS

Keyword searches are limited to exact or near matches, precluding fundamental document discovery use casesKeyword searches are limited to exact or near matches, precluding fundamental document discovery use cases

Indexed Text

DocumentCollection

Target Folder 1A

Target is moving from Leander 15RWQ1545 back into JOA Bear on the evening of 17 April 2006 will stop @ OBJ1A between 2000 and 2200 hours.

Significance: AQI LeaderSources of INTEL: OGI

DescriptionExtremist OperativesRabah Muhtadi Alrhu Oldegi Umar Nawaf

Target Folder 1A

Target is moving from Leander 15RWQ1545 back into JOA Bear on the evening of 17 April 2006 will stop @ OBJ1A between 2000 and 2200 hours.

Significance: AQI LeaderSources of INTEL: OGI

DescriptionExtremist OperativesRabah Muhtadi Alrhu Oldegi Umar Nawaf

Indexed Text Document Discovery Use Cases

.txt

Target Document

• Search for “AQI Leader”

• Search at “15RWQ1545”

• Search on “17 April 2006”

• Search for “Umar Nawaf”

Avoiding Keyword Search Pitfalls

Indexed Text & Extracted Entities

• Find people associated with AQI in Apr 2006 near 15RVQ9050PEOPLE, RELATIONSHIPS

• Search within 30km of 15RVQ9050

GEOSPATIAL COORDINATE • Search for “Al Qaeda in Iraq”

ORGANIZATION• Search between 4/12/06 – 4/18/06

DATE

Entity extraction enables geospatial, temporal, and entity category searches for documents

Entity extraction enables geospatial, temporal, and entity category searches for documents

Page 7: Toward a Scalable Knowledge Space on the Cloud: Initial Integration and Evaluation

MIT Lincoln LaboratorySKS-7DMS

Web-Based Search Capabilities

Page 8: Toward a Scalable Knowledge Space on the Cloud: Initial Integration and Evaluation

MIT Lincoln LaboratorySKS-8DMS

Web-Based Search Capabilities

Query by keyword, phrase, fuzzy match, wildcard, geo, date, source, format, and Arabic name variant

“Facets” reveal the top 20 people, organizations, etc. within documents matching search

Search hits sorted by relevance with highlighted snippets, attributes, and download links

Page 9: Toward a Scalable Knowledge Space on the Cloud: Initial Integration and Evaluation

MIT Lincoln LaboratorySKS-9DMS

Outline

• Introduction

• Structured Knowledge Space Overview

• SKS-on-Cloud Integration

• SKS-on-Cloud Benchmarking

• Future Work & Summary

Page 10: Toward a Scalable Knowledge Space on the Cloud: Initial Integration and Evaluation

MIT Lincoln LaboratorySKS-10

DMS

To Cloud or Not to Cloud?

• Scale up: costly high end HW, proprietary RDBMS*

• Centralized (move data to computation nodes)

• Relational store: defined in advance, natural data representation

• Low-level data integrity guaranteed by database

• Standard Query Language (SQL): cross-platform

• Well-established technology, large pool of expertise

• Scale out: commodity hardware, FOSS**/GOTS

• Decentralized (move computation to data nodes)

• Key-value store: free-form, add columns on the fly, app dependent model

• Data integrity left to application logic

• Non-standard APIs: every cloud for itself

• Still novel technology; specialized skill set

* Relational Database Management System ** Free Open Source Software

Traditional Cloud

Page 11: Toward a Scalable Knowledge Space on the Cloud: Initial Integration and Evaluation

MIT Lincoln LaboratoryHPEC10-11

DMS 04/22/23

Integration Plan

Secure Relat’l Store

Secure Relat’l Store

Parsers &ProcessorsParsers &

Processors

Documents

Services & InterfacesServices & Interfaces

Users

Distributed Search Engine

Distributed Search Engine

SecureCloud Store

SecureCloud Store

Side-car approach mitigates risk of exploring new technologies; proven critical path remains intact

Side-car approach mitigates risk of exploring new technologies; proven critical path remains intact

SearchEngineSearchEngine

Page 12: Toward a Scalable Knowledge Space on the Cloud: Initial Integration and Evaluation

MIT Lincoln LaboratoryHPEC10-12

DMS 04/22/23

Search Components: “SKS Classic”

Oracle 10g DDM*

Oracle 10g DDM*

Facet Results

FacetComputation

FacetComputation

FacetRetrieval

FacetRetrieval

Local FileSystem

Local FileSystem Search Results

Metadata RetrievalMetadata Retrieval

Text ContentRetrieval

Text ContentRetrieval

Results FormatterResults

Formatter

PL-3 Accredited System

*Dimensional Data Model

LocalIndexes

LocalIndexesLocal

IndexesLocal

IndexesLocal

IndexesLocal

Indexes

Lucene 2.4Multi-

Searcher

Lucene 2.4Multi-

Searcher

Analysts

Web SearchInterface

“Mullah Omar”

“Mullah Omar”

Page 13: Toward a Scalable Knowledge Space on the Cloud: Initial Integration and Evaluation

MIT Lincoln LaboratoryHPEC10-13

DMS 04/22/23

Search Components: SKS-on-Cloud

Analysts

Web SearchInterface

“Mullah Omar”

SolrNodeSolr

Node

LLLuceneIndexesLuceneIndexes

Solr RESTful

Search API

Solr RESTful

Search API

LL LL

Facet Results

FacetComputation

FacetComputation

FacetRetrieval

FacetRetrieval

“Bigtable-like” Store“Bigtable-like” Store

Search Results

Results FormatterResults

Formatter

Metadata & Text Content

Retrieval

Metadata & Text Content

Retrieval

PL-3 Accredited System (in process)

Page 14: Toward a Scalable Knowledge Space on the Cloud: Initial Integration and Evaluation

MIT Lincoln LaboratoryHPEC10-14

DMS 04/22/23

Cloud Hardware: MIT LL Compute Clusters

LAN Switch

Network Storage

Resource Manager

ConfigurationServer

Compute Nodes

Service Nodes Cluster Switch

To Lincoln LAN

Cluster(s) TX-2500 TX-3D TX-X

Classification Unclassified Classified External

Compute Nodes 512 306 120

Processors 1024 612 240

Total RAM 4,056 GB 1,800 GB 960 GB

Central Storage 36.0 TB 4.3 TB 4.3 TB

Total Local Disk Space 817.6 TB 90.0 TB 40.3 TB

MIT-LL owns and operates multiple state-of-the-art computing clusters for information technology and application development research

MIT-LL owns and operates multiple state-of-the-art computing clusters for information technology and application development research

Page 15: Toward a Scalable Knowledge Space on the Cloud: Initial Integration and Evaluation

MIT Lincoln LaboratoryHPEC10-15

DMS 04/22/23

Outline

• Introduction

• Structured Knowledge Space Overview

• SKS-on-Cloud Integration

• SKS-on-Cloud Benchmarking

• Future Work & Summary

Page 16: Toward a Scalable Knowledge Space on the Cloud: Initial Integration and Evaluation

MIT Lincoln LaboratoryHPEC10-16

DMS 04/22/23

Benchmarking Method

OracleOracle

LuceneIndexesLuceneIndexes

JMeter (request bot)

FileSystem

FileSystem

LuceneMulti-

Searcher

LuceneMulti-

Searcher

Web SearchInterface

Facet Results

Search Results

FacetComputation

FacetComputation

FacetRetrieval

FacetRetrieval

Metadata RetrievalMetadata Retrieval

Text ContentRetrieval

Text ContentRetrieval

Results FormatterResults

Formatter

RLS Secure Access (Accredited)

“Mullah Omar”

* Repeat for 200 different keywords

“Mullah Omar”*t0

t1

t2

t3

Page 17: Toward a Scalable Knowledge Space on the Cloud: Initial Integration and Evaluation

MIT Lincoln LaboratoryHPEC10-17

DMS 04/22/23

SKS-Classic Benchmarking Results

All three subcomponents contribute significantly to total timing, so all are worthwhile scaling targets

All three subcomponents contribute significantly to total timing, so all are worthwhile scaling targets

(max(t2,t3)-t0)

(t1-t0)

(t2-t1)

(t3-t1)

Be

tte

r

Page 18: Toward a Scalable Knowledge Space on the Cloud: Initial Integration and Evaluation

MIT Lincoln LaboratoryHPEC10-18

DMS 04/22/23

NOTIONAL Comparison Results

# Documents Loaded

Sear

ch T

ime

SKS-Classic

Max acceptable search time 5 sec

10M??Crossover point

Be

tte

r

SKS-on-Cloud

Goal: sufficient samples at escalating loads to estimate crossover point (if exists) and extrapolate to billion-documents regime

Goal: sufficient samples at escalating loads to estimate crossover point (if exists) and extrapolate to billion-documents regime

Page 19: Toward a Scalable Knowledge Space on the Cloud: Initial Integration and Evaluation

MIT Lincoln LaboratoryHPEC10-19

DMS 04/22/23

Outline

• Introduction

• Structured Knowledge Space Overview

• SKS-on-Cloud Integration

• SKS-on-Cloud Benchmarking

• Future Work & Summary

Page 20: Toward a Scalable Knowledge Space on the Cloud: Initial Integration and Evaluation

MIT Lincoln LaboratoryHPEC10-20

DMS 04/22/23

What Might Have Been

11 November/UKCable to US:

“pledge to jihad”

“Umar Farouk”

Anwar al-Awlaki

Umar Farouk Abdulmutallab

19 November/CIAUFA’s father: “son in Yemen”,“extreme religious views”

U.S Embassy, Nigeria

25 December/DHSCash ticket, no luggage checked

NWA flight 253Amsterdam

Detroit

AugustUS Intel intercept: “meeting to plan

operation”

“Nigerian”

Al Qaida of the Arabian Peninsula / Yemen

Analyst searching for “Umar Farouk Abdulmutallab”

finds connections to Awlaki, Nigerian, planned operation

Father’s warnings plus other derogatory

evidence enough to take preventive action

(Revoke visa, No-fly list)

Correlation engine alerts authorities that person

of interest has suspicious reservation and is about to board plane bound for US

Page 21: Toward a Scalable Knowledge Space on the Cloud: Initial Integration and Evaluation

MIT Lincoln LaboratoryHPEC10-21

DMS 04/22/23

Future Work

• Develop Analytics Engine to leverage cloud processing capabilities

– Correlating structured with unstructured data (e.g. Entity Track Analysis)

– Clustering of entity mentions within documents to improve name disambiguation

• Operationalize SKS-on-Cloud system

• Complete comparative search benchmarks to at least 10 million documents

• Scale to 1 billion, 10 billion, …

Page 22: Toward a Scalable Knowledge Space on the Cloud: Initial Integration and Evaluation

MIT Lincoln LaboratoryHPEC10-22

DMS 04/22/23

Summary

• MIT LL has developed the Structured Knowledge Space system to extract entities and relationships from weakly structured intelligence reporting formats

– Web services and browser-based user interfaces support discovery and access over the network

• To explore the feasibility and desirability of migrating the full SKS application suite to a cloud-based distributed storage & processing platform, we integrated cloud storage as a data storage sidecar on the existing system

• Early benchmarks indicate that existing system performs adequately up to 3M documents (< 2 sec for simple searches) but timings show an upward trend

– Too early to predict Cloud-based system performance; however theoretical benchmarks are promising

Page 23: Toward a Scalable Knowledge Space on the Cloud: Initial Integration and Evaluation

MIT Lincoln LaboratoryHPEC10-23

DMS 04/22/23

Acknowledgements

• Gary Condon

• Jason Hepp

• Jeremy Kepner

• Ben Landon

• Bob Piotti

• Chuck Yee

• The LLGrid team

• The SKS-RTRG development team

Contact: Delsey Sherrill, Jonathan Kurz, Craig McNally, and Will Smith{dsherrill, jonkurz, cmcnally, william.smith}@ll.mit.edu