letters from the front - special libraries association · slide 1 letters from the front lori emadi...

21
Slide 1 Letters From the Front Lori Emadi Head, Taxonomy & Metadata June 14, 2015

Upload: doxuyen

Post on 05-May-2018

220 views

Category:

Documents


3 download

TRANSCRIPT

Slide 1

Letters From the

Front

Lori Emadi

Head, Taxonomy & Metadata

June 14, 2015

Slide 2

• Background

• About the Army Data Collection

• The Business Case

• Building Hermes

• Accessibility through Metadata

• Moving forward…

Slide 3

Background

• RAND acquired hundreds of thousands of classified and unclassified documents from U.S.

Army units returning from Iraq and Afghanistan

• Files were copied as is from hundreds of hard drives – documents are varied in content,

naming, format, and structure (or lack of)

• Available tools were not suitable for working with such a large and diverse corpus of

documents

• Lacked a good way to search, make use of these data

• “Letters From The Front” project was approved as an FY 2013 R&D effort

• My project team developed a document indexing, search, and visualization capability

called Hermes from an open source tool set that is scalable and extensible

Slide 4

• Background

• About the Army Data Collection

• The Business Case

• Building Hermes

• Accessibility through Metadata

• Moving forward…

Slide 5

What We Learned About This Very Large

Data Collection

Units and Agencies (of 195, 29 submitted data)

Assistant Secretary of the Army,

Acquisition, Logistics, Logistics, and

Technology - Center for Army

Lessons Learned - Center of Military

History - Stryker Center for Lessons

Learned - United States Army

Combined Arms Support Command

and Fort Lee - US Army Corps of

Engineers - US Army G-8 - US Army

Communications Life Cycle

Management Command - 101st Air

Assault - 10th Mountain Division

(Light) and Fort Drum - 10th Special

Forces Group - 16th Engineer

Brigade - 1st Cavalry Division -

United States Army Europe and

Seventh Army - 1st Infantry Division

- 3d Infantry Division (Mechanized)

and Fort Stewart - 3rd Armored

Cavalry Regiment - 42nd Infantry

Division - I CORPS - III ARMY - III

CORPS and FT Hood - United States

Army Special Forces Command

(Airborne) - 75th Exploitation Task

Force - Multi-National Corps-Iraq -

Multi-National Force-Iraq - Multi-

National Security Transition

Command-Iraq/Commander, NATO

Training Mission-Iraq - Office of

Security Cooperation-Afghanistan -

United States Army Military Police

School - United States Military

Academy

Diverse content and file types

SITREPS - FRAGOS - SIGACTS -

INTSUMS - SPOT reports - AAR -

WARNO - BDA - BDR - CONPLAN -

Order of Battle - OPLAN - OPORD -

Deployment orders - Daily

Personnel Status - Military vehicle

status - Alert Roster - Service

Support Order - Mission Analysis

Briefs - Decision Briefs - Mission

Concept Briefs - Backbriefs –

Balcony Briefs - Debriefs - …

23 Army commands, units, and Army

support activities

4 DOD, Joint, and Other Government

activities

2 Military Academies

Emails & Email collections

(~63,000 files), PowerPoint

(~171,000), PDF (~64,000),

Excel (~84,000), Word and

other text (~400,000), images

and video (~300,000), …

Slide 6

Some interestingly named data folders …

SIRs & IRs\1ID\BEFORE

WE GOT ORGANIZED\

1ID_G3\G3 Operations\CHOPS\G3 OPS –

Battle Captains\STUFF THAT BOB JUST

SAVED\

1ID_G3\G3 Operations\FRAGOSs G3 OPS\RFI Section\SSG

xxxxxxxx\DA BUCKSTER FOLDER\MILITARY RELATED CRAP\WORK

RELATED CRAP\

1stCAV\SJA\EXSUMS\Dan’s Super Duper

FRAGO Folder\

1stCAV\G3\EOD LNO\Im Thuper Thanks Fer

Athkin\

1stCAV\SJA\EXSUMS\DEAR GOD, I HOPE

WE DON’T NEED THESE MONTHS\APRIL

2005\...

MNSTCI\NIPR\MNCI FOLDERS\NIPS\C2\C2_SECURITY\

I.Think.This.Is.The.Template.That.You.Need.To.Use.For.Submit

ting.Anything.On.Me.But.I.Could.Be.Wrong.About.That.So.I.Will

.Ask.About.It.Tomorrow\

Slide 7

• Background

• About the Army Data Collection

• The Business Case

• Building Hermes

• Accessibility through Metadata

• Moving forward…

Slide 8

Search Scenarios

• “What survey data exists on local (Iraqi, Afghan) population attitudes, beliefs, and information consumption patterns?”

• “[for x operation] We need to find out which brigades were deployed, when they were deployed, and who their brigade commanders were.”

• "There are a number of interesting distinctions that may be tractable, e.g., variations in communications of different entities and whether/how they coordinate…”

• “[for intelligence gathering] It takes multiple data points to create a profile. The data should be stored separately and combined to make the profile. Then you can run it across everything and look for relationships with any of the data points.“

Slide 9

Reconstructing Lost Events

• Not a scenario: Missing records on the 81st

BCT (Washington State ARNG) and 82nd AB

Division and their operations in Afghanistan

and Iraq. (Seattle Times article, July 13,

2013)

Slide 10

• Background

• About the Army Data Collection

• The Business Case

• Building Hermes

• Accessibility through Metadata

• Moving forward…

Slide 11

There Are Many Available Tools For Dealing With Masses of

Textual Data…

Slide 12

The Solr Suite Provides Needed Speed, Scalability and

Extensibility, and Is Open Source…

• “SolrTM is the popular, blazing fast open source

enterprise search platform from the Apache

LuceneTM project. Its major features include

powerful full-text search, hit highlighting, faceted

search, near real-time indexing, dynamic clustering,

database integration, rich document (e.g., Word,

PDF) handling, and geospatial search. Solr is highly

reliable, scalable and fault tolerant, providing

distributed indexing, replication and load-balanced

querying, automated failover and recovery,

centralized configuration and more. Solr powers the

search and navigation features of many of the

world's largest internet sites.”

• (“Apache Solr, at http://lucene.apache.org/Solr/)

…And Solr Is An Emerging Standard

for Searching Large Text

Databases…

Slide 13

Application Framework

File Preparation and Parsing

(Folder navigation, file type ID, OCR, content/metadata extraction,

language detection)

Search Human Interface

Modular and flexible by design, this architecture can be customized for other RAND efforts

Slide 14

• Background

• About the Army Data Collection

• The Business Case

• Building Hermes

• Accessibility through Metadata

• Moving forward…

Slide 15

…Example search results in a similar out-of-

box setting…

Columbia University Library Catalog (CLIO)

Slide 16

Metadata fields -

--customizable

--implemented during

processing

Slide 17

Slide 18

Query Parser Syntax

Fields field name followed by a colon ":" then term. title:"The Right Way" AND text:go

Wildcard Searches single character wildcard "?" ; multiple character wildcard "*"

Regular Expression Searches /[mb]oat/

Fuzzy Searches use the tilde, "~" : roam~ roam~1 Proximity Searches use the tilde, "~" : "jakarta apache"~10

Range Searches Use range queries with date and non-date fields: mod_date:[20020101 TO 20030101] title:{Aida TO Carmen}

Inclusive range queries are denoted by square brackets. Exclusive range queries are denoted by curly brackets.

Boosting a Term use the caret, "^", with a boost factor at the end of the term: jakarta^4 apache "jakarta apache"^4 "Apache Lucene"

Boolean Operators AND, "+", OR, NOT and "-"

"jakarta apache" OR Jakarta "jakarta apache" AND "Apache Lucene" +jakarta lucene

"jakarta apache" NOT "Apache Lucene" NOT "jakarta apache“ "jakarta apache" -"Apache Lucene"

Grouping use parentheses to group clauses to form sub queries. (jakarta OR apache) AND website

Field Grouping use parentheses to group multiple clauses: title:(+return +"pink panther")

Escaping Special Characters To escape use the \ before the character. Ex: to search for (1+1):2 use the query: \(1\+1\)\:2

Slide 19

• Background

• About the Army Data Collection

• The Business Case

• Building Hermes

• Accessibility through Metadata

• Moving forward…

Slide 20

Next Steps • Security around collection, User authentication

• Natural Language Processing (NLP)

• Extracting content attachments in emails (while keeping the attachments

in place)

• Additional visualization options

• Enhanced logging and tracking

• Ability for users to rank content, add or edit metadata content

• Enhance user interface

Slide 21

Thank You! Any Questions?

[email protected]