letters from the front - special libraries association · slide 1 letters from the front lori emadi...
TRANSCRIPT
Slide 2
• Background
• About the Army Data Collection
• The Business Case
• Building Hermes
• Accessibility through Metadata
• Moving forward…
Slide 3
Background
• RAND acquired hundreds of thousands of classified and unclassified documents from U.S.
Army units returning from Iraq and Afghanistan
• Files were copied as is from hundreds of hard drives – documents are varied in content,
naming, format, and structure (or lack of)
• Available tools were not suitable for working with such a large and diverse corpus of
documents
• Lacked a good way to search, make use of these data
• “Letters From The Front” project was approved as an FY 2013 R&D effort
• My project team developed a document indexing, search, and visualization capability
called Hermes from an open source tool set that is scalable and extensible
Slide 4
• Background
• About the Army Data Collection
• The Business Case
• Building Hermes
• Accessibility through Metadata
• Moving forward…
Slide 5
What We Learned About This Very Large
Data Collection
Units and Agencies (of 195, 29 submitted data)
Assistant Secretary of the Army,
Acquisition, Logistics, Logistics, and
Technology - Center for Army
Lessons Learned - Center of Military
History - Stryker Center for Lessons
Learned - United States Army
Combined Arms Support Command
and Fort Lee - US Army Corps of
Engineers - US Army G-8 - US Army
Communications Life Cycle
Management Command - 101st Air
Assault - 10th Mountain Division
(Light) and Fort Drum - 10th Special
Forces Group - 16th Engineer
Brigade - 1st Cavalry Division -
United States Army Europe and
Seventh Army - 1st Infantry Division
- 3d Infantry Division (Mechanized)
and Fort Stewart - 3rd Armored
Cavalry Regiment - 42nd Infantry
Division - I CORPS - III ARMY - III
CORPS and FT Hood - United States
Army Special Forces Command
(Airborne) - 75th Exploitation Task
Force - Multi-National Corps-Iraq -
Multi-National Force-Iraq - Multi-
National Security Transition
Command-Iraq/Commander, NATO
Training Mission-Iraq - Office of
Security Cooperation-Afghanistan -
United States Army Military Police
School - United States Military
Academy
Diverse content and file types
SITREPS - FRAGOS - SIGACTS -
INTSUMS - SPOT reports - AAR -
WARNO - BDA - BDR - CONPLAN -
Order of Battle - OPLAN - OPORD -
Deployment orders - Daily
Personnel Status - Military vehicle
status - Alert Roster - Service
Support Order - Mission Analysis
Briefs - Decision Briefs - Mission
Concept Briefs - Backbriefs –
Balcony Briefs - Debriefs - …
23 Army commands, units, and Army
support activities
4 DOD, Joint, and Other Government
activities
2 Military Academies
Emails & Email collections
(~63,000 files), PowerPoint
(~171,000), PDF (~64,000),
Excel (~84,000), Word and
other text (~400,000), images
and video (~300,000), …
Slide 6
Some interestingly named data folders …
SIRs & IRs\1ID\BEFORE
WE GOT ORGANIZED\
1ID_G3\G3 Operations\CHOPS\G3 OPS –
Battle Captains\STUFF THAT BOB JUST
SAVED\
1ID_G3\G3 Operations\FRAGOSs G3 OPS\RFI Section\SSG
xxxxxxxx\DA BUCKSTER FOLDER\MILITARY RELATED CRAP\WORK
RELATED CRAP\
1stCAV\SJA\EXSUMS\Dan’s Super Duper
FRAGO Folder\
1stCAV\G3\EOD LNO\Im Thuper Thanks Fer
Athkin\
1stCAV\SJA\EXSUMS\DEAR GOD, I HOPE
WE DON’T NEED THESE MONTHS\APRIL
2005\...
MNSTCI\NIPR\MNCI FOLDERS\NIPS\C2\C2_SECURITY\
I.Think.This.Is.The.Template.That.You.Need.To.Use.For.Submit
ting.Anything.On.Me.But.I.Could.Be.Wrong.About.That.So.I.Will
.Ask.About.It.Tomorrow\
Slide 7
• Background
• About the Army Data Collection
• The Business Case
• Building Hermes
• Accessibility through Metadata
• Moving forward…
Slide 8
Search Scenarios
• “What survey data exists on local (Iraqi, Afghan) population attitudes, beliefs, and information consumption patterns?”
• “[for x operation] We need to find out which brigades were deployed, when they were deployed, and who their brigade commanders were.”
• "There are a number of interesting distinctions that may be tractable, e.g., variations in communications of different entities and whether/how they coordinate…”
• “[for intelligence gathering] It takes multiple data points to create a profile. The data should be stored separately and combined to make the profile. Then you can run it across everything and look for relationships with any of the data points.“
Slide 9
Reconstructing Lost Events
• Not a scenario: Missing records on the 81st
BCT (Washington State ARNG) and 82nd AB
Division and their operations in Afghanistan
and Iraq. (Seattle Times article, July 13,
2013)
Slide 10
• Background
• About the Army Data Collection
• The Business Case
• Building Hermes
• Accessibility through Metadata
• Moving forward…
Slide 12
The Solr Suite Provides Needed Speed, Scalability and
Extensibility, and Is Open Source…
• “SolrTM is the popular, blazing fast open source
enterprise search platform from the Apache
LuceneTM project. Its major features include
powerful full-text search, hit highlighting, faceted
search, near real-time indexing, dynamic clustering,
database integration, rich document (e.g., Word,
PDF) handling, and geospatial search. Solr is highly
reliable, scalable and fault tolerant, providing
distributed indexing, replication and load-balanced
querying, automated failover and recovery,
centralized configuration and more. Solr powers the
search and navigation features of many of the
world's largest internet sites.”
• (“Apache Solr, at http://lucene.apache.org/Solr/)
…And Solr Is An Emerging Standard
for Searching Large Text
Databases…
Slide 13
Application Framework
File Preparation and Parsing
(Folder navigation, file type ID, OCR, content/metadata extraction,
language detection)
Search Human Interface
Modular and flexible by design, this architecture can be customized for other RAND efforts
Slide 14
• Background
• About the Army Data Collection
• The Business Case
• Building Hermes
• Accessibility through Metadata
• Moving forward…
Slide 15
…Example search results in a similar out-of-
box setting…
Columbia University Library Catalog (CLIO)
Slide 18
Query Parser Syntax
Fields field name followed by a colon ":" then term. title:"The Right Way" AND text:go
Wildcard Searches single character wildcard "?" ; multiple character wildcard "*"
Regular Expression Searches /[mb]oat/
Fuzzy Searches use the tilde, "~" : roam~ roam~1 Proximity Searches use the tilde, "~" : "jakarta apache"~10
Range Searches Use range queries with date and non-date fields: mod_date:[20020101 TO 20030101] title:{Aida TO Carmen}
Inclusive range queries are denoted by square brackets. Exclusive range queries are denoted by curly brackets.
Boosting a Term use the caret, "^", with a boost factor at the end of the term: jakarta^4 apache "jakarta apache"^4 "Apache Lucene"
Boolean Operators AND, "+", OR, NOT and "-"
"jakarta apache" OR Jakarta "jakarta apache" AND "Apache Lucene" +jakarta lucene
"jakarta apache" NOT "Apache Lucene" NOT "jakarta apache“ "jakarta apache" -"Apache Lucene"
Grouping use parentheses to group clauses to form sub queries. (jakarta OR apache) AND website
Field Grouping use parentheses to group multiple clauses: title:(+return +"pink panther")
Escaping Special Characters To escape use the \ before the character. Ex: to search for (1+1):2 use the query: \(1\+1\)\:2
Slide 19
• Background
• About the Army Data Collection
• The Business Case
• Building Hermes
• Accessibility through Metadata
• Moving forward…
Slide 20
Next Steps • Security around collection, User authentication
• Natural Language Processing (NLP)
• Extracting content attachments in emails (while keeping the attachments
in place)
• Additional visualization options
• Enhanced logging and tracking
• Ability for users to rank content, add or edit metadata content
• Enhance user interface