national security & intelligence applications of text...

74
National Security & Intelligence Applications of Text Analytics Patrick Lam Lead Data Scientist, Thresher Visiting Fellow, Harvard IQSS March 26, 2015

Upload: others

Post on 12-Aug-2020

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: National Security & Intelligence Applications of Text ...patricklam.org/talk/ucf_thresher.pdf · NationalSecurity&Intelligence ApplicationsofTextAnalytics PatrickLam LeadDataScientist,Thresher

National Security & IntelligenceApplications of Text Analytics

Patrick Lam

Lead Data Scientist, ThresherVisiting Fellow, Harvard IQSS

March 26, 2015

Page 2: National Security & Intelligence Applications of Text ...patricklam.org/talk/ucf_thresher.pdf · NationalSecurity&Intelligence ApplicationsofTextAnalytics PatrickLam LeadDataScientist,Thresher

What is text analytics?

Page 3: National Security & Intelligence Applications of Text ...patricklam.org/talk/ucf_thresher.pdf · NationalSecurity&Intelligence ApplicationsofTextAnalytics PatrickLam LeadDataScientist,Thresher

"a set of linguistic, statistical, and machine learningtechniques that model and structure the informationcontent of textual sources for business intelligence,exploratory data analysis, research, or investigation"

-Wikipedia

Page 4: National Security & Intelligence Applications of Text ...patricklam.org/talk/ucf_thresher.pdf · NationalSecurity&Intelligence ApplicationsofTextAnalytics PatrickLam LeadDataScientist,Thresher

−→

Page 5: National Security & Intelligence Applications of Text ...patricklam.org/talk/ucf_thresher.pdf · NationalSecurity&Intelligence ApplicationsofTextAnalytics PatrickLam LeadDataScientist,Thresher

What are some common textanalysis methods?

Page 6: National Security & Intelligence Applications of Text ...patricklam.org/talk/ucf_thresher.pdf · NationalSecurity&Intelligence ApplicationsofTextAnalytics PatrickLam LeadDataScientist,Thresher

Natural Language Processing (NLP)Enable computers to understand human text input.

Page 7: National Security & Intelligence Applications of Text ...patricklam.org/talk/ucf_thresher.pdf · NationalSecurity&Intelligence ApplicationsofTextAnalytics PatrickLam LeadDataScientist,Thresher

Machine LearningGrouping, classifying, predicting

Page 8: National Security & Intelligence Applications of Text ...patricklam.org/talk/ucf_thresher.pdf · NationalSecurity&Intelligence ApplicationsofTextAnalytics PatrickLam LeadDataScientist,Thresher

Applications

Page 9: National Security & Intelligence Applications of Text ...patricklam.org/talk/ucf_thresher.pdf · NationalSecurity&Intelligence ApplicationsofTextAnalytics PatrickLam LeadDataScientist,Thresher

Who Authored the Federalist Papers?(Mosteller and Wallace, 1963)

Page 10: National Security & Intelligence Applications of Text ...patricklam.org/talk/ucf_thresher.pdf · NationalSecurity&Intelligence ApplicationsofTextAnalytics PatrickLam LeadDataScientist,Thresher

43 papers

Hamilton

14 papers

Madison

5 papers

Jay

12 papers

H, M, or J?

Page 11: National Security & Intelligence Applications of Text ...patricklam.org/talk/ucf_thresher.pdf · NationalSecurity&Intelligence ApplicationsofTextAnalytics PatrickLam LeadDataScientist,Thresher

Model the usage of high-frequencyfunction words for each author.

also, and, by, of, on, there, …

Page 12: National Security & Intelligence Applications of Text ...patricklam.org/talk/ucf_thresher.pdf · NationalSecurity&Intelligence ApplicationsofTextAnalytics PatrickLam LeadDataScientist,Thresher

43 papers

Hamilton

14 papers

Madison

5 papers

Jay

12 papers

Madison

Page 13: National Security & Intelligence Applications of Text ...patricklam.org/talk/ucf_thresher.pdf · NationalSecurity&Intelligence ApplicationsofTextAnalytics PatrickLam LeadDataScientist,Thresher

Authorship attribution using textanalysis has a large literature and

can be useful for national security &intelligence purposes.

Page 14: National Security & Intelligence Applications of Text ...patricklam.org/talk/ucf_thresher.pdf · NationalSecurity&Intelligence ApplicationsofTextAnalytics PatrickLam LeadDataScientist,Thresher

Harvard IQSSInstitute for Quantitative Social Science

iq.harvard.edu

Page 15: National Security & Intelligence Applications of Text ...patricklam.org/talk/ucf_thresher.pdf · NationalSecurity&Intelligence ApplicationsofTextAnalytics PatrickLam LeadDataScientist,Thresher

Four recent projects using textanalysis by Harvard IQSS affiliateswith relevance to national security &

intelligence

Page 16: National Security & Intelligence Applications of Text ...patricklam.org/talk/ucf_thresher.pdf · NationalSecurity&Intelligence ApplicationsofTextAnalytics PatrickLam LeadDataScientist,Thresher

1. Reverse-Engineering CensorshipIn China

(King, Pan, and Roberts, 2013 and 2014)

Page 17: National Security & Intelligence Applications of Text ...patricklam.org/talk/ucf_thresher.pdf · NationalSecurity&Intelligence ApplicationsofTextAnalytics PatrickLam LeadDataScientist,Thresher

What Gets Censored In China?

Monitor ∼1,400 Chinese socialmedia sites over 6 months across85 content areas.

Download posts the instant theyappear.

Revisit each post later to check if itwas censored.

Analyze with new methods of textanalysis.

Experiment with writing socialmedia posts.

Page 18: National Security & Intelligence Applications of Text ...patricklam.org/talk/ucf_thresher.pdf · NationalSecurity&Intelligence ApplicationsofTextAnalytics PatrickLam LeadDataScientist,Thresher

Censorship program targetscollective action rather than criticism

of the government.

Page 19: National Security & Intelligence Applications of Text ...patricklam.org/talk/ucf_thresher.pdf · NationalSecurity&Intelligence ApplicationsofTextAnalytics PatrickLam LeadDataScientist,Thresher

2. Jihadi Radicalization of MuslimClerics(Nielsen, 2014)

Page 20: National Security & Intelligence Applications of Text ...patricklam.org/talk/ucf_thresher.pdf · NationalSecurity&Intelligence ApplicationsofTextAnalytics PatrickLam LeadDataScientist,Thresher

Why Do Some Clerics Preach Jihad WhileOthers Do Not?

Download writings of a sample ofMuslim clerics.

Use machine learning methods toscore clerics on level of jihad bycomparing writings to knownJihadi texts.

Analyze along with other data onclerics.

Page 21: National Security & Intelligence Applications of Text ...patricklam.org/talk/ucf_thresher.pdf · NationalSecurity&Intelligence ApplicationsofTextAnalytics PatrickLam LeadDataScientist,Thresher

Clerics with weak educationalnetworks and connections often useJihadi ideology to appeal to lay

audiences and further their careers.

Page 22: National Security & Intelligence Applications of Text ...patricklam.org/talk/ucf_thresher.pdf · NationalSecurity&Intelligence ApplicationsofTextAnalytics PatrickLam LeadDataScientist,Thresher

3. Predicting Crowd Behavior(Kallus, 2014)

Page 23: National Security & Intelligence Applications of Text ...patricklam.org/talk/ucf_thresher.pdf · NationalSecurity&Intelligence ApplicationsofTextAnalytics PatrickLam LeadDataScientist,Thresher

Can We Predict Major Protests?

Scan over 300,000 web sources(news, blogs, forums, Twitter) in 7languages for mentions of past,current, or future events.

Extract type of event, entitiesinvolved, and timeframe using NLPmethods.

Predict on each day whether asignificant protest will occur overthe next three days using machinelearning methods.

Page 24: National Security & Intelligence Applications of Text ...patricklam.org/talk/ucf_thresher.pdf · NationalSecurity&Intelligence ApplicationsofTextAnalytics PatrickLam LeadDataScientist,Thresher

Massive online public discourse datahas the potential to predict crowd

behavior using text analysis methods.

Page 25: National Security & Intelligence Applications of Text ...patricklam.org/talk/ucf_thresher.pdf · NationalSecurity&Intelligence ApplicationsofTextAnalytics PatrickLam LeadDataScientist,Thresher

4. Anti-Americanism in the MiddleEast

(Jamal, Keohane, Romney, and Tingley, 2015)

Page 26: National Security & Intelligence Applications of Text ...patricklam.org/talk/ucf_thresher.pdf · NationalSecurity&Intelligence ApplicationsofTextAnalytics PatrickLam LeadDataScientist,Thresher

Anti-Americanism on Twitter

Download Arabic Twitter posts bykeywords.

Consider discourse about US ingeneral and in reaction to specificevents.

Use text analysis methods toclassify proportion of posts inspecified categories.

Page 27: National Security & Intelligence Applications of Text ...patricklam.org/talk/ucf_thresher.pdf · NationalSecurity&Intelligence ApplicationsofTextAnalytics PatrickLam LeadDataScientist,Thresher

Anti-Americanism in the Middle Eastis directed toward the impingement ofthe US on other countries rather than

toward American society.

Page 28: National Security & Intelligence Applications of Text ...patricklam.org/talk/ucf_thresher.pdf · NationalSecurity&Intelligence ApplicationsofTextAnalytics PatrickLam LeadDataScientist,Thresher

All applications of text analysis havea common starting point.

Page 29: National Security & Intelligence Applications of Text ...patricklam.org/talk/ucf_thresher.pdf · NationalSecurity&Intelligence ApplicationsofTextAnalytics PatrickLam LeadDataScientist,Thresher

How do we define our set of textsthat we want to analyze?

Page 30: National Security & Intelligence Applications of Text ...patricklam.org/talk/ucf_thresher.pdf · NationalSecurity&Intelligence ApplicationsofTextAnalytics PatrickLam LeadDataScientist,Thresher

How do we retrieve the relevant textsfrom the vast set of texts available?

Page 31: National Security & Intelligence Applications of Text ...patricklam.org/talk/ucf_thresher.pdf · NationalSecurity&Intelligence ApplicationsofTextAnalytics PatrickLam LeadDataScientist,Thresher

How do we follow the relevantconversations as they evolve?

Page 32: National Security & Intelligence Applications of Text ...patricklam.org/talk/ucf_thresher.pdf · NationalSecurity&Intelligence ApplicationsofTextAnalytics PatrickLam LeadDataScientist,Thresher

Define.Retrieve.Follow.

Page 33: National Security & Intelligence Applications of Text ...patricklam.org/talk/ucf_thresher.pdf · NationalSecurity&Intelligence ApplicationsofTextAnalytics PatrickLam LeadDataScientist,Thresher

In some cases, the relevant set oftexts is well-defined, static, and

easily accessible.Federalist Papers, complete works of Shakespeare

Page 34: National Security & Intelligence Applications of Text ...patricklam.org/talk/ucf_thresher.pdf · NationalSecurity&Intelligence ApplicationsofTextAnalytics PatrickLam LeadDataScientist,Thresher

In most cases, we have to constantlysearch and retrieve the relevanttexts, often using keywords and

Boolean searches.Twitter, news articles, blogs and forums

Page 35: National Security & Intelligence Applications of Text ...patricklam.org/talk/ucf_thresher.pdf · NationalSecurity&Intelligence ApplicationsofTextAnalytics PatrickLam LeadDataScientist,Thresher

DHS Analyst's Desktop Binder

Page 36: National Security & Intelligence Applications of Text ...patricklam.org/talk/ucf_thresher.pdf · NationalSecurity&Intelligence ApplicationsofTextAnalytics PatrickLam LeadDataScientist,Thresher

Our analyses are only as good as ourkeywords!

Page 37: National Security & Intelligence Applications of Text ...patricklam.org/talk/ucf_thresher.pdf · NationalSecurity&Intelligence ApplicationsofTextAnalytics PatrickLam LeadDataScientist,Thresher

What is our current most commonlyused technology for defining therelevant keywords to retrieve our

texts?

Page 38: National Security & Intelligence Applications of Text ...patricklam.org/talk/ucf_thresher.pdf · NationalSecurity&Intelligence ApplicationsofTextAnalytics PatrickLam LeadDataScientist,Thresher
Page 39: National Security & Intelligence Applications of Text ...patricklam.org/talk/ucf_thresher.pdf · NationalSecurity&Intelligence ApplicationsofTextAnalytics PatrickLam LeadDataScientist,Thresher

Example: Think of keywords youwould use to follow the Twitterconversation around the Boston

Marathon Bombings.

Page 40: National Security & Intelligence Applications of Text ...patricklam.org/talk/ucf_thresher.pdf · NationalSecurity&Intelligence ApplicationsofTextAnalytics PatrickLam LeadDataScientist,Thresher

Keywords about the event#bostonbombings, explosion, terrorism, attack, …

Page 41: National Security & Intelligence Applications of Text ...patricklam.org/talk/ucf_thresher.pdf · NationalSecurity&Intelligence ApplicationsofTextAnalytics PatrickLam LeadDataScientist,Thresher

Keywords about the suspectssuspect, tsarnaev, dzhokhar, tamerlan, …

Page 42: National Security & Intelligence Applications of Text ...patricklam.org/talk/ucf_thresher.pdf · NationalSecurity&Intelligence ApplicationsofTextAnalytics PatrickLam LeadDataScientist,Thresher

Keywords about the victimsinnocent, victim, collier, …

Page 43: National Security & Intelligence Applications of Text ...patricklam.org/talk/ucf_thresher.pdf · NationalSecurity&Intelligence ApplicationsofTextAnalytics PatrickLam LeadDataScientist,Thresher

Keywords about the reactiontragedy, prayers, #prayforboston, …

Page 44: National Security & Intelligence Applications of Text ...patricklam.org/talk/ucf_thresher.pdf · NationalSecurity&Intelligence ApplicationsofTextAnalytics PatrickLam LeadDataScientist,Thresher

Keywords about the politicsobama, #tcot, #benghazi, …

Page 45: National Security & Intelligence Applications of Text ...patricklam.org/talk/ucf_thresher.pdf · NationalSecurity&Intelligence ApplicationsofTextAnalytics PatrickLam LeadDataScientist,Thresher

We ran a similar experiment with 43Harvard undergrads.

Page 46: National Security & Intelligence Applications of Text ...patricklam.org/talk/ucf_thresher.pdf · NationalSecurity&Intelligence ApplicationsofTextAnalytics PatrickLam LeadDataScientist,Thresher
Page 47: National Security & Intelligence Applications of Text ...patricklam.org/talk/ucf_thresher.pdf · NationalSecurity&Intelligence ApplicationsofTextAnalytics PatrickLam LeadDataScientist,Thresher

59% of the words were suggested byonly 1 out of 43 undergrads.

Page 48: National Security & Intelligence Applications of Text ...patricklam.org/talk/ucf_thresher.pdf · NationalSecurity&Intelligence ApplicationsofTextAnalytics PatrickLam LeadDataScientist,Thresher

Median number of words perrespondent was 7.

Page 49: National Security & Intelligence Applications of Text ...patricklam.org/talk/ucf_thresher.pdf · NationalSecurity&Intelligence ApplicationsofTextAnalytics PatrickLam LeadDataScientist,Thresher

Humans are good at recalling a smalllist of good keywords and

recognizing a good keyword whenthey see it.

Page 50: National Security & Intelligence Applications of Text ...patricklam.org/talk/ucf_thresher.pdf · NationalSecurity&Intelligence ApplicationsofTextAnalytics PatrickLam LeadDataScientist,Thresher

Humans are bad at recalling a longlist of keywords that capture different

ways of representing a concept.

Page 51: National Security & Intelligence Applications of Text ...patricklam.org/talk/ucf_thresher.pdf · NationalSecurity&Intelligence ApplicationsofTextAnalytics PatrickLam LeadDataScientist,Thresher

Some Existing Options forAutomated Keyword Discovery

Page 52: National Security & Intelligence Applications of Text ...patricklam.org/talk/ucf_thresher.pdf · NationalSecurity&Intelligence ApplicationsofTextAnalytics PatrickLam LeadDataScientist,Thresher

1. Mine search queriesGoogle Adwords

Page 53: National Security & Intelligence Applications of Text ...patricklam.org/talk/ucf_thresher.pdf · NationalSecurity&Intelligence ApplicationsofTextAnalytics PatrickLam LeadDataScientist,Thresher

2. Thesaurus methodsreference books, WordNet

Page 54: National Security & Intelligence Applications of Text ...patricklam.org/talk/ucf_thresher.pdf · NationalSecurity&Intelligence ApplicationsofTextAnalytics PatrickLam LeadDataScientist,Thresher

3. Co-occurrence methods

Page 55: National Security & Intelligence Applications of Text ...patricklam.org/talk/ucf_thresher.pdf · NationalSecurity&Intelligence ApplicationsofTextAnalytics PatrickLam LeadDataScientist,Thresher

Enter Thresher.(Based on King, Lam, and Roberts, 2014)

Page 56: National Security & Intelligence Applications of Text ...patricklam.org/talk/ucf_thresher.pdf · NationalSecurity&Intelligence ApplicationsofTextAnalytics PatrickLam LeadDataScientist,Thresher

Thresher keeps humans in the loopand helps them find more and better

words faster.

Page 57: National Security & Intelligence Applications of Text ...patricklam.org/talk/ucf_thresher.pdf · NationalSecurity&Intelligence ApplicationsofTextAnalytics PatrickLam LeadDataScientist,Thresher

The Thresher Algorithm

Reference set R: texts aboutconcept of interest

Search set S: broad set of texts

Target set T: texts in S about thesame concept as in R

Goal: Estimate T and find keywordsthat define T.

Page 58: National Security & Intelligence Applications of Text ...patricklam.org/talk/ucf_thresher.pdf · NationalSecurity&Intelligence ApplicationsofTextAnalytics PatrickLam LeadDataScientist,Thresher

Thresher Applications

Page 59: National Security & Intelligence Applications of Text ...patricklam.org/talk/ucf_thresher.pdf · NationalSecurity&Intelligence ApplicationsofTextAnalytics PatrickLam LeadDataScientist,Thresher

The Boston Marathon Bombings onTwitter

R: #bostonbombingsS: boston

Page 60: National Security & Intelligence Applications of Text ...patricklam.org/talk/ucf_thresher.pdf · NationalSecurity&Intelligence ApplicationsofTextAnalytics PatrickLam LeadDataScientist,Thresher

Thresher separates relevant wordsfrom irrelevant words.

Page 61: National Security & Intelligence Applications of Text ...patricklam.org/talk/ucf_thresher.pdf · NationalSecurity&Intelligence ApplicationsofTextAnalytics PatrickLam LeadDataScientist,Thresher

Target set words

suspect, bomb, police,people, fbi, tsarnaev, arrest,die, terror, attack, kill,obama, custody, prayer,dzhokhar, god, false,#prayforboston, #tcot,#bostonmarathon, picture,identify, russia, #watertown,tamelan, islam, jihad,…

Non-target set words

game, red sox, bruins,celtics, back, tonight, fan,#mlb, night, chicago, newyork, garnett, fenway, rondo,#job, playoff, yankees,blackhawks, stanley, pizza,#nhl, draft,…

Page 62: National Security & Intelligence Applications of Text ...patricklam.org/talk/ucf_thresher.pdf · NationalSecurity&Intelligence ApplicationsofTextAnalytics PatrickLam LeadDataScientist,Thresher

The Bo Xilai Scandal on Chineseblogs and forums

R:薄熙来 (Bo Xilai)S:重庆 (Chongqing)

Page 63: National Security & Intelligence Applications of Text ...patricklam.org/talk/ucf_thresher.pdf · NationalSecurity&Intelligence ApplicationsofTextAnalytics PatrickLam LeadDataScientist,Thresher

王立军 Wang Lijun政治 government事件 event (euphemism for the scandal)打黑 strike corruption犯罪 commit a crime民主 democracy权力 power文革 Cultural Revolution领导 leader改革 reform群众 the masses

中央中共 Central Communist Party社会主义 socialism

唱红 sing red songs黑社会 black society干部 cadre路线 party line

Page 64: National Security & Intelligence Applications of Text ...patricklam.org/talk/ucf_thresher.pdf · NationalSecurity&Intelligence ApplicationsofTextAnalytics PatrickLam LeadDataScientist,Thresher

Writings About Suicide Bombings inArabic

R: عمليات االستشهادية (martyrdom operations)from "Haqibatu'l-Mujahid" (A Mujahid's Bookbag)

S: "Pulpit of Tawhid and Jihad"A Jihadist web library

Page 65: National Security & Intelligence Applications of Text ...patricklam.org/talk/ucf_thresher.pdf · NationalSecurity&Intelligence ApplicationsofTextAnalytics PatrickLam LeadDataScientist,Thresher

العدو enemyقتل killing

والنكاية to vex or spite ("vex the infidels'')م ه لَم عي teach themيلِ الْخَ steed

وا د َأع و fightون لَم تُظْ wrongedترهبون terrifyالغالم boy

(the story of the boy and the king, relevant to jihadis)

Page 66: National Security & Intelligence Applications of Text ...patricklam.org/talk/ucf_thresher.pdf · NationalSecurity&Intelligence ApplicationsofTextAnalytics PatrickLam LeadDataScientist,Thresher

"And prepare against them whatever you are able ofpower and of steeds of war by which you may terrifythe enemy of Allah and your enemy and othersbesides them whom you do not know [but] whomAllah knows. And whatever you spend in the cause ofAllah will be fully repaid to you, and you will not bewronged."

-Quran 8:60

Page 67: National Security & Intelligence Applications of Text ...patricklam.org/talk/ucf_thresher.pdf · NationalSecurity&Intelligence ApplicationsofTextAnalytics PatrickLam LeadDataScientist,Thresher

Can we use Thresher to definedifferent Arabic dialects by keywords

and retrieve texts from them?

Page 68: National Security & Intelligence Applications of Text ...patricklam.org/talk/ucf_thresher.pdf · NationalSecurity&Intelligence ApplicationsofTextAnalytics PatrickLam LeadDataScientist,Thresher

Egyptian Gulf Levantine MSA Translation

ايه وش شو ماذا Whatمعرفش مادري بعرف ما يعرف ال Don't knowعايز يبي بدي To want/I wantعاوز يبغى بدك To want/You wantعايزين يبون بدكم To want/You want (pl.)

Page 69: National Security & Intelligence Applications of Text ...patricklam.org/talk/ucf_thresher.pdf · NationalSecurity&Intelligence ApplicationsofTextAnalytics PatrickLam LeadDataScientist,Thresher

Conversations can evolve quickly inresponse to certain actors and weneed to be able to follow them.

Page 70: National Security & Intelligence Applications of Text ...patricklam.org/talk/ucf_thresher.pdf · NationalSecurity&Intelligence ApplicationsofTextAnalytics PatrickLam LeadDataScientist,Thresher

Evading Censors in China

Page 71: National Security & Intelligence Applications of Text ...patricklam.org/talk/ucf_thresher.pdf · NationalSecurity&Intelligence ApplicationsofTextAnalytics PatrickLam LeadDataScientist,Thresher

自由 Freedom目田 Eye field

(homograph)

Page 72: National Security & Intelligence Applications of Text ...patricklam.org/talk/ucf_thresher.pdf · NationalSecurity&Intelligence ApplicationsofTextAnalytics PatrickLam LeadDataScientist,Thresher

和谐 Harmonious [Society]河蟹 River crab

(homophone)

Page 73: National Security & Intelligence Applications of Text ...patricklam.org/talk/ucf_thresher.pdf · NationalSecurity&Intelligence ApplicationsofTextAnalytics PatrickLam LeadDataScientist,Thresher

Evading Authorities and theDistribution of Child Pornography

Page 74: National Security & Intelligence Applications of Text ...patricklam.org/talk/ucf_thresher.pdf · NationalSecurity&Intelligence ApplicationsofTextAnalytics PatrickLam LeadDataScientist,Thresher

PatrickLam.orgThresherVentures.com

iq.harvard.edu