data science for computational journalismranger.uta.edu/~cli/talks/pydata2015_chengkaili.pdf ·...

31
Data Science for Computational Journalism Chengkai Li Associate Professor, Department of Computer Science and Engineering Director, Innovative Database and Information Systems Research (IDIR) Laboratory University of Texas at Arlington PyData Dallas, April 26, 2015

Upload: others

Post on 25-May-2020

11 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Data Science for Computational Journalismranger.uta.edu/~cli/talks/PyData2015_ChengkaiLi.pdf · Data Science for Computational Journalism. Chengkai Li . Associate Professor, Department

Data Science for Computational Journalism

Chengkai Li Associate Professor, Department of Computer Science and Engineering Director, Innovative Database and Information Systems Research (IDIR) Laboratory University of Texas at Arlington PyData Dallas, April 26, 2015

Page 2: Data Science for Computational Journalismranger.uta.edu/~cli/talks/PyData2015_ChengkaiLi.pdf · Data Science for Computational Journalism. Chengkai Li . Associate Professor, Department

Research at the Innovative Database and Information Systems Research (IDIR) Laboratory

o computational journalism o crowdsourcing and human computation o data exploration by

ranking/skyline/preference queries

o database testing o entity search and entity query o graph database usability

Research areas o Big Data and Data Science (Database, Data Mining, Wed Data Management,

Information Retrieval)

Theme of current research o building large-scale human-assisting and human-assisted data and information systems

with high usability, high efficiency and applications for social good

Research directions

Page 3: Data Science for Computational Journalismranger.uta.edu/~cli/talks/PyData2015_ChengkaiLi.pdf · Data Science for Computational Journalism. Chengkai Li . Associate Professor, Department

Our Computational Journalism Project o Started in 2010. Collaborative project with Duke,

Google Research, HP Labs, Stanford

o Fact finding: finding and monitoring number-based facts pertinent to real-world events. The facts are leads to news stories.

o Fact checking: discovering and checking factual claims in debates, speeches, interviews, news

Page 4: Data Science for Computational Journalismranger.uta.edu/~cli/talks/PyData2015_ChengkaiLi.pdf · Data Science for Computational Journalism. Chengkai Li . Associate Professor, Department

FactWatcher Tuple t for new real world event appended to database

Find constraint-measure pair (C, M) such that t is in the contextual skyline

Constraint Measure month=Feb pts, ast, reb opp_team=Nets ast, reb team=Celtics ∧ opp_team=Nets

ast, reb

… …

Wesley had 12 points, 13 assists and 5 rebounds on February 25, 1996 to become the first player with a 12/13/5 (points/assists/rebounds) in February.

Generate factual claim

http://en.wikipedia.org/wiki/Basketball

Page 5: Data Science for Computational Journalismranger.uta.edu/~cli/talks/PyData2015_ChengkaiLi.pdf · Data Science for Computational Journalism. Chengkai Li . Associate Professor, Department

Factual Claims Prominent streaks o “This month the Chinese capital has experienced 10 days with a maximum temperature

in around 35 degrees Celsius – the most for the month of July in a decade.” o “The Nikkei 225 closed below 10000 for the 12th consecutive week, the longest such

streak since June 2009.” Situational facts o “Paul George had 21 points, 11 rebounds and 5 assists to become the first Pacers player

with a 20/10/5 (points/rebounds/assists) game against the Bulls since Detlef Schrempf in December 1992.”

o “The social world’s most viral photo ever generated 3.5 million likes, 170,000 comments and 460,000 shares by Wednesday afternoon.”

Domains: politics, sports, weather, crimes, transportation, finance, social media analytics, publications

Page 6: Data Science for Computational Journalismranger.uta.edu/~cli/talks/PyData2015_ChengkaiLi.pdf · Data Science for Computational Journalism. Chengkai Li . Associate Professor, Department

http://idir.uta.edu/factwatcher/

Page 7: Data Science for Computational Journalismranger.uta.edu/~cli/talks/PyData2015_ChengkaiLi.pdf · Data Science for Computational Journalism. Chengkai Li . Associate Professor, Department
Page 8: Data Science for Computational Journalismranger.uta.edu/~cli/talks/PyData2015_ChengkaiLi.pdf · Data Science for Computational Journalism. Chengkai Li . Associate Professor, Department

People Make Claims All The Time “… our Navy is smaller than it's been since 1917", said Republican candidate Mitt Romney in third presidential debate in 2012.

http://en.wikipedia.org/wiki/Mitt_Romney http://www.thebrainchildgroup.com/

Page 9: Data Science for Computational Journalismranger.uta.edu/~cli/talks/PyData2015_ChengkaiLi.pdf · Data Science for Computational Journalism. Chengkai Li . Associate Professor, Department

Fact Checking is not Easy “… our Navy is smaller than it's been since 1917", said Republican candidate Mitt Romney in third presidential debate in 2012.

http://en.wikipedia.org/wiki/Mitt_Romney http://s3.amazonaws.com/thf_media/2010/pdf/Military_chartbook.pdf

Page 10: Data Science for Computational Journalismranger.uta.edu/~cli/talks/PyData2015_ChengkaiLi.pdf · Data Science for Computational Journalism. Chengkai Li . Associate Professor, Department

Fact Checking is not Easy “… our Navy is smaller than it's been since 1917", said Republican candidate Mitt Romney in third presidential debate in 2012.

http://en.wikipedia.org/wiki/Mitt_Romney http://s3.amazonaws.com/thf_media/2010/pdf/Military_chartbook.pdf http://en.wikipedia.org/wiki/United_States_Navy

vs

Page 11: Data Science for Computational Journalismranger.uta.edu/~cli/talks/PyData2015_ChengkaiLi.pdf · Data Science for Computational Journalism. Chengkai Li . Associate Professor, Department

Existing Fact Checking Projects Journalists and reporters spend good amount of time on fact checking Politifact http://www.politifact.com/ FactCheckEU https://factcheckeu.org/ FullFact http://fullfact.org/ Snopes http://www.snopes.com/info/whatsnew.asp Factcheck http://www.factcheck.org/

Page 12: Data Science for Computational Journalismranger.uta.edu/~cli/talks/PyData2015_ChengkaiLi.pdf · Data Science for Computational Journalism. Chengkai Li . Associate Professor, Department

ClaimBusters Long-term goal o (Partly) automate fact checking process

social media interviews

debates speeches

news

factual claims ranked by importance

classification& ranking

checked by algorithms / journalists/citizens /crowd (e.g., Twitter users)

o Plan for Election 2016

Current progress o Classification models for finding check-worthy factual statements o Preliminary exploration of crowdsourcing fact-checking

Page 13: Data Science for Computational Journalismranger.uta.edu/~cli/talks/PyData2015_ChengkaiLi.pdf · Data Science for Computational Journalism. Chengkai Li . Associate Professor, Department

Factual Claim Classification Dataset: presidential debates o Source: http://www.debates.org/index.php?page=debate-transcripts o All 30 debates (11 elections) in history: 1960, 1976—2012 o 20k sentences by presidential candidates: removed very short (< 5 words) sentences

Classify each sentence into 1 of 3 classes

Page 14: Data Science for Computational Journalismranger.uta.edu/~cli/talks/PyData2015_ChengkaiLi.pdf · Data Science for Computational Journalism. Chengkai Li . Associate Professor, Department

Examples of Sentences Important factual claims “We spend less on the military today than at any time in our history.” “The President’s position on gay marriage has changed.” “More people are unemployed today than four years ago.”

Unimportant factual claims “I was in Iowa yesterday.” “My mother enjoys cooking.” “I ran for President once before.”

Sentences with no factual claims (just opinions, questions & declarations) “Iran must not get nuclear weapons.” “7% unemployment is too high.” “My opponent is wishy-washy.” “I will be tough on crime.” "Why should we do that?“ “Hello, New Hampshire!” “Our plan is to reduce tax rate by 10%.”

Page 15: Data Science for Computational Journalismranger.uta.edu/~cli/talks/PyData2015_ChengkaiLi.pdf · Data Science for Computational Journalism. Chengkai Li . Associate Professor, Department

Ground Truth Collection Each sentence is labelled by two of many participants. The ground truth includes the sentence only if the two participants agreed on its class label.

Page 16: Data Science for Computational Journalismranger.uta.edu/~cli/talks/PyData2015_ChengkaiLi.pdf · Data Science for Computational Journalism. Chengkai Li . Associate Professor, Department

How We Use Python Data wrangling o Use NLTK (Natural Language Toolkit) to transform debate files into structured data format o Use mysql-python-connector to store extracted features into an MySQL database o Use matplotlib to plot classifiers’ performance.

Feature extraction

o Use AlchemyAPI (Python wrapper) to extract rich features of sentences: keywords, POS (part-of-speech) tags, sentiments, entities, concepts, taxonomy

Classification o Use scikit-learn to build classification models

Page 17: Data Science for Computational Journalismranger.uta.edu/~cli/talks/PyData2015_ChengkaiLi.pdf · Data Science for Computational Journalism. Chengkai Li . Associate Professor, Department

Feature Extraction Keywords, POS (part-of-speech) tags import nltk sentence = 'The tax policy for the middle class is bad.' pos = nltk.pos_tag(nltk.word_tokenize(sentence)) print(pos) [('The', 'DT'), ('tax', 'NN'), ('policy', 'NN'), ('for', 'IN'), ('the', 'DT'), ('middle', 'NN'), ('class', 'NN'), ('is', 'VBZ'), ('bad', 'JJ')]

Page 18: Data Science for Computational Journalismranger.uta.edu/~cli/talks/PyData2015_ChengkaiLi.pdf · Data Science for Computational Journalism. Chengkai Li . Associate Professor, Department

Feature Extraction Sentiments from alchemyapi import AlchemyAPI alchemyapi = AlchemyAPI() sentence = ‘The tax policy for the middle class is bad.' response = alchemyapi.sentiment('text', sentence) sentiment = response['docSentiment']['score'] print(sentiment) -0.6532

Page 19: Data Science for Computational Journalismranger.uta.edu/~cli/talks/PyData2015_ChengkaiLi.pdf · Data Science for Computational Journalism. Chengkai Li . Associate Professor, Department

Feature Extraction Entities response = alchemyapi.combined('text', sentence, {'sentiment': 1}) print(response['entities']) [{'sentiment': {'type': 'negative', 'score': '-0.653232'}, 'count': '1', 'type': 'FieldTerminology', 'relevance': '0.33', 'text': 'tax policy'}]

Page 20: Data Science for Computational Journalismranger.uta.edu/~cli/talks/PyData2015_ChengkaiLi.pdf · Data Science for Computational Journalism. Chengkai Li . Associate Professor, Department

Feature Extraction Concepts print(response[‘concepts']) [{'opencyc': 'http://sw.opencyc.org/concept/Mx4rvViw25wpEbGdrcN5Y29ycA', 'dbpedia': 'http://dbpedia.org/resource/Middle_class', 'freebase': 'http://rdf.freebase.com/ns/m.01lbc_', 'text': 'Middle class', 'relevance': '0.921176'}, {'dbpedia': 'http://dbpedia.org/resource/Social_class', 'freebase': 'http://rdf.freebase.com/ns/m.07714', 'text': 'Social class', 'relevance': '0.869326'}]

Page 21: Data Science for Computational Journalismranger.uta.edu/~cli/talks/PyData2015_ChengkaiLi.pdf · Data Science for Computational Journalism. Chengkai Li . Associate Professor, Department

Feature Extraction Taxonomy print(response[‘taxonomy']) /law, govt and politics / legal issues / legislation

Page 22: Data Science for Computational Journalismranger.uta.edu/~cli/talks/PyData2015_ChengkaiLi.pdf · Data Science for Computational Journalism. Chengkai Li . Associate Professor, Department

Classification Models Use scikit-learn to build classification models o Naïve Bayes Classifier(NBC)

o Support Vector Machine (SVM) LinearSVC (linear kernel, multi-class classification) o Random Forest Classifier (RFC) 200 trees in the forest (n_estimators = 200)

Page 23: Data Science for Computational Journalismranger.uta.edu/~cli/talks/PyData2015_ChengkaiLi.pdf · Data Science for Computational Journalism. Chengkai Li . Associate Professor, Department

Preliminary Experiments 3 classes o NFS (non-factual-statement), NO (unimportant factual claim), YES (important

factual claim) 5 categories of features o K: keyword; ET: entity type; P: POS tag; C: concept; T: taxonomy

5 combinations of features (+sentiment, +length) o K; K+P; K+P+ET; K+P+ET+C; K+P+ET+C+T

Instances o 1571 sentences in ground truth o training data : test data = 3:1 o 4-fold cross validation

Page 24: Data Science for Computational Journalismranger.uta.edu/~cli/talks/PyData2015_ChengkaiLi.pdf · Data Science for Computational Journalism. Chengkai Li . Associate Professor, Department

Classification Using scikit-learn #last column is the class attribute features = data.columns[0:-1] #splitting train/test data (handout) msk = np.random.rand(len(data)) <= 0.75 train = data[msk][features] test = data[~msk][features] train_verdict = data[msk].verdict test_verdict = data[~msk].verdict #building and applying the model clf = RandomForestClassifier(n_estimators=200)#GaussianNB()#LinearSVC() clf.fit(train, train_verdict) prediction = clf.predict(test) #cross validation cv = np.sqrt(abs(cross_val_score(clf, data[features], data.verdict, cv=4, scoring='accuracy').mean()))

Page 25: Data Science for Computational Journalismranger.uta.edu/~cli/talks/PyData2015_ChengkaiLi.pdf · Data Science for Computational Journalism. Chengkai Li . Associate Professor, Department

Results: Precision

NBC RFC

SVM

Page 26: Data Science for Computational Journalismranger.uta.edu/~cli/talks/PyData2015_ChengkaiLi.pdf · Data Science for Computational Journalism. Chengkai Li . Associate Professor, Department

Results: Recall

NBC RFC

SVM

Page 27: Data Science for Computational Journalismranger.uta.edu/~cli/talks/PyData2015_ChengkaiLi.pdf · Data Science for Computational Journalism. Chengkai Li . Associate Professor, Department

Results: F-Measure

NBC RFC

SVM

Page 28: Data Science for Computational Journalismranger.uta.edu/~cli/talks/PyData2015_ChengkaiLi.pdf · Data Science for Computational Journalism. Chengkai Li . Associate Professor, Department

You are Invited

http://bit.ly/1FSj9pt

Page 29: Data Science for Computational Journalismranger.uta.edu/~cli/talks/PyData2015_ChengkaiLi.pdf · Data Science for Computational Journalism. Chengkai Li . Associate Professor, Department

Acknowledgment UTA Students o Naeemul Hassan o Afroza Sultana o Gensheng Zhang

o Joseph Minumol o Jisa Sebastine

Collaborators o Bill Adair (Duke) o Pankaj Agarwal (Duke) o Sarah Cohen (Columbia) o James Hamilton (Stanford) o Ping Luo (Chinese Academy of Sciences)

o Mark Tremayne (UTA) o Min Wang (Google Research) o Jun Yang (Duke) o Cong Yu (Google Research)

Page 30: Data Science for Computational Journalismranger.uta.edu/~cli/talks/PyData2015_ChengkaiLi.pdf · Data Science for Computational Journalism. Chengkai Li . Associate Professor, Department

Acknowledgment Funding sponsors

Disclaimer: This material is based upon work partially supported by the National Science Foundation Grants 1018865, 1117369 and 1408928, 2011 and 2012 HP Labs Innovation Research Awards, and the National Natural Science Foundation of China Grant 61370019. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the funding agencies.

Page 31: Data Science for Computational Journalismranger.uta.edu/~cli/talks/PyData2015_ChengkaiLi.pdf · Data Science for Computational Journalism. Chengkai Li . Associate Professor, Department

Thank You! Questions? http://ranger.uta.edu/~cli http://idir.uta.edu [email protected] Please help us to label the data http://bit.ly/1FSj9pt