classification of cnn.com articles using a tf*idf metric

31
Classification of CNN.com Articles using a TF*IDF Metric Marie Vans and Steven Simske HP Labs; Fort Collins, Colorado April 20, 2016 1

Upload: marie-vans

Post on 08-Jan-2017

73 views

Category:

Data & Analytics


4 download

TRANSCRIPT

Page 1: Classification of CNN.com Articles using a TF*IDF Metric

1

Classification of CNN.com Articles using a TF*IDF MetricMarie Vans and Steven Simske HP Labs;Fort Collins, Colorado April 20, 2016

Page 2: Classification of CNN.com Articles using a TF*IDF Metric

2

Agenda• TF*IDF Family of Metrics• Word Frequencies• Data Set & Preprocessing• Algorithms for word frequencies and

classification• An example• Results• Future Directions• Conclusions

Page 3: Classification of CNN.com Articles using a TF*IDF Metric

3

TF*IDF – Family – Term Frequency  TF Name Equation

1 Power

2 Mean

3 NormLog

4 Log

5 NormLogs

6 NormMean

7 NormPower

8 NormPowers

Page 4: Classification of CNN.com Articles using a TF*IDF Metric

4

TF*IDF – Family – Inverse Document Frequency 

IDF Name IDF Equation1 NormLogsOfSums if LogRatio ≥ MinLogRatio

if LogRatio < MinLogRatio2 NormSumsOfLogs if LogRatio ≥ MinLogRatio

3 SumOfPowers

4 PowerOfSums

Page 5: Classification of CNN.com Articles using a TF*IDF Metric

5

TF*IDF – Family – Inverse Document Frequency 

IDF Name IDF Equation

5 Mean

6 NormSumOfLogs

7 NormLogOfSums

8 NormSumOfPowers

9 NormSumsOfPowers

10 SumOfLogs

11 LogOfSums

12 NormMean

Page 6: Classification of CNN.com Articles using a TF*IDF Metric

6

TF*IDF – Family – Inverse Document Frequency 

IDF Name IDF Equation

13 NormPowerOfSums

14 NormPowersOfSums

i =current wordj = current documentk = total words in document jn = total words in other than current document N = total number of documents in the corpuswi,j = number of occurrences of word i in document j.wi,n = word occurrences of word i in other documents.ni = number of documents in which i occurs.LogRatio = ratio of log for individual word to log for document lengthMinLogRatio = user settable minimum for LogRatioWordPower & DocPower = adjustable value

Page 7: Classification of CNN.com Articles using a TF*IDF Metric

7

TF*IDF – Family – Putting it togetherTF_Power*IDF_NormLogsOfSums If LogRatio ≥ MinLogRatio

* if LogRatio < MinLogRatioTF_Power*IDF_NormSumsOfLogs *

* ...TF_NormPowers*IDF_NormPowersOfSums *

112 TF*IDF Equations

Page 8: Classification of CNN.com Articles using a TF*IDF Metric

8

Word Frequencies

)

∑𝑗=1

𝑛𝑓𝑖𝑙𝑒𝑠

𝑓 ¿¿

The frequency of word i in file j

The frequency of word i in nfiles

Page 9: Classification of CNN.com Articles using a TF*IDF Metric

9

CNN Data SetClass Name TTL Number

of FilesNumber of

FilesTraining Set

Number of Files

Test SetBusiness 161 81 80Health 290 145 145Justice 224 112 112Living 98 49 49Opinion 192 96 96Politics 195 98 97Showbiz 241 121 120Sport 148 74 74Tech 132 66 66Travel 171 86 85US 160 80 80World 988 494 494

• 12 Classes• 3,000 Total

Files• Each Class

split into 2 sets:• Training

Set• Test Set

• File Classes Ground-trouth by CNN

Rafael Dueire Lins, Steven J. Simske, Luciano de Souza Cabral, Gabriel de Silva, Rinaldo Lima, Rafael F. Mello, and Luciano Favaro.A multi-tool scheme for summarizing textual documents. In Proceedings of 11st IADIS International Conference WWW/INTERNET 2012,pages 1–8, July 2012

Page 10: Classification of CNN.com Articles using a TF*IDF Metric

10

CNN Data SetClass Name TTL Number of

Train Set Unique Words

TTL Number of Test Set

Unique Words

Total Number of

Words Processed

Business 8278 7851 16129Health 12246 12036 24282Justice 9133 9032 18165Living 7936 7030 14966Opinion 11382 10886 22268Politics 9268 9039 18307Showbiz 8997 9949 18946Sport 7445 7191 14636Tech 7971 7548 15519Travel 14931 12612 27543US 8488 8707 17195World 22936 23441 46377

• 12 Classes• Total words

254,333 • Training Set

129,011 • Test Set

125,322

Page 11: Classification of CNN.com Articles using a TF*IDF Metric

11

Preprocessing• Remove “stop words”• Remove punctuation (hyphenation excepted)• No lemmatization • SharpNLP – Open Source Natural Language Processing (

https://sharpnlp.codeplex.com/)• sentence splitter• tokenizer• part-of-speech tagger• chunker • parser• name finder• coreference tool• interface to the WordNet lexical database

• File parsed with each word tagged with part of speech

Page 12: Classification of CNN.com Articles using a TF*IDF Metric

12

Program Classes (Not CNN Classes)• Word Class

• m_Spelling • m_Count (frequency of word in file)• m_Weight (assigned by different TF*IDF

measures)• m_HasHyphen (Hyphenated words counts as

single word)• m_PennTags (Parts of speech tag)• m_Tags (Number of tags associated

with word)

• TermFrequencies Class• m_TermName;• int m_TermFreq;

• Classify Class• m_businessWords;• m_healthWords;• m_justiceWords;• m_livingWords;• m_opinionWords;• m_politicsWords;• m_showbizWords;• m_sportWords;• m_techWords;• m_travelWords;• m_usWords;• m_worldWords;• m_confusionMatrx

Page 13: Classification of CNN.com Articles using a TF*IDF Metric

13

AlgorithmA. Using Training Set files in each class: (i.e. do this

12 times)1.0 For each file in the set:

Create a word object for every unique word in the file2.0 Count the total number of occurrences of each

unique word for the entire set of documents3.0 Calculate the weight of each word:

total occurrences of wordi in all files / total occurrences of all words in all files

Page 14: Classification of CNN.com Articles using a TF*IDF Metric

14

AlgorithmB. Using the Testing Set files in a specific class: (i.e. business)

1.0 For each file in the set:Create a word object for every unique word in the file

2.0 Count the total number of occurrences of each unique word for the entire set of documents

3.0 Calculate the weight of each word: = total occurrences of wordi in file / total occurrences of all

words in file = total occurrences of wordi in all files / total occurrences of all words in all files

𝑇𝑤𝑜𝑟𝑑𝑖= ∑

𝑗=1

𝑛𝑓𝑖𝑙𝑒𝑠

𝑓 ¿¿¿

Page 15: Classification of CNN.com Articles using a TF*IDF Metric

15

Algorithm

D. Classify each wordi in one test file by comparing to the same word in all training classes: in Business test class

𝐶h h𝑒𝑎𝑙𝑡 =𝑇𝑒𝑠𝑡𝑤𝑜𝑟𝑑𝑖

× h𝐻𝑒𝑎𝑙𝑡 𝑇𝑟𝑎𝑖𝑛𝑤𝑜𝑟𝑑𝑖

𝐶 𝑗𝑢𝑠𝑡𝑖𝑐𝑒=𝑇𝑒𝑠𝑡𝑤𝑜𝑟𝑑𝑖× 𝐽𝑢𝑠𝑡𝑖𝑐𝑒𝑇𝑟𝑎𝑖𝑛𝑤𝑜𝑟𝑑𝑖

.

.

.𝐶𝑤𝑜𝑟𝑙𝑑=𝑇𝑒𝑠𝑡𝑤𝑜𝑟𝑑𝑖

×𝑊𝑜𝑟𝑙𝑑𝑇𝑟𝑎𝑖𝑛𝑤𝑜𝑟𝑑𝑖

𝐶𝑙𝑎𝑠𝑠=𝑀𝑎𝑥 { {

Page 16: Classification of CNN.com Articles using a TF*IDF Metric

16

Algorithm

C. Classify each wordi in the entire test class by comparing to the same word in all training classes: in Business test class

𝐶h h𝑒𝑎𝑙𝑡 =𝑇𝑒𝑠𝑡𝑤𝑜𝑟𝑑𝑖

× h𝐻𝑒𝑎𝑙𝑡 𝑇𝑟𝑎𝑖𝑛𝑤𝑜𝑟𝑑𝑖

𝐶 𝑗𝑢𝑠𝑡𝑖𝑐𝑒=𝑇𝑒𝑠𝑡𝑤𝑜𝑟𝑑𝑖× 𝐽𝑢𝑠𝑡𝑖𝑐𝑒𝑇𝑟𝑎𝑖𝑛𝑤𝑜𝑟𝑑𝑖

.

.

.𝐶𝑤𝑜𝑟𝑙𝑑=𝑇𝑒𝑠𝑡𝑤𝑜𝑟𝑑𝑖

×𝑊𝑜𝑟𝑙𝑑𝑇𝑟𝑎𝑖𝑛𝑤𝑜𝑟𝑑𝑖

𝐶𝑙𝑎𝑠𝑠=𝑀𝑎𝑥 { {

Page 17: Classification of CNN.com Articles using a TF*IDF Metric

17

CNN Data Set – Example Article – Business ClassAfter Fukushima: Could Germany's nuclear gamble backfire?

As Germany's switchover from nuclear power to renewable energy gathers pace, concerns are mounting over the cost to the country's prosperity and its already squeezed consumers.Politicians in Europe's largest economy want renewable power to contribute 35% of the country's electricity consumption by 2020 and 80% by 2050 as part of its clean energy drive.The country's 'energiewende' -- translated as energy transformation -- is part of the government's plan to move away from nuclear power and fossil fuels to renewable energy sources, following Japan's Fukushima disaster in 2011.Michael Limburg, vice-president of the European Institute for Climate and Energy, told CNN that the government's energy targets are 'completely unfeasible.''Of course, it's possible to erect tens of thousands of windmills but only at an extreme cost and waste of natural space,' he said.'And still it would not be able to deliver electricity when it is needed.'The government is investing heavily in onshore and offshore wind farms and solar technology in an effort to reduce 40% of greenhouse gas emissions by 2020.Last year Chancellor Angela Merkel, who this week won her third term as Germany's leader, proposed to construct offshore wind farms in the North Sea, a plan that would cost 200 billion euros ($270 billion), according to the DIW economic institute in Berlin.As part of the energy drive, Merkel also pledged to permanently shut down the country's 17 nuclear reactors, which fuel 18% of the country's power needs.Under Germany's Atomic Energy Act, the last nuclear power plant will be disconnected by 2022.

Page 18: Classification of CNN.com Articles using a TF*IDF Metric

18

CNN Data Set – Example Frequencies – Training Set

m_TermFreq 6 intm_TermName "fukushima"

string

m_TermFreq 1 intm_TermName "germany"string

m_TermFreq 12 intm_TermName "nuclear"string

Single Filem_TermFreq 9 intm_TermName "fukushima"

string

m_TermFreq 26 intm_TermName "germany"string

m_TermFreq 33 intm_TermName "nuclear"string

All Files in Class

fukushima 0.000307188203972967 germany 0.000887432589255239 nuclear 0.00112635674790088

% Occurrence in Class

% Occurrence in Filefukushima 0.0102739726027397

germany 0.00156739811912226 nuclear 0.0205479452054795

Page 19: Classification of CNN.com Articles using a TF*IDF Metric

19

CNN Data Set – Example Frequencies – Test Set

m_TermFreq 2 intm_TermName "fukushima"

string

m_TermFreq 9 intm_TermName "germany"string

m_TermFreq 5 intm_TermName "nuclear"string

Single Filem_TermFreq 2 intm_TermName "fukushima"

string

m_TermFreq 21 intm_TermName "germany"string

m_TermFreq 6 intm_TermName "nuclear"string

All Files in Class

fukushima 0.000073773515308004germany 0.000774621910734047nuclear 0.000221320545924013

% Occurrence in Class

% Occurrence in Filefukushima 0.0056022408963585

germany 0.0252100840336134nuclear 0.0140056022408964

Page 20: Classification of CNN.com Articles using a TF*IDF Metric

20

Classify All Words in Single Business Test File

Business 0.00069364Health 0.00030063

Justice 0.00025000Living 0.00026707

Opinion 0.00033446Politics 0.00034694Showbiz 0.00025372Sport 0.00029984Tech 0.00033337Travel 0.00023201 US 0.00031539 World 0.00040208

MAX Class Value

Page 21: Classification of CNN.com Articles using a TF*IDF Metric

21

Classify All Words in All Business Test Files

Business 0.00059513Health 0.00035854Justice 0.00027830Living 0.00038269Opinion 0.00039295Politics 0.00036828Showbiz 0.00029162Sport 0.00036698Tech 0.00040147Travel 0.00032406US 0.00032592World 0.00037747

MAX Class Value

Page 22: Classification of CNN.com Articles using a TF*IDF Metric

22

Confusion Matrix

• Each column contains samples of classifier output• Each row contains samples in true class• Each row sums to 1.0 • Diagonal show percent classified correctly

• Mean of diagonal = 89% • Off-diagonal shows types of errors that occur

• A is misclassified as B – 3% • A is misclassified as C – 3%

Normalized Confusion Matrix Classifier Output (Computed Classification)

Prediction

A B C

True Class of

the Samples

(Input)

A 0.94 0.03 0.03

B 0.08 0.85 0.07

C 0.08 0.04 0.88

Page 23: Classification of CNN.com Articles using a TF*IDF Metric

23

Results - Classificationbusiness

health

justice living

opinion

politics

showbiz sport tech travel us world

business 0.75 0 0 0.0875 0 0 0 0.025 0.1125 0.025 0 0

health 00.77

240.020

7 0.1793 0 0.0138 0.0069 0 0.0069 0 0 0

justice 0 00.901

8 0.0179 0 0.0446 0.0089 0 0 0 0.0268 0

living 0.02040.040

8 00.816

3 0 0.0204 0 0.0612 0.0408 0 0 0

opinion 0.20830.072

90.020

8 0.2708 0.0313 0.1667 0 0.0417 0.0417 0 0.0104 0.1354

politics 0.01030.010

30.051

5 0.0412 00.855

7 0 0 0 0 0 0.0309

showbiz 0.00830.008

30.158

3 0.1417 0 0.00830.641

7 0.025 0 0 0.0083 0

sport 0.0270.013

50.040

5 0.0541 0 0.027 00.810

8 0 0.027 0 0

tech 0.03030.030

30.015

2 0.2121 0 0 0.0152 00.681

8 0.0152 0 0

travel 0.14120.011

80.023

5 0.1412 0 0.0824 0 0.0353 0.05880.435

3 0.0471 0.0235

us 0.025 0.050.312

5 0.175 0 0.1125 0.025 0.0625 0.0375 0.01250.187

5 0

world 0.07690.014

20.131

6 0.0789 0.002 0.1255 0.0061 0.0142 0.0202 0.002 0.01420.514

2

Note that the diagonals (in bold) are the correct classificationsThe rows sum to 1.0 since the left column represents the actual class from which the document is takenThe columns have a mean of 1.0 with some variance depending on whether the class in the column is an attractor class (> 1.0) or a repulsor class (<1.0)

Page 24: Classification of CNN.com Articles using a TF*IDF Metric

24

Example of Incorrectly Classified File

Business 0.00033924Health

0.00025056Justice

0.00027728Living 0.00027807

Opinion 0.00041936

Politics0.00046704

Showbiz 0.00023136

Sport 0.00028422

Tech 0.00025991

Travel 0.00021793

US 0.00032973

World 0.00043251

2nd MAX Class Value

Results of File from Opinion Test Class:

MAX Class Value3rd MAX Class Value

It takes 3 tries to get it right

Page 25: Classification of CNN.com Articles using a TF*IDF Metric

25

Classification Attempts to SuccessMeasures the average number of attempts until correct class is chosen

whereP1 = number correctly classified on first tryp2 = number correctly classified after two tries...P12 = number correctly classified on the last trynfiles = number of files in testing class

Example: Worst Class – Opinion

P1 = 3 P2 = 35 x 2P3 = 31 x 3P4 = 14 x 4P5 = 7 x5P6 = 3 x6P7 = 2 x 7P8 = 1 x8

=297/96 = 3.09Ʃ

Page 26: Classification of CNN.com Articles using a TF*IDF Metric

26

Results - Classification Attempts to Success

• Measures the average number of attempts until correct class is chosen• Ideal is 1.0 – We get it right on the first try• Best Class – Justice • Correctly classified: 0.9018• Mean classification attempts: 1.19• Delta from ideal = 0.19

• Worst Class – Opinion• Correctly classified: 0.0313• Mean classification attempts: 3.09• Delta from ideal = 2.09

• Best classification attempts class 11 times better than worst class• All other classes between best and worst

Page 27: Classification of CNN.com Articles using a TF*IDF Metric

27

Results - General• Confusion matrix shows good classification results

• Average classification rate for all classes = 0.61655883

• Classification Errors:

• Attractor classes:

• Repulsor classes:

• Normalized by total occurrences of all words in file • For classification of single file

• Normalized by total occurrences of all words in the class• For classification of multiple files

business

health justice living opinion

politics

showbiz

sport tech travel us world

1.2978

1.0245

1.6765

2.216

0.0333

1.4569

0.7037

1.0757

1.0003

0.517

0.2943

0.704

Business

Health Justice Living Politics Sport Tech

Opinion Showbiz Travel U.S. World

Page 28: Classification of CNN.com Articles using a TF*IDF Metric

28

Discussion

OpinionJustice

World

U.S.1.Documents more

varied?2.Generic words?3.Topics overlap?4.Word clusters are

broader?

Page 29: Classification of CNN.com Articles using a TF*IDF Metric

29

Future Directions• Automatic summarization based on word frequencies in sentences

• Data from Brazil also contained Gold Standard sentences for summarization• Each file contains sentences pulled out of the full article by at least 3

students• Gold Standard sentences for each file act as ground truth for automatic

summarization• New York Times Annotated Corpus:

(https://catalog.ldc.upenn.edu/LDC2008T19)• Written and published by the New York Times between January 1, 1987 and

June 19, 2007 • Metadata provided by the New York Times Newsroom, the New York Times

Indexing Service and the online production staff at nytimes.com:• Over 1.8 million articles• Over 650,000 article summaries written by library scientists• Over 1,500,000 articles manually tagged by library scientists with tags

drawn from a normalized indexing vocabulary of people, organizations, locations and topic descriptors

• Over 275,000 algorithmically-tagged articles that have been hand verified by the online production staff at nytimes.com

Page 30: Classification of CNN.com Articles using a TF*IDF Metric

30

Conclusions• A family of TF*IDF metrics for summarization and

classification• A simple TF*IDF metric • Classification scheme that works well on a set of 3,000 CNN

articles separated into 12 classes• Classification attempts to success is a measure that tells us

how hard it is to classify• Attractor and repulsor classes may help for identifying

imbalances in the data• Simple TF*IDF metric can be used for benchmarking the rest

of the 112 TF*IDF

Page 31: Classification of CNN.com Articles using a TF*IDF Metric

31

Thank You for Your Kind Attention