0 textual and quantitative analysis: towards a new, e- mediated social science khurshid ahmad, lee...

62
1 Textual and Quantitative Analysis: Towards a new, e- mediated Social Science Khurshid Ahmad, Lee Gillam, and David Cheng Department of Computing, University of Surrey

Upload: hortense-webster

Post on 25-Dec-2015

215 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: 0 Textual and Quantitative Analysis: Towards a new, e- mediated Social Science Khurshid Ahmad, Lee Gillam, and David Cheng Department of Computing, University

1

Textual and Quantitative Analysis: Towards a new, e-mediated Social Science

Khurshid Ahmad,Lee Gillam, and David Cheng Department of Computing, University of Surrey

Page 2: 0 Textual and Quantitative Analysis: Towards a new, e- mediated Social Science Khurshid Ahmad, Lee Gillam, and David Cheng Department of Computing, University

2

Outline

Think TankRationality, Bounded Rationality and SentimentNews Analysis and Sentiment AnalysisA method for identifying and extracting sentimentExperiments and EvaluationConclusions and Future Work

Page 3: 0 Textual and Quantitative Analysis: Towards a new, e- mediated Social Science Khurshid Ahmad, Lee Gillam, and David Cheng Department of Computing, University

3

What is the connection between these pairs of

terms:HAPPY & SADMORE & LESS

NORTH & SOUTHAHEAD & BEHINDHIGHER & LOWER

LOUDER & QUIETERIN PROFIT & IN LOSS

OPERATIONAL & BROKENMORE EXPENSIVE & LESS EXPENSIVE

AT UNIVERSITY & AWAY FROM UNIVERSITY

METROThursday, June 28, 2005, pp 5.

THINK TANK

Page 4: 0 Textual and Quantitative Analysis: Towards a new, e- mediated Social Science Khurshid Ahmad, Lee Gillam, and David Cheng Department of Computing, University

4

We rely on reviews and opinion polls of various kinds:

Film & TV reviews; Book reviews; Resort reviews

Bank reviews; Automobile Review; White good reviews;

Consumer surveys; ‘write your own’ reviews;

Newspaper editorials; Editors’ choice.

METROThursday, June 28, 2005, pp 5.

THINK TANK

Page 5: 0 Textual and Quantitative Analysis: Towards a new, e- mediated Social Science Khurshid Ahmad, Lee Gillam, and David Cheng Department of Computing, University

5

We rely on the sentiment of the

reviewers, editors, investment experts, and ……

We do know the cost of durables, shares, holidays.

A reasonable price is rejected if the reviews are poor; an exorbitant price is acceptable if the reviews are good;

Bad reviews stick in the mind for longer than good reviews.

METRO

THINK TANK

Page 6: 0 Textual and Quantitative Analysis: Towards a new, e- mediated Social Science Khurshid Ahmad, Lee Gillam, and David Cheng Department of Computing, University

6

We rely on the sentiment of

the more vociferous in the society sometimes

The vociferous may call black white, and white black;

The vociferous may repudiate facts and purvey fiction.

METRO

THINK TANK

Page 7: 0 Textual and Quantitative Analysis: Towards a new, e- mediated Social Science Khurshid Ahmad, Lee Gillam, and David Cheng Department of Computing, University

7

An internal war may be due to bounded rationality: given certain structural conditions – emergent anarchy, economic scarcity, weakening state structures due to globalization – elites and groups make rational decisions to pursue their aims by violent means. Within the bounded context of their decision-making parameters, going to war may be entirely rational.

THINK TANK

Jackson, Richard (2004). ‘The Social Construction of Internal War’ In (Ed.) Richard Jackson. (Re)Constructing Cultures of Violence and Peace. Rodopi: Amsterdam/New York.

Page 8: 0 Textual and Quantitative Analysis: Towards a new, e- mediated Social Science Khurshid Ahmad, Lee Gillam, and David Cheng Department of Computing, University

8

We rely on the sentiment of

safety expressed by our near and dear, and the media

The dears may have been mugged or burgled: the falling crime rate does not alleviate the fear of crime reassurance gap

METRO

THINK TANK

Page 9: 0 Textual and Quantitative Analysis: Towards a new, e- mediated Social Science Khurshid Ahmad, Lee Gillam, and David Cheng Department of Computing, University

9

THINK TANK

Turney, Peter D. (2002). “Thumbs Up or Thumbs Down? Semantic Orientation Applied to Unsupervised Classification of Reviews”. In Proc of the 40th Ann. Meeting of the Ass. for Comp. Linguistics (ACL). Philadelphia, July 2002, pp. 417-424. (Available at http://acl.ldc.upenn.edu/P/P02/P02-1053.pdf).

online service unethical practices

online experience low funds

direct deposit other problems

local branch old man

low fees lesser evil

well other virtual monopoly

small part probably wondering

printable version little difference

true service other bank

other bank possible moment

inconveniently located extra day

A new bank has just been launched: Punter Smith has passed his judgement on the bank. Which of the two columns tells us that he likes the new outfit?

Page 10: 0 Textual and Quantitative Analysis: Towards a new, e- mediated Social Science Khurshid Ahmad, Lee Gillam, and David Cheng Department of Computing, University

10

THINK TANK

Turney, Peter D. (2002). “Thumbs Up or Thumbs Down? Semantic Orientation Applied to Unsupervised Classification of Reviews”. In Proc of the 40th Ann. Meeting of the Ass. for Comp. Linguistics (ACL). Philadelphia, July 2002, pp. 417-424. (Available at http://acl.ldc.upenn.edu/P/P02/P02-1053.pdf).

online service unethical practices

online experience low funds

direct deposit other problems

local branch old man

low fees lesser evil

well other virtual monopoly

small part probably wondering

printable version little difference

true service other bank

other bank possible moment

inconveniently located

extra day

How can a machine detect the positive/negative sentiment from texts? We look at the collocation of words like excellent & poor in text corpus.

The point wise mutual information is computed between word1 & word2:

))()((

)&((),(

21

21

21 wordpwordp

wordwordpwordwordPMI

Semantic orientation of phrase is given as:

),"("

),"(")(

phrasepoorPMI

phraseexcellentPMIphraseSemOr

Page 11: 0 Textual and Quantitative Analysis: Towards a new, e- mediated Social Science Khurshid Ahmad, Lee Gillam, and David Cheng Department of Computing, University

11

THINK TANK

Phrase Semantic Orientation

Phrase Semantic Orientation

online service 2.780 unethical practices

-8.484

online experience 2.253 low funds -6.843

direct deposit 1.288 other problems -2.748

local branch 0.421 old man -2.566

low fees 0.333 lesser evil -2.288

well other 0.237 virtual monopoly -2.050

small part 0.053 probably wondering

-1.830

printable version -0.705 little difference -1.615

true service -0.732 other bank -0.850

other bank -0.850 possible moment -0.668

inconveniently located

-1.541 extra day -0.286

How can a machine detect the positive/negative sentiment from texts? We look at the collocation of words like excellent & poor in a number of texts.

Page 12: 0 Textual and Quantitative Analysis: Towards a new, e- mediated Social Science Khurshid Ahmad, Lee Gillam, and David Cheng Department of Computing, University

12

THINK TANK

Phrase Semantic

Orientation

Phrase Semantic

Orientation

online service 2.780 unethical practices -8.484

online experience

2.253 low funds -6.843

direct deposit 1.288 other problems -2.748

local branch 0.421 old man -2.566

low fees 0.333 lesser evil -2.288

well other 0.237 virtual monopoly -2.050

small part 0.053 probably wondering -1.830

printable version

-0.705

little difference -1.615

true service -0.732

other bank -0.850

other bank -0.850

possible moment -0.668

inconveniently located

-1.541 extra day -0.286

How can a machine detect the positive/negative sentiment from texts? We look at the collocation of words like excellent & poor in a number of texts.Note subjectivity: The analyst has chosen the pivotal words poor & excellent.

How well can the method be adapted to other domains?

Adaptive Information Extraction? For automatic choosing the pivots!

Page 13: 0 Textual and Quantitative Analysis: Towards a new, e- mediated Social Science Khurshid Ahmad, Lee Gillam, and David Cheng Department of Computing, University

13

Japanese yen/US dollar exchange rate (decreasing solid line); US consumer price index (increasing solid line); Japanese consumer price index (increasing dashed line),

1970:1 − 2003:5, monthly observations

THINK TANK

Why is it that Japanese consumer price index is following the same trend as the US CPI?

Page 14: 0 Textual and Quantitative Analysis: Towards a new, e- mediated Social Science Khurshid Ahmad, Lee Gillam, and David Cheng Department of Computing, University

14

The return series – the first difference values of US $/Japanese Yen exchange (Price t – Price t-1) between

1970-2003, monthly data

THINK TANK

Page 15: 0 Textual and Quantitative Analysis: Towards a new, e- mediated Social Science Khurshid Ahmad, Lee Gillam, and David Cheng Department of Computing, University

15

The volatility series – the four-week moving average of

the square of the changes in the values of US $/Japanese Yen exchange (Price t – Price t-1) between 1970-2003.

THINK TANK

High Volatility Clusters

Page 16: 0 Textual and Quantitative Analysis: Towards a new, e- mediated Social Science Khurshid Ahmad, Lee Gillam, and David Cheng Department of Computing, University

16

THINK TANK Robert Engle’s contribution: Volatility may vary considerably over time: large (small) changes in returns are followed by large (small) changes.

Engle, R. F. (1982). Autoregressive conditional heteroscedasticity with estimatesof the variance of United Kingdom inflation. Econometrica Vol 50, pp 987—1007.

Page 17: 0 Textual and Quantitative Analysis: Towards a new, e- mediated Social Science Khurshid Ahmad, Lee Gillam, and David Cheng Department of Computing, University

17

THINK TANKEngle and Ng have developed the concept of the news impact curve.

To condition at time t on the information available at t − 2 and thus consider the effect of the shock ε t−1 on the conditional variance ht in isolation.

The conditional variance is affected by the latest information, “the news” ε t−1:

The symmetric case: Both positive and negative news has the same effect.

The assymetric case: a positive and an equally large negative piece of “news” do not have the same effect on the conditional variance.

Engle, R. F. and Ng, V. K (1993). Measuring and testing the impact of news on volatility, Journal of Finance Vol. 48, pp 1749—1777.

2

110

tth

11

2

110

ttthh

Page 18: 0 Textual and Quantitative Analysis: Towards a new, e- mediated Social Science Khurshid Ahmad, Lee Gillam, and David Cheng Department of Computing, University

18

THINK TANK

Engle, R. F. and Ng, V. K (1993). Measuring and testing the impact of news on volatility, Journal of Finance Vol. 48, pp 1749—1777.

Symmetric caseAsymmetric case

Page 19: 0 Textual and Quantitative Analysis: Towards a new, e- mediated Social Science Khurshid Ahmad, Lee Gillam, and David Cheng Department of Computing, University

19

Rationality, Bounded Rationality and Sentiment

News Effects I: News Announcements Matter, and

Quickly; II: Announcement Timing Matters III: Volatility Adjusts to News Gradually IV: Pure Announcement Effects are Present

in Volatility V: Announcement Effects are Asymmetric –

Responses Vary with the Sign of the News; VI: The effect on traded volume persists

longer than on prices.

Andersen, T. G., Bollerslev, T., Diebold, F X., & Vega, C. (2002). Micro effects of macro announcements: Real time price discovery in foreign exchange. National Bureau of Economic Research Working Paper 8959, http://www.nber.org/papers/w8959

Page 20: 0 Textual and Quantitative Analysis: Towards a new, e- mediated Social Science Khurshid Ahmad, Lee Gillam, and David Cheng Department of Computing, University

20

Rationality, Bounded Rationality and Sentiment

The following statements based entirely on statistical analysis of quantitative data:

Bad news in “good times” should have an unusually large impact

In a purely ‘good times’ sample “bad news should have unusually large effects,”

Page 21: 0 Textual and Quantitative Analysis: Towards a new, e- mediated Social Science Khurshid Ahmad, Lee Gillam, and David Cheng Department of Computing, University

21

Rationality, Bounded Rationality and Sentiment

On average, the effect of macroeconomic news often varies with its sign. In particular, negative surprises often have greater impact than positive surprises.

Andersen, T. G., Bollerslev, T., Diebold, F X., & Vega, C. (2002). Micro effects of macro announcements: Real time price discovery in foreign exchange. National Bureau of Economic Research Working Paper 8959, http://www.nber.org/papers/w8959

Page 22: 0 Textual and Quantitative Analysis: Towards a new, e- mediated Social Science Khurshid Ahmad, Lee Gillam, and David Cheng Department of Computing, University

22

Rationality, Bounded Rationality and Sentiment

So, where is the news? It is not the news but the timing of the announcement the timings are used as an information proxy.

Andersen, T. G., Bollerslev, T., Diebold, F X., & Vega, C. (2002). Micro effects of macro announcements: Real time price discovery in foreign exchange. National Bureau of Economic Research Working Paper 8959, http://www.nber.org/papers/w8959

Page 23: 0 Textual and Quantitative Analysis: Towards a new, e- mediated Social Science Khurshid Ahmad, Lee Gillam, and David Cheng Department of Computing, University

23

Rationality, Bounded Rationality and Sentiment Firm-level Information Proxies:

Closed-end fund discount (CEFD); Turnover ratio (in NYSE for example) (TURN) Number of Initial Public Offerings (N-IPO); Average First Day Returns on R-IPO Equity share S Dividend Premium Age of the firm, external finance, ‘size’(log(equity))…….

Each sentiment proxy is likely to include a sentiment component and as well as idiosyncratic or non-sentiment-related components. Principal components analysis is typically used to isolate the common component.

A novel composite index built using Factor Analysis: Sentiment = -0.358CEFDt+0.402TURNt-1+0.414NIPOt

+0.464RIPOt+0.371 St-0.431Pt-1

Baker, M., and Wurgler, J. (2004). "Investor Sentiment and the Cross-Section of Stock Returns," NBER Working Papers 10449, Cambridge, Mass National Bureau of Economic Research, Inc.

Page 24: 0 Textual and Quantitative Analysis: Towards a new, e- mediated Social Science Khurshid Ahmad, Lee Gillam, and David Cheng Department of Computing, University

24

Rationality, Bounded Rationality and Sentiment

So, where is the news and financial data? There is plenty of it but in a noisy state.Today’s news and figure may contradict yesterdays or, worse still, reinforce false hopes and prejudices.The financial news and data are truly organic data – not manufactured in a laboratory

Numerical data Time series price/value movement of financial

instruments;

c. 5MB/day, per instrument

Textual data Text streams different genres:

news items; financial reports; company brochures; government documents;

market sentiment surveys; interviews

c. 20MB/day

Page 25: 0 Textual and Quantitative Analysis: Towards a new, e- mediated Social Science Khurshid Ahmad, Lee Gillam, and David Cheng Department of Computing, University

25

The Surrey Society Grids Project

A 24-node data and compute cluster (64 cpus) interfaced to a ‘real world’ data stream (Reuters News and Financial Time series Feed) for capturing, analysing and fusing quantitative and ‘qualitative’ data.Reuters Feed: 2 dedicated data lines, PC and Sun for feed management and associated networking

A small but well-formed grid – for creating a data nursery

Page 26: 0 Textual and Quantitative Analysis: Towards a new, e- mediated Social Science Khurshid Ahmad, Lee Gillam, and David Cheng Department of Computing, University

26

Surrey Society Grids Architecture

Streaming Textual Data

GRID Cluster24 Slaves

Streaming Numeric DataMain Cluster

Text and Time Series Service

Notify user about results

Distribute Tasks

Receive Results

Send Service Request

1

2

34

Surrey Grid• Given an allocated task, the corresponding data is retrieved from the data providers by the slave machines. • The main cluster monitors the slave machines until they have completed their tasks, and subsequently combines the interim results. • The final result is sent back to the client machine.

Page 27: 0 Textual and Quantitative Analysis: Towards a new, e- mediated Social Science Khurshid Ahmad, Lee Gillam, and David Cheng Department of Computing, University

27

Surrey Society Grids: Streaming Data

STREAMING ECONOMIC/POLITICAL NEWS-

Reuters; Yahoo; Bloomberg, BBC! Al Jazeera

Page 28: 0 Textual and Quantitative Analysis: Towards a new, e- mediated Social Science Khurshid Ahmad, Lee Gillam, and David Cheng Department of Computing, University

28

Surrey Society Grid: Performance

Increasing the throughput We have created a 24 node grid infrastructure, which

can provide access to upto 64 processors simultaneously Processing the (complete) RCV1 corpus: 181 million

words in 806,791 texts

No. of processors Time (seconds)

1 (Dell PowerEdge 2650)

53000

16 3572

64 1683

Page 29: 0 Textual and Quantitative Analysis: Towards a new, e- mediated Social Science Khurshid Ahmad, Lee Gillam, and David Cheng Department of Computing, University

29

Surrey Society Grid: Performance

Automatic extraction and annotation of sentiment bearing words in a 1,000,000 word text corpus –four days output from Reuters news feed – using automatically extracted key words and an automatically extracted local grammar for pattern identification.

0

50

100

150

200

250

300

350

400

450

0 6 12 18 24 30 36 42

Hours from midnight Nov. 15th, 2004

Nu

mb

er

of

wo

rds

Filtered Positive

Filtered Negative

Page 30: 0 Textual and Quantitative Analysis: Towards a new, e- mediated Social Science Khurshid Ahmad, Lee Gillam, and David Cheng Department of Computing, University

30

Surrey Society Grid: Algorithms and

Methods

We have developed a for visualising and correlating the sentiment and instrument time series both as text (and numbers) and graphically as well.

Page 31: 0 Textual and Quantitative Analysis: Towards a new, e- mediated Social Science Khurshid Ahmad, Lee Gillam, and David Cheng Department of Computing, University

31

Surrey Society Grid: Algorithms and

Methods

Interface the grid to local news media (e.g. Bradford Argus & Burnley Express) and local data repositories – crime statistics (crime surveys and police data), ethnicity compliance data, housing queues, field data

Page 32: 0 Textual and Quantitative Analysis: Towards a new, e- mediated Social Science Khurshid Ahmad, Lee Gillam, and David Cheng Department of Computing, University

32

Surrey Society Grid: Social Science Data?

The real world GenreNews Reports; Regulatory Body Reports

Informative

Commentaries; Letters to the Editors; Rumour-laden e-mails

Appelative

Semi-structured interviews; Confidence Surveys

Expressive

Language and text are constitutive (and not merely representational): but ‘society is not reducible to language and linguistic analysis (Hodgson 2000:62). Discourses are broader than language, being constituted notjust in texts, but also in definite institutional and organizational practices’ (Jackson 2004). But text is all we have after the event, the interview, the survey

Page 33: 0 Textual and Quantitative Analysis: Towards a new, e- mediated Social Science Khurshid Ahmad, Lee Gillam, and David Cheng Department of Computing, University

33

Surrey Society Grid: Social Science Data?

Financial Economics

Sociology of Crime; Crime

Science

Social Anthropology

Macro-micro Economic Indicators; Census Statistics;Survey of Social Attitudes;

Life-style and Well-being Statistics;

Market Movement Crime Statistics

Ethnicity-related data

Political News – Reports, Editorials, Letters to the Editor; Political and Social Opinion Polls;

Consumer Confidence Survey;

Investor/Trader Confidence Surveys; Regulatory Body Output;Financial News;

Citizen Confidence Surveys; Police Forces/Home Office Reports;Crime Reports;

Ethnic Minority Surveys; Police Forces/Home Office Reports;Crime Reports;

Page 34: 0 Textual and Quantitative Analysis: Towards a new, e- mediated Social Science Khurshid Ahmad, Lee Gillam, and David Cheng Department of Computing, University

34

Surrey Society Grid: Social Science Data?

•There is no visible technique in social science research methodology that can improve the researchers productivity in collecting and analysing large volumes of speech and text.

•Social scientists survey, and occasionally interview, interesting individuals in various social groups – analyse the survey form and quantify.

•So what about the data collected in the field. Data is buried in tombs never to be taken out again.

•Most text, if ever, is hand-coded by the social science researcher and then the proxy of the interpretation of the codes is presented as objective analysis.

The real world

Genre

News Reports; Regulatory Body Reports

Informative

Commentaries; Letters to the Editors; Rumour-laden e-mails

Appelative

Semi-structured interviews; Confidence Surveys

Expressive

Page 35: 0 Textual and Quantitative Analysis: Towards a new, e- mediated Social Science Khurshid Ahmad, Lee Gillam, and David Cheng Department of Computing, University

35

Surrey Society Grid: A Case Study

The real world

Genre

News Reports; Regulatory Body Reports

Informative

Commentaries; Letters to the Editors; Rumour-laden e-mails

Appelative

Semi-structured interviews; Confidence Surveys

Expressive

•We present a method for systematically identifying sentiment bearing phrases in large volumes of streaming texts – a local grammar comprising templates to extract the phrases with a minimal number of false positives.

•The sentiments are aligned with quantitative (time-varying) information and results co-integrated and tested for Granger causality

•The grammar itself is constructed automatically from a corpus of domain specific texts

Page 36: 0 Textual and Quantitative Analysis: Towards a new, e- mediated Social Science Khurshid Ahmad, Lee Gillam, and David Cheng Department of Computing, University

36

Surrey Society Grid: A Case Study

Of all the contested boundaries that define the discipline of sociology, none is more crucial than the divide between sociology and economics […] Talcott Parsons, for all [his] synthesizing ambitions, solidified the divide. “Basically,” […] “Parsons made a pact ... you, economists, study value; we, the sociologists, will study values.”If the financial markets are the core of many high-modern economies, so at their core is arbitrage: the exploitation of discrepancies in the prices of identical or similar assets. Arbitrage is pivotal to the economic theory of financial markets. It allows markets to be posited as efficient without all individual investors having to be assumed to be economically rational.

MacKenzie, Donald. 2000b. “Long-Term Capital Management: a Sociological Essay.” In (Eds) in Okönomie und Gesellschaft, Herbert Kaltoff, Richard Rottenburg and Hans-Jürgen Wagener. Marberg: Metropolis. Pp 277-287.

Page 37: 0 Textual and Quantitative Analysis: Towards a new, e- mediated Social Science Khurshid Ahmad, Lee Gillam, and David Cheng Department of Computing, University

37

Rationality, Bounded Rationality and Sentiment

A financial economist can analyse quantitative data using a large body of methods and techniques in statistical time series analysis on “fundamental data”, related, for example, to fixed assets of an enterprise, and on “technical data”, for example, share price movement;The economist can study the behaviour of a financial instrument, for example individual shares or currencies, or aggregated indices associated with stock exchanges, by looking at the changes in the value of the instrument at different time scales – ranging from minutes to decades;Financial investors/traders are trying to discover the market sentiment, looking for consensus in expectations, rising prices on falling volumes, and information/assistance from back-office analysts;The efficient market hypothesis suggests that quirks caused by sentiments can be rectified by the supposed inherent rationality of the majority of the players in the market

Page 38: 0 Textual and Quantitative Analysis: Towards a new, e- mediated Social Science Khurshid Ahmad, Lee Gillam, and David Cheng Department of Computing, University

38

Rationality, Bounded Rationality and Sentiment

Recent developments in financial economics, signified by the emergence of derivatives and arbitrage, show the triumph of rational reasoning: such instruments/strategies were created on the basis of mathematical models (Black and Scholes 1972), and the trading can be monitored using the self same models (Miller 1990);

The assumption of overarching rational behaviour has been reviewed by Herbert Simon (1978/1992) and Daniel Kahnneman (2003), and arguments have been presented in favour of a model of bounded rationality where the actors in a given social situation prefer to ignore facts and trust their own version of reality and the efficient market mechanisms fail to operate;

Page 39: 0 Textual and Quantitative Analysis: Towards a new, e- mediated Social Science Khurshid Ahmad, Lee Gillam, and David Cheng Department of Computing, University

39

News Analysis and Sentiment Analysis

Qualitative research methods are being used in financial economics, and in sociological studies of financial markets, for systematically studying the hopes and fears of the traders, investors, and regulators in the analysis of the behaviour of the markets.Since 2000, the analysis of news wire has become selective and targeted. Some researchers choose news related to economic and financial topics

news about employment distinguish between scheduled and non-scheduled news

announcements;

Page 40: 0 Textual and Quantitative Analysis: Towards a new, e- mediated Social Science Khurshid Ahmad, Lee Gillam, and David Cheng Department of Computing, University

40

News Analysis and Sentiment Analysis

Some pre-select keywords that indicate change in the value of a financial instrument – including metaphorical terms like above, below, up and down – and use them to ‘represent’ positive/negative news stories.Some use the frequency of collocational patterns for assigning a ‘feel-good/bad’ score to the story

‘Good’ news stories appear to comprise collocates like revenues rose, share rose;

‘Bad’ news stories contain profit warning, poor expectation;

‘Neutral’ stories contain collocates such as announces product, alliance made;

The ‘sentiment’ of the story is then correlated with that of a financial instrument cited in the stories and inferences made.

DeGennaro, R., and R. Shrieves (1997): ‘Public information releases, private informationarrival and volatility in the foreign exchange market’. Journal of Empirical Finance Vol. 4, pp 295–315. ;Koppel, M and Shtrimberg, I. (2004). ‘Good News or Bad News? Let the Market Decide’. In AAAI Spring Symposium on Exploring Attitude and Affect in Text. Palo Alto: AAAI Press. pp. 86-88;

Page 41: 0 Textual and Quantitative Analysis: Towards a new, e- mediated Social Science Khurshid Ahmad, Lee Gillam, and David Cheng Department of Computing, University

41

A method for identifying and extracting sentiment

No proxies – but the real dataWe adopt a text-driven and bottom-up method: starting from a collection of texts in a specialist domain, together with a representative general language corpus, A five-step algorithm for identifying discourse patterns with more or less unique meanings, without any overt access to an external knowledge base

Page 42: 0 Textual and Quantitative Analysis: Towards a new, e- mediated Social Science Khurshid Ahmad, Lee Gillam, and David Cheng Department of Computing, University

42

An algorithm for identifying and extracting sentiment

I. Select training corpora: a randomly sampled special language corpus and a general language corpus.

II. Extract key words;III. Extract key collocates;IV. Extract local grammar using collocation

analysis and relevance feedback;V. Assert the grammar as a finite state

automaton.

Page 43: 0 Textual and Quantitative Analysis: Towards a new, e- mediated Social Science Khurshid Ahmad, Lee Gillam, and David Cheng Department of Computing, University

43

Experiments and Evaluation of sentiment analysis method

I. Select training corpora

Training-Corpus The British National Corpus, comprising

100-million tokens distributed over 4124 texts (Aston and Burnard 1998);

Reuters Corpus Volume 1 (RCV1) comprising news texts produced in 1996-1997 and contains 181 million words distributed over 806,791 texts

Page 44: 0 Textual and Quantitative Analysis: Towards a new, e- mediated Social Science Khurshid Ahmad, Lee Gillam, and David Cheng Department of Computing, University

44

Experiments and Evaluation of sentiment analysis method

II. Extract key words The frequencies of individual words in the RCV1 were

computed using System Quirk; For describing how our method works we will use a

randomly selected component of the corpus – the output of February 1997, henceforth referred to as the RCV1-Feb97 corpus;

The RCV1-Feb97 corpus containing 14 Million words distributed 63,364 texts.

Page 45: 0 Textual and Quantitative Analysis: Towards a new, e- mediated Social Science Khurshid Ahmad, Lee Gillam, and David Cheng Department of Computing, University

45

Experiments and Evaluation of sentiment analysis method

Ranks

RCV1 Feb97(NRCV1Feb97=14 Million)

Cumulative

Number of

Tokens (%)

British NationalCorpus

(NBNC=100 Million)

Cumulative

Number of

Tokens (%)

1-10 the, to, of, in, a, and, said, on, s, for

0.87 M(21.3%)

the, of, and, a, in, to, for, is, as, that

22.3 M(22.3%)

11-20 at, that, was, is, it, by, with, from, percent, be

0.28 M(6.8%)

was, I, on, with, as, be, he, you, at, by

6.51 M(6.5 %)

21-30 as, he, million, year, its, will, but, has, would, were

0.17 M(4.2%)

are, this, have, but, not, from, had, his, they, or

4.23 M(4.2%)

31-40 an, not, are, have, which, had, up, n, new, market

0.13M(3.3%)

which, an, she, where, here, we, one, there, all, been

3.05 M(3.1%)

41-50 this, we, after, one, last, company, u, they, bank, government

0.10M(2.6%)

their, if, has, will, so, would, no, what, can, when

2.35 M(2.4%)

Page 46: 0 Textual and Quantitative Analysis: Towards a new, e- mediated Social Science Khurshid Ahmad, Lee Gillam, and David Cheng Department of Computing, University

46

Experiments and Evaluation of sentiment analysis method

Token RCV1 Feb97

(NRCV1Feb97= 14,244,349) BNC

(NBNC=100,000,000) Weirdne

ss(a/b)

Rank fRCV1Feb97 fRCV1Feb97 /

NRCV1Feb97

(a)

Rank fBNC fBNC / NBNC

(b)

percent 19 65763 0.462% 3394 2928 0.003% 157.84

market 40 36349 0.255% 301 30078 0.030% 8.49

company

46 29058 0.204% 219 40118 0.040% 5.09

bank 49 28041 0.197% 562 17932 0.018% 10.99

shares 56 23352 0.164% 1285 8412 0.008% 19.51

Page 47: 0 Textual and Quantitative Analysis: Towards a new, e- mediated Social Science Khurshid Ahmad, Lee Gillam, and David Cheng Department of Computing, University

47

Experiments and Evaluation of sentiment analysis method

III. Extract key collocates

f Left Right Total z-score

percent 65763

up 5315 4360 955 5315 15.91

rose 4361 3988 373 4361 13.04

rise 2391 980 1411 2391 7.12

down 2291 1636 655 2291 6.82

fell 2074 1844 230 2074 6.17

Page 48: 0 Textual and Quantitative Analysis: Towards a new, e- mediated Social Science Khurshid Ahmad, Lee Gillam, and David Cheng Department of Computing, University

48

Experiments and Evaluation of sentiment analysis method

IV. Extract local grammar using collocation and relevance feedback

Pattern f Collocate

Left Right

z-score

10 percent to 108 rose 24 0 5.45

by 10 percent to 18 rose 5 0 2.27

rose 10 percent to

14 billion 0 7 4.24

rose 20 percent to

11 billion 1 7 6.02

Page 49: 0 Textual and Quantitative Analysis: Towards a new, e- mediated Social Science Khurshid Ahmad, Lee Gillam, and David Cheng Department of Computing, University

49

Experiments and Evaluation of sentiment analysis method

V. Assert the grammar as a finite state automaton The (re-) collocation patterns can then be asserted as a finite state automata

for each of the movement verbs and spatial preposition metaphors

Page 50: 0 Textual and Quantitative Analysis: Towards a new, e- mediated Social Science Khurshid Ahmad, Lee Gillam, and David Cheng Department of Computing, University

50

Experiments and Evaluation of sentiment analysis method

V. Assert the grammar as a finite state automaton The (re-) collocation patterns can then be asserted as a finite state automata for each of the

movement verbs and spatial preposition metaphors

Page 51: 0 Textual and Quantitative Analysis: Towards a new, e- mediated Social Science Khurshid Ahmad, Lee Gillam, and David Cheng Department of Computing, University

51

Experiments and Evaluation of sentiment analysis method

V. Assert the grammar as a finite state automaton The (re-) collocation patterns can then be asserted as a finite state automata for each of the

movement verbs and spatial preposition metaphors

Page 52: 0 Textual and Quantitative Analysis: Towards a new, e- mediated Social Science Khurshid Ahmad, Lee Gillam, and David Cheng Department of Computing, University

52

Experiments and Evaluation of sentiment analysis method

V. Assert the grammar as a finite state automaton The (re-) collocation patterns can then be asserted as a finite state automata for each of the

movement verbs and spatial preposition metaphors

Page 53: 0 Textual and Quantitative Analysis: Towards a new, e- mediated Social Science Khurshid Ahmad, Lee Gillam, and David Cheng Department of Computing, University

53

Experiments and Evaluation of sentiment analysis method

•The local grammar is used sentences that contain sentiment bearing phrases and can automatically annotate the phrases.•The graph shows the filtering power of the local grammar patterns: identifies between 1,000 to 10,000 sentiment words hourly in a corpus of between 10,000 to 100,000 tokens per hour to find between 10 to 100 ‘true’ sentiment bearing sentences

0

1

2

3

4

5

6

7

0 6 12 18 24 30 36 42Hours from midnight Nov. 15th, 2004

Nu

mb

er o

f wo

rds

(Lo

g s

cale

)

Raw Sentiment

Filtered Sentiment

Total number of Tokens

Page 54: 0 Textual and Quantitative Analysis: Towards a new, e- mediated Social Science Khurshid Ahmad, Lee Gillam, and David Cheng Department of Computing, University

54

Experiments and Evaluation of sentiment analysis method

Changes in the total number of positive/negative words together with those that are used in the local grammars (filtered positive / negative words) and total number of words.

1

1.5

2

2.5

3

3.5

4

4.5

5

5.5

6

0 6 12 18 24 30 36 42

Hours from midnight Nov. 15th, 2004

Nu

mb

er

of

wo

rds

(L

og

sc

ale

)

Raw Positive Words

Raw Negative Words

Filtered Positive Words

Filtered Negative Words

Total Number of Words

Page 55: 0 Textual and Quantitative Analysis: Towards a new, e- mediated Social Science Khurshid Ahmad, Lee Gillam, and David Cheng Department of Computing, University

55

Experiments and Evaluation of sentiment analysis method

Changes in the total number of positive/negative words together with those that are used in the local grammars (filtered positive / negative words) and total number of words.

0

50

100

150

200

250

300

350

400

450

0 6 12 18 24 30 36 42

Hours from midnight Nov. 15th, 2004

Nu

mb

er

of

wo

rds

Filtered Positive

Filtered Negative

Page 56: 0 Textual and Quantitative Analysis: Towards a new, e- mediated Social Science Khurshid Ahmad, Lee Gillam, and David Cheng Department of Computing, University

56

Experiments and Evaluation of sentiment analysis method

Increasing the throughput We have created a 24 node grid

infrastructure, which can provide access to upto 64 processors simultaneously

Processing the (complete) RCV1 corpus (181 million words in 806,791 texts) on a single machine (a Dell PowerEdge 2650) takes 53300 seconds

Using 16 processors we gain a throughput increase by a factor of 15 (3572 seconds);

Using 64 processors, the time is halved again (1683 seconds).

Page 57: 0 Textual and Quantitative Analysis: Towards a new, e- mediated Social Science Khurshid Ahmad, Lee Gillam, and David Cheng Department of Computing, University

57

Conclusions and Future Work

Though we have devised programs that can learn unambiguous patterns of use of positive or negative sentiment, a sentence is always used in the context of other sentences and the context may change if the inference is made on the basis of one sentence only;One can argue that a new text is a response to some or all of the existing texts, and in that sense each text is contextualised within a network of other texts - even if all the existing texts unambiguously expressed a positive sentiment, a new text with strong negative sentiment may invalidate all of the positive sentiment.

Page 58: 0 Textual and Quantitative Analysis: Towards a new, e- mediated Social Science Khurshid Ahmad, Lee Gillam, and David Cheng Department of Computing, University

58

Conclusions and Future Work

Range of quantitative analysis techniques includes wavelet analysis (Ahmad et al 2004), fuzzy-logic knowledge bases (Poopola et al 2004), and case-based reasoning;

These techniques may be used to create a confidence index – or sentiment index;

These techniques can be extended to the new areas like the reassurance gap in policing totalising war discourse that leads to ethnic/racial

conflicts

Page 59: 0 Textual and Quantitative Analysis: Towards a new, e- mediated Social Science Khurshid Ahmad, Lee Gillam, and David Cheng Department of Computing, University

59

Conclusions and Future Work

Quantitative analysis methods developed in the Surrey Society Grids project can be used in the analysis of on-line or accessible data such as crime statistics, for sociology of crime, and labour force surveys, based on race/ethnicity for anthropology;

The fusion of the results of the textual and quantitative analysis can, in turn, be used to automatically produce a crime confidence index, for measuring the fear of crime, and a conflict index, for measuring ethnic/racial tension in a community;

Page 60: 0 Textual and Quantitative Analysis: Towards a new, e- mediated Social Science Khurshid Ahmad, Lee Gillam, and David Cheng Department of Computing, University

60

Conclusions and Future Work

Data Sources Financial Economics

Sociology of Crime; Crime

Science

Social Anthropology

Quantitative

Macro-micro Economic Indicators; Census Statistics;

Survey of Social Attitudes; Life-style and Well-being Statistics;

Market Movement

Crime Statistics

Ethnicity-related data

Qualitative

Political News – Reports, Editorials, Letters to the Editor;

Political and Social Opinion Polls; Consumer Confidence Survey;

Investor/Trader Confidence Surveys; Regulatory Body Output;Financial News;

Citizen Confidence Surveys; Police Forces/Home Office Reports;Crime Reports;

Ethnic Minority ; Police Forces/Home Office Reports;Crime Reports;

Page 61: 0 Textual and Quantitative Analysis: Towards a new, e- mediated Social Science Khurshid Ahmad, Lee Gillam, and David Cheng Department of Computing, University

61

Investor Psychology Sociology of Crime Anthropology of Ethnicity

Methods/ Techniques

Financial News and Reports; State-of-the-Economy Reports; Company Reports.

National News Reportage & Editorials; Police Authority & Other Reports; Policy Documents

National and International News Reportage & Editorials; Local Govt. Reports; Policy Documents

Corpus Ling. & IE: Terminological, Grammatical and Ontological Analysis for Identifying and Disambiguating sentiment and named entities

News Commentaries on financial instruments.

‘Letters to the Editor’; Web Sites

‘Letters to the Editor’; Web Sites

Ditto

Focus Group Encounters

Semi-structured interviews

Semi-structured interviews

Discourse Analysis

Qualitative data Informative

Appellative

Expressive Executive movements;

corporate entity identification

Anonymisation of field data

IE: Named Entity extractors

Technical Data (e.g. Stock Price Movement; Price/Earning Ratio)

Crime Statistics Labour Force Surveys; Educational Achievement Surveys

Wavelet analysis; Monte-Carlo type bootstrapping

Company demographics – fixed assets

UK census data

UK census data

Data Analysis; Aggregation; Visualisation; Case-Based Reasoning (CBR)

Quantitative data High Frequency

(Numerical)

Low Frequency (Numerical)

Indeterminate

Questionnaires Questionnaires Questionnaires Ditto

Confidence Index

Crime Index

Conflict Index

Data Mining; Visualisation techniques

Fusion

Investment decision (buy/sell)

Policy formation / evaluation

Policy formation / evaluation

Ontology learning for Rule-based / Case-Based Reasoning

Page 62: 0 Textual and Quantitative Analysis: Towards a new, e- mediated Social Science Khurshid Ahmad, Lee Gillam, and David Cheng Department of Computing, University

62

INVESTOR PSYCHOLOGY

SOCIOLOGY OF CRIME

ANTHROPOLOGY OF ETHNICITY

METHODS/ TECHNIQUES

Qualitative data INFORMATIVE

APPELLATIVE

EXPRESSIVE

Quantitative data HIGH FREQUENCY

(NUMERICAL)

LOW FREQUENCY (NUMERICAL)

INDETERMINATE

Fusion