a framework to integrate unstructured and structured data...

8
A Framework to Integrate Unstructured and Structured Data for Enterprise Analytics Lipika Dey, Ishan Verma, Arpit Khurdiya, Sameera Bharadwaja H. TCS Innovation Labs Tata Consultancy Services Ltd. New Delhi, India (lipika.dey, ishan.verma, arpit.khurdiya, sameera.bharadwaja)@tcs.com Abstract- It is well-accepted that when information from structured and unstructured data sources is analyzed together, the potential of gaining meaningful insights increases manifolds. This paper provides a framework for integrating structured and unstructured data in the context of enterprise analytics. Structured data is assumed to be in the form of a time-series that encodes some aspect of enterprise performance over a specified period, like monthly or weekly sales figures or stock prices etc. Unstructured data may be gathered from news sources, internal repositories of consumer feedbacks, blogs and discussion forums or also from social-media like Twitter, Facebook etc. This paper focuses on intelligent methods of linking time-series data points to the unstructured content in an application-specific way such that the linked unstructured text creates a context for interpreting the time-series behavior. The aim is to generate new forms of data that can be employed in future to derive predictive models or perform causal analytics or also help in risk assessment for Enterprises. KeywordsInformation Fusion, Data Association, Fusion Enabled Decision Support, Time-series Analysis I. INTRODUCTION Enterprise decision making and strategizing is largely influenced by figures emerging from large volumes of structured data like weekly sales figures, daily stock-prices, monthly or weekly market-share, rise or fall in customer- satisfaction indices etc. However successful decision making also depends on decision makers‘ capability to assess the environment around which is likely to influence business in a major way. The signals to be caught from the environment may be related to world politics, global or regional economic policies, competition landscape, socio-political changes in different parts of the world, actions by major stake-holders and so on. Most of these signals can be usually obtained from News. Unstructured data like News, blogs, market reports, social media contains wealth of information that can contribute significantly towards interpretation of structured data, when the two are fused in a meaningful way. For example, deviations from expected figures in sales data have been traditionally interpreted in conjunction with weather reports and information about natural calamities. Similarly, deviations in market share of a product can often be interpreted in conjunction with events reported in the social-media, news or competitive intelligence like competitor price changes, celebrity promotions, product launches etc. It is obvious that getting all or most of the required data for decision making is not really very difficult. It is however not easy to get the relevant ones in a timely fashion without being subjected to information overload. The biggest challenge of course is to be able to assess the impact of all the events. We show that it is possible to design computational aids that can address some of the above challenges. Feasibility of bringing large volumes of data from a variety of sources as and when it is generated have been facilitated by emerging technologies like MPP, Hadoop etc. falling under the umbrella of Big Data tools.However it is just not enough to provide unified access to the data. There exists no systemic framework that facilitates the integration of two types of information in a generic way. Our aim in this work is to provide that framework. We did not come across much research in this area. Most of the earlier works on integration of structured and unstructured data have focused on entity-based indexing to correlate two types of information. A considerable amount of literature exists in linking unstructured documents like complaint logs, call- transcripts, emails, reports etc. to transaction records or similar structured elements via the entity names [1]. It may be noted that the focus of the above-mentioned works has been on retrieval of related information based on entity matching and resolution and not on causal analytics. Besides, the unstructured documents used in the above scenarios have been restricted to Enterprise content only. On the other hand, though there has been a lot of work on extracting information about a single entity from multiple web documents [2], these works have also not focused on linking the extracted information to structured data for interpretation and inferring. In this paper, we have presented a framework that facilitates integrated analysis of structured and unstructured data in the context of an Enterprise. The objective of the framework is to allow data acquisition from multiple heterogeneous sources and automate the process of knowledge discovery through correlation of information components extracted from the data. The framework exploits text processing and mining techniques for information extraction

Upload: doanminh

Post on 06-Mar-2018

229 views

Category:

Documents


2 download

TRANSCRIPT

Page 1: A Framework to Integrate Unstructured and Structured Data ...fusion.isif.org/proceedings/fusion2013/html/pdf/Friday, 12 July... · A Framework to Integrate Unstructured and Structured

A Framework to Integrate Unstructured and

Structured Data for Enterprise Analytics

Lipika Dey, Ishan Verma, Arpit Khurdiya, Sameera Bharadwaja H.

TCS Innovation Labs

Tata Consultancy Services Ltd.

New Delhi, India

(lipika.dey, ishan.verma, arpit.khurdiya, sameera.bharadwaja)@tcs.com

Abstract- It is well-accepted that when information from

structured and unstructured data sources is analyzed together,

the potential of gaining meaningful insights increases manifolds.

This paper provides a framework for integrating structured and

unstructured data in the context of enterprise analytics.

Structured data is assumed to be in the form of a time-series that

encodes some aspect of enterprise performance over a specified

period, like monthly or weekly sales figures or stock prices etc.

Unstructured data may be gathered from news sources, internal

repositories of consumer feedbacks, blogs and discussion forums

or also from social-media like Twitter, Facebook etc. This paper

focuses on intelligent methods of linking time-series data points

to the unstructured content in an application-specific way such

that the linked unstructured text creates a context for

interpreting the time-series behavior. The aim is to generate new

forms of data that can be employed in future to derive predictive

models or perform causal analytics or also help in risk assessment for Enterprises.

Keywords—Information Fusion, Data Association, Fusion

Enabled Decision Support, Time-series Analysis

I. INTRODUCTION

Enterprise decision making and strategizing is largely influenced by figures emerging from large volumes of structured data like weekly sales figures, daily stock-prices, monthly or weekly market-share, rise or fall in customer-satisfaction indices etc. However successful decision making also depends on decision makers‘ capability to assess the environment around which is likely to influence business in a major way. The signals to be caught from the environment may be related to world politics, global or regional economic policies, competition landscape, socio-political changes in different parts of the world, actions by major stake-holders and so on. Most of these signals can be usually obtained from News. Unstructured data like News, blogs, market reports, social media contains wealth of information that can contribute significantly towards interpretation of structured data, when the two are fused in a meaningful way. For example, deviations from expected figures in sales data have been traditionally interpreted in conjunction with weather reports and information about natural calamities. Similarly, deviations in market share of a product can often be interpreted in conjunction with events reported in the social-media, news or competitive intelligence

like competitor price changes, celebrity promotions, product launches etc.

It is obvious that getting all or most of the required data for decision making is not really very difficult. It is however not easy to get the relevant ones in a timely fashion without being subjected to information overload. The biggest challenge of course is to be able to assess the impact of all the events.

We show that it is possible to design computational aids that can address some of the above challenges. Feasibility of bringing large volumes of data from a variety of sources as and when it is generated have been facilitated by emerging technologies like MPP, Hadoop etc. falling under the umbrella of Big Data tools.However it is just not enough to provide unified access to the data. There exists no systemic framework that facilitates the integration of two types of information in a generic way. Our aim in this work is to provide that framework.

We did not come across much research in this area. Most of the earlier works on integration of structured and unstructured data have focused on entity-based indexing to correlate two types of information. A considerable amount of literature exists in linking unstructured documents like complaint logs, call-transcripts, emails, reports etc. to transaction records or similar structured elements via the entity names [1]. It may be noted that the focus of the above-mentioned works has been on retrieval of related information based on entity matching and resolution and not on causal analytics. Besides, the unstructured documents used in the above scenarios have been restricted to Enterprise content only. On the other hand, though there has been a lot of work on extracting information about a single entity from multiple web documents [2], these works have also not focused on linking the extracted information to structured data for interpretation and inferring.

In this paper, we have presented a framework that facilitates integrated analysis of structured and unstructured data in the context of an Enterprise. The objective of the framework is to allow data acquisition from multiple heterogeneous sources and automate the process of knowledge discovery through correlation of information components extracted from the data. The framework exploits text processing and mining techniques for information extraction

Page 2: A Framework to Integrate Unstructured and Structured Data ...fusion.isif.org/proceedings/fusion2013/html/pdf/Friday, 12 July... · A Framework to Integrate Unstructured and Structured

Fig. 1. Stock Chart for Wipro, an Indian IT Company showing a

massive fall on January 18th, 2013 (large red box on right of the chart).

from unstructured sources. The framework is equipped with a knowledge representation scheme to store and correlate information components extracted from heterogeneous sources. It supports event-based contextual correlation of information extracted from both structured and unstructured data.

The overall framework is envisaged as an evolutionary analytics platform that can learn to perform causal analytics through implicit learning from human-interactions and feedbacks. Causal analytics on time-series data in conjunction with current events can lead to novel predictive and risk-assessment models. While it is too early to think of such models in the absence of enough data, the framework provides an ideal platform to collect such data for model building and validation. The reasoning components of the framework to perform causal analytics are yet to be built.

The key contribution of the present paper lies in proposing methodologies for extraction, classification, evaluation and correlation of events from diverse data-sources. We show how generic text-mining techniques can be employed to identify significant events from large collections of documents that can then be contextually correlated to time-series events. Since contextual correlations do not indicate a causal relationship among the events, we present some results on the quality of correlations using human evaluations.

Rest of the paper is organized as follows. Section II presents a review of related work. Section III describes an example of how event-based knowledge discovery aids enterprise analytics. The proposed framework for entity-based information fusion system is described in Section IV. Section V describes an extraction technique, representation and characterization model of world events. Section VI describes representation of time-series events derived. Section VII presents few results and observations. Section VIII concludes the paper with a brief touch upon future work.

II. REVIEW OF RELATED WORK

As mentioned earlier, this is a relatively new area, which has gained prominence due to the availability of large heterogeneous collections of data from multiple sources. Past efforts in linking text data and structured data had been mostly in the context of transactional systems. We present an overview of some of the related workthatfuse data from multiple sources. Some systems such as EROCS [3] and LIPTUS [4] links structured and unstructured data with help of entity resolution. EROCS was designed to link a given text document with relevant structured data based on named entities that were also values in the data-base records. It considers structured data as predefined set of entities and identifies the best matched entity for a given document. Entities referred to customer names, items bought, stores visited and so on. Similarly, LIPTUS associates customer interactions recorded in the form of emails and transcribed phone calls, with customer and account profiles database. This was also meant to associate details of communication embedded in text to customer identities recorded in data bases.

Application of time series analysis [6] on market data for pattern identification and prediction [7-9] is a well-studied problem. There exists a rich literature in this area which is not

covered in this review since they do not deal with any unstructured text at all.Attempts to link market news data to stock-market data was reported in [5]. This work employed topic extraction to identify major information components from news documents. It was observed that topics can be correlated to fall or rise market indices. This work did not focus on analysing causes of a specific stock value.

The closest to our work is however a retrieval system that was presented in [10]. In this work novel indexing methods for structured and unstructured data has been reported as a way to retrieve data simultaneously from different repositories. The focus of this system however was majorly on data acquisition and extraction rather than its correlation.

III. EVENT-BASED KNOWLEDGE DISCOVERY FOR

ENTERPRISE ANALYTICS – A MOTIVATING EXAMPLE

Before going into the technical details of the work, we start with some examples that show how stock-market data can be linked to news data to reveal causal relationships between different types of news events and the stock-price value. In our experiments we have news data related to 5 major Indian IT companies and also their stock prices for last 6 months. We would like to clarify that it is not our intention to provide any insights into stock-market prediction, which is a well-established field and we have very little expertise in it.

Fig. 1 shows the stock chart for Wipro for the period. The red boxes mark the days when the closing price of stock had been lower than the opening price. A large red box is seen on the right hand side for January 18th, 2013. This can be clearly thought of as a day on or close to which some significant events must have happened, after which the overall stock prices also seems to be have fallen for the company.

Searching the news repository for related news of 17th and 18th January, 2013 yields news documents related to WIPRO‘s stock values, some of which are shown in Fig. 2. It is interesting to observe from the titles that initially the shares surge on 17th based on expectations of good quarterly report, starts well on 18th based on good profit reports, but finally go down. None of the titles however explain the causes for the fall.Table I shows some sentences extracted manually from one of the News documents (http://profit.ndtv.com/news/cheat-sheet/article-five-reasons-why-wipro-shares-fell-5-despite-strong-q3-316452) and the human interpretations that can be attached to these. We observe that, if there are sufficient

Page 3: A Framework to Integrate Unstructured and Structured Data ...fusion.isif.org/proceedings/fusion2013/html/pdf/Friday, 12 July... · A Framework to Integrate Unstructured and Structured

Fig. 2. Market News on Wipro – 17th and 18

th January, 2013.

Fig. 3. Fusion Framework for Integrated Analysis of Data.

numbers of similar instances observed across IT industry over time, it can be inferred that report on declining business volumes can cause fall in stock prices, as it happened for WIPRO.

TABLE I. MANUALLY RETRIEVEDSENTENCES FROM A NEWS

DOCUMENT THAT EXPLAIN THE CAUSE FOR FALL IN STOCK PRICES OF WIPRO

Sentence Interpretation Type of

Sentence

Traders said Wipro's performance

in the core IT services segment was

not as strong as expected

Performance did

not meet

expectations

Reporting -

Someone said

something

Wipro's IT business volumes, or the

billable hours, declined 1 per cent

sequentially (quarter-on-quarter)

against estimates of 1.6 per cent

QoQ growth.

Business

Volumns

Declined

Fact –

Decrease in a

measurable

value

Wipro was the worst performer in

terms of volume growth as

compared to other top-tier IT firms.

Performance bad

in comparison to

peers

Opinion –

Comparison to

others

The manual exercise outlined here not only reaffirms the hypothesis that analysis can gain from looking at text and structured data jointly, but also underlines the challenges. The sentences identified and presented in Table I had to be retrieved from a large collection. While the interpretation of the sentences was based on their semantic content, the classification of the sentences was based on linguistic analysis.

The purpose of the proposed framework is to support methods that can automate a large part of the above-mentioned tasks to help humans draw the right conclusions. Thus we have worked on methods to automatically identify sentences that are relevant to an analysis task based on their semantic content, linguistic structures and statistical significance. We have also

worked on mechanisms to extract the information components from these sentences and store them in a more structured way to facilitate future inference tasks.

IV. FRAMEWORK FOR EVENT-BASED FUSION OF

INFORMATION

Fig.3illustratesan overview of the proposed framework. Each unit in the framework is like a place-holder that houses several tools appropriate for a task. While some of these tools are in place, more can be added or replaced by alternate and more efficient ones. The acquisition and assimilation of content is followed by processing and extraction of information components that can be used for analysis. All unstructured content along with the information components extracted from them are indexed. The analytics unit is visualized as a store-house of domain or application-specific analytical technologies rather than as a pre-specified collection of algorithms. Predictive and Risk-assessment analytical components are two such examples which we intend to build in future. These units will be using correlated information that is produced by the correlation engine.The query-processor aids the analytical unit by retrieving information.

The functional details of each component are explained in this section. Detailed methodologies are presented in the subsequent sub-sections.

A. Data Acquisition and Content Pre-processing

The framework supports data gathering from a wide range of pre-defined or new sources. Unstructured documents like Company reports or proposals etc. are ingested as files written in one of several known formats. News articles are gathered from the Web. Social-media content collection can also be integrated. Presently, content from Twitter and a large set of

Page 4: A Framework to Integrate Unstructured and Structured Data ...fusion.isif.org/proceedings/fusion2013/html/pdf/Friday, 12 July... · A Framework to Integrate Unstructured and Structured

blogs that discuss market-news are collected. All unstructured content is uniquely identified with an ID number, assigned a date and stamp. Additional meta-data elements like author name, category of document, sender, receiver etc. that are provided by the content sources are also associated to the documents. Structured data is mostly gathered from pre-defined data sources. While stock-market data can be obtained from a number of open-sources, other kinds of data like those of sales or demand-forecasting are enterprise-sensitive. These can be fed in directly from data bases where they are stored.

Since unstructured content come from heterogeneous sources, almost always there is a lot of duplication in this content collection. De-duplication helps in identifying the duplicates or near-duplicates, and thereby group content together. Open source text indexing mechanisms are deployed to identify exact duplicates. Min-hash based techniques [11] are deployed to identify near duplicates. De-duplication and grouping help in reducing computational overheads for processing unstructured content. Attributes of a group can be defined in a problem-specific way. The document group retains information like the number of unique sources in which the document appeared, first occurrence time and location, total number or comments or replies that can be associated to it etc. to assist in the analytics process. For social-media content, an important parameter for duplicated content is the total number of shares, which is also referred to as buzz. Total buzz of a document is a function of its buzz across all sources across all times.

B. Content Processing Unit

Content processing involves a host of activities related to information extraction, organization and characterization. One of the core tasks accomplished by this unit is to identify events that are exploited for event-based reasoning. Event extraction from unstructured content is a text-mining activity that itself utilizes several other fundamental text components like words, phrases, relations etc. This unit houses an assortment of text-mining tools. These include Parts-of-speech (POS) analyzer, Phrase extractor, Named Entity Recognizer (NER), Sentiment and opinion extractor, Relation extractor and Event-extractor. Phrase extraction is a two-step process. We have deployed open-source tools for standard text-mining tasks like POS tagging, initial phrase extraction, Named-Entity extraction and relation extraction. The initial phrases extracted are analyzed for similarity and combined to yield a set of canonical phrases that can take care of structural variations of syntactically similar content. Details of the event extraction and

characterization processes will be described in the next section.

C. Time-series data analysis unit

This unit hosts the time-series data and various analytical components that operate on it. The analysis unit hosts event extractors to detect points of interest that exhibit one or more features from a set of pre-defined properties. The exact nature of the algorithms employed to detect frequent or significant patterns or anomalies can differ from application to application. However generic or default implementations can also be made available. For example, significant patterns in stock-market data analysis include M pattern, cup-and-handle etc. which are detected over a range of data. Similarly,

anomaly-based event detectors can identify deviation events in a domain-dependent fashion as a deviation from ―normal‖ behavior or ―deviation from predicted behavior‖.

D. Indexing Unit

This unit is responsible for indexing the entire repository along with all associated meta-dataand events extracted by the earlier units. Indexing facilitates query-processing which is essential to retrievecorrelated events for inferential analysis.

E. Correlation Engine

The correlation engine links world events and time-series events. Correlations are createdprimarily using time of occurrence or derivations of it. Derivations of time used for correlation are members ofa discrete set of pre – defined time intervals that include ―today‖, ―yesterday‖, ―last one week‖, ―last fort-night‖,―last one month‖, ―immediately before‖, ―immediately after‖, ―next week‖, ―next fort-night‖, ―next one month‖etc. and more can be added. An initial set of correlations are formed using pre-defined rules. A correlation table entry is an n-ary relation defined over the domains of world and time-series events. These entries are furtherjudged for significance using data mining techniques like association analysis.

V. EXTRACTION, REPRESENTATION AND

CHARACTERIZATION OF WORLD EVENTS

The Oxford Dictionary defines an event as follows: ―a thing that happens or takes place, especially one of importance‖. Philosophically events are associated with state changes. This definition of events makes it a natural choice for spatio-temporal analysis, especially if the analysis is about interpretation of deviant situations. The activity oriented definition on the other hand can be exploited to identify textual elements that are associated to some action or activity as carriers of event-related information.

Based on these observations, we have defined two types of events:

1) World events:A world-event is extracted from text

documents and contains information about a specific action or

activity indicating an occurrence or a happening at a particular

time-instance. A world-event event can also be associated to a

possibly empty set of entities, a location, a significance value

and an impact. The entities associated to an event may be

further classified as actors or perpetrators of the action or as

objects who are impacted by the event.

2) Time-series events: A time-series event is characterized

by a deviation in observed behavior of a measurable variable

from its expected behavior. Time-series events are observed

while tracking specified measurable quantities like sales,

stock-market value etc. Simple events can be defined in terms

of rise or fall of the value or as deviations from expected value

that has been predicted by a model. More complex events can

be defined as functions of state-changes over time or as

functions of multiple time-series. For example, while defining

a stock-market deviation event for a company we have made

Page 5: A Framework to Integrate Unstructured and Structured Data ...fusion.isif.org/proceedings/fusion2013/html/pdf/Friday, 12 July... · A Framework to Integrate Unstructured and Structured

use of stock values of the sector as a whole tracked over a

defined time-period, rather than looking at isolated values.

World-events may be linked to time-series events based on its associated information like time-of-occurrence, names of associated entities, location etc. The basis for linking two or more events from the above categories is termed as the ‗context for correlation‘ and is a combination of all the above elements. Context for correlation plays a significant role in the analytics process. For example, on a given day, News articles for a particular company are linked to its stock prices. Some of these News articles contain references to the stock values and their behavior.Other sentences may contain reason explaining the behavior. Yet others may simply report activities of the organization. These activities may be contextually linked to the stock behavior. We explain how event detection from the sentences or articles can play a significant role in establishing these contexts and associations.

Event detection has been quite popular with the multimedia research community, where events are associated to action descriptions. Event detection from text documents has proved to be a greater challenge due to the usual problems of unstructured text processing like dealing with senses, ambiguities, polysemy, synonymy etc. We now present some algorithms that deploy syntactic analysis of text followed by deep semantic analysis to identify and characterize events. Since world-events encode an activity along with associated entities, event detection from text documents is preceded by entity detection and relation detection. Named entities are extracted from each sentence using Stanford NER and stored. Each sentence is processed for relation extraction using ReVerb1, a tool that was developed by University of Washington. ReVerb is a pattern-based relation extractor that can capture relation phrases expressed by verb-noun combinations after eliminating incoherent and uninformative extractions. The relation extractor produces outputs of the form (SUBJECT – PREDICATE – OBJECT) where the predicate is as follows:

V | VP| VW*P

V = verb particle? Adv?

W = (noun | adj | adv | pron | det)

P = (prep | particle | inf. marker)

Table II presents the most important sentences and

corresponding relations extracted from the News archive for January 18th, 2013 with respect to Wipro. It also shows how these sentences could be grouped together based on their semantic content. The semantic nature can be captured through the verbs and their subjects. We utilize the relations, or more specifically the verbs in the above predicates to characterize the events that have happened on that day. Since there are many events reported for a single day, we also propose to show only the most relevant events for the day.

We propose a two-step processto obtain the characterization of the events.In the first step, the verbs are classified into classes that can encode the semantic character of the relations, and thereby the actions or events represented by it using VerbNet (VN) [12]. VN is a large on-line verb lexicon that organizes verbs into verb-classes and sub-classes, such that

members of each class exhibit more syntactic and semantic coherence among themselves than with members of other classes. Each verb class in VN is completely described by thematic roles, selectional restrictions on the arguments and frames consisting of a syntactic description and semantic predicates with a temporal function. A single verb, however, is classified under different classes or sub-classes in VN. Though each VN class contains a set of syntactic descriptions, or syntactic frames, depicting the possible surface realizations of the argument structure for constructions such as transitive, intransitive, prepositional phrases, resultatives and a large set of diathesis alternations, it is not possible to use these descriptions for online disambiguation of verb-classes for large collections of text-data, which are also quite noisy.

TABLE II. RELATIONS EXTRACTED FROM THE SAMPLE SENTENCES

In the second step of our work, we employ a disambiguation algorithm that iteratively allocates every verb to a single class, taking into account the repository statistics. This ensures that statistically significant verb-classes are captured correctly even without employing computationally complex language processing techniques.

To begin with, every verb is assigned to all classes that it can possibly belong to using VN look-up. The significance of each class or sub-class C is denoted by Sig(C). For a given repository is calculated using the following function:

Sig(C) = 1/ Specificity(C)

where,

Specificity(C) = 1/|C|, |C| = Number of verb members C has

in VN.

Clearly the function favors classes that have fewer member verbs and can be therefore semantically less ambiguous. The algorithm completes class assignment in multiple passes till all verbs are assigned to exactly one class. As and when a verb is disambiguated or assigned to one class, it is de-assigned from other classes and the frequencies of all those classes are decreased. In case of a tie, repository specificity is used to

Sentence Subject: Predicate: Object Semantic

Category

The company's consolidated

revenue grew 10.27% to Rs

10,989 crore from Rs 9965

crore a year ago.

The company 's

consolidated revenue : grew

: Rs 9965 crore

Behavior

of

Revenue

IT Products revenues grew 11

percent, and consumer care and

lighting revenues increased 17

percent from a year ago.

Products revenues: grew: 11

percent

Behavior

of

Revenue

However, IT business volumes,

or the billable hours, fell 1

percent from the previous

quarter, according to company

data.

business volumes , or the

billable hours : fell 1

percent from : the previous

quarter

Behavior

of

Business

Volume

The stock of Bangalore based

IT exporter Wipro crashed

during the early morning trade

by as much as 5% after the

company announced poor

results for the December 2012

quarter.

the company: announced:

poor results

Announce

ment

1http://reverb.cs.washington.edu/.

Page 6: A Framework to Integrate Unstructured and Structured Data ...fusion.isif.org/proceedings/fusion2013/html/pdf/Friday, 12 July... · A Framework to Integrate Unstructured and Structured

Fig. 4. Verb class disambiguation example

resolve in favor of the class which has fewer co-members present in the current repository, than a class which has more co-members present, where co-members are verbs that belong to the same class in VN. The rationale for this is as follows: clearly the non-favored class could have occurred due to other co-members rather than the present verb, whereas the favored class has more possibility to have occurred due to it. If disambiguation is still not complete, then the most frequent class left is assigned to the verb.

The detailed algorithm is presented below:

Verb-Class Disambiguation Algorithm

1) For each document d in repositoryR, triples of the form

<subject, object, verb> are extracted using Reverb.

2) LetVdenote the complete set of verbs extracted from R

and 𝐶𝑖 ∈ 𝐶, where Ci the set of VN classes is associated

with 𝑉𝑖 ∈ 𝑉 and C is the set of classes present in VerbNet.

3) For each,𝐶𝑗 ∈ 𝐶, compute significance of the class from

VerbNet; Sigv(Cj) = 1/Specificityv(Cj) where,

Specificityv(C) = 1/ (Number of Verb Members associated

with Cj) and Vj is set of verbs associated with Cj and

freq(Vj) is the frequency of Vj in R.

4) For each 𝑉𝑖 ∈ 𝑉

a) Assign a verb class to the verb with highest

Sigv(Ci). Let Ciabe the class assigned toVi.

b) In case of a tie between verb classes, Viis added

to a set Vd.

5) For each 𝑉𝑖 ∈ 𝑉 − 𝑉𝑑 , remove all verb classes apart from

Ciai.e. Ci := {Cia}.

6) For each 𝐶𝑗 ∈ 𝐶, compute significance of the class from

R; Sigr(Cj) = 1/Specificityr(Cj), where, Specificityr(C) = 1/

(Number of Verb Members of Cj in R).

7) For each 𝑉𝑖 ∈ 𝑉𝑑

a) Assign a verb class to the verb with highest

Sigr(Ci).

8) In case of a tie between verb classes, Assign a verb class

to the verb with highest freq(Ci), where ,freq(Ci) = ∑

freq(Vi ), where Vi is set of verbs associated with Ci and

freq(Vi )is the frequency of Vi inR.

End Algorithm

Fig. 4 explains the functioning through an example. It shows that though the verb admit is initially assigned to three classes, it finally gets assigned to the class ―confess‖ which is more in the sense of declaring or reporting for this repository, than the class admit which is in the sense of allowing.

Verb-class assignment provides the first major step towards structuring the information components. Selecting a sub-set of

the most important verb-classes presents an efficient way to identify the key actions or events that are reported in the repository. Only those sentences that contain verbs belonging to the selected set are judged as relevant for the analysis task.

In other words, verb-class assignment is used as a way to summarize the important information for human analysts.

For automated analysis, text information components extracted from these sentences are used to create World-events. Associated meta-data from the corresponding documents like company name for which the document was collected, date and source are used to generate the context for the event. A complete representation of a World-event can now be viewed as follows:

(eventID, sourceDocument, Date, VerbClass, Actors, EventType)

VI. REPRESENTATION OF TIME-SERIES EVENTS

In this work, time series events are detected as abnormalities. Abnormalities are outliers or deviations observed in a time-series. Since we have worked here primarily on stock-market data, where a single company‘s stock value depends both on the individual company‘s track records as well as the sector behavior, a time-series value at an instant is declared to be an outlier if the value is significantly ―far‖ from past values in the same series or it is significantly far from its neighboring values for other series on the same day. The first aspect captures the past behaviour of a single entity or the organization which is tracked. The second aspect captures the sector behaviour. In the given examples, the sector comprises all major IT companies in India. Since stock-market data is inherently noisy we use smoothening and averaging to obtain reliable outliers.The following algorithm explains how deviations are identified in the time-series.

Page 7: A Framework to Integrate Unstructured and Structured Data ...fusion.isif.org/proceedings/fusion2013/html/pdf/Friday, 12 July... · A Framework to Integrate Unstructured and Structured

Fig. 5. List of Top-20 VN classes extracted from relations mined from market news data.

TABLE III. MEMBER VERBS IN THE TOP-20 VN CLASSES

Class Verb Members Event_Type

Indicate-78-1

indicate, say, explain Reporting / Announcement

Escape-51.1-2 rise, advance, come, go, fall, enter, plunge, arrive, approach, exit,

tumble, escape State change of Measurable quantity

Wish-62 plan, expect, aim, mean, propose, hope, dream Expectation

Get-13.5.1-1 catch, win, buy, reach, gain, choose, procure, earn, chatter, hire, boost Acquisition / Contract win

Calibratable_cos-45.6 decline, increase, soar, decrease, vary, depreciate, diminish, plummet State change of measurable quantity

Time-series Event Detection Algorithm

1) Input: Market data (time-series) associated with the

elements of set of interested entities: E1 …, En, as T = {TE1,

TE2, …, TEn}.

2) Define deviation DEi(n) at a point TEi(n) as follows:

𝐷𝐸𝑖 𝑛 = 𝑇𝐸𝑖 𝑛 − 𝑇𝐸𝑗 𝑛 𝑖≠𝑗

𝑛 − 1, 𝑖 = 1 𝑡𝑜 𝑛

3) Construct Deviation-series for all the entities,

represented byD = {DE1, DE2, …, DEn}.

4) Obtain smoothened deviation series using moving-

average filter of length 5. This is represented as 𝐷 = {𝐷 𝐸1 ,𝐷 𝐸2 ,… , 𝐷 𝐸𝑛 }where

𝐷 𝐸𝑖(𝑛) = 𝐷𝐸𝑖(𝑘)𝑛−4𝑘=𝑛 .

5) Obtain envelope series using the maximum and

minimum values of smoothened deviation series.

6) If a point in the original market data-series lies outside

the defined envelope, it is treated as outlier and marked as a

time-series event.

End Algorithm

A time-series event is finally represented as follows:

(eventID, Entity, Date, DeviationValue, DeviationType)

The correlation engine uses the Date and entity values to link World events and time-series events.

VII. RESULTS AND OBSERVATIONS

In this section we present some results that illustrate the capability of the proposed methodologies to help in generating event-based correlations for integrated analytics. Fig. 5 presents the list of top 20 VN classes that were extracted from relations mined from market news data repository. It is found that the relative frequencies of the different verb-classes are more or less identical for all the companies.Table III shows the member verbs from these classes. It also shows manual assignment of event-type based on these verb members.

Fig. 6 illustratessome correlated events obtained from the system. Fig. 6 (a) shows a time-series event observed for ―HCL‖ on 10th Aug, 2012. Instances of related news article are retrieved from the repository based on entity name anddeviation date. Event extraction process extracts the significant events, the most prominent of which are of type ―Acquisition/Contract win‖. Similar events were found through correlation for ―Tech Mahindra‖ on date 17 Sep, 2012 as shown in Fig. 6(b). Though not substantiated through large number of events yet, it may be surmised that acquisition events, which encode information of type ―contract win‖ or ―acquisition of companies‖ may have a correlation with rise in stock price of a company. Figure 7 shows another set of correlated events for Tech Mahindra coinciding with its stock gains on 12 Dec., 2012. It was found to be related to stock cell by British Telecom, as is also captured by the events extracted.

Page 8: A Framework to Integrate Unstructured and Structured Data ...fusion.isif.org/proceedings/fusion2013/html/pdf/Friday, 12 July... · A Framework to Integrate Unstructured and Structured

Fig. 6. Correlation instances for a) HCL and b) Tech-Mahindra obtained by the system for 10th Aug 2012 and 17

th Sep 2012 respectively.

Fig. 7. Correlated Events for Tech Mahindra on Dec. 12, 2012

VIII. CONCLUSION

In this work, we have presented a framework to enable data

integration of diverse types from diverse sources. The proposed work describes extraction, classification, evaluation

and correlation of events of different types from a variety of

data-sources. We propose two basic types of events – world

events and time-series events as the basis for correlating

information from multiple sources. We have usedtext-mining

techniques to identify and characterize significant world-

events from large collections of documents. We also show

how these eventscan be contextually correlated to time-series

events.We have presented results from a data collection

related to IT company stock prices. It can be observed that the

system is capable of identifying interesting and significant correlations between stock prices and Company events. Future

work concentrates on modeling the effect of different types

ofworld events on time-series events through association

analysis and statistical reasoning. Significant associations,

when learnt, can be further utilized to build predictive

systems. In that case, world events can be predictors of time-

series events. Such systems can be used to generate warnings

or alerts or for risk assessment.

REFERENCES

[1] Venkatesan T. Chakaravarthy, Himanshu Gupta, Prasan Roy, Mukesh

Mohania, ―Efficiently linking text documents with relevant structured information,‖ Proceedings of the 32

nd international conference on Very

large data bases, Seoul, Korea, September 12-15, 2006.

[2] Erhard Rahm, Andreas Thor, David Aumueller,―Dynamic fusion of web data; database and XML technologies,‖ Lecture Notes in Computer

Science, Volume 4704, 2007, pp 14-16.

[3] Bhide M. et. al., ―Enhanced business intelligence using EROCS,‖ IEEE

24th International Conference onData Engineering (ICDE), pp. 1616-

1619, 7-12 April, 2008

[4] Manish A. et. al.,―LIPTUS: associating structured and unstructured

information in a banking environment,‖ Proceedings of the 2007 ACM SIGMOD international conference on Management of data, Beijing,

China, June 11-14, 2007.

[5] Anuj Mahajan, Lipika Dey, and S.M. Haque, ―Mining financial news for major events and their impacts on the market,‖ IEEE/WIC/ACM

International Conference on Web Intelligence and Intelligent Agent Technology, 2008, pp. 423-426.

[6] Hamilton James, Time Series Analysis, Princeton Univ. Press., 1994

[7] Shalini D., Shashi M., Sowjanya A. M., ―Mining frequent patterns of

stock data using hybrid clustering,‖ Annual IEEE India Conference (INDICON), pp.1-4, 16-18 Dec. 2011.

[8] James N. K. Liu, Raymond W. M. Kwong, ―Automatic extraction and

identification of chart patterns towards financial forecast,‖ Appl. Soft Computation, pp. 1197-1208, Jan 2007.

[9] Nesbitt, K.V., Barrass, S., ―Finding trading patterns in stock market data,‖ Computer Graphics and Applications, vol.24, no. 5, pp.45-55,

Sept.-Oct. 2004.

[10] Todd Wakefield, David Bean,―Visualization of Integrated Structure and Unstructured Data,‖ US Patent Number: US20050108256.

[11] A. Z. Broder, ―On the resemblance and containment of documents, in

'Compression and Complexity of Sequences,‖ IEEE Computer Society Press, Salerno, Italy , pp. 21—29.

[12] Karin Kipper, Anna Korhonen, Neville Ryant and Martha Palmer,

‖Extensive Classifications of English verbs,‖ Proceedings of the 12th EURALEX International Congress., Turin, Italy, September, 2006.