executive briefing: text mining for business...
TRANSCRIPT
![Page 1: Executive Briefing: Text Mining for Business Intelligenceopim.wharton.upenn.edu/~sok/sokpapers/2007/textmini… · · 2006-10-18Text Mining for Business Intelligence Steven O](https://reader035.vdocuments.net/reader035/viewer/2022070609/5aceaeec7f8b9a1d328c191a/html5/thumbnails/1.jpg)
1
Executive Briefing: Text Mining for Business
IntelligenceSteven O. Kimbrough
Acknowledgements: Ulku Oktem, John Ranieri
2006-10-20File: textmining-businessintelligence-foils.*
INSEAD – UNILEVER workshop “Scenario Planning and Scanning on Biofuels and Related Commodities” 19-20 Oct 2006.
![Page 2: Executive Briefing: Text Mining for Business Intelligenceopim.wharton.upenn.edu/~sok/sokpapers/2007/textmini… · · 2006-10-18Text Mining for Business Intelligence Steven O](https://reader035.vdocuments.net/reader035/viewer/2022070609/5aceaeec7f8b9a1d328c191a/html5/thumbnails/2.jpg)
Goals for the Briefing• Provide a background briefing on text
mining for such purposes as business intelligence, environmental scanning, and product application discovery
• Emphasizing: concepts over products
• Concepts more advanced than products
• Aims: Secure intuitions, prepare for design
• Assuming: Familiarity with use of data for these purposes (business intelligence, etc.)
2
![Page 3: Executive Briefing: Text Mining for Business Intelligenceopim.wharton.upenn.edu/~sok/sokpapers/2007/textmini… · · 2006-10-18Text Mining for Business Intelligence Steven O](https://reader035.vdocuments.net/reader035/viewer/2022070609/5aceaeec7f8b9a1d328c191a/html5/thumbnails/3.jpg)
Briefing in Three PartsA. Background on obtaining information from
text.
B. New ideas in text mining.
• Vaim concept.
C. Design exercise.
• High-level specification of requirements for business intelligence and scanning systems
3
![Page 4: Executive Briefing: Text Mining for Business Intelligenceopim.wharton.upenn.edu/~sok/sokpapers/2007/textmini… · · 2006-10-18Text Mining for Business Intelligence Steven O](https://reader035.vdocuments.net/reader035/viewer/2022070609/5aceaeec7f8b9a1d328c191a/html5/thumbnails/4.jpg)
Points to Be Made, Part A
• Importance of text as a source of information for scanning and intelligence
• Limitations of Internet search engines and, in general, Information Retrieval (IR) systems.
Remember: this is a briefing. We shall be brief.
4
![Page 5: Executive Briefing: Text Mining for Business Intelligenceopim.wharton.upenn.edu/~sok/sokpapers/2007/textmini… · · 2006-10-18Text Mining for Business Intelligence Steven O](https://reader035.vdocuments.net/reader035/viewer/2022070609/5aceaeec7f8b9a1d328c191a/html5/thumbnails/5.jpg)
Why Bother with Text?• Text is noisy and difficult to handle. Why not stick
with data? We know about data.
• Answer: There is much valuable information in textual sources that is not in fact available as data. It’s huge. Examples:
• News stories
• Regulatory filings
• Patents
• The World Wide Web
• et cetera
5
![Page 6: Executive Briefing: Text Mining for Business Intelligenceopim.wharton.upenn.edu/~sok/sokpapers/2007/textmini… · · 2006-10-18Text Mining for Business Intelligence Steven O](https://reader035.vdocuments.net/reader035/viewer/2022070609/5aceaeec7f8b9a1d328c191a/html5/thumbnails/6.jpg)
Hasn’t Google Solved the Information Retrieval Problem?
1. No. Search engines (on the Internet or not) have important shortcomings, especially in our context.
2. No. Information Retrieval is only one technique for getting information from text. There are other techniques and they are valuable (and Internet search engines don’t tend to use them).
6
The problem of getting information from text is called Information Retrieval (IR). Aren’t Internet search engines the last word on IR?
![Page 7: Executive Briefing: Text Mining for Business Intelligenceopim.wharton.upenn.edu/~sok/sokpapers/2007/textmini… · · 2006-10-18Text Mining for Business Intelligence Steven O](https://reader035.vdocuments.net/reader035/viewer/2022070609/5aceaeec7f8b9a1d328c191a/html5/thumbnails/7.jpg)
1. State-of-the-Art in Search Engines
• Impressive strengths and surprising successes
• But, all are record-oriented
• Inherent limitations?
7
![Page 8: Executive Briefing: Text Mining for Business Intelligenceopim.wharton.upenn.edu/~sok/sokpapers/2007/textmini… · · 2006-10-18Text Mining for Business Intelligence Steven O](https://reader035.vdocuments.net/reader035/viewer/2022070609/5aceaeec7f8b9a1d328c191a/html5/thumbnails/8.jpg)
Information Retrieval Limitations
• Classical IR• Record-oriented. “Find me the documents relevant
to such-and-such.”• If we wish to find a few relevant documents, then
performance may be OK.• If we wish to be comprehensive (not to say
exhaustive), then performance is known to be poor.• ***Not well-suited for exploration, when you don’t
know exactly what you are looking for***• Recall/precision tradeoff• ***Futility point and defeat by scale***
8
![Page 9: Executive Briefing: Text Mining for Business Intelligenceopim.wharton.upenn.edu/~sok/sokpapers/2007/textmini… · · 2006-10-18Text Mining for Business Intelligence Steven O](https://reader035.vdocuments.net/reader035/viewer/2022070609/5aceaeec7f8b9a1d328c191a/html5/thumbnails/9.jpg)
Google et alia?
• Bring something new to the table• PageRank algorithm and exploitation of the
link structure on the Web• Seeks: “authoritative pages”• Roughly, high rank on query Q = pointed
to by many pages that match to Q• Again, good for finding a few documents
about Q, if there are “authoritative pages” on Q. And if not?
9
![Page 10: Executive Briefing: Text Mining for Business Intelligenceopim.wharton.upenn.edu/~sok/sokpapers/2007/textmini… · · 2006-10-18Text Mining for Business Intelligence Steven O](https://reader035.vdocuments.net/reader035/viewer/2022070609/5aceaeec7f8b9a1d328c191a/html5/thumbnails/10.jpg)
10
Illustration• Example query, Google or otherwise
• “light weight” and “heat resistant”
• Note: hits and responses
• Limited to documents on the web i.e., publicly available, not corporate archives
• Terminology: Hits, responses: records
![Page 11: Executive Briefing: Text Mining for Business Intelligenceopim.wharton.upenn.edu/~sok/sokpapers/2007/textmini… · · 2006-10-18Text Mining for Business Intelligence Steven O](https://reader035.vdocuments.net/reader035/viewer/2022070609/5aceaeec7f8b9a1d328c191a/html5/thumbnails/11.jpg)
11
Limitation 1:
Powerful retrieval yields unmanageable numbers of hits. Ranking helpful, but of limited value.
![Page 12: Executive Briefing: Text Mining for Business Intelligenceopim.wharton.upenn.edu/~sok/sokpapers/2007/textmini… · · 2006-10-18Text Mining for Business Intelligence Steven O](https://reader035.vdocuments.net/reader035/viewer/2022070609/5aceaeec7f8b9a1d328c191a/html5/thumbnails/12.jpg)
12
Limitation 2: Minimal support for telling you about something you don’t know and may find interesting
![Page 13: Executive Briefing: Text Mining for Business Intelligenceopim.wharton.upenn.edu/~sok/sokpapers/2007/textmini… · · 2006-10-18Text Mining for Business Intelligence Steven O](https://reader035.vdocuments.net/reader035/viewer/2022070609/5aceaeec7f8b9a1d328c191a/html5/thumbnails/13.jpg)
13
Limitation 3: Many undifferentiated hits (except by rank). A listing of records.
![Page 14: Executive Briefing: Text Mining for Business Intelligenceopim.wharton.upenn.edu/~sok/sokpapers/2007/textmini… · · 2006-10-18Text Mining for Business Intelligence Steven O](https://reader035.vdocuments.net/reader035/viewer/2022070609/5aceaeec7f8b9a1d328c191a/html5/thumbnails/14.jpg)
2. Other Techniques• Think: repository + access methods• Data repositories:
• Relational database• Data warehouse• OLAP “cubes”
• Document repositories (much more complex):• (Special) collections of documents• Indexes of documents (in collections)• Classifications, taxonomies, ontologies• Cross-indexes (associations among
documents)• Extracted information
14
![Page 15: Executive Briefing: Text Mining for Business Intelligenceopim.wharton.upenn.edu/~sok/sokpapers/2007/textmini… · · 2006-10-18Text Mining for Business Intelligence Steven O](https://reader035.vdocuments.net/reader035/viewer/2022070609/5aceaeec7f8b9a1d328c191a/html5/thumbnails/15.jpg)
2. Other Techniques• Think: repository + access methods• Data access methods
• SQL• OLAP, ROLAP, MOLAP, HOLAP• etc.
• Document access methods• Search, via Information Retrieval methods• Data access methods, after deriving data
from texta. Based largely on counts, of words,
documents, etc.b. From data obtained by IE methods
15
![Page 16: Executive Briefing: Text Mining for Business Intelligenceopim.wharton.upenn.edu/~sok/sokpapers/2007/textmini… · · 2006-10-18Text Mining for Business Intelligence Steven O](https://reader035.vdocuments.net/reader035/viewer/2022070609/5aceaeec7f8b9a1d328c191a/html5/thumbnails/16.jpg)
Summary Framework
16
Data Text
Record-oriented OLTP
(relational database)
IR
(inverted file)
Pattern-oriented OLAP & Data Mining
(multidimensional data “cube”)
Text Mining
(? To be determined... next)
![Page 17: Executive Briefing: Text Mining for Business Intelligenceopim.wharton.upenn.edu/~sok/sokpapers/2007/textmini… · · 2006-10-18Text Mining for Business Intelligence Steven O](https://reader035.vdocuments.net/reader035/viewer/2022070609/5aceaeec7f8b9a1d328c191a/html5/thumbnails/17.jpg)
Text Mining, Concept• A good quote:
17
http://www.ischool.berkeley.edu/~hearst/text-mining.html
Text Mining is the discovery by computer of new, previously unknown information, by automatically extracting information from different written resources. A key element is the linking together of the extracted information together to form new facts or new hypotheses to be explored further by more conventional means of experimentation.
Text mining is different from what we're familiar with in web search. In search, the user is typically looking for something that is already known and has been written by someone else. The problem is pushing aside all the material that currently isn't relevant to your needs in order to find the relevant information.
In text mining, the goal is to discover heretofore unknown information, something that no one yet knows and so could not have yet written down.
![Page 18: Executive Briefing: Text Mining for Business Intelligenceopim.wharton.upenn.edu/~sok/sokpapers/2007/textmini… · · 2006-10-18Text Mining for Business Intelligence Steven O](https://reader035.vdocuments.net/reader035/viewer/2022070609/5aceaeec7f8b9a1d328c191a/html5/thumbnails/18.jpg)
So, How Do We Do It?
• How can we make discoveries in text?
• “Read it” doesn’t count as an answer.
• But, what’s nice about text is that you can read it (unlike pure data).
• Challenge: Automated support for making discoveries from text.
18
![Page 19: Executive Briefing: Text Mining for Business Intelligenceopim.wharton.upenn.edu/~sok/sokpapers/2007/textmini… · · 2006-10-18Text Mining for Business Intelligence Steven O](https://reader035.vdocuments.net/reader035/viewer/2022070609/5aceaeec7f8b9a1d328c191a/html5/thumbnails/19.jpg)
Briefing in Three PartsA. Background on obtaining information from
text.
B. New ideas in text mining.
• Vaim concept.
C. Design exercise.
• High-level specification of requirements for business intelligence and scanning systems
19
![Page 20: Executive Briefing: Text Mining for Business Intelligenceopim.wharton.upenn.edu/~sok/sokpapers/2007/textmini… · · 2006-10-18Text Mining for Business Intelligence Steven O](https://reader035.vdocuments.net/reader035/viewer/2022070609/5aceaeec7f8b9a1d328c191a/html5/thumbnails/20.jpg)
In Brief on the Vaim ConceptThree distinctive aspects:
• P Pattern-oriented (versus record-oriented) system. Why? Focus on making discoveries, rather than retrieving a particular document. Requires categorization of documents (and data).
• D Data, which is derived from text. Measures of indicators in categories.
• M Mashing together (cross-indexing) of information from different sources.
20
![Page 21: Executive Briefing: Text Mining for Business Intelligenceopim.wharton.upenn.edu/~sok/sokpapers/2007/textmini… · · 2006-10-18Text Mining for Business Intelligence Steven O](https://reader035.vdocuments.net/reader035/viewer/2022070609/5aceaeec7f8b9a1d328c191a/html5/thumbnails/21.jpg)
21
Points to Be Made, Part BVaim concept: value-added information mash
Focal interest: deriving indicators from texts
Examples
Concept of indicators
Comments on uses; vision
But first....
![Page 22: Executive Briefing: Text Mining for Business Intelligenceopim.wharton.upenn.edu/~sok/sokpapers/2007/textmini… · · 2006-10-18Text Mining for Business Intelligence Steven O](https://reader035.vdocuments.net/reader035/viewer/2022070609/5aceaeec7f8b9a1d328c191a/html5/thumbnails/22.jpg)
22
Data Are Key
To decision making and its support
In addition to standard issues:
Obtaining data from new sources (texts)
Integrating (mashing) information from multiple sources (vaim, document sources)
Our vaim, and MV:Biofuels, work directly addresses these two issues
![Page 23: Executive Briefing: Text Mining for Business Intelligenceopim.wharton.upenn.edu/~sok/sokpapers/2007/textmini… · · 2006-10-18Text Mining for Business Intelligence Steven O](https://reader035.vdocuments.net/reader035/viewer/2022070609/5aceaeec7f8b9a1d328c191a/html5/thumbnails/23.jpg)
23
Data =
Measures of variables in categories
Example: average temperature at the earth’s surface, 1950-1999
http://carto.eu.org/article2480.html
![Page 24: Executive Briefing: Text Mining for Business Intelligenceopim.wharton.upenn.edu/~sok/sokpapers/2007/textmini… · · 2006-10-18Text Mining for Business Intelligence Steven O](https://reader035.vdocuments.net/reader035/viewer/2022070609/5aceaeec7f8b9a1d328c191a/html5/thumbnails/24.jpg)
24
Variables: Average temperature at the earth’s surface
Categories: Years 1950-99
Measures (or Values): As plotted, by category
Text
Data = Measures ofVariables inCategories
![Page 25: Executive Briefing: Text Mining for Business Intelligenceopim.wharton.upenn.edu/~sok/sokpapers/2007/textmini… · · 2006-10-18Text Mining for Business Intelligence Steven O](https://reader035.vdocuments.net/reader035/viewer/2022070609/5aceaeec7f8b9a1d328c191a/html5/thumbnails/25.jpg)
25
Indicators & Indices
Indicator: A variable, e.g., surface temperature, stock price, number homes sold, ... thought to indicate, to carry information reliably (enough).
Index: An aggregation of indicators; it is also an indicator.
Examples: Index of Leading Economic Indicators; Dow Jones Industrial Average
![Page 26: Executive Briefing: Text Mining for Business Intelligenceopim.wharton.upenn.edu/~sok/sokpapers/2007/textmini… · · 2006-10-18Text Mining for Business Intelligence Steven O](https://reader035.vdocuments.net/reader035/viewer/2022070609/5aceaeec7f8b9a1d328c191a/html5/thumbnails/26.jpg)
26
Indicators & Indices
Very widely accepted and used as inputs to managerial decision making
Further examples: Social indicators (http://unstats.un.org/unsd/demographic/products/socind/default.htm), economic indicators, ecological indicators
![Page 27: Executive Briefing: Text Mining for Business Intelligenceopim.wharton.upenn.edu/~sok/sokpapers/2007/textmini… · · 2006-10-18Text Mining for Business Intelligence Steven O](https://reader035.vdocuments.net/reader035/viewer/2022070609/5aceaeec7f8b9a1d328c191a/html5/thumbnails/27.jpg)
27
MV:Biofuels• Begins with two simple ontologies: a list of
n-grams of import in the biofuels space, and a list of companies in biofuels
• Obtain documents from multiple sources. Minimally:
• Patents (presently, just US patents)
• Web pages of biofuels companies
• News articles indexed on Factiva
![Page 28: Executive Briefing: Text Mining for Business Intelligenceopim.wharton.upenn.edu/~sok/sokpapers/2007/textmini… · · 2006-10-18Text Mining for Business Intelligence Steven O](https://reader035.vdocuments.net/reader035/viewer/2022070609/5aceaeec7f8b9a1d328c191a/html5/thumbnails/28.jpg)
28
Index & Cross Index
• n-grams in patents, Web pages, news articles
• companies in patents, Web pages, news articles
• patents and news articles by date
• patents by patent classification scheme
The cross indexing effects the mashing.
![Page 29: Executive Briefing: Text Mining for Business Intelligenceopim.wharton.upenn.edu/~sok/sokpapers/2007/textmini… · · 2006-10-18Text Mining for Business Intelligence Steven O](https://reader035.vdocuments.net/reader035/viewer/2022070609/5aceaeec7f8b9a1d328c191a/html5/thumbnails/29.jpg)
29
Examples for Illustration
• Data charted: Two themes in biofuels
• Data charted: Four firms, four indicators, pertaining to biofuels
• Discussion of energy efficiency
![Page 30: Executive Briefing: Text Mining for Business Intelligenceopim.wharton.upenn.edu/~sok/sokpapers/2007/textmini… · · 2006-10-18Text Mining for Business Intelligence Steven O](https://reader035.vdocuments.net/reader035/viewer/2022070609/5aceaeec7f8b9a1d328c191a/html5/thumbnails/30.jpg)
Comparing Trends: Two Themes in Biofuels
1999 2000 2001 2002 2003
gov’t requirements sustainability
Sustainability gains in importance; government regulations do not.
30
![Page 31: Executive Briefing: Text Mining for Business Intelligenceopim.wharton.upenn.edu/~sok/sokpapers/2007/textmini… · · 2006-10-18Text Mining for Business Intelligence Steven O](https://reader035.vdocuments.net/reader035/viewer/2022070609/5aceaeec7f8b9a1d328c191a/html5/thumbnails/31.jpg)
31
Associations: Comparison of 4 firms on 4 indicators
biodiesel vegetable oil biofuel sustainability
4 biofuels firmsNeste OilUS BioenergyBiodiesel InternationalCargill
![Page 32: Executive Briefing: Text Mining for Business Intelligenceopim.wharton.upenn.edu/~sok/sokpapers/2007/textmini… · · 2006-10-18Text Mining for Business Intelligence Steven O](https://reader035.vdocuments.net/reader035/viewer/2022070609/5aceaeec7f8b9a1d328c191a/html5/thumbnails/32.jpg)
32
Discussion of Energy Efficiency by Year
Year Indicator Value1999 9552000 9192001 10472002 10682003 987
![Page 33: Executive Briefing: Text Mining for Business Intelligenceopim.wharton.upenn.edu/~sok/sokpapers/2007/textmini… · · 2006-10-18Text Mining for Business Intelligence Steven O](https://reader035.vdocuments.net/reader035/viewer/2022070609/5aceaeec7f8b9a1d328c191a/html5/thumbnails/33.jpg)
33
Top Firms by Discussion of Energy Efficiency (1999-2003)
Firm Indicator Value
Toyota 80Canon 72
General Electric 48U. Cal. Regents 40
Matsushita 34
Fuji Xerox 31
... ...
![Page 34: Executive Briefing: Text Mining for Business Intelligenceopim.wharton.upenn.edu/~sok/sokpapers/2007/textmini… · · 2006-10-18Text Mining for Business Intelligence Steven O](https://reader035.vdocuments.net/reader035/viewer/2022070609/5aceaeec7f8b9a1d328c191a/html5/thumbnails/34.jpg)
34
Top Firms by Discussion of Energy Efficiency and Biofuels (1999-2003)
Firm Indicator Value
Dakota Ag Energy, Inc 2
Squirrel Holdings, Ltd 1
Valmet Fibertech AB 1
![Page 35: Executive Briefing: Text Mining for Business Intelligenceopim.wharton.upenn.edu/~sok/sokpapers/2007/textmini… · · 2006-10-18Text Mining for Business Intelligence Steven O](https://reader035.vdocuments.net/reader035/viewer/2022070609/5aceaeec7f8b9a1d328c191a/html5/thumbnails/35.jpg)
35
Classification of Patents Discussing Energy Efficiency and Biofuels
Firm No. Class Description
Dakota Ag Energy, Inc 2 435 Chemistry: molecular
biology and microbiology
Valmet Fibertech AB 1 34 Drying and gas or vapor
contact with solids
Squirrel Holdings, Ltd 1 429
Chemistry: electrical current producing
apparatus, product and process
![Page 36: Executive Briefing: Text Mining for Business Intelligenceopim.wharton.upenn.edu/~sok/sokpapers/2007/textmini… · · 2006-10-18Text Mining for Business Intelligence Steven O](https://reader035.vdocuments.net/reader035/viewer/2022070609/5aceaeec7f8b9a1d328c191a/html5/thumbnails/36.jpg)
36
Classification of High-Value Indicators
Class Indicator Description62 309 Refrigeration
429 153 Chemistry: electrical current producing apparatus
347 148 Incremental printing of symbolic information
219 129 Electric heating
... ... ...
![Page 37: Executive Briefing: Text Mining for Business Intelligenceopim.wharton.upenn.edu/~sok/sokpapers/2007/textmini… · · 2006-10-18Text Mining for Business Intelligence Steven O](https://reader035.vdocuments.net/reader035/viewer/2022070609/5aceaeec7f8b9a1d328c191a/html5/thumbnails/37.jpg)
37
Energy Efficiency Related PatentsIn Chemicals-Related Companies
0
5
10
15
20
25
30
Air Prod.Paraxair 3M Du PontKimberly
# of PatentsRelevancy
![Page 38: Executive Briefing: Text Mining for Business Intelligenceopim.wharton.upenn.edu/~sok/sokpapers/2007/textmini… · · 2006-10-18Text Mining for Business Intelligence Steven O](https://reader035.vdocuments.net/reader035/viewer/2022070609/5aceaeec7f8b9a1d328c191a/html5/thumbnails/38.jpg)
38
Energy Efficiency Related Patents
In Chemicals-Related Topics
chem of inorganic
chem:electronic
chem app & proc
drying
gas sep
heat exch
0 20 40 60 80 100
# of patents
![Page 39: Executive Briefing: Text Mining for Business Intelligenceopim.wharton.upenn.edu/~sok/sokpapers/2007/textmini… · · 2006-10-18Text Mining for Business Intelligence Steven O](https://reader035.vdocuments.net/reader035/viewer/2022070609/5aceaeec7f8b9a1d328c191a/html5/thumbnails/39.jpg)
39
These Are All Examples of Data Derived from Text
• Many other innovative types of queries are possible. See the paper for more examples.
• Key elements: expertise embodied in ontologies, which are used to index and cross index the document collections; Data = (i) Measures of (ii) Variables (Indicators) in (iii) Categories
![Page 40: Executive Briefing: Text Mining for Business Intelligenceopim.wharton.upenn.edu/~sok/sokpapers/2007/textmini… · · 2006-10-18Text Mining for Business Intelligence Steven O](https://reader035.vdocuments.net/reader035/viewer/2022070609/5aceaeec7f8b9a1d328c191a/html5/thumbnails/40.jpg)
40
Summary
• The (natural) magic of data
• Leads to: Can we get (useful) data from texts?
• Not “IR” or “IE”, but cheerfully accepting contributions from both fields
![Page 41: Executive Briefing: Text Mining for Business Intelligenceopim.wharton.upenn.edu/~sok/sokpapers/2007/textmini… · · 2006-10-18Text Mining for Business Intelligence Steven O](https://reader035.vdocuments.net/reader035/viewer/2022070609/5aceaeec7f8b9a1d328c191a/html5/thumbnails/41.jpg)
41
Vision
• Texts, widely (& internationally) collected, categorized, indexed
• Patents, regulatory filings, regulations, technical reports, news articles, ...
• A “new world” of readily-available indicators, derived from these texts, evaluated, and used for decision making along with traditional data sources.
![Page 42: Executive Briefing: Text Mining for Business Intelligenceopim.wharton.upenn.edu/~sok/sokpapers/2007/textmini… · · 2006-10-18Text Mining for Business Intelligence Steven O](https://reader035.vdocuments.net/reader035/viewer/2022070609/5aceaeec7f8b9a1d328c191a/html5/thumbnails/42.jpg)
Briefing in Three PartsA. Background on obtaining information from
text.
B. New ideas in text mining.
• Vaim concept.
C. Design exercise.
• High-level specification of requirements for business intelligence and scanning systems
42But first... Paul Kleindorfer
![Page 43: Executive Briefing: Text Mining for Business Intelligenceopim.wharton.upenn.edu/~sok/sokpapers/2007/textmini… · · 2006-10-18Text Mining for Business Intelligence Steven O](https://reader035.vdocuments.net/reader035/viewer/2022070609/5aceaeec7f8b9a1d328c191a/html5/thumbnails/43.jpg)
So Now a Design Methodology• Phase 1: Requirements. Two key questions:
• What are the indicators to track?• In what categories are the indicators to
be matched?• Phase 2: Requirements validation.
• How can these requirements be met?• NB Possibility of surrogate variables• NB IID (iterative & incremental design)
is required• How to monitor and revise?
• Are the indicators indicating?• What’s new and different?
43
![Page 44: Executive Briefing: Text Mining for Business Intelligenceopim.wharton.upenn.edu/~sok/sokpapers/2007/textmini… · · 2006-10-18Text Mining for Business Intelligence Steven O](https://reader035.vdocuments.net/reader035/viewer/2022070609/5aceaeec7f8b9a1d328c191a/html5/thumbnails/44.jpg)
Phase 1: Build a Specification Table (Example)
• Rows: Variables. Columns: Categories.
44
Years IntellectualProperties
Products Firms
OntologyTerms 1999-2005
“sustainability”, “biofuels”
OntologyTerms
All productsTerms: a, b, c, and d
energy efficiency
1996-2006 All firms
![Page 45: Executive Briefing: Text Mining for Business Intelligenceopim.wharton.upenn.edu/~sok/sokpapers/2007/textmini… · · 2006-10-18Text Mining for Business Intelligence Steven O](https://reader035.vdocuments.net/reader035/viewer/2022070609/5aceaeec7f8b9a1d328c191a/html5/thumbnails/45.jpg)
Phase 2: How Can It Be Done?
• Energy efficiency for Years x Firms• Annual reports and news articles are
dated. Obtain a list of firms. Obtain annual reports and/or news articles about the firms. Measure and report “energy efficiency” and related terms in these documents.
45
Row 3 of the table.
![Page 46: Executive Briefing: Text Mining for Business Intelligenceopim.wharton.upenn.edu/~sok/sokpapers/2007/textmini… · · 2006-10-18Text Mining for Business Intelligence Steven O](https://reader035.vdocuments.net/reader035/viewer/2022070609/5aceaeec7f8b9a1d328c191a/html5/thumbnails/46.jpg)
Phase 2: How Can It Be Done?
• “sustainability”, “biofuels” for Years by IP• Patents and patent applications are
publicly available. They are dated (and dates can be extracted). Measure and report on “sustainability”, “biofuels” and related terms in these documents.
46
Row 1 of the table.
![Page 47: Executive Briefing: Text Mining for Business Intelligenceopim.wharton.upenn.edu/~sok/sokpapers/2007/textmini… · · 2006-10-18Text Mining for Business Intelligence Steven O](https://reader035.vdocuments.net/reader035/viewer/2022070609/5aceaeec7f8b9a1d328c191a/html5/thumbnails/47.jpg)
Phase 2: How Can It Be Done?
• Terms a, b, c and d for Products.• Obtain a product classification scheme.
For each taxon in the classification scheme, obtain descriptive texts. Measure and report on a, b, c and d, and related terms, in these documents.
47
Row 2 of the table.
![Page 48: Executive Briefing: Text Mining for Business Intelligenceopim.wharton.upenn.edu/~sok/sokpapers/2007/textmini… · · 2006-10-18Text Mining for Business Intelligence Steven O](https://reader035.vdocuments.net/reader035/viewer/2022070609/5aceaeec7f8b9a1d328c191a/html5/thumbnails/48.jpg)
In Summary, Part C
• The fundamentals are not new. To scan seriously, to undertake business intelligence seriously, you first identify what variables (indicators) and what categories your are interested in. Then you get the data.
• The news here is that in getting the data, collections of text may well be key sources.
• Should we try it?
48
![Page 49: Executive Briefing: Text Mining for Business Intelligenceopim.wharton.upenn.edu/~sok/sokpapers/2007/textmini… · · 2006-10-18Text Mining for Business Intelligence Steven O](https://reader035.vdocuments.net/reader035/viewer/2022070609/5aceaeec7f8b9a1d328c191a/html5/thumbnails/49.jpg)
Backup Slides
• Only if needed
49
![Page 50: Executive Briefing: Text Mining for Business Intelligenceopim.wharton.upenn.edu/~sok/sokpapers/2007/textmini… · · 2006-10-18Text Mining for Business Intelligence Steven O](https://reader035.vdocuments.net/reader035/viewer/2022070609/5aceaeec7f8b9a1d328c191a/html5/thumbnails/50.jpg)
50
Information MashingInformation Mashing. Don't you just love that term? It's one of the major goals of Sunlight and while we've been working on it for the past couple of months we have a ways to go before it happens in any substantial way. Our goal is simple: integrate in a user-friendly way individual data sets (like campaign contributions, lobbyists and government contracts) that makes the whole larger than the sum of its parts. We'd like to create something we've dubbed an “Accountability Matrix.” A website where, with one click you can look up a major donor and see not just their campaign contributions, but also their lobbying expenditures, the names of members who've flown on their private jet, the names of former congressional staffers they've hired, and so on. In a nutshell, we want to make information more liquid and more accessible to the public.
![Page 51: Executive Briefing: Text Mining for Business Intelligenceopim.wharton.upenn.edu/~sok/sokpapers/2007/textmini… · · 2006-10-18Text Mining for Business Intelligence Steven O](https://reader035.vdocuments.net/reader035/viewer/2022070609/5aceaeec7f8b9a1d328c191a/html5/thumbnails/51.jpg)
51
Great Idea...but...• Needs to be generalized
• Vaim: value-added information mash
• Add linking, special knowledge processes, etc.
• Needs to be operationalized
• Main subject of this talk now
• Focus: Deriving indicators from texts
![Page 52: Executive Briefing: Text Mining for Business Intelligenceopim.wharton.upenn.edu/~sok/sokpapers/2007/textmini… · · 2006-10-18Text Mining for Business Intelligence Steven O](https://reader035.vdocuments.net/reader035/viewer/2022070609/5aceaeec7f8b9a1d328c191a/html5/thumbnails/52.jpg)
52
In the interests of time...
See accompanying draft paper, “On Deriving Indicators from Texts”
Will now summarize our points/position
Under construction: MV:Biofuels (minimal vaim)
But first...
![Page 53: Executive Briefing: Text Mining for Business Intelligenceopim.wharton.upenn.edu/~sok/sokpapers/2007/textmini… · · 2006-10-18Text Mining for Business Intelligence Steven O](https://reader035.vdocuments.net/reader035/viewer/2022070609/5aceaeec7f8b9a1d328c191a/html5/thumbnails/53.jpg)
53
Requirements
• (For deriving data from texts)
• “Mashing and mining”
• Mashing: integrate information from multiple sources
• Mining: as in “data mining”; find patterns in the data; pattern-oriented v. record-oriented
![Page 54: Executive Briefing: Text Mining for Business Intelligenceopim.wharton.upenn.edu/~sok/sokpapers/2007/textmini… · · 2006-10-18Text Mining for Business Intelligence Steven O](https://reader035.vdocuments.net/reader035/viewer/2022070609/5aceaeec7f8b9a1d328c191a/html5/thumbnails/54.jpg)
54
Just what are data?
• Answer: Measures of indicators in categories
• Measures: induce numbers
• Indicators: numbers of what? Weight, height, count
• Categories: slice and dice the universe; patterns are constituted by measures of indicators across categories
![Page 55: Executive Briefing: Text Mining for Business Intelligenceopim.wharton.upenn.edu/~sok/sokpapers/2007/textmini… · · 2006-10-18Text Mining for Business Intelligence Steven O](https://reader035.vdocuments.net/reader035/viewer/2022070609/5aceaeec7f8b9a1d328c191a/html5/thumbnails/55.jpg)
55
How can we categorize texts?
• By classification schemes: firms, products, materials; broadly: ontologies (biofuels n-grams in MV:Biofuels)
• By source: Where did the docs come from?
• By conditions; WHERE, HAVING
![Page 56: Executive Briefing: Text Mining for Business Intelligenceopim.wharton.upenn.edu/~sok/sokpapers/2007/textmini… · · 2006-10-18Text Mining for Business Intelligence Steven O](https://reader035.vdocuments.net/reader035/viewer/2022070609/5aceaeec7f8b9a1d328c191a/html5/thumbnails/56.jpg)
56
News
• We did it (as a prototype) and it works, with MV:Biofuels
• Future:
• Demonstrate & explore practical usage and value
• Extend: Broader and deeper; more, more, faster, faster