similarity and distance measures for hierarchical taxonomies

19

Upload: rutgers-university

Post on 14-Jun-2015

997 views

Category:

Documents


1 download

DESCRIPTION

Overview of Robert C. McNamee\\’s Paper "Can’t See the Forest for the Leaves: Similarity and Distance Measures for Hierarchical Taxonomies with a Patent Classification Example"

TRANSCRIPT

Page 1: Similarity and Distance Measures for Hierarchical Taxonomies
Page 2: Similarity and Distance Measures for Hierarchical Taxonomies

Introduction• World around us is filled with complex phenomena which we

describe with hierarchical categorization systems (taxonomies)• Researchers often conceptualize phenomena in their full

complexity but underutilize the taxonomies that describe them.• Why?

– Under-appreciation of taxonomical data– De-facto acceptance of methods that under-utilize taxonomies

• So What?– Risk drawing false conclusions– Avoid questions that cannot be analyzed with current methods– Under-develop new taxonomies– Eventually, conceive of underlying phenomena in less complex ways

Page 3: Similarity and Distance Measures for Hierarchical Taxonomies

Linnaean Taxonomy Example

Page 4: Similarity and Distance Measures for Hierarchical Taxonomies

Phenomenon: Technology Space• All Inclusive:

– “Technology Space is the universal set of all possible technological ideas” (Olsson & Frey, 2002, p. 71)

• Fully-Connected & Continuous: – The “technology set is always coherent and is not made up of

islands in idea space.” (Olsson & Frey, 2002, p. 72)

• Distance– Any two technologies within the totality of technology space can

be considered to be somewhat similar or distant from one another.– “For instance the two ideas ‘steel’ and ‘the Bessemer process’ are

more closely related than the ideas ‘the Bessemer process’ and ‘the spinning wheel’” (Olsson & Frey, 2002, p. 71)

Page 5: Similarity and Distance Measures for Hierarchical Taxonomies

Patent Classification Taxonomy

http://uspto.gov/go/classification/selectnumwithtitle.htmhttp://www.uspto.gov/web/offices/opc/documents/classescombined.pdf

Page 6: Similarity and Distance Measures for Hierarchical Taxonomies

USCL Hierarchy (Class and above)

Page 7: Similarity and Distance Measures for Hierarchical Taxonomies

USCL Class 704 (Subclass Level)

Page 8: Similarity and Distance Measures for Hierarchical Taxonomies

Traditional Distance Measures

• Jaffe’s Distance Measure (1986)– Used at various levels to combine the patents associated with a

country, firm, subsidiary, or inventor– A specific level of technological aggregation such as Jaffe’s

category, Jaffe’s subcategory, USPTO class, USPTO subclass, IPC 1 digit, IPC 3 digit, IPC 4 digit, etc… is chosen

– The uncentered correlation (cosine similarity) is calculated and this is subtracted from 1 to create a distance measure.

Page 9: Similarity and Distance Measures for Hierarchical Taxonomies

Limitations of Traditional Measures• Independence

– Assume that groupings at the level of technological aggregation chosen are unrelated / independent

• Trade-off: Go higher in the hierarchy– Over-emphasize the similarities while

ignoring differences at lower levels

A21BA21C

B60F

A21BA21B

A01M=

=

• Within Field / Industry– Within field / industry measures are largely homogenous at higher

levels of technological aggregation.

• Patent Level– Patent level measures reduce down to 0-1 dummy.

Page 10: Similarity and Distance Measures for Hierarchical Taxonomies

Taxonomically Appropriate MeasureJust two modifications/extensions necessary:

1. Use “Class and Subclass Array Expansion” (to include all super-ordinate classifications implicitly included with each classification)

2. Use IDF weighting of each classification (to take into account the actual distribution of invention across technological space)

=Number of times classification i is assigned to entity A=Number of times classification i is assigned to entity B=Frequency of Patents Classified within subtree subsumed by parent of classification i=Frequency of Patents Classified within subtree subsumed by classification i

Page 11: Similarity and Distance Measures for Hierarchical Taxonomies

Class & Subclass Array Expansion“Subclasses inherit all the properties of their parent Subclass. This means that every Subclass title is interpreted to include the title of its parent Subclass; its definition is interpreted to include the definition of its parent Subclass; etc” (USPTO Overview of the Classification System, 2007, p. 9)

Dimension / Level

Classification (Dimension Name)

Description

1 G1-02COMMUNICATIONS, RADIANT ENERGY, WEAPONS, ELECTRICAL, AND COMPUTER ARTS

2 G1-02/G2-05…/CALCULATORS, COMPUTERS, OR DATA PROCESSING SYSTEMS

3 G1-02/G2-05/704

…/DATA PROCESSING: SPEECH SIGNAL PROCESSING, LINGUISTICS, LANGUAGE TRANSLATION, AND AUDIO COMPRESSION-DECOMPRESSION

4 G1-02/G2-05/704/200 …/SPEECH SIGNAL PROCESSING

5 G1-02/G2-05/704/200/231 …/Recognition

6 G1-02/G2-05/704/200/231/232 …/Neural network

Page 12: Similarity and Distance Measures for Hierarchical Taxonomies

IDF Weighting• Distribution of Technology Space

– “In information theory (Cover & Thomas, 1991) the information contained in a statement is measured by the negative logarithm of the probability of that statement” (Lin 1998, p. 297)

• Intuitive logic: – Frequent concepts or dimensions are less informative than

rare ones (Elkan 2005; Robertson 2004; Aizawa 2003)

– Also known as Inverse Document Frequency (IDF) weighting

Page 13: Similarity and Distance Measures for Hierarchical Taxonomies

Patent Example6

7

7

6

5

55

5

4

4 3

3

2

2

2

1

1

7

7432

Page 14: Similarity and Distance Measures for Hierarchical Taxonomies

Dataset #1: Traditional Methods• Class 704: 8,713 patents, 1976 – 2008, up to 17 classes per patent• 48,150 patent-patent citation dyads ( within technology field flows)

Primary Only All Classifications

Cla

ss L

evel

Cla

ss-S

ubcl

ass

Leve

l

Graphs show frequency of similarity calculations within samples

Left most is similarity = 0

Right most is similarity = 1

Page 15: Similarity and Distance Measures for Hierarchical Taxonomies

Dataset #1: Taxonomical Method

Primary Only All Classifications

Page 16: Similarity and Distance Measures for Hierarchical Taxonomies

Dataset #1: Traditional vs. Taxonomical

Subclass Level Class Level

Page 17: Similarity and Distance Measures for Hierarchical Taxonomies

Dataset #2: Traditional Methods• NBER Patent Dataset (Patents 1980 2000)• “Organizations” created based on Assignee ID• Random 1% sample selected• 718 Organizations with 12,993 patents (primary classifications only)• All pairwise comparisons (257,403 unique dyads)

Class Level Jaffe Subcategory Level Jaffe Category Level

Page 18: Similarity and Distance Measures for Hierarchical Taxonomies

Dataset #2: Taxonomical Method

vs.

Cla

ssvs

. S

ubca

tego

ryvs

. C

ateg

ory

• 135,000+ Class Level• 100,000+ Subcategory Level• 40,000+ Category Level

Significant Effects:• Traditional = zero Similarity• Taxonomical = non-zero similarity

Page 19: Similarity and Distance Measures for Hierarchical Taxonomies

Conclusions• Actual utility when applied to specific research questions.• Some evidence of the worth of these methods

Necessary – Tests questions where traditional methods meaninglessFlexible– Applies to all levels of analysisReasonable– Evidence of the relationship to traditional methodsValuable– Greater variation and continuity– More meaningfulness to underlying theoretical phenomenon