similarity and distance measures for hierarchical taxonomies
DESCRIPTION
Overview of Robert C. McNamee\\’s Paper "Can’t See the Forest for the Leaves: Similarity and Distance Measures for Hierarchical Taxonomies with a Patent Classification Example"TRANSCRIPT
Introduction• World around us is filled with complex phenomena which we
describe with hierarchical categorization systems (taxonomies)• Researchers often conceptualize phenomena in their full
complexity but underutilize the taxonomies that describe them.• Why?
– Under-appreciation of taxonomical data– De-facto acceptance of methods that under-utilize taxonomies
• So What?– Risk drawing false conclusions– Avoid questions that cannot be analyzed with current methods– Under-develop new taxonomies– Eventually, conceive of underlying phenomena in less complex ways
Linnaean Taxonomy Example
Phenomenon: Technology Space• All Inclusive:
– “Technology Space is the universal set of all possible technological ideas” (Olsson & Frey, 2002, p. 71)
• Fully-Connected & Continuous: – The “technology set is always coherent and is not made up of
islands in idea space.” (Olsson & Frey, 2002, p. 72)
• Distance– Any two technologies within the totality of technology space can
be considered to be somewhat similar or distant from one another.– “For instance the two ideas ‘steel’ and ‘the Bessemer process’ are
more closely related than the ideas ‘the Bessemer process’ and ‘the spinning wheel’” (Olsson & Frey, 2002, p. 71)
Patent Classification Taxonomy
http://uspto.gov/go/classification/selectnumwithtitle.htmhttp://www.uspto.gov/web/offices/opc/documents/classescombined.pdf
USCL Hierarchy (Class and above)
USCL Class 704 (Subclass Level)
Traditional Distance Measures
• Jaffe’s Distance Measure (1986)– Used at various levels to combine the patents associated with a
country, firm, subsidiary, or inventor– A specific level of technological aggregation such as Jaffe’s
category, Jaffe’s subcategory, USPTO class, USPTO subclass, IPC 1 digit, IPC 3 digit, IPC 4 digit, etc… is chosen
– The uncentered correlation (cosine similarity) is calculated and this is subtracted from 1 to create a distance measure.
Limitations of Traditional Measures• Independence
– Assume that groupings at the level of technological aggregation chosen are unrelated / independent
• Trade-off: Go higher in the hierarchy– Over-emphasize the similarities while
ignoring differences at lower levels
A21BA21C
B60F
A21BA21B
A01M=
=
≠
≠
• Within Field / Industry– Within field / industry measures are largely homogenous at higher
levels of technological aggregation.
• Patent Level– Patent level measures reduce down to 0-1 dummy.
Taxonomically Appropriate MeasureJust two modifications/extensions necessary:
1. Use “Class and Subclass Array Expansion” (to include all super-ordinate classifications implicitly included with each classification)
2. Use IDF weighting of each classification (to take into account the actual distribution of invention across technological space)
=Number of times classification i is assigned to entity A=Number of times classification i is assigned to entity B=Frequency of Patents Classified within subtree subsumed by parent of classification i=Frequency of Patents Classified within subtree subsumed by classification i
Class & Subclass Array Expansion“Subclasses inherit all the properties of their parent Subclass. This means that every Subclass title is interpreted to include the title of its parent Subclass; its definition is interpreted to include the definition of its parent Subclass; etc” (USPTO Overview of the Classification System, 2007, p. 9)
Dimension / Level
Classification (Dimension Name)
Description
1 G1-02COMMUNICATIONS, RADIANT ENERGY, WEAPONS, ELECTRICAL, AND COMPUTER ARTS
2 G1-02/G2-05…/CALCULATORS, COMPUTERS, OR DATA PROCESSING SYSTEMS
3 G1-02/G2-05/704
…/DATA PROCESSING: SPEECH SIGNAL PROCESSING, LINGUISTICS, LANGUAGE TRANSLATION, AND AUDIO COMPRESSION-DECOMPRESSION
4 G1-02/G2-05/704/200 …/SPEECH SIGNAL PROCESSING
5 G1-02/G2-05/704/200/231 …/Recognition
6 G1-02/G2-05/704/200/231/232 …/Neural network
IDF Weighting• Distribution of Technology Space
– “In information theory (Cover & Thomas, 1991) the information contained in a statement is measured by the negative logarithm of the probability of that statement” (Lin 1998, p. 297)
• Intuitive logic: – Frequent concepts or dimensions are less informative than
rare ones (Elkan 2005; Robertson 2004; Aizawa 2003)
– Also known as Inverse Document Frequency (IDF) weighting
Patent Example6
7
7
6
5
55
5
4
4 3
3
2
2
2
1
1
7
7432
Dataset #1: Traditional Methods• Class 704: 8,713 patents, 1976 – 2008, up to 17 classes per patent• 48,150 patent-patent citation dyads ( within technology field flows)
Primary Only All Classifications
Cla
ss L
evel
Cla
ss-S
ubcl
ass
Leve
l
Graphs show frequency of similarity calculations within samples
Left most is similarity = 0
Right most is similarity = 1
Dataset #1: Taxonomical Method
Primary Only All Classifications
Dataset #1: Traditional vs. Taxonomical
Subclass Level Class Level
Dataset #2: Traditional Methods• NBER Patent Dataset (Patents 1980 2000)• “Organizations” created based on Assignee ID• Random 1% sample selected• 718 Organizations with 12,993 patents (primary classifications only)• All pairwise comparisons (257,403 unique dyads)
Class Level Jaffe Subcategory Level Jaffe Category Level
Dataset #2: Taxonomical Method
vs.
Cla
ssvs
. S
ubca
tego
ryvs
. C
ateg
ory
• 135,000+ Class Level• 100,000+ Subcategory Level• 40,000+ Category Level
Significant Effects:• Traditional = zero Similarity• Taxonomical = non-zero similarity
Conclusions• Actual utility when applied to specific research questions.• Some evidence of the worth of these methods
Necessary – Tests questions where traditional methods meaninglessFlexible– Applies to all levels of analysisReasonable– Evidence of the relationship to traditional methodsValuable– Greater variation and continuity– More meaningfulness to underlying theoretical phenomenon