case study: building the asce thesaurus

38
LESSONS LEARNED FROM PREPARING ASCE TAXONOMY FOR MACHINE AIDED INDEXING (MAI) Xi Van Fleet Senior Manager of Information Services Publishing Technology Department Publication Division American Society of Civil Engineers

Upload: accessinnovations

Post on 11-May-2015

637 views

Category:

Technology


0 download

DESCRIPTION

Xi Van Fleet of the American Society of Civil Engineers (ASCE) shares her experience on rule-building, utilizing Access Innovations' Data Harmony machine-aided indexing software, as well as free online resources.

TRANSCRIPT

Page 1: Case Study:  Building the ASCE Thesaurus

LESSONS LEARNED FROM PREPARING ASCE TAXONOMY FOR MACHINE AIDED INDEXING (MAI)

Xi Van Fleet

Senior Manager of Information Services

Publishing Technology Department

Publication Division

American Society of Civil Engineers

Page 2: Case Study:  Building the ASCE Thesaurus

Publications of American Society of Civil Engineering

A Brief History

American Society of Civil Engineers (ASCE) was founded in 1852. We are the oldest engineering society in the Untied States.

Our first publication, Transactions of American Society of Civil Engineers, was published in 1872. It is the predecessor of our journals.

The first monograph was published in 1892.

Page 3: Case Study:  Building the ASCE Thesaurus

Publications of American Society of Civil Engineering

Today

Leading publisher in civil engineering

34 Peer-reviewed journals

Books and standards

Conference proceedings

Magazines

Page 4: Case Study:  Building the ASCE Thesaurus

Online Civil Engineering Knowledge Environment

250+ ASCE e-book titles 65 ASCE StandardsProceeding volumes with 42,000 papers from 2000 to presentPeers-reviewed journals with 60,000 papers from 1983 to present

More than 220,000 records with complete coverage of ASCE publications

Full-text database

Bibliographic database

Page 5: Case Study:  Building the ASCE Thesaurus

Content driven

Overlapping with other engineering disciplinese.g. chemical engineering, mechanical engineering; material engineering

Strong on core disciplines: e.g. structural engineering, geotechnical engineering

Weaker on peripheral disciplines: Aerospace engineering, energy engineering

ASCE Taxonomy

Page 6: Case Study:  Building the ASCE Thesaurus

The taxonomy project started in 2009

Access Innovations created the first version based on the existing CEDB subject headings and data mined from the content

The draft contained over 30,000 terms. We divided it into three individual taxonomies:

Technical topics

Geographic terms

ASCE corporate

In-house subject experts of different disciplines were invited to validate the technical topics.

Project History

Page 7: Case Study:  Building the ASCE Thesaurus

“Final” Version of Taxonomy of Technical Topics

Preferred terms: 2440Equivalent terms: 3167Top terms: 22Terms with "Related Terms": 488Terms withg "Non-Preferred Terms": 1320

Page 8: Case Study:  Building the ASCE Thesaurus

Prepare ASCE Taxonomy for Machine Aided Index (MAI)

• Taxonomy enrichment

• Rule building

Page 9: Case Study:  Building the ASCE Thesaurus

Taxonomy Enrichment

Add Equivalent /Non-preferred Terms

• Alternative spellingAnalysis – Analyses; Modeling vs. modelling

• Irregular word formsCurricula vs. Curriculums

• Synonyms Flood – inundationHealth care facilities – Hospitals, Nursing homes…

• AcronymsAutomated people movers – APM

• Term variation• Bedforms, Bed-forms, Bed forms

Page 10: Case Study:  Building the ASCE Thesaurus

Rule Building

Rules teach MAIStro to think like humans by providing it with context, logic, and instructions.

Simple rules Simple conditional rules

Complex conditional rules

Page 11: Case Study:  Building the ASCE Thesaurus

Resources Used

Page 12: Case Study:  Building the ASCE Thesaurus

Some Synonyms are obvious and easy.

e.g. Preferred term: Driver behavior

Equivalent/Non-Preferred Terms

Page 13: Case Study:  Building the ASCE Thesaurus

How to find synonyms

How to find synonyms

Some synonyms are “hidden”, e.g. Agricultural wastes

Equivalent/Non-Preferred Terms

Page 14: Case Study:  Building the ASCE Thesaurus

Preferred term: Public health and safety

How to find synonyms

How to find synonyms

Equivalent/Non-Preferred Terms

Page 15: Case Study:  Building the ASCE Thesaurus

How to find synonyms

Equivalent/Non-Preferred Terms

Page 16: Case Study:  Building the ASCE Thesaurus

How to find synonyms

Preferred term: Public health and safety:

Note: in our content “health” can also be used for a structure, a river, or environment.

Equivalent/Non-Preferred Terms

Page 17: Case Study:  Building the ASCE Thesaurus

Preferred term: Intelligent transportation systems

How to find synonyms

Equivalent/Non-Preferred Terms

Page 18: Case Study:  Building the ASCE Thesaurus

Preferred term: High-rise buildings

e.g. Spring Temple Buddha

Tokyo Spring Tree

Preferred term: Developing countriesI

ASCE taxonomy term: Civil engineering landmarksASCE Civil engineering landmarks Award list

How to find synonyms

Equivalent/Non-Preferred Terms

Page 19: Case Study:  Building the ASCE Thesaurus

Irregular words

Preferred term: LaborNon-preferred term: labour

Preferred term: Structural behaviorNon-prefrerred term: Structural behaviour

Preferred term: Multi-story buildings

Non-preferred term: Multi-storey buildings

Preferred term: Fiber reinforced polymer

Non-preferred term: Fibre reinforced polymer

Equivalent/Non-Preferred Terms

Think about variation

Page 20: Case Study:  Building the ASCE Thesaurus

Terms made of phrase with variations

Preferred term: Lightweight concreteNon-Preferred terms: Light-weight concrete, Light weight concrete

Preferred term: Design/Bid/BuildNon-Preferred terms: Design-bid-build, Design bid build, D/B/B/, DBB. D-B-B

Equivalent/Non-Preferred Terms

Think about variation

Page 21: Case Study:  Building the ASCE Thesaurus

Equivalent/Non-Preferred Terms

Terms with prefix

Bio+Preferred termsBiobinders; Biofuels; Biocement; Biokinetics; Biofilters;

Biofouling; Biogrouting; Bioleaching…

Post + Preferred termsPostearthquakes; Postcombustion; Postcracking

Other prefix: Pre, Micro, Macro, Super. Multi, Non, Off...

Think about variation

Page 22: Case Study:  Building the ASCE Thesaurus

Acronyms

Preferred term: Magnetic levitation trains Non-preferred term: Maglev

Preferred term: Automated people moversNon-preferred term: APM

Preferred term: Air traffic controlAcronym: ATC

ATC=apparent tardiness cost; applied technology council … Need disambiguation

Preferred term: Intelligent transportation systemsAcronym: ITS

Be careful with acronyms

Equivalent/Non-Preferred Terms

Page 23: Case Study:  Building the ASCE Thesaurus

Create Rulebase

MAIStro automatically creates text-to-match (TTM) rule for every term, both preferred and non-preferred

TTM works for many terms:Flash floods – Flash floodsContinuing education – Continuing education Ridership – RidershipHydraulic engineering – Hydraulic engineering

Text that matches

Page 24: Case Study:  Building the ASCE Thesaurus

Create Rulebase

Noun vs. verb vs. adjective vs. adverbPreferred term: Corrosion

CorrosiveCorrosivenessCorrosivity CorrodingCorrodedCorrodibleCorrodibility…

Simple ruleCorros* USE CorrosionCorrod* USE Corrosion

Text that doesn't quite match (variations)

Page 25: Case Study:  Building the ASCE Thesaurus

Create Rulebase

Preferred term: Lateral loads Variations: Lateral loading; Laterally loaded…

Need simple conditional rule:load*IF (WITH "lateral*")

Lateral loadsENDIF

Text that doesn't quite match (variations)

Page 26: Case Study:  Building the ASCE Thesaurus

Create Rulebase

Variations of “Span bridges”

Bridge*IF (NEAR "span" OR NEAR "short-span" OR NEAR "long-span" OR NEAR "single-span" OR NEAR "multi-span" OR NEAR "multiple-span" OR NEAR "four-span" OR NEAR "three-span" OR NEAR “one-span” OR NEAR “continuous-span" OR NEAR "simple-span" OR NEAR "large-span")

USE Span bridgesENDIF

Text that doesn't quite match (variations)

Page 27: Case Study:  Building the ASCE Thesaurus

Create Rulebase

Find hyhpenated terms in our content

Page 28: Case Study:  Building the ASCE Thesaurus

Preferred term: Structural analysis

Analy*IF (WITH "structur*" OR WITH "load" OR WITH "loads")

IF (NEAR "arch*" OR WITH "column*" OR NEAR "bar" OR NEAR "bars" OR NEAR "bar's" OR NEAR "beam" OR NEAR "beams" OR NEAR "strut" OR NEAR "struts" OR NEAR "compression member*" OR NEAR "tie" OR NEAR "ties" OR NEAR "tie rod" OR NEAR "tie-rod" OR NEAR "tie rods" OR NEAR "tie-rods" OR NEAR "eyebar*" OR NEAR "guy-wire*" OR NEAR "guy wire*" OR NEAR "suspension cable*" OR NEAR "wire rope*" OR NEAR "angle section*" OR NEAR "connect*" OR NEAR "coupl*" OR NEAR "diaphragm*" OR NEAR "flange*" OR NEAR "frame*" OR NEAR "bent" OR NEAR "bents" OR NEAR "girder*" OR NEAR "hollow section*" OR NEAR "hollow structural section*" OR NEAR "joint*" OR NEAR "joist*" OR NEAR "membrane*" OR NEAR "panel" OR NEAR "plate" OR NEAR "slab*" OR NEAR "stud" OR NEAR "studs" OR NEAR "tendon*" OR NEAR "tensile member*" OR NEAR "truss*" OR NEAR "tube*" OR NEAR "wall*" OR NEAR "gable*" OR NEAR "wall section*" OR MENTIONS "structural failure*" OR MENTIONS "building failure*")USE Structural analysisENDIF

Create Rulebase Text that doesn’t quite match (whole vs parts)

Page 29: Case Study:  Building the ASCE Thesaurus

Bridge the gap

Raising the bar

Foundationa solid foundation, a firm foundation, research

foundation…

Toll: Toll Brothers, human toll, take a toll…

Using NULL rules

right match that is wrong

Create Rulebase - To Disambiguate

Page 30: Case Study:  Building the ASCE Thesaurus

Create Rulebase

Phases that contain more than one term

Text: Continuous Multispan Concrete Girder Highway Bridges

Preferred terms:Continuous bridgesSpan bridgesConcrete bridgesGirder bridgesHighway bridges

Page 31: Case Study:  Building the ASCE Thesaurus

Create Rulebase - To Disambiguate

Preferred term: Wells (noun vs adverb)

Well*

IF (WITH "hydraul*" OR WITH "Hydro*" OR WITH "Aquifer*" OR WITH "Multiaquifer*" OR WITH "discharg*" OR WITH "pump*" OR WITH "stilling" OR WITH "flow*" OR WITH "water*" OR WITH "groundwater" OR WITH "Recirculation" OR WITH "Artesian")USE Wells

Page 32: Case Study:  Building the ASCE Thesaurus

Foundation*

IF (NOT (NEAR "success*" OR NEAR "research" OR NEAR "national science" OR NEAR "grant*" OR NEAR "president*" OR NEAR "ASCE foundation*" OR AROUND "engineering foundation" OR NEAR "economic" OR NEAR "prize*" OR NEAR "award*" OR NEAR "education*" OR NEAR "campaign*" OR AROUND "reason foundation" OR AROUND "national science foundation" OR AROUND "nsf" OR NEAR "job*" OR NEAR "partner*" OR NEAR "organization*" OR NEAR "scholar*"))

IF (WITH "bridge*" OR AROUND "bridge foundation*")USE Bridge foundations

ENDIFIF (WITH "dam" OR WITH "dams" OR AROUND "dam foundation*")

USE Dam foundationsENDIFIF (NEAR "deep" OR AROUND "deep foundation*")

USE Deep foundations…

Create Rulebase - To Disambiguate

Page 33: Case Study:  Building the ASCE Thesaurus

If a term is impossible to write a rule, it may not a good term.

BubblesWater bubbles, air bubbles, gas bubbles, financial bubbles…

fluid dynamics, waste treatment, material science, soil mechanics…

Clue: if you have trouble place a term in the taxonomy, you are likely to have trouble creating rules for it.

Disambiguation

Page 34: Case Study:  Building the ASCE Thesaurus

Create Rulebase

Test*Test, tests, testing, testings, testify, testimony, testosterone

Wave*Waves, wavelength, wave length, wavelet, wavefront, waverider, waveguide…

Truncate text with care

Page 35: Case Study:  Building the ASCE Thesaurus

Preferred Term: Workplace discrimination

Discriminat*IF (WITH "age" or WITH "minority" or WITH "racial" or WITH "race" or WITH "disabilit*" or WITH "senior" or WITH "older" or WITH "old" or WITH "women" or WITH "woman" or WITH "diversity" or WITH "dispute" or WITH "equal*" or WITH "female" or WITH "male" or WITH "workplace" or WITH "African*“ or WITH “Hispanic”)USE Workplace discriminationENDIF

Text that hardly matches (need specifics)

Create Rulebase

Page 36: Case Study:  Building the ASCE Thesaurus

Taxonomy Enrichment and Rule Building is a Process.

Another opportunity to fine tune the taxonomyDiffus*IF (MENTIONS "transport" OR MENTIONS "concentration" OR MENTIONS "gradient" OR MENTIONS "advetive" OR MENTIONS "equilibr*" OR MENTIONS "voc" OR MENTIONS "vocs"OR MENTIONS "volatile organic compound*" OR MENTIONS "water*" OR MENTIONS "moisture" OR MENTIONS "wave*" OR MENTIONS "flow" OR MENTIONS "chemical*" OR MENTIONS "molecul*" OR MENTIONS "soil*" OR MENTIONS "waste*" OR MENTIONS "filter*" OR MENTIONS "runoff" OR MENTIONS "run-off" OR MENTIONS "jet" OR MENTIONS "turbulen*" OR MENTIONS "gas" OR MENTIONS "emission*" OR MENTIONS "emit*" OR MENTIONS "air" OR MENTIONS "oxygen" OR MENTIONS "thermal" OR MENTIONS "solute*" OR MENTIONS "chloride*" OR MENTIONS "contamin*" OR MENTIONS "pollut*" OR MENTIONS "organic" OR MENTIONS "compound*" OR MENTIONS "nitri*" OR MENTIONS "ion" OR MENTIONS "ions" OR MENTIONS "dye" OR MENTIONS "dyes" OR MENTIONS "fluid*" OR MENTIONS "channel*" OR MENTIONS "river*" OR MENTIONS "stream*" OR MENTIONS "tidal" OR MENTIONS "hydro*" OR MENTIONS "hydrau*" OR MENTIONS "lake*" OR MENTIONS "bay" OR MENTIONS "bays" OR MENTIONS "ocean*" OR MENTIONS "coast*" OR MENTIONS "sediment*" OR MENTIONS "sea" OR MENTIONS "seas" OR MENTIONS "catchment*" OR MENTIONS "reservoir*" OR MENTIONS "estuar*" OR MENTIONS "sewage*" OR MENTIONS "flood*" OR MENTIONS "porous medi*" OR MENTIONS "concrete*" OR MENTIONS "bentonite" OR MENTIONS "cement*" OR MENTIONS "clay*" OR MENTIONS "advection*" OR MENTIONS "convection*" OR MENTIONS "eddy" OR MENTIONS "eddies" OR MENTIONS "flux")

IF (AROUND "voc" OR AROUND "vocs" OR AROUND "volatile organic compound*" OR AROUND "chemical*" OR AROUND "molecul*" OR AROUND "chlorid*" OR AROUND "nitri*" OR AROUND "ion" OR AROUND "ions" OR AROUND "polymer*" OR AROUND "species" OR AROUND "polyaromatic*" OR AROUND "hydrocarbon*" OR AROUND "aromatic*" OR AROUND "pah" OR AROUND "pahs" OR AROUND "dichloromethane*" OR AROUND "chloromethane*" OR AROUND "chemox")

USE Diffusion (chemical)ENDIFIF (AROUND "thermo*" OR AROUND "thermal" OR AROUND "thermodiffusion")

USE Diffusion (thermal)ENDIFIF (AROUND "porous" OR AROUND "porosity" OR AROUND "soil*" OR AROUND "clay*" OR AROUND "pore" OR AROUND "pores" OR AROUND

"cement*" OR AROUND "concrete*" OR AROUND "bentonite")

USE Diffusion (porous media)ENDIFIF (AROUND "fluid*")

IF (WITH "turbulen*" OR WITH "eddy" OR WITH "eddies")

USE Turbulent diffusionELSE

ENDIFIF (NOT (AROUND "voc" OR AROUND "vocs" OR AROUND "volatile organic compound*" OR AROUND "chemical*" OR AROUND "molecul*" OR

AROUND "chlorid*" OR AROUND "nitri*" OR AROUND "ion" OR AROUND "ions" OR AROUND "polymer*" OR AROUND "species" OR AROUND "polyaromatic*" OR AROUND "hydrocarbon*" OR AROUND "aromatic*" OR AROUND "pah" OR AROUND "pahs" OR AROUND "dichloromethane*" OR AROUND "chloromethane*" OR AROUND "chemox" OR AROUND "thermo*" OR AROUND "thermal" OR AROUND "thermodiffusion" OR AROUND "porous" OR AROUND "porosity" OR AROUND "soil*" OR AROUND "clay*" OR AROUND "pore" OR AROUND "pores" OR AROUND "cement*" OR AROUND "concrete*" OR AROUND "bentonite" OR AROUND "fluid*"OR WITH "wave" OR WITH "waves"))

USE DiffusionENDIF

ENDIF

Page 37: Case Study:  Building the ASCE Thesaurus

• It is impossible to build perfect rules.

• Noise (rules too general) or misses (rules too granular). Try to strike a balance.

• Be ready for the unexpected. Keep note of possible equivalent terms when you are not working on the taxonomy, e.g. “ring of fire”=Earthquakes, “la nina”, “el nino”, “polar vortex” =Climate change

Taxonomy Enrichment and Rule Building is a Process

Page 38: Case Study:  Building the ASCE Thesaurus

Questions?