nlm summer lectures · 2016. 6. 8. · frequency spectrum of medline 2006 1 500001 1000001 1500001...

51
June 8, 2016 NLM NLM Summer Lectures

Upload: others

Post on 10-Oct-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: NLM Summer Lectures · 2016. 6. 8. · Frequency Spectrum of Medline 2006 1 500001 1000001 1500001 2000001 2500001 3000001 1 100 10000 1000000 100000000 M V(m,N)

June 8, 2016

NLM

NLM Summer Lectures

Page 2: NLM Summer Lectures · 2016. 6. 8. · Frequency Spectrum of Medline 2006 1 500001 1000001 1500001 2000001 2500001 3000001 1 100 10000 1000000 100000000 M V(m,N)

November 28, 2011

The SPECIALIST lexiconand lexical tools

June 8, 2016

The SPECIALIST Lexicon and Lexical Tools

Allen BrowneChris Lu

Page 3: NLM Summer Lectures · 2016. 6. 8. · Frequency Spectrum of Medline 2006 1 500001 1000001 1500001 2000001 2500001 3000001 1 100 10000 1000000 100000000 M V(m,N)

Lexical tools

Textprocessing

SPECIALISTLEXICON

Page 4: NLM Summer Lectures · 2016. 6. 8. · Frequency Spectrum of Medline 2006 1 500001 1000001 1500001 2000001 2500001 3000001 1 100 10000 1000000 100000000 M V(m,N)
Page 5: NLM Summer Lectures · 2016. 6. 8. · Frequency Spectrum of Medline 2006 1 500001 1000001 1500001 2000001 2500001 3000001 1 100 10000 1000000 100000000 M V(m,N)

The SPECIALIST Lexicon

A syntactic lexicon Biomedical and general English Over 490,000 records

Page 6: NLM Summer Lectures · 2016. 6. 8. · Frequency Spectrum of Medline 2006 1 500001 1000001 1500001 2000001 2500001 3000001 1 100 10000 1000000 100000000 M V(m,N)

Lexicon Growth

Page 7: NLM Summer Lectures · 2016. 6. 8. · Frequency Spectrum of Medline 2006 1 500001 1000001 1500001 2000001 2500001 3000001 1 100 10000 1000000 100000000 M V(m,N)

Miller -- 1991

Page 8: NLM Summer Lectures · 2016. 6. 8. · Frequency Spectrum of Medline 2006 1 500001 1000001 1500001 2000001 2500001 3000001 1 100 10000 1000000 100000000 M V(m,N)

George A. Miller

The Science of Words

1991

Page 9: NLM Summer Lectures · 2016. 6. 8. · Frequency Spectrum of Medline 2006 1 500001 1000001 1500001 2000001 2500001 3000001 1 100 10000 1000000 100000000 M V(m,N)

Frequency Spectrum of Medline 2006

1

500001

1000001

1500001

2000001

2500001

3000001

1 100 10000 1000000 100000000

M

V(m

,N)

Page 10: NLM Summer Lectures · 2016. 6. 8. · Frequency Spectrum of Medline 2006 1 500001 1000001 1500001 2000001 2500001 3000001 1 100 10000 1000000 100000000 M V(m,N)

Frequency Spectrum:Alice in Wonderland

Bayaan, 2001

Page 11: NLM Summer Lectures · 2016. 6. 8. · Frequency Spectrum of Medline 2006 1 500001 1000001 1500001 2000001 2500001 3000001 1 100 10000 1000000 100000000 M V(m,N)
Page 12: NLM Summer Lectures · 2016. 6. 8. · Frequency Spectrum of Medline 2006 1 500001 1000001 1500001 2000001 2500001 3000001 1 100 10000 1000000 100000000 M V(m,N)

The SPECIALIST LEXICON

Morphology Inflection Derivation

Orthography Spelling variants

Syntax Complementation for verbs, nouns, and adjectives

Page 13: NLM Summer Lectures · 2016. 6. 8. · Frequency Spectrum of Medline 2006 1 500001 1000001 1500001 2000001 2500001 3000001 1 100 10000 1000000 100000000 M V(m,N)

Morphology Inflectional

nucleus, nuclei cauterize, cauterizes, cauterized, cauterizing red, redder reddest

Derivational laryngeal -- larynx transport -- transportation

Page 14: NLM Summer Lectures · 2016. 6. 8. · Frequency Spectrum of Medline 2006 1 500001 1000001 1500001 2000001 2500001 3000001 1 100 10000 1000000 100000000 M V(m,N)

Derivational Morphology

Dictionary+ology+ist

Page 15: NLM Summer Lectures · 2016. 6. 8. · Frequency Spectrum of Medline 2006 1 500001 1000001 1500001 2000001 2500001 3000001 1 100 10000 1000000 100000000 M V(m,N)

Derivational Morphology

Page 16: NLM Summer Lectures · 2016. 6. 8. · Frequency Spectrum of Medline 2006 1 500001 1000001 1500001 2000001 2500001 3000001 1 100 10000 1000000 100000000 M V(m,N)

Orthography

align -- aline Grave’s disease -- Graves’s disease -- Graves’ disease anesthetize -- anesthetise Esophagus -- oesophagus foetus – fetus centre -- center

Spelling Variation

Page 17: NLM Summer Lectures · 2016. 6. 8. · Frequency Spectrum of Medline 2006 1 500001 1000001 1500001 2000001 2500001 3000001 1 100 10000 1000000 100000000 M V(m,N)

Orthography

Page 18: NLM Summer Lectures · 2016. 6. 8. · Frequency Spectrum of Medline 2006 1 500001 1000001 1500001 2000001 2500001 3000001 1 100 10000 1000000 100000000 M V(m,N)

Syntax -- Verb Complements intran

I’ll treat. tran=np

He treated the patient. ditran=np,pphr(with,np)

She treated the patient with the drug.

Page 19: NLM Summer Lectures · 2016. 6. 8. · Frequency Spectrum of Medline 2006 1 500001 1000001 1500001 2000001 2500001 3000001 1 100 10000 1000000 100000000 M V(m,N)

Syntax -- Verb Complements{base=treatentry=E0061964

cat=verbvariants=regintrantran=nptran=pphr(with,np)tran=pphr(of,np)ditran=np,pphr(to,np)ditran=np,pphr(with,np)ditran=np,pphr(for,np)cplxtran=np,advblnominalization=treatment|noun|E0061968

}

Page 20: NLM Summer Lectures · 2016. 6. 8. · Frequency Spectrum of Medline 2006 1 500001 1000001 1500001 2000001 2500001 3000001 1 100 10000 1000000 100000000 M V(m,N)

Syntax -- Verb Particle Constructions

clean upscrub downlook up

Page 21: NLM Summer Lectures · 2016. 6. 8. · Frequency Spectrum of Medline 2006 1 500001 1000001 1500001 2000001 2500001 3000001 1 100 10000 1000000 100000000 M V(m,N)

{base=cleanentry=E0017272cat=verbvariants=regintranintran;part(up)tran=np tran=np;part(up) nominalization=clean|noun|E0017273 nominalization=cleanup|noun|E0319808}

Page 22: NLM Summer Lectures · 2016. 6. 8. · Frequency Spectrum of Medline 2006 1 500001 1000001 1500001 2000001 2500001 3000001 1 100 10000 1000000 100000000 M V(m,N)

Categories – Parts of Speech

Page 23: NLM Summer Lectures · 2016. 6. 8. · Frequency Spectrum of Medline 2006 1 500001 1000001 1500001 2000001 2500001 3000001 1 100 10000 1000000 100000000 M V(m,N)

{base=Kaposi's sarcomaspelling_variant=Kaposi sarcomaentry=E0003576

cat=nounvariants=uncountvariants=regvariants=glreg

}

{base=chronicentry=E0016869

cat=adjvariants=invposition=attrib(1)position=predstative

}

{base=aspirateentry=E0010803

cat=verbvariants=regtran=npnominalization=aspiration|noun|E0010804

}

Lexicon Unit Records

{base=inentry=E0033870

cat=prep}

Page 24: NLM Summer Lectures · 2016. 6. 8. · Frequency Spectrum of Medline 2006 1 500001 1000001 1500001 2000001 2500001 3000001 1 100 10000 1000000 100000000 M V(m,N)

Orthographic vs. Lexicographic Word:

Why, for instance, if a two-word boy scout feels chilly on his one-word campground, does he pull up a two-word camp chair in front of his one-word campfire? Anyone who seeks a strictly logical answer to such questions is chasing will-o'-the-wisps (chargeable in telegrams as a single word, because of the hyphens) in a semantic bog.

Louis Salomon, Semantics and Common Sense, Holt Rinehart and Winston, 1966.

Page 25: NLM Summer Lectures · 2016. 6. 8. · Frequency Spectrum of Medline 2006 1 500001 1000001 1500001 2000001 2500001 3000001 1 100 10000 1000000 100000000 M V(m,N)

UTF-8

{base=resume spelling_variant=résumé spelling_variant=resumé entry=E0053099

cat=nounvariants=reg

}

{base=rolespelling_variant=rôleentry=E0053757

cat=nounvariants=reg

}

{base=deja vuspelling_variant=deja-vu spelling_variant=déjà vu entry=E0021340

cat=nounvariants=uncount

}

{base=cafespelling_variant=café entry=E0420690

cat=nounvariants=reg

}

Page 26: NLM Summer Lectures · 2016. 6. 8. · Frequency Spectrum of Medline 2006 1 500001 1000001 1500001 2000001 2500001 3000001 1 100 10000 1000000 100000000 M V(m,N)

Noun Variants

Kaposi’s sarcoma Kaposi’s sarcomas Kaposi’s sarcomata Kaposi sarcoma Kaposi sarcomas Kaposi sarcomata

{base=Kaposi's sarcomaspelling_variant=Kaposi sarcomaentry=E0003576

cat=nounvariants=uncountvariants=regvariants=glreg

}

Page 27: NLM Summer Lectures · 2016. 6. 8. · Frequency Spectrum of Medline 2006 1 500001 1000001 1500001 2000001 2500001 3000001 1 100 10000 1000000 100000000 M V(m,N)

Regular Nouns

The plural suffix is s.y becomes ie following a consonant before s.e is inserted before s if the base ends in s, z, x, ch, or s

Leach – LeachesStomach – Stomachs irregular

Page 28: NLM Summer Lectures · 2016. 6. 8. · Frequency Spectrum of Medline 2006 1 500001 1000001 1500001 2000001 2500001 3000001 1 100 10000 1000000 100000000 M V(m,N)

Greco-latin Regular nouns

Page 29: NLM Summer Lectures · 2016. 6. 8. · Frequency Spectrum of Medline 2006 1 500001 1000001 1500001 2000001 2500001 3000001 1 100 10000 1000000 100000000 M V(m,N)

Octopuses

{base=octopusentry=E0204527

cat=nounvariants=regvariants=glreg

}

Page 30: NLM Summer Lectures · 2016. 6. 8. · Frequency Spectrum of Medline 2006 1 500001 1000001 1500001 2000001 2500001 3000001 1 100 10000 1000000 100000000 M V(m,N)

Uncount Nouns(abstract or mass)

* a smallpox * two smallpoxes much smallpox * a potassium * two potassiums much potassium

{base=smallpoxentry=E0056359

cat=nounvariants=uncount

}{base=potassiumentry=E0049387

cat=nounvariants=uncount

}

* This form does not occur

Page 31: NLM Summer Lectures · 2016. 6. 8. · Frequency Spectrum of Medline 2006 1 500001 1000001 1500001 2000001 2500001 3000001 1 100 10000 1000000 100000000 M V(m,N)

Countablilty

Mail* A mail

much mail* many mails

E-MailAn e-mailmuch e-mailmany e-mails

* This form does not occur

Page 32: NLM Summer Lectures · 2016. 6. 8. · Frequency Spectrum of Medline 2006 1 500001 1000001 1500001 2000001 2500001 3000001 1 100 10000 1000000 100000000 M V(m,N)

Uncount Nouns

Page 33: NLM Summer Lectures · 2016. 6. 8. · Frequency Spectrum of Medline 2006 1 500001 1000001 1500001 2000001 2500001 3000001 1 100 10000 1000000 100000000 M V(m,N)

Fixed Plural Nouns

{base=policeentry=E0048616

cat=nounvariants=plur

}

{base=scissorsentry=E0054633

cat=nounvariants=plur

}

Page 34: NLM Summer Lectures · 2016. 6. 8. · Frequency Spectrum of Medline 2006 1 500001 1000001 1500001 2000001 2500001 3000001 1 100 10000 1000000 100000000 M V(m,N)

Irregular Nouns

{base=corpusentry=E0019113

cat=nounvariants=irreg|corpora|variants=reg

}

{base=larynxentry=E0036919

cat=nounvariants=irreg|larynges|variants=reg

}

Page 35: NLM Summer Lectures · 2016. 6. 8. · Frequency Spectrum of Medline 2006 1 500001 1000001 1500001 2000001 2500001 3000001 1 100 10000 1000000 100000000 M V(m,N)

Regular Verbs The third person present tense suffix is s.

y becomes ie following a consonant before s. e is inserted between z, x, ch, or sh and s.

The past tense suffix is ed. y becomes ie following a consonant before ed. Final e is deleted before ed.

• The present participle suffix is ing.- y becomes ie following a consonant before ing.- Final e is deleted before ing

unless preceded by e, y or o.

• The past participle is the same as the past tense.

Page 36: NLM Summer Lectures · 2016. 6. 8. · Frequency Spectrum of Medline 2006 1 500001 1000001 1500001 2000001 2500001 3000001 1 100 10000 1000000 100000000 M V(m,N)

Regular Verbs

dismiss: dismisses, dismissed, dismissing agree: agrees; agreed; agreeing dry: dries, dried, drying

Page 37: NLM Summer Lectures · 2016. 6. 8. · Frequency Spectrum of Medline 2006 1 500001 1000001 1500001 2000001 2500001 3000001 1 100 10000 1000000 100000000 M V(m,N)

Regular Doubling Verbs

End in a CVC pattern Double the final consonant before ed and ing. Are otherwise regular variants=regd

control: controls, controlled, controlling

Page 38: NLM Summer Lectures · 2016. 6. 8. · Frequency Spectrum of Medline 2006 1 500001 1000001 1500001 2000001 2500001 3000001 1 100 10000 1000000 100000000 M V(m,N)

Irregular Verbs

Bite: bite, bites, bit, bitten

Page 39: NLM Summer Lectures · 2016. 6. 8. · Frequency Spectrum of Medline 2006 1 500001 1000001 1500001 2000001 2500001 3000001 1 100 10000 1000000 100000000 M V(m,N)

Irregular Verbs{base=bite

entry=E0013219cat=verb variants=irreg|bite|bites|bit|bitten|biting| intrantran=npcplxtran=np,advbl }

Page 40: NLM Summer Lectures · 2016. 6. 8. · Frequency Spectrum of Medline 2006 1 500001 1000001 1500001 2000001 2500001 3000001 1 100 10000 1000000 100000000 M V(m,N)

Ancillary Data Bases

Synonymy sm.db

Derivation dm.db, dm.rules

Inflection im.rules

Neoclassical compounds nc.db

Page 41: NLM Summer Lectures · 2016. 6. 8. · Frequency Spectrum of Medline 2006 1 500001 1000001 1500001 2000001 2500001 3000001 1 100 10000 1000000 100000000 M V(m,N)

Derivational Facts and Rules

dm.facts

treatment|noun|treat|verbprohibition|noun|prohibitive|adjcell lineage|noun|cell line|nounphotochemotherapeutic|adj|photochemotherapy|nounpharmacotherapeutic|adj|pharmacotherapy|noun

Page 42: NLM Summer Lectures · 2016. 6. 8. · Frequency Spectrum of Medline 2006 1 500001 1000001 1500001 2000001 2500001 3000001 1 100 10000 1000000 100000000 M V(m,N)

Derivational Facts and Rules

dm.rules

# e.g. alienation|alienateation$|noun|ate|verb

ration|rate; station|state;

Page 43: NLM Summer Lectures · 2016. 6. 8. · Frequency Spectrum of Medline 2006 1 500001 1000001 1500001 2000001 2500001 3000001 1 100 10000 1000000 100000000 M V(m,N)

Inflectional Facts and Rulesim.rules

# Noun rules (glreg)us$|noun|singular|i$|noun|plural

antus|anti;ma$|noun|singular|mata$|noun|plurala$|noun|singular|ae$|noun|pluralum$|noun|singular|a$|noun|pluralon$|noun|singular|a$|noun|pluralsis$|noun|singular|ses$|noun|pluralis$|noun|singular|ides$|noun|pluralmen$|noun|singular|mina$|noun|pluralex$|noun|singular|ices$|noun|pluralx$|noun|singular|ces$|noun|plural

Page 44: NLM Summer Lectures · 2016. 6. 8. · Frequency Spectrum of Medline 2006 1 500001 1000001 1500001 2000001 2500001 3000001 1 100 10000 1000000 100000000 M V(m,N)

Neoclassical compoundsnc.db

abdomin(o)|abdomen|rootab|away from|prefixacanth(o)|prickle|rootacar(o)|mite|rootacetabul(o)|acetabulum|rootad|towards|prefixagogue|inducing|terminalalbumin(o)|albumin|rootsis|condition|terminalstomy|surgical opening|terminal

Page 45: NLM Summer Lectures · 2016. 6. 8. · Frequency Spectrum of Medline 2006 1 500001 1000001 1500001 2000001 2500001 3000001 1 100 10000 1000000 100000000 M V(m,N)

pneu.mo.no.ul.tra.mi.cro.scop.ic.sil.i.co.vol.ca.no.co.ni.o.sis \'n(y)u:-m*-(.)no--.*l-tr*-.mi--kr*-'ska:p-ik-'sil-i-(.)ko--(.)v\n [NL, fr. Gk pneumo-n + ISV ultramicroscopic + NL silicon +]a:l-'ka--no--.ko--ne--'o--s*s ISV volcano + Gk konis dust : a pneumoconiosis caused by the inhalation of very fine silicate or quartz dust

-- Merriam Webster's 3rd International Dictionary, page 1747.

PNEUMONOULTRAMICROSCOPICSILICOVOLCANOCONIOSIS

Page 46: NLM Summer Lectures · 2016. 6. 8. · Frequency Spectrum of Medline 2006 1 500001 1000001 1500001 2000001 2500001 3000001 1 100 10000 1000000 100000000 M V(m,N)

The Protein of a tobacco mosaic virus, Dahlemense strain

acetylseryltyrosylserylisoleucylthreonylserylprolylserylglutaminylphenylalanylvalylphenylalanylleucylserylserylvalyltryptophylalanylaspartylprolylisoleucylglutamylleucylleucyllasparaginylvalylcysteinylthreonylserylserylleucylglycllasparaginylglutaminylphenylalanylglutaminylthreonylglutaminylglutaminylalanylarginylthreonylthreonylglutaminylvalylglutaminylglutaminylphenylalanylserylglutaminylvalyltryptophyllysylprolylphenylalanylprolylglutaminylserylthreonylvalylarginylphenylalanylprolylglycylaspartylvalyltyrosyllsyslvalyltyrosylarginyltyrosylasparaginylalanylvalylleucylaspartylprolylleucylisoleucylthreonylalanylleucylleucylglycylthryonylphenylalanylaspartylthreonylarginylasparaginylarginylisoleucylisoleucylglutamylvalylglutamylasparaginylglutaminylglutaminylserylprolylthreonylthreonylalanylglutamylthreonylleucylaspartylalanylthreonylarginylarginylvalylaspartylaspartylalanylthreonylvalylalanylisoleucylarginylserylalanylasparaginylisoleucylasparaginylleucylvallasparaginylglutamylleucylvalylarginylglycylthreonylglycylleucultyrosylasparaginylglutaminylasparaginylthreonylphenylalanylglutamylserylmethionylserylglycylleucylvalyltryptophylthreonylserylalanylprolylalanylserine

Page 47: NLM Summer Lectures · 2016. 6. 8. · Frequency Spectrum of Medline 2006 1 500001 1000001 1500001 2000001 2500001 3000001 1 100 10000 1000000 100000000 M V(m,N)

Synonymssm.db

alar|adj|wing|nounamygdaline|adj|tonsil|nounarticular|adj|joint|nounbulbar|adj|medulla oblongata|nounfununcular|adj|boil|noungenicular|adj|knee|nounhepatocellular|adj|liver cells|nounlazar|adj|leprosy|nounlenticular|adj|crystalline lens|nounypsiliform|adj|upsiloid|adjwolfram|noun|tungsten|noundouble vision|noun|diplopia|noun

Page 48: NLM Summer Lectures · 2016. 6. 8. · Frequency Spectrum of Medline 2006 1 500001 1000001 1500001 2000001 2500001 3000001 1 100 10000 1000000 100000000 M V(m,N)

Lexical tools

Textprocessing

SPECIALISTLEXICON

Page 49: NLM Summer Lectures · 2016. 6. 8. · Frequency Spectrum of Medline 2006 1 500001 1000001 1500001 2000001 2500001 3000001 1 100 10000 1000000 100000000 M V(m,N)

Lexical Tools

Wordind -- breaks strings into words Produces the Metathesaurus word indexes

(MRXW) LVG -- performs various lexical transformations NORM -- a selection of LVG transformations,

Used for Metathesaurus indexing Produces the Metathesaurus Normalized word and

string indexes (MRXNW & MRXNS) Used to access those indexes

Page 50: NLM Summer Lectures · 2016. 6. 8. · Frequency Spectrum of Medline 2006 1 500001 1000001 1500001 2000001 2500001 3000001 1 100 10000 1000000 100000000 M V(m,N)

Hodgkin Disease HODGKINS DISEASE Hodgkin's Disease Disease, Hodgkin's HODGKIN'S DISEASE Hodgkin's disease Hodgkins Disease Hodgkin's disease NOS Hodgkin's disease, NOS Disease, Hodgkins Diseases, Hodgkins Hodgkins Diseases Hodgkins disease hodgkin's disease Disease;Hodgkins Disease, Hodgkin

disease hodgkin

Normalization

Page 51: NLM Summer Lectures · 2016. 6. 8. · Frequency Spectrum of Medline 2006 1 500001 1000001 1500001 2000001 2500001 3000001 1 100 10000 1000000 100000000 M V(m,N)

The Lexical Systems Group

Allen Browne: [email protected] Chris Lu: [email protected]