lis 7450, searching electronic databases

Post on 23-Feb-2016

37 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

DESCRIPTION

LIS 7450, Searching Electronic Databases. Basic: Database Structure & Database Construction Dialog: Database Construction for Dialog (FYI) Deborah A. Torres. Database Structure. Organization of Data Elements and records. Database Record. - PowerPoint PPT Presentation

TRANSCRIPT

LIS 7450, Searching Electronic Databases

Basic: Database Structure & Database Construction

Dialog: Database Construction for Dialog (FYI)

Deborah A. Torres

Database Structure

Organization of Data Elements and records

Database Record

Record – basic unit of information in a database (file). Example: Bibliographic record contains

description information, i.e. author, title, publisher etc.

Fields

Field – a distinct part or section of a record (a unit of information within the record) Example of personnel record fields:

employee’s name, special identifier number, address, date of hire etc.

Field Design Decisions

For each field Decide what information is placed within

that field & format for that information (text, numeric)

Should there be subfields within a field? What to call the fields? Field codes (abbreviations, numbering) Order of the fields

Example: MARC Record (a type of record you should be familiar with)

Record Fields & CodesThe 100 field

contain author information.The 245 field contains main title information.

Other Design Decisions

Hyphenated words Home-school

Stop words High frequency words not useful for searching

Single words and phrases Library, library science, color of money

Alternative spellings of words Color, colour

Types of Databases

Bibliographic – references and abstracts of published documents

Fulltext – complete text of articles, dictionary entry, code of law, or other such document.

Directory – factual information about organizations, companies, products, people, or materials.

Types of Databases

Numeric – data in a tabular or statistically manipulated form, often with some added text.

Hybrid – a mix of record types. For example, a database may have full-text records for some publications and citations and abstracts for other source documents.

Database Construction

Basic Steps for automatic indexing of text documents

Six Basic StepsStep 1: Parse text into wordsStep 2: Compare to stoplist and eliminate

stopwordsStep 3: Stem content words (reduce to root

words) (skip this step if decide not to stem)

Step 4: Count stemmed word occurrencesStep 5: Create union list of termsStep 6: Create data structure for specific

retrieval techniques (i.e. an inverted file)

Example: Simple Set of 5, One-sentence documents

D1: It is a dog eat dog world!D2: While the world sleeps.D3: Let sleeping dogs lie.D4: I will eat my hat.D5: My dog wears a hat.

“D” stands for document

Step 1: Parse Text into WordsD1:itisa dogeatdogworld

D2:whiletheworldsleeps

D3:letsleepingdogslie

D4:Iwilleatmyhat

D5:mydogwearsahat

Note: Some databases remove punctuation for words, like possessives; others preserve it. What difference would this make?

Step 2: Eliminate Stop WordsD1:dogeatdogworld

D2:worldsleeps

D3:letsleepingdogslie

D4:eathat

D5:dogwearshat

Stop words are content-free words – those not useful in determining the content of the document.Examples: pronouns (I, my), prepositions (of, by, on), articles (a, the, this)

Step 3: Stemming (remember not all databases stem words)

D1:dogeatdogworld

D2:worldsleeps

D3:letsleepingdogslie

D4:eathat

D5:dogwearshat

D1:dogeatdogworld

D2:worldsleep

D3:letsleepdoglie

D4:eathat

D5:dogwearhat

Types of Stemming DecisionsNo Stemming:contractcontractscontractedcontractingcontractorcontractioncontractualcontracture

Weak Stemming:Inflections: -s, -es, -ed, -ing, -’s

Strong Stemming:Derivations: -tion, -ly, -ally

Reduce words to a root variant; there are different stemming algorithms

A bit more about stemming for searching…

Some databases automatically search for all of the words that come from the same stem/root word unless you indicate that you only want the word you entered.

Example: if you entered computer, the database would also search for computing, computers, computation, etc.

Step 4: Sort Words, Count DuplicatesD1:dogdogeatworld

D2:sleep world

D3:dogletliesleep

D4:eathat

D5:doghat wear

D1:dog(2)eatworld

D2:sleep world

D3:dogletliesleep

D4:eathat

D5:doghat wear

Sort into Alpha order

Count any duplicate

s

Step 5: Create Union List of Unique TermsUnsorted List

dogeat

world sleep world dogletlie

sleep eathat doghat wear

Sorted List dogdogdogeateathat hat letlie

sleep sleep wearworld world

Sorted, Unique List

dogeathatletlie

sleepwearworld

Step 6: Create Inverted Index (inverted file)

dogeathatletliesleepwearword

Union List Unique terms

dog: D1 D3 D5eat: D1 D4hat: D4 D5let: D3lie: D3sleep: D2 D3wear: D5word: D1 D2

Inverted Index: has pointers to documents in which word occurs

Inverted Index

Dialog Database Construction

FYI: For those interested in Dialog

Dialog Database Construction

Step 1: Create a linear file of records received from the Information Provider. Assign sequential accession numbers to the records.

Step 2: Label the fields within the records: AU for Author, TI for Title, etc. If a field is word-indexed, also label the words within each field. Exclude stop words: AN FOR THE AND FROM TO BY WITH

Dialog Database Construction

Step 3: Create the Basic Index: all words and phrases from fields containing subject-related terms.

Step 4: Create the Additional Indexes: all terms from all remaining fields.

top related