informaon retrievalir.cis.udel.edu/~carteret/cisc689/slides/lecture1.pdf · retrieval • retrieval...

16
3/17/09 1 Informa(on Retrieval CISC489/689‐010, Lecture #1 Monday, Feb. 9 Ben CartereFe Informa(on Retrieval

Upload: others

Post on 15-Jul-2020

6 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Informaon Retrievalir.cis.udel.edu/~carteret/CISC689/slides/lecture1.pdf · Retrieval • Retrieval models define a view of relevance. • Ranking algorithms used in search engines

3/17/09

1

Informa(onRetrieval

CISC489/689‐010,Lecture#1Monday,Feb.9

BenCartereFe

Informa(onRetrieval

Page 2: Informaon Retrievalir.cis.udel.edu/~carteret/CISC689/slides/lecture1.pdf · Retrieval • Retrieval models define a view of relevance. • Ranking algorithms used in search engines

3/17/09

2

Informa(onRetrieval

Domains,Applica(ons,andTasks

•  Websearch•  Ver(calsearch•  Enterprisesearch•  Mediasearch•  Ques(onanswering•  Recommendersystems•  Adver(sing•  Personalitemsearch•  Passageretrieval

•  Filtering•  Summariza(on•  Clustering•  Topicdetec(on•  Cross‐language•  Federatedsearch•  Metasearch•  Socialsearch•  Novel‐itemretrieval

Page 3: Informaon Retrievalir.cis.udel.edu/~carteret/CISC689/slides/lecture1.pdf · Retrieval • Retrieval models define a view of relevance. • Ranking algorithms used in search engines

3/17/09

3

•  GerardSalton,1968:–  Informa(on retrieval is a field concerned with the structure, analysis, organiza(on, storage, searching, and retrieval of informa(on. 

•  Thisclassisaboutcomputa(onalmethodsforthestructure,analysis,organiza(on,storage,searching,andretrievalofinforma(on.– Andprimarilyabouttext documents.

WhatisIR?

WhatisaDocument?

•  Examples:– webpages,email,books,newsstories,scholarlypapers,textmessages,Word™,Powerpoint™,PDF,forumpos(ngs,patents,IMsessions,etc.

•  Commonproper(es:– Significanttextcontent.– Somestructure(e.g.,(tle,author,dateforpapers;subject,sender,des(na(onforemail).

Page 4: Informaon Retrievalir.cis.udel.edu/~carteret/CISC689/slides/lecture1.pdf · Retrieval • Retrieval models define a view of relevance. • Ranking algorithms used in search engines

3/17/09

4

ExamplesofDocuments<DOC> <DOCNO>WSJ890824-0049</DOCNO> <DD> = 890824 </DD> <AN> 890824-0049. </AN> <HL> Politics & Policy: @ FDA Focuses @ On Eli Lilly @ In Drug Inquiry @ --- @ Probe of the Industry Shifts @ To Possible Brand-Name @ Manufacturing Problems @ --- @ By Bill Richards and Bruce Ingersoll @ Staff Reporters of The Wall Street Journal </HL> <DD> 08/24/89 </DD> <SO> WALL STREET JOURNAL (J) </SO> <CO> LLY PRX </CO> <IN> DRUG MANUFACTURERS (DRG) </IN> <GV> FOOD AND DRUG ADMINISTRATION (FDA) </GV> <TEXT> Food and Drug Administration investigators are looking into possible brand-name drug manufacturing problems at an Indianapolis plant owned by Eli Lilly & Co.

… </TEXT> </DOC>

<DOC> <DOCNO>AP891117-0141</DOCNO> <FILEID>AP-NR-11-17-89 1612EST</FILEID> <FIRST>u w AM-GenericDrugs 11-17 0740</FIRST> <SECOND>AM-Generic Drugs,740</SECOND> <HEAD>FDA Chief Says Agency Needs More Power to Punish Cheating on Drug Tests</HEAD> <BYLINE>By DEBORAH MESCE</BYLINE> <BYLINE>Associated Press Writer</BYLINE> <DATELINE>WASHINGTON (AP) </DATELINE> <TEXT> The Food and Drug Administration chief told Congress on Friday the agency needs more authority to punish generic drug companies that cheat on safety tests and misrepresent data to win product approvals. … </TEXT> </DOC>

<DOC> <DOCNO>DOE1-01-0215</DOCNO> <TEXT> Interpretation of the relative GI ‘toxicities’ of cytotoxic drugs depends on the endpoint chosen. Histological assays of the dynamics of mitotic and necrotic cells in murine crypts revealed few apparently radical differences between individual drugs and between drugs and radiation. The microcolony assay of clonogenic cells reveals major differences between drugs in the ability of cells to maintain crypt integrity or to regenerate crypt-like structures… </TEXT> </DOC>

<DOC> <DOCNO>FR891016-0068</DOCNO> <DOCID>fr.10-16-89.f2.A1067</DOCID> <TEXT> <ITAG tagnum=69> <ITAG tagnum=41>[Docket No. 89N-0432]</ITAG>

<ITAG tagnum=56>Par Pharmaceutical, Inc.; Proposal to Withdraw Approval of Three Abbreviated New Drug Applications; Opportunity for a Hearing</ITAG>

<T2>SUMMARY: </T2>The Food and Drug Administration (FDA) proposes to withdraw approval of abbreviated new drug applications (ANDA's) 71-642, 71-643, and 72-337 held by Par Pharmaceutical, Inc., One Ram Ridge Rd., Spring Valley, NY 10977 (Par). The grounds for the proposed withdrawal are (1) that the applications contain untrue statements of material fact, and (2) that, based on new information evaluated together with the evidence available when the applications were approved, there is a lack of substantial evidence that the drugs will have the effects they purport or are represented to have under the conditions of use prescribed, recommended, or suggested in their labeling. </ITAG>

</TEXT> </DOC>

<DOC> <DOCNO>ZF109-649-919</DOCNO> <DOCID>09 649 919 OV: 09 649 805.&M; </DOCID>

<JOURNAL>PC Magazine Dec 11 1990 v9 n21 p428(2) * Full Text COPYRIGHT Ziff-Davis Publishing Co. 1990.&M; </JOURNAL> <TITLE>Generic 3D Drafting. (Software Review) (one of three evaluations of low-cost 3D CAD programs in ‘Low-cost CAD: modeling for the masses.’) (evaluation) </TITLE> <AUTHOR>Haase, Bruce.&M; </AUTHOR> <SUMMARY>Generic Software Inc’s $349 Generic 3D Drafting is a low-cost… </SUMMARY> <DESCRIPT> Company: Generic Software Inc. (Products).&O; Product: Generic 3-D Drafting 1.1 (CAD Software).&O; Topic: Computer-Aided Design … </DESCRIPT> <TEXT> … </TEXT> </DOC>

Query:Generic Drugs – Illegal Activities by Manufacturers

Descrip-on:Toberelevantadocumentmustiden(fyaspecificgenericdrugcompanybeinginves(gatedbytheFDAorCongress.Italsomustiden(fythedrug,i.e.,thegenericdrugforZantac.

Documentsvs.DatabaseRecords

•  Databaserecordsaretypicallymadeupofwell‐definedfields(ora<ributes).– e.g.companynames,addresses,accountnumbers,drugnames,patentnumbers,inves(ga(onfilenumbers.

•  Easytocomparefieldswithwell‐definedseman(cstoqueriesinordertofindmatches.

•  OurqueryhasnofieldsandourdocumentshaveliFlestructure.

Page 5: Informaon Retrievalir.cis.udel.edu/~carteret/CISC689/slides/lecture1.pdf · Retrieval • Retrieval models define a view of relevance. • Ranking algorithms used in search engines

3/17/09

5

IRvs.Databases

Informa-onRetrieval

•  Data:–  Semi‐structured.

–  Heterogeneous.–  Noisy.

•  Unstructuredorsemi‐structuredqueries.

•  Naturallanguageseman(cs.•  Infrequentoff‐lineindex

changes.

Databases

•  Data:–  Structured.–  Homogeneous.–  Clean.

•  Structuredqueries.•  Well‐definedfield

seman(cs.

•  Frequenton‐lineindexchanges.

GenericDrugs–IllegalAc(vi(esbyManufacturers

<DOC> <DOCNO>WSJ890824-0049</DOCNO> <DD> = 890824 </DD> <AN> 890824-0049. </AN> <HL> Politics & Policy: @ FDA Focuses @ On Eli Lilly @ In Drug Inquiry @ --- @ Probe of the Industry Shifts @ To Possible Brand-Name @ Manufacturing Problems @ --- @ By Bill Richards and Bruce Ingersoll @ Staff Reporters of The Wall Street Journal </HL> <DD> 08/24/89 </DD> <SO> WALL STREET JOURNAL (J) </SO> <CO> LLY PRX </CO> <IN> DRUG MANUFACTURERS (DRG) </IN> <GV> FOOD AND DRUG ADMINISTRATION (FDA) </GV> <TEXT> Food and Drug Administration investigators are looking into possible brand-name drug manufacturing problems at an Indianapolis plant owned by Eli Lilly & Co.

… </TEXT> </DOC>

<DOC> <DOCNO>AP891117-0141</DOCNO> <FILEID>AP-NR-11-17-89 1612EST</FILEID> <FIRST>uw AM-GenericDrugs 11-17 0740</FIRST> <SECOND>AM-Generic Drugs,740</SECOND> <HEAD>FDA Chief Says Agency Needs More Power to Punish Cheating on Drug Tests</HEAD> <BYLINE>By DEBORAH MESCE</BYLINE> <BYLINE>Associated Press Writer</BYLINE> <DATELINE>WASHINGTON (AP) </DATELINE> <TEXT> The Food and Drug Administration chief told Congress on Friday the agency needs more authority to punish generic drug companies that cheat on safety tests and misrepresent data to win product approvals. … </TEXT> </DOC>

<DOC> <DOCNO>DOE1-01-0215</DOCNO> <TEXT> Interpretation of the relative GI ‘toxicities’ of cytotoxic drugs depends on the endpoint chosen. Histological assays of the dynamics of mitotic and necrotic cells in murine crypts revealed few apparently radical differences between individual drugs and between drugs and radiation. The microcolony assay of clonogenic cells reveals major differences between drugs in the ability of cells to maintain crypt integrity or to regenerate crypt-like structures… </TEXT> </DOC>

<DOC> <DOCNO>FR891016-0068</DOCNO> <DOCID>fr.10-16-89.f2.A1067</DOCID> <TEXT> <ITAG tagnum=69> <ITAG tagnum=41>[Docket No. 89N-0432]</ITAG>

<ITAG tagnum=56>Par Pharmaceutical, Inc.; Proposal to Withdraw Approval of Three Abbreviated New Drug Applications; Opportunity for a Hearing</ITAG>

<T2>SUMMARY: </T2>The Food and Drug Administration (FDA) proposes to withdraw approval of abbreviated new drug applications (ANDA's) 71-642, 71-643, and 72-337 held by Par Pharmaceutical, Inc., One Ram Ridge Rd., Spring Valley, NY 10977 (Par). The grounds for the proposed withdrawal are (1) that the applications contain untrue statements of material fact, and (2) that, based on new information evaluated together with the evidence available when the applications were approved, there is a lack of substantial evidence that the drugs will have the effects they purport or are represented to have under the conditions of use prescribed, recommended, or suggested in their labeling. </ITAG>

</TEXT> </DOC>

<DOC> <DOCNO>ZF109-649-919</DOCNO> <DOCID>09 649 919 OV: 09 649 805.&M; </DOCID>

<JOURNAL>PC Magazine Dec 11 1990 v9 n21 p428(2) * Full Text COPYRIGHT Ziff-Davis Publishing Co. 1990.&M; </JOURNAL> <TITLE>Generic 3D Drafting. (Software Review) (one of three evaluations of low-cost 3D CAD programs in ‘Low-cost CAD: modeling for the masses.’) (evaluation) </TITLE> <AUTHOR>Haase, Bruce.&M; </AUTHOR> <SUMMARY>Generic Software Inc’s $349 Generic 3D Drafting is a low-cost… </SUMMARY> <DESCRIPT> Company: Generic Software Inc. (Products).&O; Product: Generic 3-D Drafting 1.1 (CAD Software).&O; Topic: Computer-Aided Design … </DESCRIPT> <TEXT> … </TEXT> </DOC>

Page 6: Informaon Retrievalir.cis.udel.edu/~carteret/CISC689/slides/lecture1.pdf · Retrieval • Retrieval models define a view of relevance. • Ranking algorithms used in search engines

3/17/09

6

ComparingText

•  DeterminingwhetheradocumentmatchesaqueryisafundamentalproblemofIR.

•  Exactmatchisnotenough:– Manydifferentwaystostatethesameinforma(on– Documentsmayberelevantevenwhenlackingsomeofthequeryterms.

– Documentsmaybenonrelevanteveniftheycontainallthequeryterms.

Relevance

•  Whatdoesitmeanforadocumenttoberelevant?

– Simpledefini(on:Arelevantdocumentcontainsinforma(onthatapersonwaslookingforwhentheysubmiFedaquerytothesearchengine.

– Manyfactorsinfluenceaperson’sdecisionaboutwhatisrelevant:e.g.,task,context,novelty,style.

– Topical relevance (sametopic)vs.user relevance (everythingelse).

•  Howcanwebuildanenginethatretrievesrelevantdocuments?

Page 7: Informaon Retrievalir.cis.udel.edu/~carteret/CISC689/slides/lecture1.pdf · Retrieval • Retrieval models define a view of relevance. • Ranking algorithms used in search engines

3/17/09

7

Retrieval

•  Retrieval modelsdefineaviewofrelevance.•  Ranking algorithmsusedinsearchenginesarebasedonretrievalmodels.

•  Mostmodelsdescribesta(s(calproper(esoftextratherthanlinguis(cproper(es.–  i.e.coun(ngsimpletextfeaturessuchaswords.

– Sta(s(calapproachstartedwithLuhninthe‘50s.– Linguis(cfeaturescanbepartofasta(s(calmodel.

Evalua(on

•  Howdoweknowwhethertheengineisdoingagoodjoboffindingrelevantdocuments?– Evalua(onisexperimentalproceduresandmeasuresforcomparingsystemoutputwithuserexpecta(ons.

–  IRevalua(onmethodsnowusedinmanyfields.

– Recallandprecisionareexamplesofeffec(venessmeasures. 

Page 8: Informaon Retrievalir.cis.udel.edu/~carteret/CISC689/slides/lecture1.pdf · Retrieval • Retrieval models define a view of relevance. • Ranking algorithms used in search engines

3/17/09

8

NotJustDocuments

•  Newapplica(onsincreasinglyinvolvenewmedia.– e.g.video,photos,music,speech

•  Liketext,contentisdifficulttodescribeandcompare.–  textmaybeusedtorepresentthem(e.g.tags).

•  IRapproachestosearchandevalua(onareappropriate.

DimensionsofIR

Content Applica-ons Tasks

Text Websearch Adhocsearch

Images Ver(calsearch Filtering

Video Enterprisesearch Classifica(on

Scanneddocs Desktopsearch Ques(onanswering

Audio Forumsearch

Music P2Psearch

Literaturesearch

Page 9: Informaon Retrievalir.cis.udel.edu/~carteret/CISC689/slides/lecture1.pdf · Retrieval • Retrieval models define a view of relevance. • Ranking algorithms used in search engines

3/17/09

9

IRTasks

•  Ad‐hocsearch:–  Findrelevantdocumentsforanarbitrarytextquery.

•  Filtering:–  Iden(fyrelevantuserprofilesforanewdocument.

•  Classifica(on:–  Iden(fyrelevantlabelsfordocuments.

•  Ques(onanswering:– Giveaspecificanswertoaques(on.

IRandSearchEngines

•  Asearchengineistheprac(calapplica(onofinforma(onretrievaltechniquestolargescaletextcollec(ons.

•  Relevance,retrieval,evalua(onareissues.•  Soareusersandinforma(onneeds,performance,coverage,upda(ng,scalability,adaptability,andabilitytohandlespecificproblems(likespam).

Page 10: Informaon Retrievalir.cis.udel.edu/~carteret/CISC689/slides/lecture1.pdf · Retrieval • Retrieval models define a view of relevance. • Ranking algorithms used in search engines

3/17/09

10

ComponentsofaSearchEngineUniverseofthingstoorganizeandsearch

Filter/crawler/domain/…

Corpus

Parser/tokenizer

IndexerUser Interface

query Queryparser

f(Q,D)

RetrievedresultsDisplayedresults

Retrievalfunc(onServer(s)

BuildingaSearchEngine

•  Textprocessingandindexing.– Parsing;tokenizing;stoppingandstemming;invertedindexes;scalability;indexupdates.

•  Queryprocessingandranking.– Querylanguages;indexlook‐up;retrievalmodels;features;relevancefeedback;userinterac(on.

•  Evalua(on.– Effec(venessatperformingtask;queryingspeed;usersa(sfac(on.

Page 11: Informaon Retrievalir.cis.udel.edu/~carteret/CISC689/slides/lecture1.pdf · Retrieval • Retrieval models define a view of relevance. • Ranking algorithms used in search engines

3/17/09

11

CourseOverview

•  Thiscourseisaboutinforma(on retrieval in prac(ce:theapplica(onofIRtosearchenginedesignandimplementa(on.

•  Courseproject:– DesignandimplementasmallsearchenginecapableofindexingandsearchingWikipediapages.

– Evaluateitsperformanceoverprovidedqueries.– Addsomethinginteres(ngtoit.

CourseStructure

•  Firsthalf:– Fundamentalsofindexing,retrieval,andevalua(on.

– Bythemidtermwewillhavecoveredallaspectsofdesigningabasicsearchengine.

•  Secondhalf:– Addi(onaltopicsinsearchenginefunc(onality.– Fieldedsearch,userinterac(on,clustering,link‐graphfeatures,crawling,etc.

Page 12: Informaon Retrievalir.cis.udel.edu/~carteret/CISC689/slides/lecture1.pdf · Retrieval • Retrieval models define a view of relevance. • Ranking algorithms used in search engines

3/17/09

12

Textbook

•  Search Engines:  Informa(on Retrieval in Prac(cebyW.BruceCrop,DonaldMetzler,andTrevorStrohman.

•  Unfortunatelynotyetpublished.–  IhavePDFsofchapters.– Alsochecksupplementaltextsonthecoursewebpage.

CourseProject

•  DesignandimplementasmallsearchenginetoindexandsearchWikipediapages.

•  Semester‐longprojectinthreephases:I.  Indexing.II.  Searchingandevalua(ng.

III.  Addi(onalfeatures.•  Bymidtermwewillhavecoveredeverythingneededtocompletethefirsttwophases.

Page 13: Informaon Retrievalir.cis.udel.edu/~carteret/CISC689/slides/lecture1.pdf · Retrieval • Retrieval models define a view of relevance. • Ranking algorithms used in search engines

3/17/09

13

CourseProject:Phases

•  ForphasesIandII,youwillproduce:– AwriFenreportofyourdesigndecisionsandimplementa(ondetails,includingproblemsyouencounteredandhowyouresolvedthem.

– Code.– Milestoneworksheetresponses.

•  Timeline:– PhaseI:about1.5months.– PhaseII:about1month.– PhaseIII:about1month.

CourseProject:Milestones

•  Eachphasehasmilestonestomakesureyouarenotrunningintotrouble.– Worksheetswithques(onsyoucananswerusingyourcode.

– Milestonesandworksheetswillbeavailableinadvancesoyoumayworkaheadifdesired(werecommendit).

–  Ifyouarehavingtrouble,wewillbeabletohelpyouearly.

Page 14: Informaon Retrievalir.cis.udel.edu/~carteret/CISC689/slides/lecture1.pdf · Retrieval • Retrieval models define a view of relevance. • Ranking algorithms used in search engines

3/17/09

14

CourseProject:PhaseIII

•  PhaseIIIinvolvesaddingextrafeaturestoyourengine.

•  Anythingwecoverinthesecondhalfofthecourse,oranythinginthebookbutnotcovered,orsomethingelse.

•  Youwillwritea2‐4pageproposalexplaininghowyouwouldaddthefeaturetoyourcurrentcodebase.

•  Attheendofthesemesteryouwillgiveashortpresenta(ononyourengine.

CourseProject:Implementa(on

•  Thisisaprogrammingproject!•  Youmayuseanyprogramminglanguagetheprofessorand/orTAunderstand.– WehighlyrecommendC,C++,orJava.

•  Youwillhaveaccountsonmylabcluster.– Nodiskquotas;16GbRAMpernode;8corespernode.

– Donotuseforfilesharingorotherillicitac(vity!

Page 15: Informaon Retrievalir.cis.udel.edu/~carteret/CISC689/slides/lecture1.pdf · Retrieval • Retrieval models define a view of relevance. • Ranking algorithms used in search engines

3/17/09

15

CourseProject:Data

•  IhaveobtainedallEnglish‐languageWikipediapages.

•  Thetop10%withhighestPageRankareprovidedfortheproject.– 489studentsmustindexandsearch20%ofthose(2%ofEnglishWikipedia).

– 689studentsmustindexandsearch100%ofthose(10%ofEnglishWikipedia).

– Extracredit:indexandsearchevenmore.

ProjectGrading

•  Projectis60%oftotalgrade.•  Eachphaseis20%.

– PhasesIandIIbreakdownasfollows:• WriFenreport(2‐4pages):5%•  Code:10%•  Turninginworksheets:5%

– PhaseIII:•  Proposal(2‐4pages):10%for689,15%for489.•  Code:5%for689,0%for489.•  Finalpresenta(on:5%.

Page 16: Informaon Retrievalir.cis.udel.edu/~carteret/CISC689/slides/lecture1.pdf · Retrieval • Retrieval models define a view of relevance. • Ranking algorithms used in search engines

3/17/09

16

HomeworksandExams

•  Inaddi(ontotheproject,therewillbe5homeworksand2exams(midtermandfinal).

•  Eachhomeworkis4%oftotalgrade.•  Eachexamis10%oftotalgrade.•  Examswillcoverimplementa(ondetailsofproject.

BooksandResources

•  Informa(on Retrieval, KeithvanRijsbergen. – hFp://www.dcs.gla.uc.uk/Keith/Preface.html

•  Introduc(on to Informa(on Retrieval,Manningetal.– hFp://www‐csli.stanford.edu/~hinrich/informa(on‐retrieval‐book.html

•  Checkthecoursewebpageopen!– hFp://www.cis.udel.edu/~carteret/CISC689