cs 5604 information storage and retrieval solr team final
TRANSCRIPT
CS 5604 Information Storage and RetrievalSolr Team Final Presentation
Presenters:Liuqing Li, Ye Wang, Anusha Pillai, Ke Tian
{liuqing, yewang16, anusha89, ketian} @vt.edu
Instructor: Dr. Edward A. Fox
Virginia Polytechnic Institute and State UniversityBlacksburg, VA, 24061
December 6, 2016
Solr Team Final Presentation
• Background• Implementation• ProblemsFaced• LessonsLearned• FutureWork• Acknowledgement
Outline
1
Solr Team Final Presentation
Background — Overview
2
Solr Team Final Presentation
Background — Updates
3
Spring 2016 Fall 2016
schema.xml
Coarsegrained Finegrained
Nocopyfields Copyfields forallfieldssearch
Createstopwords.txt &profanity.txt Updatethetwofiles
morphlines.conf
Twofieldtypes:stringandtext Multiplefieldtypes
Field“time”=>string Field“time”=>datetime
Nomultiple-valuedfields Multiple-valuedfield parser
Basic Indexing Smallcollection 1.2billiontweetsdataset
Incremental Indexing VirtualCloudera(VC) VC &HadoopCluster(HC)
Recommendation Brief description ImplementedinVC&HC
Custom Ranking Brief description ImplementedinVC&HC
Solr Admin UIBrief description Detaileddescription
Limitedfacetedsearch Detailedfacetedsearch
Solr Team Final Presentation
• LiveMode• ContinuousstreamofHBase cellupdatesintolivesearchindexers
• Simpleandefficient• Cannothandlebigdata
• BatchMode• BatchindextablesinHBase byusingMapReducejobs• WriteindexfilesintoHDFS(/user/cs5604f16_solr/…)• Canhandlebigdata
Implementation — Basic Indexing
4
Solr Team Final Presentation
• schema.xml:fieldsconfiguration• field(e.g.,ideal-cs5604f16-fake)
• #offields:30• Types:string(22),text_general (2),int (2),float(2),long(1),date(1)• Stored:True(17),False(13)
• dynamicField:matchingmultiplefields,usingwildcard
• copyField
Implementation — Basic Indexing
5
Solr Team Final Presentation
• stopword.txtandprofanity.txt• stopword.txt:tf-idf valuewillnotbecalculated• profanity.txt:quickresponseforsuchsearchqueries• Solr loadsthetwofileswhilereadingschema.xml
Implementation — Basic Indexing
6
Source:https://pypi.python.org/pypi/many-stop-wordshttp://www.freewebheaders.com/full-list-of-bad-words-banned-by-google/
Solr Team Final Presentation
• morphlines.conf:mappingandparsing
Implementation — Basic Indexing
7
MappingdatafromHBase toSolr
Splitmultiplevaluesintolist "topic_label_s": "twitter;social;media;text"
Solr Team Final Presentation
• Indexthebigdataset
Implementation — Basic Indexing
8
ideal-cs5604f16 ideal-cs5604f16-1204
Dataset Allcollections(rawtweets)
Allcollections(rawtweets+processeddata)
Indexing
# of DataNode 18 17
Space Cost 392.33GB 399.21GB
Time Cost
Mapping 1h21m 1h45m
Reducing 5h11m 5h13m
Merging 3h18m 3h10m
Total 9h50m 10h8m
Solr Team Final Presentation
• Purpose• ProcessacontinuousstreamofHBase cellupdatesintolivesearchindexes(NearReal-Time,NRTIndexing)
• Solvetheproblemoffrequentinserts,deletesandupdates
• Howdoesitwork?• EnablingHBase replication(columnfamily)• PointinganNRTIndexerServiceatanHBase table• StartinganNRTIndexerService
• Ourwork
Implementation — Incremental Indexing
8
Source:http://www.cloudera.com/documentation/enterprise/5-6-x/topics/search_config_hbase_indexer_for_search.html
Solr Team Final Presentation
Implementation — Incremental Indexing
CreateandchecktheNRTindexer
9
Solr Team Final Presentation
RestarttheHBase Solr Indexerservice
Implementation — Incremental Indexing
RestarttheserviceinVC
RestarttheserviceinHC
10
Solr Team Final Presentation
Implementation — Incremental Indexing
11
CreateandchecktheNRTindexerChecktheresultsinHBase andSolr AdminUI
Solr Team Final Presentation
• Types• Textualsimilaritybased• Collaborativefiltering
• MoreLikeThisComponent• Identifiessimilardocumentstosearchresultdocuments.• Canbeconfiguredasarequesthandlerorsearchcomponent
• Usestermvectorstocomputesimilarity.• Termvectorcanbecalculatedduringqueryruntimeorprecomputedduringindexing
• Extractshighestmatchingtermsbasedontf-idf similarity
Implementation — Recommendation
12
Solr Team Final Presentation
• schema.xml• Setstored=true• SettermVectors =true(forcalcalating tf-idf)
• Aftermakingchanges,reindexing ismandatory
• solrconfig.xml• Enablemlt
• Defineotherconfigurationparameters• e.g.,mlt.fl,mlt.mintf,mlt.mindf,mlt.maxdf,mlt.qf
Implementation — Recommendation
13
Solr Team Final Presentation
• RequestHandler
Implementation — Recommendation
Link:https://drive.google.com/open?id=0B2iasHDgHqGyYUk0R3RkVktkM2M 14
Solr Team Final Presentation
• SearchComponent
Implementation — Recommendation
Link:https://drive.google.com/open?id=0B2iasHDgHqGyU0doVEpidlh3c2c 15
Solr Team Final Presentation
Implementation — Custom Ranking
16
• Purpose• Customizeandoptimizetherankedresults
• Howdoesitwork?• SearchComponent
• prepare():pre-processing,invokedbeforequeryisexecuted• processing():post-processing,invokedafteralltheresultsarefetched
• CustomScoring
• Re-ranking
𝑺𝒄𝒐𝒓𝒆 = 𝑫𝒐𝒄𝒔𝒄𝒐𝒓𝒆,𝑺𝒐𝒍𝒓 + 𝑫𝒐𝒄𝒊𝒎𝒑𝒐𝒓𝒕𝒂𝒏𝒄𝒆+𝑊45678×𝐷𝑜𝑐=85>?,45678 + 𝑊8@A=4?>×𝐷𝑜𝑐=85>?,8@A=4?>
Solr Team Final Presentation
Implementation — Custom Ranking
BuildandcopyjarfileintoHadoopCluster
16
Solr Team Final Presentation
Implementation — Custom Ranking
BuildandcopyjarfileintoHadoopCluster
16
Modifythesolrconfig.xml
Solr Team Final Presentation
Implementation — Custom Ranking
17
UpdatetheinstanceDirReloadthecollectionChecktheresultsinSolr AdminUI
Solr Team Final Presentation
Implementation — Solr Admin UI
1
2
3
Choose ideal-cs5604f16-fake for querying
DashBoard:providebasicfunctionsforuserstochoose.(LoggingtocheckSolrlogsfordebugging)
CoreSelector:selectthecore(dataset)forqueries
Solr instanceInformation:currentversions,JVMinformation
19
Solr Team Final Presentation
Implementation — Solr Admin UI
22
1
2
4 5
3
Fieldname
Resultstatistics
Therequest-handler:/selectThequeryevent:qParametersforquery:fq (filterqueries)sort(descendingorascending)ExecutequeryResultsoutputs:json format
Solr Team Final Presentation
Implementation — Solr Admin UI
23
1
2
4
3
5
Thefacetedsearchquery:rangeFacetedsearchfield:t_month_iParameters,truewhenenabledSearchResults:countsSearchResults:details
Solr Team Final Presentation
Problem Faced
24
ClouderaandOSVirtualClouderaseems slowandoftencrashesduetothememory
Notfamiliar withthewholearchitectureatthebeginning
VersionsofClouderaandSolr
DataConsistencycheck
Notenoughrealdataavailabletoperformtests
Notmuchinformationavailableregardinglogstoperformcollaborativefiltering
CollaborationCommunicationandmodification
Solr Team Final Presentation
Lessons Learned
25
SolrHBase
HDFS
Patience
Carefulness
TeamCollaboration
Solr Team Final Presentation
Future Work
26
SearchCustomizemorerequesthandlers
Dealwiththeprofanityissue
CustomRankingCustomizemoresearchcomponents
Recommendation
Createacustomrecommendationcomponent(Probabilities– CTAteam)
Implementthecollaborativefiltering(Log files– FEteam)
SolrFigureoutSolrCloud,multipleSolr nodesinClouderaSearch
Solr Team Final Presentation
Acknowledgement
27
Projects
NSFIIS- 1319578 III:Small:IntegratedDigitalEventArchivingandLibrary(IDEAL)
NSFIIS- 1619028 III:Small:CollaborativeResearch:GlobalEventandTrendArchiveResearch(GETAR)
TeamsCMT,CMW,CLA,CTA,FEteams
PersonsInstructor Dr.EdwardA.Fox
GRA Sunshin Lee
Thank you !
Questions?