letting in the light: using solr as an external search component

Download Letting In the Light: Using Solr as an External Search Component

If you can't read please download the document

Upload: jay-luker

Post on 16-Apr-2017

4.733 views

Category:

Technology


0 download

TRANSCRIPT

Letting In The Light

Using Solr as an External Search Component

Jay LukerBenoit ThiellSAO/NASA Astrophysics Data System

http://adsabs.harvard.edu/

The SAO/NASA Astrophysics Data System (ADS) is a Digital Library portal for researchers in Astronomy and Physics, operated by the Smithsonian Astrophysical Observatory (SAO) under a NASA grant.

Here's what to expect...

Overview of ADS

Overview of Invenio

Our Solr-Invenio Integration Project

A few tips on Solr hacking along the way

The ADS Project

Established in 1989 (before the web!) as a portal for accessing astronomical data and bibliographic metadata

Was restructured in 1994 to become an A&I service for astronomers and astrophysicists, with fulltext archive

Has 100% penetration in astronomical community, with take-up in other areas of space sciences, engineering and physics

1994 was the move to the web

ADS Holdings

Almost 9M bibliographic metadata records

625K fulltext articles

Painstakingly curated collection of citations and links to fulltext and data products

ADS Services

Free!

Search, Browse, Notifications, Personalization

API access to all content (TWITA)

Network of 12 mirror sites

ADS Labs: http://labs.adsabs.harvard.edu

Astronomy: 1.8M Physics: 5.8M Arxiv e-prints: 650K Citations: 40M (over 3.4M papers with citations)Curated links: 23M (fulltext, data products, citations)4M scanned pages, 625K articles650K pages historical materialAdvanced search allows for searching by astronomical object (via SIMBAD) and attributes like has datasetTWITA = The Website Is The API: via data_type= param, also structured metadata within the pages

Never heard of ?

1993: Started its life at CERN as a preprint server

2000: Extension of the server to allow storing multimedia content (photos, posters, brochures, videos) and creation of the open-source project CDSware project

Renamed CDS Invenio and then Invenio

Both an institutional repository and a digital library

Check it out! http://invenio-software.org/

Why choose Invenio?

ADS and Invenio share the same objectives: store and disseminate information to scientific communities

Growing penetration in the field of physics

Metadata curation tools (record editor, merger)

Support of citations graphs and citation-based searches

Second-order searches support

INSPIRE: Invenio for SPIRES, the Physics database at Stanford.

Under the hood

Written in Python, mod_wsgi, some C and Lisp

Coupled with MySQL only (for now)

Scales to sets of 2M+ records

MARC storage of records

Modular architecture with:OAI harvesting, OAI server

Format conversion (MARCXML, DC, NLM, etc)

References and citations handler

Plot and figure extraction

invenio.intbitset

Sets of Invenio record IDs (MARC controlfield 001)

In-house C implementation of Python sets

Fast dumping and loading marshalling functions

Stored marshalled in the database and used as such in the search engine

Invenio sounds great! Why use Solr then?

Invenio's search engine has trouble with 9M+ record (work-in-progress)

Invenio's indexing is slow by design (providing search speed) but it is too slow for such a large repository

Solr has a wide community of users/developers and lots of extensions.

Issues with the integration

Keeping the metadata on both systems in sync

Invenio's search engine requires full sets of results

Communicate over HTTP with very large payloads

Invenio + Solr

Objectives

Take advantage of Solr fulltext
indexing & searching

Take advantage of Solr faceting

Not duplicate existing Invenio functionality

Write as little code as possible

Keep things loosely coupled

Obviously, performance was also an objectiveInvenio team had been skeptical of the necessity of incorporating an external tool/service to do fulltext indexing and/or faceting, but once introduced to solr they quickly came aroundIn spite of the fact that at least some of the fancypants sorting, ranking, filtering functionality could most likely be reproduced using Solr, there was a strong reluctance to rewrite that code.Writing as little java as possible doesn't just come from a java-phobic frame of mind; it's also about limiting how much we rely on custom solr components. Rely as much as possible on what Solr affords.Loose integration in this case means the ability to swap in alternate services for retrieving fulltext search results and facets. More on how we succeeded in that towards the end.

Problem #1 Retrieving very large result set of ids.Like, millions.

The WTH Approach

http://myhost:8983/solr/select?q={foo}&fl=id&rows={n}

Query for foo

Only return the id field

Return n rows of the result

(A bit about ids)

Schema ids

Lucene ids

Defined in your schema.xml

Can be integers, strings, etc

Typically set as the

Internal to Lucene

Always integers

Unique within an index segment

When we talk about the ids being sent back and forth between Invenio & Solr we are talking about the schema ids.

The WTH Approach

* warmed cache, different servers, same LAN

seconds

So what's going on here? Our first thought was maybe it was the time needed to serialize/de-serialize the response, but that turned out not to be it.

So what's going on here?

documentcache

Lucene Doc

id: 1234,bibcode: , Title: , ...

Query Response

QueryResult

[1,5,16,84,...]

QueryResultMaxDocsCachedQueryResultWindowSizeenableLazyFieldLoading

Solution: Custom Collector

QueryResult

[1,5,16,84,...]

Query Response

...InvenioIdCollector collector = new InvenioIdCollector();searcher.search(query, collector);ArrayList ids = collector.getIds();rsp.add(ids, ids);...

MyQueryComponent.java

...ArrayList ids = new ArrayList();...Public void collect(int doc) {this.ids.add(this.idMap[doc]);}...

MyCollector.java

Solution: Custom Collector

OK, Let's Try This Again

http://myhost:8983/solr/select?q={foo}&qt=my_querytype

Query for foo

Use our custom query handler

No need to specify number of rows or which fields to return

Better. But ...

Problem #2 Facets.

Solr

QueryProcessingPost-processingReturn/Render

Fulltext Search

Record Ids

Invenio

What's Missing?

Post-processing = 2nd order searching, filteringCan't retreive facets with the initial query because the final list of search results will depend on Invenio post-processing.So how do you send a very large set of ids to get a set of facet results?

Solr

QueryProcessingPost-processingReturn/Render

Fulltext Search

Record ids

Invenio

Again, WTH?

Record ids?

Facets

Solr

QueryProcessingPost-processingReturn/Render

Fulltext Search

InvenioBitSet

Invenio

Current Solution

InvenioBitSet

Facets

Satisfies all most objectives.We get searching & facetingWe don't have to write a lot of python or java: invenio needs the indexing pieceNot duplicating anything that invenio already does very wellLoosely coupled because communication is in a form that is native to invenio, we could easily swap in/out different services for either piece

Parts Required

Custom QueryComponent for accepting fulltext search query and returning an
Integer BitSet

Custom Collector to collect doc ids

Custom BitSet class (maybe)

Custom BinaryResponseWriter

Custom QueryComponent for accepting an Integer BitSet query and returning facets

Seems like a lot, but in total lines of code it's not that much, especially considering it's in Java. Plus I suck at Java and I was able to do it all in 2-3 weeks of trial and error hacking.Plus, it all very closely conforms to the affordances of the Solr API. Only one small thing that might be considered a hack.

Invenio Query Component Config

bitset_stream invenio_query stats

...

solrconfig.xml

Defining our custom query component and telling the default solr search handler to use itAlso defining our custom response writer

Invenio Query Component

public void process(ResponseBuilder rb)throws IOException{ SolrQueryResponse rsp = rb.rsp; SolrIndexSearcher searcher = rb.req.getSearcher();

InvenioIdCollector collector = \ new InvenioIdCollector();

SolrIndexSearcher.QueryCommand cmd = \ rb.getQueryCommand(); Query query = cmd.getQuery();

searcher.search(query, collector); InvenioBitSet bitset = collector.getBitSet(); rsp.add("bitset", bitset);}

InvenioQueryComponent.java

A query component class has two opportunities to interact with the incoming request: prepare & process. We only need process.

Invenio Id Collector

public void setNextReader(IndexReader reader, int docBase) throws IOException { this.reader = reader; this.docBase = docBase;

try { this.idMap = FieldCache.DEFAULT.getInts( this.reader, "id"); } catch (IOException e) { SolrException.logOnce( SolrCore.log, "Exception during idMap init", e); }}

InvenioIdCollector.java

Response Writer

public void write(OutputStream out, SolrQueryRequest req, SolrQueryResponse rsp) { InvenioBitSet bitset = \ (InvenioBitSet) rsp.getValues().get("bitset"); ZOutputStream zOut = new ZOutputStream(out, JZlib.Z_BEST_SPEED);

try { zOut.write(bitset.toByteArray()); zOut.flush(); } catch (IOException e) { SolrException.logOnce(SolrCore.log, "Exception during compression/output of bitset", e); }}

InvenioBitsetStreamResponseWriter.java

These times include decompressing and unmarshalling the bitset into an invenio intbitset object in python

Invenio Facet Component Config

json OR 0 true author_facet ... invenio_facets facet

solrconfig.xml

Defining our custom query component and telling the default solr search handler to use itAlso defining our custom response writer

A bit of python

r = urllib2.Request(facet_query_url)data = bitset.fastdump()boundary = mimetools.choose_boundary()

contents = '--%s\r\n' % boundarycontents += 'Content-Disposition: form-data;' \ + 'name="bitset"; filename="bitset"\r\n'contents += 'Content-Type: application/octet-stream\r\n'contents += '\r\n' + data + '\r\n'contents += '--%s--\r\n\r\n' % boundaryr.add_data(contents)

r.add_unredirected_header('Content-Type', 'multipart/form-data; boundary=%s' % boundary)

u = urllib2.urlopen(r)facet_data = simplejson.load(u)

Facet Query Component

...Iterable streams = req.getContentStreams();...InputStream is = stream.getStream();ByteArrayOutputStream bOut = new ByteArrayOutputStream();ZInputStream zIn = new ZinputStream(is);

IOUtils.copy(zIn, bOut);InvenioBitSet bitset = \ new InvenioBitSet(bOut.toByteArray());...

InvenioFacetComponent.java

Facet Query Component (cont.)

... BitDocSet docSetFilter = new BitDocSet(); int i = 0; while (bitset.nextSetBit(i) != -1) { int nextBit = bitset.nextSetBit(i); int lucene_id = idMap.get(nextBit); docSetFilter.add(lucene_id); i = nextBit + 1; }... SolrIndexSearcher.QueryCommand cmd = \ rb.getQueryCommand(); cmd.setFilter(docSetFilter);SolrIndexSearcher.QueryResult result = \ new SolrIndexSearcher.QueryResult(); searcher.search(result,cmd); rb.setResult( result ); ...

InvenioFacetComponent.java

PyluceneEmbedded solrcpython within Java...

Alternative Approaches

PyLucene is a Python wrapper around Java Lucene. It embeds a Java VM with Lucene into a Python process. The extension is machine-generated with JCC, a C++ code generator that makes it possible to call into Java classes from Python via Java's Native Invocation Interface (JNI).

Further Study

Can we make use of Solr's OpenBitSet?

Is there a way to bypass the Collector stage completely?

How can we return document scores?

Alternative approaches: pylucene, pylucene + solr, cpython within Java.

PyLucene is a Python wrapper around Java Lucene. It embeds a Java VM with Lucene into a Python process. The extension is machine-generated with JCC, a C++ code generator that makes it possible to call into Java classes from Python via Java's Native Invocation Interface (JNI).

Thanks!

Thanks also to: The ADS Team, @adsabs

The Invenio Team, especially...

Roman Chyla

Jan Iwaszkiewicz

https://github.com/lbjay/solr-invenio

Invenio team had been skeptical of the necessity of incorporating an external tool/service to do fulltext indexing and/or faceting, but once introduced to solr they quickly came aroundIn spite of the fact that at least some of the fancypants sorting, ranking, filtering functionality could most likely be reproduced using Solr, there was a strong reluctance to rewrite that code.Writing as little java as possible doesn't just come from a java-phobic frame of mind; it's also about limiting how much we rely on custom solr components. Rely as much as possible on what Solr affords.Loose integration in this case means the ability to swap in alternate services for retrieving fulltext search results and facets. More on how we succeeded in that towards the end.