lucid imagination, inc. – 1 big, bigger biggest large scale issues: phrase queries and common...

Lucid Imagination, Inc. – http://www.lucidimagination.com 1

Big, Bigger BiggestLarge scale issues: Phrase queries and common words OCR

Tom Burton WestHathi Trust Project

Lucid Imagination, Inc. – http://www.lucidimagination.com

Hathi Trust Large Scale Search ChallengesGoal: Design a system for full-text search that

will scale to 5 million to 20 million volumes (at a reasonable cost.)

Challenges:

Must scale to 20 million full-text volumes

Very long documents compared to most large-scale search applications

Multilingual collection

OCR quality varies

2


Index Size, Caching, and Memory

Our documents average about 300 pages which is about 700KB of OCR.

Our 5 million document index is between 2 and 3 terabytes. About 300 GB per million documents

Large index means disk I/O is bottleneck

Tradeoff JVM vs OS memory

Solr uses OS memory (disk I/O caching) for caching of postings

Memory available for disk I/O caching has most impact on response time (assuming adequate cache warming)

Fitting entire index in memory not feasible with terabyte size index

3


Response time varies with query

4

Average: 673

Median: 91

90th: 328

99th: 7,504


Slowest 5 % of queries

Response Time 95th percentile (seconds)

0

1

10

100

1,000

940 950 960 970 980 990 1,000

Query number

Resp

onse

Tim

e (s

econ

ds)

The slowest 5% of queries took about 1 second or longer.

The slowest 1% of queries took between 10 seconds and 2 minutes.

Slowest 0.5% of queries took between 30 seconds and 2 minutes

These queries affect response time of other queries

Cache pollution

Contention for resources

Slowest queries are phrase queries containing common words


Query processingPhrase queries use position index (Boolean queries do not).

Position index accounts for 85% of index size

Position list for common words such as “the” can be many GB in size

This causes lots of disk I/O .

Solr depends on the operating systems disk cache to reduce disk I/O requirements for words that occur in more than one query

I/O from Phrase queries containing common words pollutes the cache

6


Slow QueriesSlowest test query: “the lives and literature of the beat generation” took 2 minutes.4MB data read for Boolean query. 9,000+ MB read for Phrase query.

WORD NUMBER OF DOCUMENTS

POSTINGS LIST (SIZE MB)

TOTAL TERM OCCURRENCES (MILLIONS)

POSITION LIST (SIZE MB)

the 800,000 0.8 4,351 4,351

of 892,000 0.89 2,795 2,795

and 769,000 0.77 1,870 1,870

literature 435,000 0.44 9 9

generation 414,000 0.41 5 5

lives 432,000 0.43 5 5

beat 278,000 0.28 1 1

TOTALTOTAL 4.024.02 9,0369,036

7


Why not use Stop Words?The word “the” occurs more than 4 billion times in our 1 million document index.

Removing “stop” words (“the”, “of” etc.) not desirable for our use cases.

Couldn’t search for many phrases

“to be or not to be”

“the who”

“man in the moon” vs. “man on the moon”

Stop words in one language are content words in another language

German stop words “war” and “die” are content words in English

English stop words “is” and “by” are content words (“ice” and “village”) in Swedish

8


“CommonGrams”

Ported Nutch “CommonGrams” algorithm to Solr

Create Bi-Grams selectively for any two word sequence containing common terms

Slowest query: “The lives and literature of the beat generation”

“the-lives” “lives-and”

“and-literature” “literature-of”

“of-the” “the-beat” “generation”

9


Standard index vs. CommonGrams

Standard Index Common Grams

WORD

TOTAL OCCURRENCES

IN CORPUS (MILLIONS)

NUMBER OF DOCS

(THOUSANDS)

the 2,013 386

of 1,299 440

and 855 376

literature 4 210

lives 2 194

generation 2 199

beat 0.6 130

TOTALTOTAL 4,1764,176

TERM

TOTAL OCCURRENCES

IN CORPUS (MILLIONS)

NUMBER OF DOCS

(THOUSANDS)

of-the 446 396

generation 2.42 262

the-lives 0.36 128

literature-of 0.35 103

lives-and 0.25 115

and-literature 0.24 77

the-beat 0.06 26

TOTALTOTAL 450450


Comparison of Response time (ms)

AVERAGE MEDIAN 90th 99th

SLOWEST QUERY

Standard Index 459 32 146 6,784 120,595

Common Grams 68 3 71 2,226 7,800

11


Other issues

Analyze your slowest queries

We analyzed the slowest queries from our query logs and discovered additional “common words” to be added to our list.

We used Solr Admin panel to run our slowest queries from our logs with the “debug” flag checked.

We discovered that words such as “l’art” were being split into two token phrase queries.

We used the Solr Admin Analysis tool and determined that the analyzer we were using was the culprit.

12


Other issues

We broke Solr … temporarily

Dirty OCR in combination with over 200 languages creates indexes with over 2.4 billion unique terms

Solr/Lucene index size was limited to 2.1 Billion unique terms

Patched: Now it’s 274 Billion

Dirty OCR is difficult to remove without removing “good” words.

Because Solr/Lucene tii/tis index uses pointers into the frequency and position files we suspect that the performance impact is minimal compared to disk I/O demands, but we will be testing soon.

13

lucid imagination, inc. – 1 big, bigger biggest large scale issues: phrase queries and common...

Documents

lucid imagination

resources slowest queries

slowest query standard

position index boolean

good words

additional common words

token phrase queries

query io