faceting optimizations for solr
TRANSCRIPT
![Page 1: Faceting optimizations for Solr](https://reader036.vdocuments.net/reader036/viewer/2022062412/58aaa71e1a28abfa0e8b5761/html5/thumbnails/1.jpg)
OCTOBER 13-16, 2015 • AUSTIN, TX
![Page 2: Faceting optimizations for Solr](https://reader036.vdocuments.net/reader036/viewer/2022062412/58aaa71e1a28abfa0e8b5761/html5/thumbnails/2.jpg)
Faceting optimizations for SolrToke EskildsenSearch Engineer / Solr HackerState and University Library, Denmark@TokeEskildsen / [email protected]
![Page 3: Faceting optimizations for Solr](https://reader036.vdocuments.net/reader036/viewer/2022062412/58aaa71e1a28abfa0e8b5761/html5/thumbnails/3.jpg)
3
3/55
Overview
Web scale at the State and University Library, Denmark
Field faceting 101 Optimizations Reuse Tracking Caching Alternative counters
![Page 4: Faceting optimizations for Solr](https://reader036.vdocuments.net/reader036/viewer/2022062412/58aaa71e1a28abfa0e8b5761/html5/thumbnails/4.jpg)
4/55
Web scale for a small web
Denmark Consolidation circa 10th century 5.6 million people
Danish Net Archive (http://netarkivet.dk) Constitution 2005 20 billion items / 590TB+ raw data
![Page 5: Faceting optimizations for Solr](https://reader036.vdocuments.net/reader036/viewer/2022062412/58aaa71e1a28abfa0e8b5761/html5/thumbnails/5.jpg)
5/55
Indexing 20 billion web items / 590TB into Solr
Solr index size is 1/9th of real data = 70TB Each shard holds 200M documents / 900GB Shards build chronologically by dedicated machine Projected 80 shards Current build time per shard: 4 days Total build time is 20 CPU-core years
So far only 7.4 billion documents / 27TB in index
![Page 6: Faceting optimizations for Solr](https://reader036.vdocuments.net/reader036/viewer/2022062412/58aaa71e1a28abfa0e8b5761/html5/thumbnails/6.jpg)
6/55
Searching a 7.4 billion documents / 27TB Solr index
SolrCloud with 2 machines, each having 16 HT-cores, 256GB RAM, 25 * 930GB SSD 25 shards @ 900GB 1 Solr/shard/SSD, Xmx=8g, Solr 4.10 Disk cache 100GB or < 1% of index size
![Page 7: Faceting optimizations for Solr](https://reader036.vdocuments.net/reader036/viewer/2022062412/58aaa71e1a28abfa0e8b5761/html5/thumbnails/7.jpg)
7/55
![Page 8: Faceting optimizations for Solr](https://reader036.vdocuments.net/reader036/viewer/2022062412/58aaa71e1a28abfa0e8b5761/html5/thumbnails/8.jpg)
8/55
String faceting 101 (single shard)
counter = new int[ordinals]for docID: result.getDocIDs() for ordinal: getOrdinals(docID) counter[ordinal]++
for ordinal = 0 ; ordinal < counters.length ; ordinal++ priorityQueue.add(ordinal, counter[ordinal])
for entry: priorityQueue result.add(resolveTerm(ordinal), count)
ord term counter0 A 01 B 32 C 03 D 10064 E 15 F 16 G 07 H 08 I 3
![Page 9: Faceting optimizations for Solr](https://reader036.vdocuments.net/reader036/viewer/2022062412/58aaa71e1a28abfa0e8b5761/html5/thumbnails/9.jpg)
9/55
Test setup 1 (easy start)
Solr setup 16 HT-cores, 256GB RAM, SSD Single shard 250M documents / 900GB
URL field Single String value 200M unique terms
3 concurrent “users” Random search terms
![Page 10: Faceting optimizations for Solr](https://reader036.vdocuments.net/reader036/viewer/2022062412/58aaa71e1a28abfa0e8b5761/html5/thumbnails/10.jpg)
10/55
Vanilla Solr, single shard, 250M documents, 200M values, 3 users
![Page 11: Faceting optimizations for Solr](https://reader036.vdocuments.net/reader036/viewer/2022062412/58aaa71e1a28abfa0e8b5761/html5/thumbnails/11.jpg)
11/55
Allocating and dereferencing 800MB arrays
![Page 12: Faceting optimizations for Solr](https://reader036.vdocuments.net/reader036/viewer/2022062412/58aaa71e1a28abfa0e8b5761/html5/thumbnails/12.jpg)
12/55
Reuse the counter
counter = new int[ordinals]for docID: result.getDocIDs() for ordinal: getOrdinals(docID) counter[ordinal]++
for ordinal = 0 ; ordinal < counters.length ; ordinal++ priorityQueue.add(ordinal, counter[ordinal])
<counter no more referenced and will be garbage collected at some point>
![Page 13: Faceting optimizations for Solr](https://reader036.vdocuments.net/reader036/viewer/2022062412/58aaa71e1a28abfa0e8b5761/html5/thumbnails/13.jpg)
13/55
Reuse the counter
counter = pool.getCounter()for docID: result.getDocIDs() for ordinal: getOrdinals(docID) counter[ordinal]++
for ordinal = 0 ; ordinal < counters.length ; ordinal++ priorityQueue.add(ordinal, counter[ordinal])
pool.release(counter)
Note: The JSON Facet API in Solr 5 already supports reuse of counters
![Page 14: Faceting optimizations for Solr](https://reader036.vdocuments.net/reader036/viewer/2022062412/58aaa71e1a28abfa0e8b5761/html5/thumbnails/14.jpg)
14/55
Using and clearing 800MB arrays
![Page 15: Faceting optimizations for Solr](https://reader036.vdocuments.net/reader036/viewer/2022062412/58aaa71e1a28abfa0e8b5761/html5/thumbnails/15.jpg)
15/55
Reusing counters vs. not doing so
![Page 16: Faceting optimizations for Solr](https://reader036.vdocuments.net/reader036/viewer/2022062412/58aaa71e1a28abfa0e8b5761/html5/thumbnails/16.jpg)
16/55
Reusing counters, now with readable visualization
![Page 17: Faceting optimizations for Solr](https://reader036.vdocuments.net/reader036/viewer/2022062412/58aaa71e1a28abfa0e8b5761/html5/thumbnails/17.jpg)
17/55
Reusing counters, now with readable visualization
Why does it always take more than 500ms?
![Page 18: Faceting optimizations for Solr](https://reader036.vdocuments.net/reader036/viewer/2022062412/58aaa71e1a28abfa0e8b5761/html5/thumbnails/18.jpg)
18/55
Iteration is not free
counter = pool.getCounter()for docID: result.getDocIDs() for ordinal: getOrdinals(docID) counter[ordinal]++
for ordinal = 0 ; ordinal < counters.length ; ordinal++ priorityQueue.add(ordinal, counter[ordinal])
pool.release(counter)
200M unique terms = 800MB
![Page 19: Faceting optimizations for Solr](https://reader036.vdocuments.net/reader036/viewer/2022062412/58aaa71e1a28abfa0e8b5761/html5/thumbnails/19.jpg)
19/55
ord counter0 01 02 03 04 05 06 07 08 0
trackerN/AN/AN/AN/AN/AN/AN/AN/AN/A
Tracking updated counters
![Page 20: Faceting optimizations for Solr](https://reader036.vdocuments.net/reader036/viewer/2022062412/58aaa71e1a28abfa0e8b5761/html5/thumbnails/20.jpg)
20/55
ord counter0 01 02 03 14 05 06 07 08 0
tracker3
N/AN/AN/AN/AN/AN/AN/AN/A
counter[3]++
Tracking updated counters
![Page 21: Faceting optimizations for Solr](https://reader036.vdocuments.net/reader036/viewer/2022062412/58aaa71e1a28abfa0e8b5761/html5/thumbnails/21.jpg)
21/55
ord counter0 01 12 03 14 05 06 07 08 0
tracker31
N/AN/AN/AN/AN/AN/AN/A
counter[3]++counter[1]++
Tracking updated counters
![Page 22: Faceting optimizations for Solr](https://reader036.vdocuments.net/reader036/viewer/2022062412/58aaa71e1a28abfa0e8b5761/html5/thumbnails/22.jpg)
22/55
ord counter0 01 32 03 14 05 06 07 08 0
tracker31
N/AN/AN/AN/AN/AN/AN/A
counter[3]++counter[1]++counter[1]++counter[1]++
Tracking updated counters
![Page 23: Faceting optimizations for Solr](https://reader036.vdocuments.net/reader036/viewer/2022062412/58aaa71e1a28abfa0e8b5761/html5/thumbnails/23.jpg)
23/55
ord counter0 01 32 03 10064 15 16 07 08 3
tracker31845
N/AN/AN/AN/A
counter[3]++counter[1]++counter[1]++counter[1]++counter[8]++counter[8]++counter[4]++counter[8]++counter[5]++counter[1]++counter[1]++…counter[1]++
Tracking updated counters
![Page 24: Faceting optimizations for Solr](https://reader036.vdocuments.net/reader036/viewer/2022062412/58aaa71e1a28abfa0e8b5761/html5/thumbnails/24.jpg)
24/55
Tracking updated counters
counter = pool.getCounter()for docID: result.getDocIDs() for ordinal: getOrdinals(docID) if counter[ordinal]++ == 0 && tracked < maxTracked tracker[tracked++] = ordinalif tracked < maxTracked for i = 0 ; i < tracked ; i++ priorityQueue.add(tracker[i], counter[tracker[i]])else for ordinal = 0 ; ordinal < counter.length ; ordinal++ priorityQueue.add(ordinal, counter[ordinal])
ord counter0 01 32 03 10064 15 16 07 08 3
tracker31845
N/AN/AN/AN/A
![Page 25: Faceting optimizations for Solr](https://reader036.vdocuments.net/reader036/viewer/2022062412/58aaa71e1a28abfa0e8b5761/html5/thumbnails/25.jpg)
25/55
Tracking updated counters
![Page 26: Faceting optimizations for Solr](https://reader036.vdocuments.net/reader036/viewer/2022062412/58aaa71e1a28abfa0e8b5761/html5/thumbnails/26.jpg)
26/55
Distributed faceting
Phase 1) All shards performs faceting. The Merger calculates the top-X terms.Phase 2) The term counts are requested from the shards that did not return them in phase 1. The Merger calculates the final counts for the top-X terms.
for term: fineCountRequest.getTerms() result.add(term, searcher.numDocs(query(field:term), base.getDocIDs()))
![Page 27: Faceting optimizations for Solr](https://reader036.vdocuments.net/reader036/viewer/2022062412/58aaa71e1a28abfa0e8b5761/html5/thumbnails/27.jpg)
27/55
Test setup 2 (more shards, smaller field)
Solr setup 16 HT-cores, 256GB RAM, SSD 9 shards @ 250M documents / 900GB
domain field Single String value 1.1M unique terms per shard
1 concurrent “user” Random search terms
![Page 28: Faceting optimizations for Solr](https://reader036.vdocuments.net/reader036/viewer/2022062412/58aaa71e1a28abfa0e8b5761/html5/thumbnails/28.jpg)
28/55
Pit of Pain™ (or maybe “Horrible Hill”?)
![Page 29: Faceting optimizations for Solr](https://reader036.vdocuments.net/reader036/viewer/2022062412/58aaa71e1a28abfa0e8b5761/html5/thumbnails/29.jpg)
29/55
Fine counting can be slow
Phase 1: Standard faceting
Phase 2:for term: fineCountRequest.getTerms() result.add(term, searcher.numDocs(query(field:term), base.getDocIDs()))
![Page 30: Faceting optimizations for Solr](https://reader036.vdocuments.net/reader036/viewer/2022062412/58aaa71e1a28abfa0e8b5761/html5/thumbnails/30.jpg)
30/55
Alternative fine counting
counter = pool.getCounter()for docID: result.getDocIDs() for ordinal: getOrdinals(docID) counter.increment(ordinal)
for term: fineCountRequest.getTerms() result.add(term, counter.get(getOrdinal(term)))
} Same as phase 1, which yieldsord counter
0 01 32 03 10064 15 16 07 08 3
![Page 31: Faceting optimizations for Solr](https://reader036.vdocuments.net/reader036/viewer/2022062412/58aaa71e1a28abfa0e8b5761/html5/thumbnails/31.jpg)
31/55
Using cached counters from phase 1 in phase 2
counter = pool.getCounter(key)
for term: query.getTerms() result.add(term, counter.get(getOrdinal(term)))
pool.release(counter)
![Page 32: Faceting optimizations for Solr](https://reader036.vdocuments.net/reader036/viewer/2022062412/58aaa71e1a28abfa0e8b5761/html5/thumbnails/32.jpg)
32/55
Pit of Pain™ practically eliminated
![Page 33: Faceting optimizations for Solr](https://reader036.vdocuments.net/reader036/viewer/2022062412/58aaa71e1a28abfa0e8b5761/html5/thumbnails/33.jpg)
33/55
Pit of Pain™ practically eliminated
Stick figure CC BY-NC 2.5 Randall Munroe xkcd.com
![Page 34: Faceting optimizations for Solr](https://reader036.vdocuments.net/reader036/viewer/2022062412/58aaa71e1a28abfa0e8b5761/html5/thumbnails/34.jpg)
34/55
Test setup 3 (more shards, more fields)
Solr setup 16 HT-cores, 256GB RAM, SSD 23 shards @ 250M documents / 900GB
Faceting on 6 fields url: ~200M unique terms / shard domain & host: ~1M unique terms each / shard type, suffix, year: < 1000 unique terms / shard
![Page 35: Faceting optimizations for Solr](https://reader036.vdocuments.net/reader036/viewer/2022062412/58aaa71e1a28abfa0e8b5761/html5/thumbnails/35.jpg)
35/55
1 machine, 7 billion documents / 23TB total index, 6 facet fields
![Page 36: Faceting optimizations for Solr](https://reader036.vdocuments.net/reader036/viewer/2022062412/58aaa71e1a28abfa0e8b5761/html5/thumbnails/36.jpg)
36/55
High-cardinality can mean different things
Single shard / 250,000,000 docs / 900GB
Field References Max docs/term Unique termsdomain 250,000,000 3,000,000 1,100,000
url 250,000,000 56,000 200,000,000
links 5,800,000,000 5,000,000 610,000,000
2440 MB / counter
![Page 37: Faceting optimizations for Solr](https://reader036.vdocuments.net/reader036/viewer/2022062412/58aaa71e1a28abfa0e8b5761/html5/thumbnails/37.jpg)
37/55
Remember: 1 machine = 25 shards
25 shards / 7 billion / 23TB
Field References Max docs/term Unique termsdomain 7,000,000,000 3,000,000 ~25,000,000
url 7,000,000,000 56,000 ~5,000,000,000
links 125,000,000,000 5,000,000 ~15,000,000,000
60 GB / facet call
![Page 38: Faceting optimizations for Solr](https://reader036.vdocuments.net/reader036/viewer/2022062412/58aaa71e1a28abfa0e8b5761/html5/thumbnails/38.jpg)
38/55
Different distributions domain 1.1M url 200M links 600M
High max
Low max
Very long tail
Short tail
![Page 39: Faceting optimizations for Solr](https://reader036.vdocuments.net/reader036/viewer/2022062412/58aaa71e1a28abfa0e8b5761/html5/thumbnails/39.jpg)
39/55
Theoretical lower limit per counter: log2(max_count)
max=1
max=7
max=2047
max=3
max=63
![Page 40: Faceting optimizations for Solr](https://reader036.vdocuments.net/reader036/viewer/2022062412/58aaa71e1a28abfa0e8b5761/html5/thumbnails/40.jpg)
40/55
int vs. PackedIntsdomain: 4 MBurl: 780 MBlinks: 2350 MB
int[ordinals] PackedInts(ordinals, maxBPV)
domain: 3 MB (72%)url: 420 MB (53%)links: 1760 MB (75%)
![Page 41: Faceting optimizations for Solr](https://reader036.vdocuments.net/reader036/viewer/2022062412/58aaa71e1a28abfa0e8b5761/html5/thumbnails/41.jpg)
41/55
n-plane-z counters
Platonic ideal Harsh reality
Plane d
Plane c
Plane b
Plane a
![Page 42: Faceting optimizations for Solr](https://reader036.vdocuments.net/reader036/viewer/2022062412/58aaa71e1a28abfa0e8b5761/html5/thumbnails/42.jpg)
42/55
Plane d
Plane c
Plane b
Plane a
L: 0 ≣ 000000
![Page 43: Faceting optimizations for Solr](https://reader036.vdocuments.net/reader036/viewer/2022062412/58aaa71e1a28abfa0e8b5761/html5/thumbnails/43.jpg)
43/55
Plane d
Plane c
Plane b
Plane a
L: 0 ≣ 000000L: 1 ≣ 000001
![Page 44: Faceting optimizations for Solr](https://reader036.vdocuments.net/reader036/viewer/2022062412/58aaa71e1a28abfa0e8b5761/html5/thumbnails/44.jpg)
44/55
Plane d
Plane c
Plane b
Plane a
L: 0 ≣ 000000L: 1 ≣ 000001L: 2 ≣ 000011
![Page 45: Faceting optimizations for Solr](https://reader036.vdocuments.net/reader036/viewer/2022062412/58aaa71e1a28abfa0e8b5761/html5/thumbnails/45.jpg)
45/55
Plane d
Plane c
Plane b
Plane a
L: 0 ≣ 000000L: 1 ≣ 000001L: 2 ≣ 000011L: 3 ≣ 000101
![Page 46: Faceting optimizations for Solr](https://reader036.vdocuments.net/reader036/viewer/2022062412/58aaa71e1a28abfa0e8b5761/html5/thumbnails/46.jpg)
46/55
Plane d
Plane c
Plane b
Plane a
L: 0 ≣ 000000L: 1 ≣ 000001L: 2 ≣ 000011L: 3 ≣ 000101L: 4 ≣ 000111L: 5 ≣ 001001L: 6 ≣ 001011L: 7 ≣ 001101...L: 12 ≣ 010111
![Page 47: Faceting optimizations for Solr](https://reader036.vdocuments.net/reader036/viewer/2022062412/58aaa71e1a28abfa0e8b5761/html5/thumbnails/47.jpg)
47/55
Comparison of counter structuresdomain: 4 MBurl: 780 MBlinks: 2350 MB
domain: 3 MB (72%)url: 420 MB (53%)links: 1760 MB (75%)
domain: 1 MB (30%)url: 66 MB ( 8%)links: 311 MB (13%)
int[ordinals] PackedInts(ordinals, maxBPV) n-plane-z
![Page 48: Faceting optimizations for Solr](https://reader036.vdocuments.net/reader036/viewer/2022062412/58aaa71e1a28abfa0e8b5761/html5/thumbnails/48.jpg)
48/55
Speed comparison
![Page 49: Faceting optimizations for Solr](https://reader036.vdocuments.net/reader036/viewer/2022062412/58aaa71e1a28abfa0e8b5761/html5/thumbnails/49.jpg)
49/55
I could go on about
Threaded counting Heuristic faceting Fine count skipping Counter capping Monotonically increasing tracker for n-plane-z Regexp filtering
![Page 50: Faceting optimizations for Solr](https://reader036.vdocuments.net/reader036/viewer/2022062412/58aaa71e1a28abfa0e8b5761/html5/thumbnails/50.jpg)
50/55
What about huge result sets?
Rare for explorative term-based searches Common for batch extractions Threading works poorly as #shards > #CPUs But how bad is it really?
![Page 51: Faceting optimizations for Solr](https://reader036.vdocuments.net/reader036/viewer/2022062412/58aaa71e1a28abfa0e8b5761/html5/thumbnails/51.jpg)
51/55
Really bad! 8 minutes
![Page 52: Faceting optimizations for Solr](https://reader036.vdocuments.net/reader036/viewer/2022062412/58aaa71e1a28abfa0e8b5761/html5/thumbnails/52.jpg)
52/55
Heuristic faceting
Use sampling to guess top-X terms Re-use the existing tracked counters 1:1000 sampling seems usable for the field links,
which has 5 billion references per shard Fine-count the guessed terms
![Page 53: Faceting optimizations for Solr](https://reader036.vdocuments.net/reader036/viewer/2022062412/58aaa71e1a28abfa0e8b5761/html5/thumbnails/53.jpg)
53/55
Over provisioning helps validity
![Page 54: Faceting optimizations for Solr](https://reader036.vdocuments.net/reader036/viewer/2022062412/58aaa71e1a28abfa0e8b5761/html5/thumbnails/54.jpg)
54/55
10 seconds < 8 minutes
![Page 55: Faceting optimizations for Solr](https://reader036.vdocuments.net/reader036/viewer/2022062412/58aaa71e1a28abfa0e8b5761/html5/thumbnails/55.jpg)
55/55
Never enough time, but talk to me about
Threaded counting Monotonically increasing tracker for n-plane-z Regexp filtering Fine count skipping Counter capping
![Page 56: Faceting optimizations for Solr](https://reader036.vdocuments.net/reader036/viewer/2022062412/58aaa71e1a28abfa0e8b5761/html5/thumbnails/56.jpg)
56/55
Extra info
The techniques presented can be tested with sparse faceting, available as a plug-in replacement WAR for Solr 4.10 at https://tokee.github.io/lucene-solr/. A version for Solr 5 will eventually be implemented, but the timeframe is unknown.
No current plans for incorporating the full feature set in the official Solr distribution exists. Suggested approach for incorporation is to split it into multiple independent or semi-independent features, starting with those applicable to most people, such as the distributes faceting fine count optimization.
In-depth descriptions and performance tests of the different features can be found at https://sbdevel.wordpress.com.
![Page 57: Faceting optimizations for Solr](https://reader036.vdocuments.net/reader036/viewer/2022062412/58aaa71e1a28abfa0e8b5761/html5/thumbnails/57.jpg)
57/55
18M documents / 50GB, facet on 5 fields (2*10M values, 3*smaller)
![Page 58: Faceting optimizations for Solr](https://reader036.vdocuments.net/reader036/viewer/2022062412/58aaa71e1a28abfa0e8b5761/html5/thumbnails/58.jpg)
58/55
6 billion docs / 20TB, 25 shards, single machinefacet on 6 fields (1*4000M, 2*20M, 3*smaller)
![Page 59: Faceting optimizations for Solr](https://reader036.vdocuments.net/reader036/viewer/2022062412/58aaa71e1a28abfa0e8b5761/html5/thumbnails/59.jpg)
59/55
7 billion docs / 23TB, 25 shards, single machinefacet on 5 fields (2*20M, 3*smaller)