populate your search index, nest 2016-01

17
Populating your Search Index NEST Meetup, 2016-01

Upload: david-smiley

Post on 07-Jan-2017

417 views

Category:

Technology


5 download

TRANSCRIPT

Page 1: Populate your Search index, NEST 2016-01

Populating your Search IndexNEST Meetup, 2016-01

Page 2: Populate your Search index, NEST 2016-01

5 Presentations Indexing Considerations, Pipelines, and

Apache NiFi A Proposal for a Document Pipeline How we do it at TIAA-CREF with Solr How we do it at DRG with Solr Logstash and Beats with ElasticSearch

Page 3: Populate your Search index, NEST 2016-01

Indexing ConsiderationsIndexing considerations to think about when

building out a search platform

Page 4: Populate your Search index, NEST 2016-01

What do I mean? How do you plan to get data into the index

(Solr/ES/…)? Backups? Schedule & Monitor? Realtime search requirements?

What software? (pipelines, crawlers, …)

Page 5: Populate your Search index, NEST 2016-01

Crawling? Common in the “enterprise search” space What crawler will you use?

Nutch is well-known but too complex for smaller scale jobs

Many more exist. Security access control metadata to federate?

Try ManifoldCF which excels at this.

Page 6: Populate your Search index, NEST 2016-01

Bulk indexing Plan for a “bulk reindex” use-case

When changing schemas / ingestion extraction rules Or recovering when there’s no backup

Not having a backup is typical; esp. if re-indexing is fast Optimize settings for this to be fast

May need to toggle after ingestion into “normal” settings Use multiple machines during indexing (e.g. via hadoop)?

“Optimize” (merge) Lucene segments at the end?

Page 7: Populate your Search index, NEST 2016-01

Incremental indexing (adding new/updated content) Detect deletes how?

A: Flag for removal upstream before eventually removing

B: Track all IDs somewhere; find the ones that went missing

Maybe don’t need to synchronize deletes until off-hours?

Realtime Indexing, separate?

Page 8: Populate your Search index, NEST 2016-01

Backups (DR: Disaster Recovery) Scenario:

Admin accidentally deleted 30k random docs; oh %#?!

Not solved by replication/redundancy Useful in other scenarios, like testing

Might not need it; especially if bulk re-indexing is fast

Take Snapshots (e.g. AWS, or via the search system, or…) Recovery: Deploy snapshot then sync it back up to

date. Solr: see BloomReach’s “HAFT” project

Page 9: Populate your Search index, NEST 2016-01

Document TransformationsMapping source data (e.g. HTML doc or database record) to a search document Examples:

Text from PDF extraction Enrichment (e.g. Named Entity Recognition) Text pre-processing before search platform gets it Merging multiple data sources; joining

Home-grown or use an existing ETL / “pipeline”? Do some of this directly on the search platform?

Page 10: Populate your Search index, NEST 2016-01

Schedule, Monitor How will a bulk index be triggered?

Incremental index? Unix Cron? Basic but crude. A Web UI to control this is great. A CI server (e.g. Jenkins) can work! (web, logs,

alerting) Monitor/alert for problems?

Perhaps via general log monitoring (e.g. ELK)

Page 11: Populate your Search index, NEST 2016-01

Open-Source ETL SoftwareA summary of an investigation I did on open-source

options in 2013.

Page 12: Populate your Search index, NEST 2016-01

ETL Software Extract Transform Load – a general idea Software that calls itself ETL tends to be very

similar. Clover ETL Pentaho Data Integration, AKA Kettle Talend Open Studio, Data Integration

Page 13: Populate your Search index, NEST 2016-01

Common features Two are GPL/LGPL, Talend is ApacheFremium model – pay for “enterprise” features The Good: (in a word, mature)

GUI wire diagram builder Books / resources

The Bad: Text-editing the pipeline not recommended: thus need GUI Poor community Data model is table-like; no native multi-valued fields

Page 14: Populate your Search index, NEST 2016-01

Talend screenshot

Page 15: Populate your Search index, NEST 2016-01

Apache NiFi“is an easy to use, powerful, and reliable system to

process and distribute data.”

Page 16: Populate your Search index, NEST 2016-01
Page 17: Populate your Search index, NEST 2016-01

Apache Nifi overview Web-based UI Runtime modification of flow control Data provenance features Extensible (of course) Security, role based access control