populate your search index, nest 2016-01
TRANSCRIPT
Populating your Search IndexNEST Meetup, 2016-01
5 Presentations Indexing Considerations, Pipelines, and
Apache NiFi A Proposal for a Document Pipeline How we do it at TIAA-CREF with Solr How we do it at DRG with Solr Logstash and Beats with ElasticSearch
Indexing ConsiderationsIndexing considerations to think about when
building out a search platform
What do I mean? How do you plan to get data into the index
(Solr/ES/…)? Backups? Schedule & Monitor? Realtime search requirements?
What software? (pipelines, crawlers, …)
Crawling? Common in the “enterprise search” space What crawler will you use?
Nutch is well-known but too complex for smaller scale jobs
Many more exist. Security access control metadata to federate?
Try ManifoldCF which excels at this.
Bulk indexing Plan for a “bulk reindex” use-case
When changing schemas / ingestion extraction rules Or recovering when there’s no backup
Not having a backup is typical; esp. if re-indexing is fast Optimize settings for this to be fast
May need to toggle after ingestion into “normal” settings Use multiple machines during indexing (e.g. via hadoop)?
“Optimize” (merge) Lucene segments at the end?
Incremental indexing (adding new/updated content) Detect deletes how?
A: Flag for removal upstream before eventually removing
B: Track all IDs somewhere; find the ones that went missing
Maybe don’t need to synchronize deletes until off-hours?
Realtime Indexing, separate?
Backups (DR: Disaster Recovery) Scenario:
Admin accidentally deleted 30k random docs; oh %#?!
Not solved by replication/redundancy Useful in other scenarios, like testing
Might not need it; especially if bulk re-indexing is fast
Take Snapshots (e.g. AWS, or via the search system, or…) Recovery: Deploy snapshot then sync it back up to
date. Solr: see BloomReach’s “HAFT” project
Document TransformationsMapping source data (e.g. HTML doc or database record) to a search document Examples:
Text from PDF extraction Enrichment (e.g. Named Entity Recognition) Text pre-processing before search platform gets it Merging multiple data sources; joining
Home-grown or use an existing ETL / “pipeline”? Do some of this directly on the search platform?
Schedule, Monitor How will a bulk index be triggered?
Incremental index? Unix Cron? Basic but crude. A Web UI to control this is great. A CI server (e.g. Jenkins) can work! (web, logs,
alerting) Monitor/alert for problems?
Perhaps via general log monitoring (e.g. ELK)
Open-Source ETL SoftwareA summary of an investigation I did on open-source
options in 2013.
ETL Software Extract Transform Load – a general idea Software that calls itself ETL tends to be very
similar. Clover ETL Pentaho Data Integration, AKA Kettle Talend Open Studio, Data Integration
Common features Two are GPL/LGPL, Talend is ApacheFremium model – pay for “enterprise” features The Good: (in a word, mature)
GUI wire diagram builder Books / resources
The Bad: Text-editing the pipeline not recommended: thus need GUI Poor community Data model is table-like; no native multi-valued fields
Talend screenshot
Apache NiFi“is an easy to use, powerful, and reliable system to
process and distribute data.”
Apache Nifi overview Web-based UI Runtime modification of flow control Data provenance features Extensible (of course) Security, role based access control