populate your search index, nest 2016-01

Populating your Search IndexNEST Meetup, 2016-01

5 Presentations Indexing Considerations, Pipelines, and

Apache NiFi A Proposal for a Document Pipeline How we do it at TIAA-CREF with Solr How we do it at DRG with Solr Logstash and Beats with ElasticSearch

Indexing ConsiderationsIndexing considerations to think about when

building out a search platform

What do I mean? How do you plan to get data into the index

(Solr/ES/…)? Backups? Schedule & Monitor? Realtime search requirements?

What software? (pipelines, crawlers, …)

Crawling? Common in the “enterprise search” space What crawler will you use?

Nutch is well-known but too complex for smaller scale jobs

Many more exist. Security access control metadata to federate?

Try ManifoldCF which excels at this.

Bulk indexing Plan for a “bulk reindex” use-case

When changing schemas / ingestion extraction rules Or recovering when there’s no backup

Not having a backup is typical; esp. if re-indexing is fast Optimize settings for this to be fast

May need to toggle after ingestion into “normal” settings Use multiple machines during indexing (e.g. via hadoop)?

“Optimize” (merge) Lucene segments at the end?

Incremental indexing (adding new/updated content) Detect deletes how?

A: Flag for removal upstream before eventually removing

B: Track all IDs somewhere; find the ones that went missing

Maybe don’t need to synchronize deletes until off-hours?

Realtime Indexing, separate?

Backups (DR: Disaster Recovery) Scenario:

Admin accidentally deleted 30k random docs; oh %#?!

Not solved by replication/redundancy Useful in other scenarios, like testing

Might not need it; especially if bulk re-indexing is fast

Take Snapshots (e.g. AWS, or via the search system, or…) Recovery: Deploy snapshot then sync it back up to

date. Solr: see BloomReach’s “HAFT” project

https://github.com/bloomreach/solrcloud-haft

Document TransformationsMapping source data (e.g. HTML doc or database record) to a search document Examples:

Text from PDF extraction Enrichment (e.g. Named Entity Recognition) Text pre-processing before search platform gets it Merging multiple data sources; joining

Home-grown or use an existing ETL / “pipeline”? Do some of this directly on the search platform?

Schedule, Monitor How will a bulk index be triggered?

Incremental index? Unix Cron? Basic but crude. A Web UI to control this is great. A CI server (e.g. Jenkins) can work! (web, logs,

alerting) Monitor/alert for problems?

Perhaps via general log monitoring (e.g. ELK)

Open-Source ETL SoftwareA summary of an investigation I did on open-source

options in 2013.

ETL Software Extract Transform Load – a general idea Software that calls itself ETL tends to be very

similar. Clover ETL Pentaho Data Integration, AKA Kettle Talend Open Studio, Data Integration

Common features Two are GPL/LGPL, Talend is ApacheFremium model – pay for “enterprise” features The Good: (in a word, mature)

GUI wire diagram builder Books / resources

The Bad: Text-editing the pipeline not recommended: thus need GUI Poor community Data model is table-like; no native multi-valued fields

Talend screenshot

Apache NiFi“is an easy to use, powerful, and reliable system to

process and distribute data.”

Apache Nifi overview Web-based UI Runtime modification of flow control Data provenance features Extensible (of course) Security, role based access control

populate your search index, nest 2016-01

Technology