adaptto() 2014 - integrating open source search with cq/aem

26
APACHE SLING & FRIENDS TECH MEETUP BERLIN, 22-24 SEPTEMBER 2014 Integrating Open Source Search with AEM Gaston Gonzalez

Upload: therealgaston

Post on 05-Dec-2014

396 views

Category:

Technology


1 download

DESCRIPTION

A presentation by Gaston Gonzalez at adaptTo() 2014 describing several approaches for integrating Apache Solr with AEM. It starts with an introduction to various pull and push indexing strategies (e.g., Sling Eventing, content publishing and web crawling). The topic of content ingestion is followed by an approach for delivering rapid search front-end experiences using AEM Solr Search.

TRANSCRIPT

Page 1: adaptTo() 2014 - Integrating Open Source Search with CQ/AEM

APACHE SLING & FRIENDS TECH MEETUPBERLIN, 22-24 SEPTEMBER 2014

Integrating Open Source Search with AEMGaston Gonzalez

Page 2: adaptTo() 2014 - Integrating Open Source Search with CQ/AEM

Agenda

adaptTo() 2014 2

Apache Solr Data Ingestion

Drivers Implementation Patterns

Searching Data

Page 3: adaptTo() 2014 - Integrating Open Source Search with CQ/AEM

adaptTo() 2014 3

Data Ingestion: Drivers

Page 4: adaptTo() 2014 - Integrating Open Source Search with CQ/AEM

Choosing a Data Ingestion Approach

adaptTo() 2014 4

Where is your content stored? AEM/CQ External databases and data stores External feeds (RSS)

How many documents? Small – Tens of Thousands Medium – Hundreds of Thousands Large – Millions+

Page 5: adaptTo() 2014 - Integrating Open Source Search with CQ/AEM

Choosing a Data Ingestion Approach (2/3)

adaptTo() 2014 5

Index latency Near real-time indexing (low latency) Hourly, daily, weekly (high latency)

Do you need document enrichment? Add additional metadata Sanitize and transform data Merge/join documents/records

Page 6: adaptTo() 2014 - Integrating Open Source Search with CQ/AEM

Choosing a Data Ingestion Approach (3/3)

adaptTo() 2014 6

Implementation Complexity Implementation effort Time to market

Supporting infrastructure Crawlers Document Processing Platform Messaging Queues

Page 7: adaptTo() 2014 - Integrating Open Source Search with CQ/AEM

7adaptTo() 2014

Data Ingestion: Implementation Patterns

Page 8: adaptTo() 2014 - Integrating Open Source Search with CQ/AEM

Solr Update Request Handler Pattern

adaptTo() 2014 8

Method: Pull Document Enrichment: Simple

Content Location: CQ, CQ + External Implementation Complexity: Low

Document Corpus: Small Infrastructure: None

Index Latency: High

Recipe1. Implement servlet to generate same output as a Solr’s update handler (i.e., JSON).2. Create script to invoke servlet (curl) and save response to file system.3. Update index by posting (curl) documents to Solr request handler (i.e.,

JsonUpdateRequestHandler).4. Call script from scheduler (cron).

Page 9: adaptTo() 2014 - Integrating Open Source Search with CQ/AEM

Solr Update Request Handler Pattern (2/3)

adaptTo() 2014 9

Page 10: adaptTo() 2014 - Integrating Open Source Search with CQ/AEM

Solr Update Request Handler Pattern (3/3)

adaptTo() 2014 10

# Request from CQ a dump of the content in the Solr JSON update handler formatcurl -s -u ${CQ_USER}:${CQ_PASS} -o ${SAVE_FILE} http://localhost:4502/apps/geometrixx-media/solr/updatehandler?type=${SLING_RESOURCE_TYPE}

# Perform delete by querycurl http://${SOLR_HOST}:${SOLR_PORT}/solr/${SOLR_CORE}/update?commit=true -H "Content-Type: application/json" --data-binary '{"delete": { "query":"*:*" }}'

# Post the local JSON file that was saved in step 1 to Solr and commitcurl http://${SOLR_HOST}:${SOLR_PORT}/solr/${SOLR_CORE}/update?commit=true -H "Content-Type: application/json" --data-binary @${SAVE_FILE}

Page 11: adaptTo() 2014 - Integrating Open Source Search with CQ/AEM

HTML Selector + Crawler Pattern

adaptTo() 2014 11

Method: Pull Document Enrichment: Simple

Content Location: CQ, CQ + External Implementation Complexity: Low+

Document Corpus: Small, Medium Infrastructure: Crawler

Index Latency: Medium, High

Recipe1. Implement selector to generate simplified HTML response.2. Implement dynamic index page or selector for crawler use.3. Configure crawler (Nutch) seed URL to use the dynamic index pages and crawl to

depth = 1.4. Schedule crawler.

Page 12: adaptTo() 2014 - Integrating Open Source Search with CQ/AEM

HTML Selector + Crawler Pattern (2/2)

adaptTo() 2014 12

/content/geometrixx-media/en/events/big-sur-beach-party.crawlerpage.html

<html> <head> <title>Big Sur Beach Party</title> </head> <body> <h1>Big Sur Beach Party</h1> <p>Some content here</p> </body></html>

/content/geometrixx-media/en/events.crawlerindex.html (seed URL)

…<ul> <li><a href=“…”>Big Sur Beach Party</a></li> <li><a href=“…”>The only Festival You Need</a></li>…</ul>…

$ nutch crawl ${NUTCH_HOME}/urls -dir ${NUTCH_HOME}/crawl -threads 4 –depth 1 -topN 100-solr http://${HOST}:${PORT}/solr/${CORE}

Page 13: adaptTo() 2014 - Integrating Open Source Search with CQ/AEM

Event Listener Pattern

adaptTo() 2014 13

Method: Push Document Enrichment: Simple

Content Location: CQ, CQ + External* Implementation Complexity: Low

Document Corpus: Small, Medium, Large Infrastructure: None

Index Latency: Low

Recipe1. Wrap SolrJ as an OSGi bundle.2. Implement listener (JCR Observer, Sling eventing or workflow*) to listen for changes

to desired content.3. On event, trigger appropriate SolrJ index call (add/update/delete).

Page 14: adaptTo() 2014 - Integrating Open Source Search with CQ/AEM

Event Listener Pattern (2/2)

adaptTo() 2014 14

Page 15: adaptTo() 2014 - Integrating Open Source Search with CQ/AEM

Event Listener + Messaging Pattern

adaptTo() 2014 15

Method: Push Document Enrichment: Simple

Content Location: CQ, CQ + External* Implementation Complexity: Med+High

Document Corpus: Medium, Large Infrastructure: Messaging Middleware

Index Latency: Low

Recipe1. Wrap messaging middleware client as an OSGi bundle.2. Implement listener (JCR Observer, Sling eventing or workflow) to listen for changes to

desired content.3. On event, write to message* to queue.4. External process polls queue and performs index update.

* If messaging platform restricts message size, consider passing URL pointer to an ingestion friendly page.

Page 16: adaptTo() 2014 - Integrating Open Source Search with CQ/AEM

Event Listener + Messaging Pattern (2/2)

adaptTo() 2014 16

Document Manifest URL

{ "id" : ”<uid>", "title" : ”Big Sur Beach Party”, …}

Message Format

{ "operation":”<ADD|UPDATE|DELETE>", "timeStamp":"<UTC timestamp>", "path":"/content/geometrixx-media/en/events/big-sur-beach-party.search.xml", "uid”:"<uid>", "docUrl":http://<host>/content/geometrixx-media/en/events/big-sur-beach-party.search.json"}

Page 17: adaptTo() 2014 - Integrating Open Source Search with CQ/AEM

Event Listener + Messaging + DP Pattern

adaptTo() 2014 17

Method: Push Document Enrichment: Complex

Content Location: CQ, CQ + External Implementation Complexity: High

Document Corpus: Medium, Large Infrastructure: Messaging Middleware, Document Processing

Index Latency: Low

Recipe1. Same as Event Listener + Messaging Pattern (steps 1-3)2. Read message from document processing platform.3. Perform document transformation and enrichment.4. Update index from document processing platform using SolrJ.

Page 18: adaptTo() 2014 - Integrating Open Source Search with CQ/AEM

Enterprise Deployment Architecture

adaptTo() 2014 18

Page 19: adaptTo() 2014 - Integrating Open Source Search with CQ/AEM

19adaptTo() 2014

Searching Data

Page 20: adaptTo() 2014 - Integrating Open Source Search with CQ/AEM

AEM Solr Search

adaptTo() 2014 20

Open source Apache Solr / AEM integration Allows content authors and developers to

build rapid search interfaces Provides rich set of search UI components

Search results, facets, search input, statistics, pagination, etc.

Server-side and client-side implementation Implementations based on Bootstrap and AJAX Solr Geometrixx Media Sample (JSON update handler,

eventing, …)

Page 21: adaptTo() 2014 - Integrating Open Source Search with CQ/AEM

AEM Solr Search (2/3)

adaptTo() 2014 21

Clone it from Git.$ git clone https://github.com/headwirecom/aem-solr-search.git$ cd aem-solr-search

Deploy AEM Solr Search to AEM$ mvn clean install -Pauto-deploy-all

Deploy Geometrixx Media Sample$ mvn install -Pauto-deploy-sample

Start local Solr instance $ cd aemsolrsearch-quickstart $ mvn clean resources:resources jetty:run

Index Geometrixx Media Articles (launch new terminal)$ cd ../aemsolrsearch-geometrixx-media-sample $ ./index-geometrixx-media-articles.sh

Page 22: adaptTo() 2014 - Integrating Open Source Search with CQ/AEM

AEM Solr Search (3/3)

adaptTo() 2014 22

Page 23: adaptTo() 2014 - Integrating Open Source Search with CQ/AEM

AEM Solr Search (3/3)

adaptTo() 2014 23

Demohttp://www.aemsolrsearch.com/#/demo

Orhttps://www.youtube.com/watch?v=TUhMbD4DUSU

Page 24: adaptTo() 2014 - Integrating Open Source Search with CQ/AEM

adaptTo() 2014 24

Appendix

Page 25: adaptTo() 2014 - Integrating Open Source Search with CQ/AEM

Resources

adaptTo() 2014 25

Sites http://www.aemsolrsearch.com/ http://www.gastongonzalez.com/

Code & Samples https://github.com/headwirecom/aem-solr-search