wikipedia cloud search webinar

Post on 25-Jun-2015

205 Views

Category:

Technology

0 Downloads

Preview:

Click to see full reader

DESCRIPTION

View this webinar presented by Search Technologies' Chief Architect Paul Nelson on cloud search and a Wikipedia use case. Webinar given in conjunction with Amazon Cloud Search. Search Technologies provides implementation and consulting services for Amazon CloudSearch. For further information, see http://www.searchtechnologies.com/amazon-cloudsearch-services.html http://www.searchtechnologies.com/

TRANSCRIPT

1

Searching Wikipedia with Amazon CloudSearch

2Agenda

• Project Background• High-level Architecture• Summary & Observations

3Project Background

• Amazon contracted with Search Technologies to help with beta-testing, prior to the launch of Amazon CloudSearch

• Decision to use Wikipedia as a convenient data set for testing purposes

3

4High-level Architecture

4

5Indexing

• Wikipedia provides content in a series of large xml files• Amazon CloudSearch ingests xml in a specified form• Various content processing tasks to perform

• Splitting into individual documents• Date normalization• Metadata extraction & mapping• Cleanup, etc.

• We used Aspire for these tasks

5

6Aspire in Brief

• Based on Apache Felix / OSGi• Thread-safe, multi-threaded, distributable• Any number of pipelines, conditional branching• Plug-in components individually testable & upgradable• In use with FAST ESP, FS4SP, Solr, Amazon CloudSearch, GSA.• Tested with Elasticsearch and SP 2013

6

7XML Input

7

8Indexing

• Streaming Wikipedia Dump Files directly into CloudSearch

• 500 docs/second achieved without much effort• Using 4 x XL instances of CloudSearch• 1 x XL EC2 instance for Aspire

8

9Searching

• Amazon CloudSearch provides a RESTful/XML interface for search purposes

• For the Wikipedia project, we needed a UI• Chose to use Twigkit• Wrote a Java API for CloudSearch • The Java API is freely downloadable (with source) at

http://www.searchtechnologies.com/java-api-amazon-cloudsearch.html

9

10Searching

• Supports navigators and relevancy customization• E.g. a “PageRank” style link

analysis was performed

• Limits set high: E.g. retrieve 500,000 results in a single list, delivered in just a few seconds• Very useful for analysis

applications

• So, what does it look like?

10

12wikipedia.searchtechnologies.com 12

13Summary & Observations

• A capable and scalable “raw” engine• xml in, RESTful/xml out• Easy to set up – much the same as an EC2

instance• Elastic scalability

13

14Summary & Observations

• Cost effective• From $75 per month, including management /

maintenance

• Extremely convenient• Switch on / off at leisure• Promotes experimentation & agility

14

15

top related