aws webcast - build a scalable search engine with the new amazon cloudsearch
DESCRIPTION
Amazon CloudSearch is a fully-managed service that makes it easy to set up, operate, and scale a search solution for your website or application. Traditional search solutions require significant time and resources to maintain and operate. In addition to the complexity involved, administration of a search system is also expensive. Amazon CloudSearch not only significantly lowers the cost of a search solution, but it also makes it easy to setup a search system that can change with the needs of the business. During this session we will provide an overview of Amazon CloudSearch including recently launched powerful search and admin features, discuss popular use cases for CloudSearch, and share best practices that will help you fully leverage CloudSearch to build scalable search solutions for your websites and applications.TRANSCRIPT
Build a Scalable Search Engine With the
New Amazon CloudSearch
Agenda
• What Search Engines Do
• Amazon CloudSearch Introduction
• Building With CloudSearch
What Search Engines Do
Search Engines Connect Us To Data
Documents
Representation of a Document
Field Value
id tt0371746
title Iron Man
description When wealthy industrialist Tony Stark is forced to build
an armored suit after a life-threatening incident, he
ultimately decides to use its technology to fight against
evil.
director John Favreau
actors Robert Downey Jr., Gwyneth Paltrow, Terrence Howard
...
rating 7.9
release_date 2008-05-02T00:00:00Z
Data Types
Doubles
Dates
Signed Integers Text
Literal
Geo
• Latlon data type
• Region search
• Distance sort
• Supports mobile
Text Processing (Normalization)
• Tokenization
(parsing)
• Downcasing
• Stemming
• Stopword removal
• Synonym Addition
When wealthy industrialist Tony Stark is forced to
build an armored suit after a life-threatening
incident, he ultimately decides to use its
technology to fight against evil.
when wealth industrial tony stark force build
armor suit after life threaten incident ultimate
decide use technology fight against evil
Indexing
Term Documents (Posting List)
Iron The Man in the Iron Mask
Iron Man 2
Iron Man
The Iron Giant
The Iron Lady
...
Man Rain Man
The Man in the Moon
Iron Man 2
The Lawnmower Man
The Third Man
Iron Man
...
Matching
The Man in the Iron
Mask
Iron Man 2
Iron Man
The Iron Giant
The Iron Lady
Rain Man
The Man in the Moon
Iron Man 2
The Lawnmower Man
The Third Man
Iron Man
Iron Man 2
Iron Man
Ranking and Relevance
• The meat of the search engine
• TF-IDF – uniqueness and presence
• Additional Criteria
– Measures of document value (e.g. rating)
– Observed user behavior
– Freshness
Summary
• Search makes data accessible
• Search documents gather information about one search target
• Reverse indices provide the basis of text-text matching
• Relevance brings the best matches
Amazon CloudSearch
Building a Search service
• Build your own
– Extend datastores and build custom relevance engine
• Open Source
– Apache Solr, ElasticSearch
• Legacy Enterprise Search
– FAST, Autonomy, Endeca
Challenges with building a Search service
• COMPLEX: Requires extensive search expertise
• COSTLY: High upfront expenditure
• SLOW: Long time to market. Slows innovation
• UNDIFFERENTIATED: Operational overhead that doesn’t add value to
core product
Where CloudSearch fits in the picture
Amazon CloudSearch is a fully managed search service in the cloud that
makes it easy to setup, operate, and scale a search solution for your
website or application
Similar benefits as other AWS Managed Services
• Easy to setup and operate (Console, SDK, CLT)
• Pay as you go
• No need to guess capacity
• Experiment fast with low risk
• Go Global in minutes
Building With CloudSearch
Create a Domain
Upload Data
Document Upload
http(s)://< document service endpoint >/2013-01-01/documents/batch
Accept: application/json
Content-Length: 1176
Content-Type: application/json
Host: doc.imdb-movies-rr2f34ofg56xneuemujamut52i.us-east-1.cloudsearch.amazonaws.com
{ : , : "tt0371746", : { "directors" : [ "Jon Favreau" ], "release_date" : "2008-04-14T00:00:00Z", "rating" : 7.9, "genres" : [ "Action", "Adventure", "Sci-Fi" ], "image_url" : "http://ia.media-imdb.com/images/M/MV5BMTczNTI2ODUwOF5BMl5BanBnXkFtZTcwMTU0NTIzMw@@._V1_SX400_.jpg", "plot" : "When wealthy industrialist Tony Stark is forced to build an armored suit after a life-threatening incident, he ultimately decides to use its technology to fight against evil.", "title" : "Iron Man", "rank" : 171, "running_time_secs" : 7560, "actors" : [ "Robert Downey Jr.", "Gwyneth Paltrow", "Terrence Howard" ], "year" : 2008 }},
{ , : "tt0434409"} ]
Simple Queries
Movies > Sci-Fi/Fantasy > 2008 to 2010 > Downey > "Iron"
Simple Queries
http(s)/<search endpoint>/2013-01-01/search?q=iron+man
{"id": "tt0371746",
"highlights": {
"plot": "When wealthy industrialist Tony Stark is
forced to build an armored suit after a life-threatening
incident, he ultimately decides to use its technology to
fight against evil.",
"title": "Iron Man"} },
{"id": "tt1866249",
"highlights": {
"plot": "A man in an iron lung who wishes to lose his
virginity contacts a professional sex surrogate with the
help of his therapist and priest.",
"title": "The Sessions" } },
Complex Queries
Movies > Sci-Fi/Fantasy > 2008 to 2010 > Downey > "Iron"
Complex Queries
/search?q=(and 'iron' genres:'Sci-Fi/Fantasy' actors:'downey'
year:[2008,2010] category:'Movies')&q.parser=structured&
q.options={fields:['title^2','plot^0.5']}
{"id": "tt0371746",
"fields": {
"title": "Iron Man",
"year": "2008"
}},
{"id": "tt1228705",
"fields": {
"title": "Iron Man 2",
"year": "2010"
}}
Faceting
Movies > Sci-Fi/Fantasy > 2008 to 2010 > Downey > "Iron"
Feature Detail: Faceting
/search?q=iron man&facet.genres={}
{"status": {...},"hits": {...},
"facets": {"genres": {
"buckets": [
{"value": "Action", "count": 62},
{"value": "Sci-Fi/Fantasy", "count": 25},
{"value": "Comedy", "count": 2},
{"value": "History", "count": 1},...
Adjustable Ranking
Movies > Sci-Fi/Fantasy > 2008 to 2010 > Downey > "Iron"
Expressions
• Baseline TF-IDF function provides textual relevance
• Expressions use field sources or other expressions
• Allows customization per-user or per-query
Movies > Sci-Fi/Fantasy > 2008 to 2010 > Downey > "Iron"
Highlighting
Feature Detail: Highlighting
/search&q=iron+man&highlight.plot={"format":"text"}
{"status": {"rid": "8Pq/88woCwrstGQ=","time-ms": 48},
"hits": {"found": 9,"start": 0,
"hit": [{
"id": "tt1228705",
"fields": {
"title": "Iron Man 2"
},
"highlights": {
"plot": "With the world now aware of his identity as
*Iron* *Man*, Tony Stark must contend..."
} }, . . .
Movies > Sci-Fi/Fantasy > 2008 to 2010 > Downey > "Iron"
Feature Detail: Suggestions
http://<endpoint>/2013-01-01/suggest?q=ir&suggester=title_sug
{"status": {"rid": "t7mti80oAQrstGQ=","time-ms": 3},
"suggest": {"query": "ir", "found": 5,
"suggestions": [
{"suggestion":"Iron Man Three","score": 0,
"id": "tt0371746"},
{ "suggestion": "Iron Man", "score": 0,
"id": "tt1228705"},
Feature Detail: Availability Options
Feature Detail: Scaling Options
Feature Detail: IAM Integration
Configuration API Only
{
"Version":"2012-10-17",
"Statement": [
{ "Effect": "Allow",
"Action": ["cloudsearch:*"],
"Resource": "arn:aws:cloudsearch:us-east-1:111122223333:domain/imdb-movies" },
{ "Effect": "Deny",
"Action": ["cloudsearch:DeleteDomain"],
"Resource": "arn:aws:cloudsearch:us-east-1:111122223333:domain/imdb-movies" }
]
}
Closing Thoughts
• Content Discovery goes hand in hand with Content. Search is
everywhere!
• CloudSearch is a fully managed, easy to use, cost effective search
service
• Get the powerful search features found in open source engines
(Apache Solr) combined with value add AWS features (easy setup, on
demand pricing, auto scaling, Multi-AZ, global availability)