build a scalable search engine with amazon cloudsearch by jon handler

45
Build a Scalable Search Engine With Amazon CloudSearch

Upload: eiji-shinohara

Post on 21-Jun-2015

2.449 views

Category:

Technology


5 download

DESCRIPTION

at AWSプロダクトシリーズ|よくわかるAmazon CloudSearch http://kokucheese.com/event/index/168838/

TRANSCRIPT

Page 1: Build a Scalable Search Engine With Amazon CloudSearch by Jon Handler

Build a Scalable Search Engine With Amazon CloudSearch

Page 2: Build a Scalable Search Engine With Amazon CloudSearch by Jon Handler

Agenda

•  Introduction to Search •  Amazon CloudSearch •  Building with CloudSearch

Page 3: Build a Scalable Search Engine With Amazon CloudSearch by Jon Handler

Introduction to Search

Page 4: Build a Scalable Search Engine With Amazon CloudSearch by Jon Handler

Search Engines Connect Us To Data

Page 5: Build a Scalable Search Engine With Amazon CloudSearch by Jon Handler

Documents

Page 6: Build a Scalable Search Engine With Amazon CloudSearch by Jon Handler

Representation of a Document

Field Value

id tt0371746

title Iron Man

description When wealthy industrialist Tony Stark is forced to build an armored suit after a life-threatening incident, he ultimately decides to use its technology to fight against evil.

director John Favreau

actors Robert Downey Jr., Gwyneth Paltrow, Terrence Howard ...

rating 7.9

release_date 2008-05-02T00:00:00Z

Page 7: Build a Scalable Search Engine With Amazon CloudSearch by Jon Handler

Data Types

Doubles

Dates

Signed Integers Text

Literal

Page 8: Build a Scalable Search Engine With Amazon CloudSearch by Jon Handler

Geo

•  Latlon data type •  Region search •  Distance sort •  Supports mobile

Page 9: Build a Scalable Search Engine With Amazon CloudSearch by Jon Handler

Text Processing (Normalization)

•  Tokenization (parsing)

•  Downcasing •  Stemming •  Stopword removal •  Synonym Addition

When wealthy industrialist Tony Stark is forced to build an armored suit after a life-threatening incident, he ultimately decides to use its technology to fight against evil. when wealth industrial tony stark force build armor suit after life threaten incident ultimate decide use technology fight against evil

Page 10: Build a Scalable Search Engine With Amazon CloudSearch by Jon Handler

Indexing

Term Documents (Posting List)

Iron The Man in the Iron Mask Iron Man 2 Iron Man The Iron Giant The Iron Lady

...

Man Rain Man The Man in the Moon Iron Man 2 The Lawnmower Man The Third Man Iron Man ...

Page 11: Build a Scalable Search Engine With Amazon CloudSearch by Jon Handler

Matching

The Man in the Iron Mask Iron Man 2 Iron Man The Iron Giant The Iron Lady

Rain Man The Man in the Moon Iron Man 2 The Lawnmower Man The Third Man Iron Man

Iron Man 2 Iron Man

Page 12: Build a Scalable Search Engine With Amazon CloudSearch by Jon Handler

Ranking and Relevance

•  The meat of the search engine •  TF-IDF – uniqueness and presence •  Additional Criteria

–  Measures of document value (e.g. rating) –  Observed user behavior –  Freshness

Page 13: Build a Scalable Search Engine With Amazon CloudSearch by Jon Handler

Summary

•  Search makes data accessible •  Search documents gather information about one search target •  Reverse indices provide the basis of text-text matching •  Relevance brings the best matches

Page 14: Build a Scalable Search Engine With Amazon CloudSearch by Jon Handler

Amazon CloudSearch

Page 15: Build a Scalable Search Engine With Amazon CloudSearch by Jon Handler

Building a Search service

•  Build your own –  Extend datastores and build custom relevance engine

•  Open Source

–  Apache Solr, ElasticSearch

•  Enterprise Search

–  FAST, Autonomy, Endeca

Page 16: Build a Scalable Search Engine With Amazon CloudSearch by Jon Handler

Challenges with building a Search service

•  COMPLEX: Requires extensive search expertise •  COSTLY: High upfront expenditure •  SLOW: Long time to market. Slows innovation

•  UNDIFFERENTIATED: Operational overhead that doesn’t add value to core product

Page 17: Build a Scalable Search Engine With Amazon CloudSearch by Jon Handler

Where CloudSearch fits in the picture

Amazon CloudSearch is a fully managed search service in the cloud that makes it easy to setup, operate, and scale a search solution for your website or application Similar benefits as other AWS Managed Services •  Easy to setup and operate (Console, SDK, CLT) •  Pay as you go •  No need to guess capacity •  Experiment fast with low risk •  Go Global in minutes

Page 18: Build a Scalable Search Engine With Amazon CloudSearch by Jon Handler

Reference Architecture

Page 19: Build a Scalable Search Engine With Amazon CloudSearch by Jon Handler

Automatic Scaling

SEARCH INSTANCE Index Partition n

Copy 1

SEARCH INSTANCE Index Partition 2

Copy 2

SEARCH INSTANCE Index Partition n

Copy 2

SEARCH INSTANCE Index Partition 2

Copy n

SEARCH INSTANCE

DATA Document Quantity and Size

TRAFFIC Search Request Volume and Complexity

Index Partition n Copy n

SEARCH INSTANCE Index Partition 1

Copy 1

SEARCH INSTANCE Index Partition 2

Copy 1

SEARCH INSTANCE Index Partition 1

Copy 2

SEARCH INSTANCE Index Partition 1

Copy n

Page 20: Build a Scalable Search Engine With Amazon CloudSearch by Jon Handler

Building With CloudSearch

Page 21: Build a Scalable Search Engine With Amazon CloudSearch by Jon Handler

Movies > Sci-Fi/Fantasy > 2008 to 2010 > Downey > "Iron"

Page 22: Build a Scalable Search Engine With Amazon CloudSearch by Jon Handler

Create a Domain

Page 23: Build a Scalable Search Engine With Amazon CloudSearch by Jon Handler

Upload Data

Page 24: Build a Scalable Search Engine With Amazon CloudSearch by Jon Handler

2014年3月 CloudSearch Launch

Arabic, Armenian, Basque, Bulgarian, Catalan, Simplified Chinese, Traditional Chinese, Czech, Danish, Dutch, English,

Finnish, French, Galician, German, Greek, Hindi, Hungarian, Indonesian, Irish, Italian, Japanese,

Korean, Latvian, Norwegian, Persian, Portuguese, Romanian, Russian, Spanish, Swedish, Thai, Turkish

•  Support  for  33  languages

Page 25: Build a Scalable Search Engine With Amazon CloudSearch by Jon Handler

CloudSearchへのデータ投入(コンソールCSV)

生成したSDFフォーマットのファイルをダウンロードすることも出来る  

1  

2  

3  

Page 26: Build a Scalable Search Engine With Amazon CloudSearch by Jon Handler

Japanese Text Processing

•  形態素解析(Morphological Analysis) –  自然言語で書かれた文を形態素の列に分割し、それぞれの品詞を判別する作業

(http://ja.wikipedia.org/wiki/形態素解析) •  英語のようにスペースで区切られている言語と異なり、

•  日本語は日本語用の構文解析が必要

–  例) 彼はエンジニアだ •  彼(名詞-代名詞)/は(助詞-係助詞)/エンジニア(名詞-一般)/だ(助動詞) •  “エンジニア”を抽出してインデックスを作ることにより、 •  ”エンジニア”で検索された際に、高速なレスポンスの実現が可能

Page 27: Build a Scalable Search Engine With Amazon CloudSearch by Jon Handler

Japanese Text Processing •  正規化(Normalize)

–  エンジニア(半角カナ)で検索された場合も、エンジニア(全角カナ)で検索された場合も、どちらの場合もヒットして欲しい

–  CloudSearchでサポートされている機能 –  更に突っ込んだ正規化に関しては要件に応じて下記のような実装を自分で行う事が望ま

しい場合もある •  NFD(Canonical Decomposition): 正規化形式D •  NFC(Canonical Composition): 正規化形式C •  NFKD(Compatibility Decomposition): 正規化形式KD •  NFKC(Compatibility Composition): 正規化形式KC

Page 28: Build a Scalable Search Engine With Amazon CloudSearch by Jon Handler

Japanese Text Processing •  Stemming

–  飲んだ → 飲ん(動詞-自立, baseForm:飲む)/だ(助動詞) → 飲む

–  ステミング辞書への追加 (API/SDKでも追加可能)

Page 29: Build a Scalable Search Engine With Amazon CloudSearch by Jon Handler

Japanese Text Processing •  Stopword Removal

–  「の」、「は」、「か」といった意味の無い言葉を除く –  ステミング同様Stopword辞書への追加 (API/SDKでも追加可能)

Page 30: Build a Scalable Search Engine With Amazon CloudSearch by Jon Handler

Japanese Text Processing •  Synonym Addition

–  Synonym = 同義語 •  「ベニス」「ベネチア」「ヴェネチア」 •  「昨年」「去年」

–  同じ意味なので検索された場合にヒットさせる

–  Stopwords, Stemming同様に追加可能

Page 31: Build a Scalable Search Engine With Amazon CloudSearch by Jon Handler

Japanese Text Processing •  Synonym Addition

–  シノニム辞書への追加 (API/SDKでも追加可能) •  Alias

–  pupilで検索してstudentのドキュメントがヒット

–  studentで検索してpupilのドキュメントはヒットしない

•  Group

–  1st, first, oneどれで検索しても

–  1st, first, oneの全てのドキュメントがヒット

Page 32: Build a Scalable Search Engine With Amazon CloudSearch by Jon Handler

Document Upload http(s)://< document service endpoint >/2013-01-01/documents/batch!!Accept: application/json !Content-Length: 1176 !Content-Type: application/json !Host: doc.imdb-movies-rr2f34ofg56xneuemujamut52i.us-east-1.cloudsearch.amazonaws.com !!{ : , : "tt0371746", : { "directors" : [ "Jon Favreau" ], "release_date" : "2008-04-14T00:00:00Z", "rating" : 7.9, "genres" : [ "Action", "Adventure", "Sci-Fi" ], "image_url" : "http://ia.media-imdb.com/images/M/MV5BMTczNTI2ODUwOF5BMl5BanBnXkFtZTcwMTU0NTIzMw@@._V1_SX400_.jpg", "plot" : "When wealthy industrialist Tony Stark is forced to build an armored suit after a life-threatening incident, he ultimately decides to use its technology to fight against evil.", "title" : "Iron Man", "rank" : 171, "running_time_secs" : 7560, "actors" : [ "Robert Downey Jr.", "Gwyneth Paltrow", "Terrence Howard" ], "year" : 2008 }},!{ , : "tt0434409"} ]!

Page 33: Build a Scalable Search Engine With Amazon CloudSearch by Jon Handler

Simple Queries

Movies > Sci-Fi/Fantasy > 2008 to 2010 > Downey > "Iron"

Page 34: Build a Scalable Search Engine With Amazon CloudSearch by Jon Handler

Simple Queries http(s)/<search endpoint>/2013-01-01/search?q=iron+man!

{"status": {"rid": "oei6zt8oAgq5QOc=",!"time-ms": 4},!

"hits": {"found": 9, "start": 0,!"hit": [!

{"id": "tt1228705"},!{"id": "tt0120744"},!{"id": "tt0371746"},!{"id": "tt1866249"},!{"id": "tt0119558"},!{"id": "tt0402894"},!{"id": "tt1258972"},!{"id": "tt1300854"},!{"id": "tt0462465"} ] } }!

Page 35: Build a Scalable Search Engine With Amazon CloudSearch by Jon Handler

Complex Queries

Movies > Sci-Fi/Fantasy > 2008 to 2010 > Downey > "Iron"

Page 36: Build a Scalable Search Engine With Amazon CloudSearch by Jon Handler

Faceting

Movies > Sci-Fi/Fantasy > 2008 to 2010 > Downey > "Iron"

Page 37: Build a Scalable Search Engine With Amazon CloudSearch by Jon Handler

Drilldown

Movies > Sci-Fi/Fantasy > 2008 to 2010 > Downey > "Iron"

Page 38: Build a Scalable Search Engine With Amazon CloudSearch by Jon Handler

Adjustable Ranking

Movies > Sci-Fi/Fantasy > 2008 to 2010 > Downey > "Iron"

Page 39: Build a Scalable Search Engine With Amazon CloudSearch by Jon Handler

Movies > Sci-Fi/Fantasy > 2008 to 2010 > Downey > "Iron"

Highlighting

Page 40: Build a Scalable Search Engine With Amazon CloudSearch by Jon Handler

Movies > Sci-Fi/Fantasy > 2008 to 2010 > Downey > "Iron"

Page 41: Build a Scalable Search Engine With Amazon CloudSearch by Jon Handler

Availability Options

Page 42: Build a Scalable Search Engine With Amazon CloudSearch by Jon Handler

Scaling Options

Page 43: Build a Scalable Search Engine With Amazon CloudSearch by Jon Handler

IAM Integration

Configuration API Only

{! "Version":"2012-10-17",! "Statement": [! { "Effect": "Allow", "Action": ["cloudsearch:*"], "Resource": "arn:aws:cloudsearch:us-east-1:111122223333:domain/imdb-movies" },! { "Effect": "Deny",! "Action": ["cloudsearch:DeleteDomain"],! "Resource": "arn:aws:cloudsearch:us-east-1:111122223333:domain/imdb-movies" }! ]!}!

Page 44: Build a Scalable Search Engine With Amazon CloudSearch by Jon Handler

Closing Thoughts

•  Content Discovery goes hand in hand with Content. Search is everywhere!

•  Amazon CloudSearch is a fully managed, easy to use, cost effective search service – easy to build, easy to scale

•  Get the powerful search features found in open source engines (Apache Solr) combined with value add AWS features (easy setup, on demand pricing, auto scaling, Multi-AZ, global availability)

Page 45: Build a Scalable Search Engine With Amazon CloudSearch by Jon Handler

Questions?

Jon Handler ([email protected])

Pravin Muthukumar ([email protected])