build a scalable search engine with amazon cloudsearch by jon handler
DESCRIPTION
at AWSプロダクトシリーズ|よくわかるAmazon CloudSearch http://kokucheese.com/event/index/168838/TRANSCRIPT
Build a Scalable Search Engine With Amazon CloudSearch
Agenda
• Introduction to Search • Amazon CloudSearch • Building with CloudSearch
Introduction to Search
Search Engines Connect Us To Data
Documents
Representation of a Document
Field Value
id tt0371746
title Iron Man
description When wealthy industrialist Tony Stark is forced to build an armored suit after a life-threatening incident, he ultimately decides to use its technology to fight against evil.
director John Favreau
actors Robert Downey Jr., Gwyneth Paltrow, Terrence Howard ...
rating 7.9
release_date 2008-05-02T00:00:00Z
Data Types
Doubles
Dates
Signed Integers Text
Literal
Geo
• Latlon data type • Region search • Distance sort • Supports mobile
Text Processing (Normalization)
• Tokenization (parsing)
• Downcasing • Stemming • Stopword removal • Synonym Addition
When wealthy industrialist Tony Stark is forced to build an armored suit after a life-threatening incident, he ultimately decides to use its technology to fight against evil. when wealth industrial tony stark force build armor suit after life threaten incident ultimate decide use technology fight against evil
Indexing
Term Documents (Posting List)
Iron The Man in the Iron Mask Iron Man 2 Iron Man The Iron Giant The Iron Lady
...
Man Rain Man The Man in the Moon Iron Man 2 The Lawnmower Man The Third Man Iron Man ...
Matching
The Man in the Iron Mask Iron Man 2 Iron Man The Iron Giant The Iron Lady
Rain Man The Man in the Moon Iron Man 2 The Lawnmower Man The Third Man Iron Man
Iron Man 2 Iron Man
Ranking and Relevance
• The meat of the search engine • TF-IDF – uniqueness and presence • Additional Criteria
– Measures of document value (e.g. rating) – Observed user behavior – Freshness
Summary
• Search makes data accessible • Search documents gather information about one search target • Reverse indices provide the basis of text-text matching • Relevance brings the best matches
Amazon CloudSearch
Building a Search service
• Build your own – Extend datastores and build custom relevance engine
• Open Source
– Apache Solr, ElasticSearch
• Enterprise Search
– FAST, Autonomy, Endeca
Challenges with building a Search service
• COMPLEX: Requires extensive search expertise • COSTLY: High upfront expenditure • SLOW: Long time to market. Slows innovation
• UNDIFFERENTIATED: Operational overhead that doesn’t add value to core product
Where CloudSearch fits in the picture
Amazon CloudSearch is a fully managed search service in the cloud that makes it easy to setup, operate, and scale a search solution for your website or application Similar benefits as other AWS Managed Services • Easy to setup and operate (Console, SDK, CLT) • Pay as you go • No need to guess capacity • Experiment fast with low risk • Go Global in minutes
Reference Architecture
Automatic Scaling
SEARCH INSTANCE Index Partition n
Copy 1
SEARCH INSTANCE Index Partition 2
Copy 2
SEARCH INSTANCE Index Partition n
Copy 2
SEARCH INSTANCE Index Partition 2
Copy n
SEARCH INSTANCE
DATA Document Quantity and Size
TRAFFIC Search Request Volume and Complexity
Index Partition n Copy n
SEARCH INSTANCE Index Partition 1
Copy 1
SEARCH INSTANCE Index Partition 2
Copy 1
SEARCH INSTANCE Index Partition 1
Copy 2
SEARCH INSTANCE Index Partition 1
Copy n
Building With CloudSearch
Movies > Sci-Fi/Fantasy > 2008 to 2010 > Downey > "Iron"
Create a Domain
Upload Data
2014年3月 CloudSearch Launch
Arabic, Armenian, Basque, Bulgarian, Catalan, Simplified Chinese, Traditional Chinese, Czech, Danish, Dutch, English,
Finnish, French, Galician, German, Greek, Hindi, Hungarian, Indonesian, Irish, Italian, Japanese,
Korean, Latvian, Norwegian, Persian, Portuguese, Romanian, Russian, Spanish, Swedish, Thai, Turkish
• Support for 33 languages
CloudSearchへのデータ投入(コンソールCSV)
生成したSDFフォーマットのファイルをダウンロードすることも出来る
1
2
3
Japanese Text Processing
• 形態素解析(Morphological Analysis) – 自然言語で書かれた文を形態素の列に分割し、それぞれの品詞を判別する作業
(http://ja.wikipedia.org/wiki/形態素解析) • 英語のようにスペースで区切られている言語と異なり、
• 日本語は日本語用の構文解析が必要
– 例) 彼はエンジニアだ • 彼(名詞-代名詞)/は(助詞-係助詞)/エンジニア(名詞-一般)/だ(助動詞) • “エンジニア”を抽出してインデックスを作ることにより、 • ”エンジニア”で検索された際に、高速なレスポンスの実現が可能
Japanese Text Processing • 正規化(Normalize)
– エンジニア(半角カナ)で検索された場合も、エンジニア(全角カナ)で検索された場合も、どちらの場合もヒットして欲しい
– CloudSearchでサポートされている機能 – 更に突っ込んだ正規化に関しては要件に応じて下記のような実装を自分で行う事が望ま
しい場合もある • NFD(Canonical Decomposition): 正規化形式D • NFC(Canonical Composition): 正規化形式C • NFKD(Compatibility Decomposition): 正規化形式KD • NFKC(Compatibility Composition): 正規化形式KC
Japanese Text Processing • Stemming
– 飲んだ → 飲ん(動詞-自立, baseForm:飲む)/だ(助動詞) → 飲む
– ステミング辞書への追加 (API/SDKでも追加可能)
Japanese Text Processing • Stopword Removal
– 「の」、「は」、「か」といった意味の無い言葉を除く – ステミング同様Stopword辞書への追加 (API/SDKでも追加可能)
Japanese Text Processing • Synonym Addition
– Synonym = 同義語 • 「ベニス」「ベネチア」「ヴェネチア」 • 「昨年」「去年」
– 同じ意味なので検索された場合にヒットさせる
– Stopwords, Stemming同様に追加可能
Japanese Text Processing • Synonym Addition
– シノニム辞書への追加 (API/SDKでも追加可能) • Alias
– pupilで検索してstudentのドキュメントがヒット
– studentで検索してpupilのドキュメントはヒットしない
• Group
– 1st, first, oneどれで検索しても
– 1st, first, oneの全てのドキュメントがヒット
Document Upload http(s)://< document service endpoint >/2013-01-01/documents/batch!!Accept: application/json !Content-Length: 1176 !Content-Type: application/json !Host: doc.imdb-movies-rr2f34ofg56xneuemujamut52i.us-east-1.cloudsearch.amazonaws.com !!{ : , : "tt0371746", : { "directors" : [ "Jon Favreau" ], "release_date" : "2008-04-14T00:00:00Z", "rating" : 7.9, "genres" : [ "Action", "Adventure", "Sci-Fi" ], "image_url" : "http://ia.media-imdb.com/images/M/MV5BMTczNTI2ODUwOF5BMl5BanBnXkFtZTcwMTU0NTIzMw@@._V1_SX400_.jpg", "plot" : "When wealthy industrialist Tony Stark is forced to build an armored suit after a life-threatening incident, he ultimately decides to use its technology to fight against evil.", "title" : "Iron Man", "rank" : 171, "running_time_secs" : 7560, "actors" : [ "Robert Downey Jr.", "Gwyneth Paltrow", "Terrence Howard" ], "year" : 2008 }},!{ , : "tt0434409"} ]!
Simple Queries
Movies > Sci-Fi/Fantasy > 2008 to 2010 > Downey > "Iron"
Simple Queries http(s)/<search endpoint>/2013-01-01/search?q=iron+man!
{"status": {"rid": "oei6zt8oAgq5QOc=",!"time-ms": 4},!
"hits": {"found": 9, "start": 0,!"hit": [!
{"id": "tt1228705"},!{"id": "tt0120744"},!{"id": "tt0371746"},!{"id": "tt1866249"},!{"id": "tt0119558"},!{"id": "tt0402894"},!{"id": "tt1258972"},!{"id": "tt1300854"},!{"id": "tt0462465"} ] } }!
Complex Queries
Movies > Sci-Fi/Fantasy > 2008 to 2010 > Downey > "Iron"
Faceting
Movies > Sci-Fi/Fantasy > 2008 to 2010 > Downey > "Iron"
Drilldown
Movies > Sci-Fi/Fantasy > 2008 to 2010 > Downey > "Iron"
Adjustable Ranking
Movies > Sci-Fi/Fantasy > 2008 to 2010 > Downey > "Iron"
Movies > Sci-Fi/Fantasy > 2008 to 2010 > Downey > "Iron"
Highlighting
Movies > Sci-Fi/Fantasy > 2008 to 2010 > Downey > "Iron"
Availability Options
Scaling Options
IAM Integration
Configuration API Only
{! "Version":"2012-10-17",! "Statement": [! { "Effect": "Allow", "Action": ["cloudsearch:*"], "Resource": "arn:aws:cloudsearch:us-east-1:111122223333:domain/imdb-movies" },! { "Effect": "Deny",! "Action": ["cloudsearch:DeleteDomain"],! "Resource": "arn:aws:cloudsearch:us-east-1:111122223333:domain/imdb-movies" }! ]!}!
Closing Thoughts
• Content Discovery goes hand in hand with Content. Search is everywhere!
• Amazon CloudSearch is a fully managed, easy to use, cost effective search service – easy to build, easy to scale
• Get the powerful search features found in open source engines (Apache Solr) combined with value add AWS features (easy setup, on demand pricing, auto scaling, Multi-AZ, global availability)