using elasticsearch as a fast, flexible, and scalable solution to search occurrence records and...

19
Using ElasticSearch as a fast, flexible, and scalable solution to search occurrence records and checklists Christian Gendreau, Canadensys Marie-Elise Lecoq, GBIF France

Upload: kristgen

Post on 26-Jan-2015

112 views

Category:

Technology


3 download

DESCRIPTION

TDWG 2013 talk on ElasticSearch by Canadensys and GBIF France.

TRANSCRIPT

Page 1: Using ElasticSearch as a fast, flexible, and scalable solution to search occurrence records and checklists

Using ElasticSearch as a fast, flexible, and scalable solution to search occurrence records and checklists

Christian Gendreau, Canadensys Marie-Elise Lecoq, GBIF France

Page 2: Using ElasticSearch as a fast, flexible, and scalable solution to search occurrence records and checklists

Introduction

ElasticSearch is an open source, document oriented, distributed search engine, built on top of Apache Lucene.

From ElasticSearch GitHub page

Page 3: Using ElasticSearch as a fast, flexible, and scalable solution to search occurrence records and checklists

Setup

•  Java 6 or higher •  Download : # wget …elasticsearch-0.90.5.zip •  Unzip

Page 4: Using ElasticSearch as a fast, flexible, and scalable solution to search occurrence records and checklists

Configuration

•  Name your cluster •  Replication and multi-shard are enabled by default •  Start : # bin/elasticsearch

Page 5: Using ElasticSearch as a fast, flexible, and scalable solution to search occurrence records and checklists

Add data

Using the REST API

$ curl -XPUT 'http://localhost:9200/twitter/tweet/1' -d '{ "user" : "kimchy", "post_date" : "2009-11-15T14:12:12", "message" : "trying out Elastic Search" }'

Page 6: Using ElasticSearch as a fast, flexible, and scalable solution to search occurrence records and checklists

Import data

Rivers •  Document-based database (mongoDB) •  JDBC (relational database) •  Data source (wikipedia, Twitter)

Page 7: Using ElasticSearch as a fast, flexible, and scalable solution to search occurrence records and checklists

Mapping

•  Schema-less •  Customize indexing •  Customize querying

Page 8: Using ElasticSearch as a fast, flexible, and scalable solution to search occurrence records and checklists

ElasticSearch at Canadensys

Database of Vascular Plants of Canada (VASCAN)

data.canadensys.net/vascan

Page 9: Using ElasticSearch as a fast, flexible, and scalable solution to search occurrence records and checklists

Our ElasticSearch index Index structure for scientific names •  autocompletion : edge_ngram filter

o  “carex” -> “ca”,”car”,”care”,”carex” •  genus first letter : pattern_replace filter

o  “carex feta” -> “c. feta” •  epithet : path_hierarchy tokenizer

o  “carex feta” -> “feta”

Page 10: Using ElasticSearch as a fast, flexible, and scalable solution to search occurrence records and checklists

ElasticSearch at GBIF France

Data stored in ElasticSearch are updated upon MongoDB changes.

The search engine requests elasticsearch using filters like taxon, date, place, dataset and geolocalisation. Statistic calculation using facets

Page 11: Using ElasticSearch as a fast, flexible, and scalable solution to search occurrence records and checklists

ElasticSearch at GBIF France

Page 12: Using ElasticSearch as a fast, flexible, and scalable solution to search occurrence records and checklists

ElasticSearch - Solr

•  Solr and elasticsearch both tries to solve the same problem with no much differences

•  Development setup and production deployment (replication / sharding) easier with elasticsearch

•  By default, the elasticsearch is well configured for Lucene and customization remains easy.

Page 13: Using ElasticSearch as a fast, flexible, and scalable solution to search occurrence records and checklists

Facets

•  “Group by” in SQL •  Mostly used for calculate statistics

•  Example :

curl -XGET [...] "facets" : {

”dataset" : { "terms" : { "field" : ”dataset",

"order" : "term” …

Page 14: Using ElasticSearch as a fast, flexible, and scalable solution to search occurrence records and checklists

API and libraries

REST API o  interoperability between different programming languages o  HTTP request

Java API

o  more efficient than REST API due to the binary API use. o  built in marshaling(data formatting on the network)

Page 15: Using ElasticSearch as a fast, flexible, and scalable solution to search occurrence records and checklists

Query - RESTfull API

Example: $ curl localhost:9200/vascan/_search?pretty=1 -d

'{"query":{ "match":{ "name" :{ "query":"carex" } } } }’

Page 16: Using ElasticSearch as a fast, flexible, and scalable solution to search occurrence records and checklists

Query - Java API

Code example: ... SearchRequestBuilder srb = client.prepareSearch(INDEX_NAME)

.setQuery(QueryBuilders .boolQuery() .should(QueryBuilders.matchQuery("vernacular_name",text))

.setTypes(VERNACULAR_TYPE); ...

Page 17: Using ElasticSearch as a fast, flexible, and scalable solution to search occurrence records and checklists

Pitfalls

•  Error reporting (index creation, river creation) •  Results may be hard to predict using complex queries •  Documentation

•  With each mapping modification comes a free reindex from data

Page 18: Using ElasticSearch as a fast, flexible, and scalable solution to search occurrence records and checklists

Future

•  Scientific Name analyzer •  Geospatial component

Page 19: Using ElasticSearch as a fast, flexible, and scalable solution to search occurrence records and checklists

Thank you!