elasticsearch meetup 30 - 10 - 2014
TRANSCRIPT
ElasticSearch
lessonslearned
Alberto Paro, October 30, 2014
AgendaIntroductionElasticSearchCommon PitfallsQuestions
About meAlberto Paro, @aparo77My motto: “always learning”CTO at Big Data TechnologiesFreelance Consulting
International companies (Italy, Switzerland, Austria, USA)Web Development on Big Data SolutionsNLP/Spark/Lucene/SOLR/ElasticSearch implementations & trainingReactive and Functional Programming (Scala, Akka, Spray.io, Play)
About mePackt Publishing Book Author and reviewer
ElasticSearch Cookbook (Author, Dec 2013) ElasticSearch Server (Review, Apr 2014)ElasticSearch Cookbook – Second Edition (Author, Dec 2014)
Using ElasticSearch from 2010 ~ version 1.10PyES – ElasticSearch python driver used by Cern, IBM, …ElasticSearch MongoDB riverDjango ElasticSearch EngineFor companies I developed up to 4 ORMs for ElasticSearch (.Net, Python, Scala) and several plugins
ElasticSearchApache LuceneStarted in 2010 by Shay BanonOpen Source – Apache LicenseA company was formed in 2012: ElasticSearch
Training, support and development
@kimchy
ElasticSearchScalable
Distributed, Node DiscoveryAutomatic shardingQuery distribution
RESTful, HTTP APIWith API wrappers for .Net, Ruby, Java, Scala, …JSON in, JSON out -> JSON Coast-to-Coast
Document ModelMaps Json to Object“schemaless” -> field type recognitionKeeps source, keeps ‘version’ number, keeps timestamp, …
ElasticSearchField types and analyzers
String, numeric, geo, …Custom types: attachments, IP, IBAN, …Arrays, subdocuments, nested documents
Integrated AggregationsYour big data insights
TermsMin/Max/Avg/SumTop hitGeo DistanceAnd more
DBMS -> ElasticSearchDBMS ElasticSearch MongoDBDatabase Index DatabaseTable Type CollectionField Field FieldRecord Document Document
User must rethink their models.
DBMS -> ElasticSearchDatamodelling is the same Entity Relation, plus:
Multi valuesEmbeddingMutable/Immutable dataAlternative three foreign key alternative:
Term queryParent/ChildNested
{ "book" : { "isbn" : ”9781782166627", "name" : ”ElasticSearch Cookbook", "author" : { "first_name" : ”Alberto", "last_name" : ”Paro" }, "pages" : 430, "tag" : [”elasticsearch", ”java”, “python”, “Rest”] }}
Common PitfallsSchema(less)?Automatic field type recognition
Can miss typesStrict about types: only some types can be upgradedCheck the datetime:
UNIX (epoch from …) (the standard world)ISO 8601 -> “yyyy-MM-ddTHH:mm:ssZ”
Common PitfallsWhat’s the best transport protocol?In JVM, prefer the native
FasterExtra bonus
HTTP best for balancerThrift best for performance
Faster than HTTPCharset “safe”
Common PitfallsNever, Never publish your ElasticSearch server outside DMZ
Security problems with scriptingSimple HTTP can destroy your server
Or simply drain your money on Amazon Cloud
ElasticSearch has a lot of problems with URL securityVulnerabilities
Common PitfallsVery fast indexingBulk indexing:
Set up without replicas (replicas = 0, not 1)Play with bulk size (300-500-1000-5000-10000)Performances depends on data complexityBefore indexing: After indexing:
curl -XPUT localhost:9200/test/_settings -d '{ "index" : { "refresh_interval" : "1s" } }'
curl -XPUT localhost:9200/test/_settings -d '{ "index" : { "refresh_interval" : "-1” } }'
Common PitfallsElasticSearch uses a lot of memory and file-descriptors!Optimize them in /etc/security/limits.conf
elasticsearch soft nofile 32000
elasticsearch hard nofile 32000
elasticsearch - memlock unlimited
Set the ES_HEAP_SIZEElasticSearch config file conf/elasticsearch.yml
bootstrap.mlockall: true
Common PitfallsWait the yellow statusAre you using ElasticSearch as Primary datastore?
It can replace both DBMS or MongoDBbut it depends on your data
Cron SnapshotsDon’t abuse flush(Be reactive)
Prefer “update” to post repost the same objectUse the “version” Luke!
Common PitfallsIf possible don’t use rivers
Hard to debugReduce your server responsivityCan crash your serverThey will be removed (2.0?)(Prefer Spark SchemaDDL)
Use scriptsThe easy way to extend ElasticSearch for trivial functionalitiesPrefer Groovy (or native Java for performances)Don’t use inline scripts, if possible
Prefer indexed or file with parameters
Common PitfallsUse plugins
If it’s not available, write a new one
Always backup before upgradingSnapshots can save your life!
Bug in 1.3.xCheck your plugins to compatibilityRead the ElasticSearch changelog
Sometimes you MUST upgrade your cluster
Use a least 3 nodes (if possible)
ConclusionsElasticSearch benefits
Easy to setupVery clever architecture
DrawbacksChange sharding in a full index non-trivialPay attention when upgrading
ElasticSearchClever architecture, fast, stable, extendableDoes exactly what you need