graphconnect europe 2016 - moving graphs to production at scale - ian robinson

36
Moving Graphs to Produc3on Ian Robinson

Upload: neo4j-the-fastest-and-most-scalable-native-graph-database

Post on 16-Apr-2017

211 views

Category:

Technology


0 download

TRANSCRIPT

MovingGraphstoProduc3onIanRobinson

Overview

•  Solu%onArchitectures•  Hardware/So5wareRequirements•  HAArchitecture•  Backups•  Monitoring•  Tes%ng

Solu3onArchitectures

ServerServerwithProceduresEmbedded

Solu3onArchitectures

Server•  ServerinfrastructurewrapsembeddedNeo4j•  Binaryprotocol(Bolt)•  Uniformdrivers(Java,.NET,Python,JavaScript)

ServerwithProceduresEmbedded

Cypher/BoltCypher/BoltCypher/Bolt

Driver

Applica%on

Loadbalancer

Solu3onArchitectures

ServerServerwithProcedures•  Server-sidejar,calledfromCypher•  Executecomplexlogiconserver•  Closetothedata•  Mul%pleopera%onsperrequest•  Integratewithbackendsystems•  Graphglobalqueries,schemaintrospec%on,etc.

Embedded

Cypher/BoltRESTAPICypher/Bolyt

Driver

Applica%on

Loadbalancer

Cypher/Bolt

Procedures

hUps://github.com/neo4j-contrib/neo4j-apoc-procedures

Solu3onArchitectures

ServerServerwithProceduresEmbedded•  HostNeo4jinapplica%on’sJavaprocess•  AccesstoNeo4j’sJavaAPIs

JavaAPIs

Applica%on

HardwareCPU•  IntelCorei3(minimum)•  IntelCorei7(recommended)•  Neo4jscaleswiththenumberofcores

•  RequiresEnterprisetoscalebeyond4coresDisk•  SLC(single-levelcell)SSDw/SATA•  ext4(recommended),ZFS•  IncreasepermiUednumberofopenfilesto40,000+

Memory•  LotsofRAM(forheap+pagecache)

•  8-12GBheap(upto24GB)•  Explicitlysetpagecacheto(storesize+10%+headroom)

–  Otherwisedefaultsto50%ofRAM-heap-size(75%pre2.3)

dbms.memory.pagecache.size=10g

neo4j.conf

SoEware

Java•  OpenJDK8orOracleJava8•  IBMJDK8onPOWER8•  G1garbagecollector•  Defaultfrom2.3•  JDK1.7.0_71orlater

Opera3ngSystem•  Linux•  HPUX•  Windows2012

wrapper.java.additional=-XX:+UseG1GC

neo4j-wrapper.conf(pre2.3)

EC2Instances•  HVM(hardwarevirtualmachine)overPV(paravirtual)•  C3orC4(compute-op%mized)•  E.gc4.2xlarge(15GiBRAM,8vCPU,1000MbpsEBSthroughput)

•  R3(memory-op%mized)•  E.g.r3.xlarge(30.5GiBRAM,4vCPU)•  NotEBS-op%mizedbydefault

•  UseHAclusteringandonlinebackupsforincreaseddurability•  DistributeclusteracrossAvailabilityZonesinaRegion

LocalStorage•  SSDorHDD•  HighestI/Operformance

•  Includedinvirtualserver•  Upto8x800GBSSD(i2.8xlarge)or24x2000GBHDD(d2.8xlarge)•  LostwhenEC2instanceisterminated

Elas3cBlockStore(EBS)•  AUachedtoEC2instancevianetworkconnec%on•  Upto16TBSSD•  PersistevenifEC2instanceisterminated

•  UseEBS-op%mizedEC2instancesfordedicatedthroughputtoEBS•  ProvisionedIOPS(io1)forpredictableperformance •  Upto30IOPSperGiB

–  E.g.300GiBvolume,9000IOPS

HAArchitecture

Database

Transac%onPropaga%on

ClusterManagement

Neo4jHAInstance2

Database

Transac%onPropaga%on

ClusterManagement

Neo4jHAInstance1

Database

Transac%onPropaga%on

ClusterManagement

Neo4jHAInstance3

Master

ClusterConfigura3onJoiningCluster•  ha.initial_hosts (neo4j.conf)

•  Listofserverstocontactwhenjoiningcluster•  Allhostsmustbeavailablewhenstar%nginstance•  Forlargeclusters,supplyonlyasmallnumberofhosts,e.g.3

PullandPushTransac3ons•  ha.pull_interval=10s (offbydefault)•  ha.tx_push_factor=1 (default,butbesteffortsonly)

Tuning•  ha.heartbeat_timeout=11s (default)

•  Heartbeatssent,bydefault,every5s•  Increase%meoutsifpausescauseheartbeatstobedelayed•  Warning:itwilltakelongertodiscoveraninstancehasfailed

•  ha.role_switch_timeout=120s (default)•  Increaseifnewinstances%meoutwhilecatchingupwithmasteronstartup

HARoleEndpoints–UsefulforLoadBalancingEndpoint State StatusCode Body/db/manage/server/ha/master

Master 200 OK true

Slave 404 Not Found false

Unknown 404 Not Found UNKNOWN/db/manage/server/ha/slave

Master 404 Not Found false

Slave 200 OK true

Unknown 404 Not Found UNKNOWN/db/manage/server/ha/available

Master 200 OK master

Slave 200 OK slave

Unknown 404 Not Found UNKNOWN

From2.3onwards dbms.security.ha_status_auth_enabled=false

neo4j.conf

HAJMXEndpoint

JSONResponse•  Alive?•  Role•  LastcommiUedtransac%onID•  Instancesincluster•  Role•  InstanceID•  Available?•  URI

Iden%fyslavesfallingbehind

Doeseveryoneagreeoncomposi%onofcluster?

/db/manage/server/jmx/domain/org.neo4j/instance%3Dkernel%230%2Cname%3DHigh%20Availability

CrossDC-Clusters

•  Samesubnet(considerusingaVPN)•  BandwidthbetweenDCsalignedwithwritethroughput•  Commonprac%ce:instancesinsecondaryrunasslave-only•  Restrictsmasterelec%ontotheprimary

•  Whenfailingover,reconfigureinstancesinsecondary

ha.slave_only=true

neo4j.conf

ha.slave_only=false

neo4j.conf

ScaleHorizontallyForHighReadThroughput

Applica%on

Master Slave Slave

LoadBalancer

e.g.HAProxyELB

NGINX

ScaleHorizontallyForHighReadThroughput

Applica%on

Master Slave Slave

ReadLoadBalancerWriteLoadBalancer

HAProxyConfigura3on

hUp://blog.armbruster-it.de/2015/08/neo4j-and-haproxy-some-best-prac%ces-and-tricks/

ConfigureHAProxyasReadLoadBalancerglobal daemon maxconn 256

defaults mode http timeout connect 5000ms timeout client 50000ms timeout server 50000ms

frontend http-in bind *:80 default_backend neo4j-slaves

backend neo4j-slaves option httpchk GET /db/manage/server/ha/slave server s1 10.0.1.10:7474 maxconn 32 check server s2 10.0.1.11:7474 maxconn 32 check server s3 10.0.1.12:7474 maxconn 32 check

listen admin bind *:8080 stats enable

ConfigureHAProxyasReadLoadBalancerglobal daemon maxconn 256

defaults mode http timeout connect 5000ms timeout client 50000ms timeout server 50000ms

frontend http-in bind *:80 default_backend neo4j-slaves

backend neo4j-slaves option httpchk GET /db/manage/server/ha/slave server s1 10.0.1.10:7474 maxconn 32 check server s2 10.0.1.11:7474 maxconn 32 check server s3 10.0.1.12:7474 maxconn 32 check

listen admin bind *:8080 stats enable

404 Not Found false

404 Not Found UNKNOWN

200 OK true

Master

Slave

Unknown

ImproveReadPerformancewithCacheSharding

Applica%on

1 2 3

LoadBalancer

MATCH (c:Country{name:'Australia'})... MATCH (c:Country{name:'Zambia'})... MATCH (c:Country{name:'Norway'})...

CacheShardingUsingConsistentRou3ng

Applica%on

1 2 3

LoadBalancer

MATCH (c:Country{name:'Australia'})... MATCH (c:Country{name:'Zambia'})... MATCH (c:Country{name:'Norway'})... A-I1J-R2S-Z3

MATCH (c:Country{name:'Zambia'})... MATCH (c:Country{name:'Norway'})... MATCH (c:Country{name:'Australia'})...

ConfigureHAProxyforCacheShardingglobal daemon maxconn 256

defaults mode http timeout connect 5000ms timeout client 50000ms timeout server 50000ms

frontend http-in bind *:80 default_backend neo4j-slaves

backend neo4j-slaves balance url_param country_code server s1 10.0.1.10:7474 maxconn 32 server s2 10.0.1.11:7474 maxconn 32 server s3 10.0.1.12:7474 maxconn 32

listen admin bind *:8080 stats enable

ConfigureHAProxyforCacheShardingglobal daemon maxconn 256

defaults mode http timeout connect 5000ms timeout client 50000ms timeout server 50000ms

frontend http-in bind *:80 default_backend neo4j-slaves

backend neo4j-slaves balance url_param country_code server s1 10.0.1.10:7474 maxconn 32 server s2 10.0.1.11:7474 maxconn 32 server s3 10.0.1.12:7474 maxconn 32

listen admin bind *:8080 stats enable

BackupsModes•  Full•  Incremental•  Ontopofapreviousbackup•  Useslogicallogstoapplychanges,sologsmustbekeptatleast2xbackupinterval

ConsistencyCheck•  Partoffullbackupandstandalonetool•  Evaluatestorehealth•  -verify false todisableinbackup

dbms.tx_log.rotation.retention_policy=7 days (default)

neo4j.conf

BackupStrategies

•  Localorremotebackups•  Ifbackinguptoremotemachine,consistencychecktakesplaceofflinewithrespecttothedatabase

•  Backupfromadedicatedslaveorroundrobin•  Chooseaschedule:•  Fullonceperday,incrementaleveryhour

•  Torestorefrombackup:•  Stopinstance•  Replacegraph.dbwithbackup•  Startinstance

BackupStrategies

BackupServer

A B C

A–full,consistencycheckB–full,consistencycheckC–full,consistencycheckA–incrementalB–incrementalC–incremental…A–incrementalB–incrementalC–incrementalA–full,consistencycheckB–full,consistencycheckC–full,consistencycheck

bin/neo4j-backup \ -from single://neo4j.example.org:20000 \ -to /backups/201510151318263/graph.db -verify true|false

MonitoringPull•  MetricsavailableviaJMXandHTTPandinbrowser

Push•  Metricspublishingfrom2.3onwards(Enterprise)•  Node,rela%onship,propertycounts•  Network/cluster•  Transac%ons(ac%ve,started,commiUed,rolledback,etc)•  Neo4jpagecache(pagefaults,evic%ons,flushes,excep%ons)•  JVM

•  Publishedto:•  Graphite•  Ganglia•  CSV

metrics.graphite.enabled=true metrics.graphite.server=52.29.63.174:2003 metrics.prefix=neo4j-1

neo4j.config

CollateInternalandExternalViewsoftheSystemSystem•  collectd

Database•  Metrics•  Tailneo4j.log

HAEndpoints•  /db/manage/server/ha/master •  /db/manage/server/ha/slave

ServerLatencies•  hAp.log

CypherQueries•  dbms.logs.query.enabled=true •  dbms.logs.query.threshold=2s

Applica3onmetrics•  End-to-endlatencies

TestatScaleSoakTests•  Representa%vedatasetandqueries•  Peakloadandabove

Verify•  Correctness•  Performance•  Latency•  Throughput

•  StabilityOpera3ons•  Backup•  Disasterrecovery•  Replaceinstances

PerformanceTip–UsetheCypherQueryPlanner

8,386,880hits 59,272hits

CREATE INDEX ON :Crime(description)

PerformanceTip–WriteRequests

•  AlignthenumberofconcurrentwriterequestswiththenumberofNeo4jserverthreadsonthemaster•  Bydefault,numberofserverthreads=numberofCPUsreportedavailablebytheJVM

•  Configurethenumberofthreadsinneo4j.confusingorg.neo4j.server.webserver.maxthreads

•  Servicerequestsfromathreadpoolinyourapplica%on•  Usethethreadpoolqueuedepthtoapplybackpressure

PerformanceTip–BatchWritesUsingaQueue

Write

WriteWrite

Queue

SingleThread Batch

hUp://maxdemarzi.com/2013/09/05/scaling-writes/hUp://maxdemarzi.com/2014/07/01/scaling-concurrent-writes-in-neo4j/

PerformanceTip–JVM

•  LookforGCpausesindebug.log•  grep blocked data/databases/graph.db/debug.log

•  Causedby•  Heaptoosmall•  New/survivorspacetoosmall•  BadlywriUenCypherqueryorstoredprocedure

EnableGCLogging

LogwillbewriUentologs/neo4j-gc.log

wrapper.java.additional=-Xloggc:logs/neo4j-gc.log wrapper.java.additional=-XX:+PrintGCDetails wrapper.java.additional=-XX:+PrintGCDateStamps wrapper.java.additional=-XX:+PrintGCApplicationStoppedTime wrapper.java.additional=-XX:+PrintTenuringDistribution wrapper.java.additional=-XX:+PrintGCCause

neo4j-wrapper.conf

ThankYou