couchconf israel 2013_couchbase server in production

Post on 13-Jul-2015

370 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

1

Couchbase Server 2.0 in

ProductionPerry Krug

Sr. Solutions Architect

2

Typical Couchbase production environment

Application users

Load Balancer

Application Servers

Servers

3

We’ll focus on App-Couchbase interaction …

Application users

Load Balancer

Application Servers

Servers

4

… at each step of the application lifecycle

Dev/Test Size Deploy Monitor Manage

5

KEY CONCEPTS

6

Couchbase Single Node Architecture

Replication, Rebalance, Shard State Manager

REST management API/Web UI

8091Admin Console

Erla

ng

/OTP

11210 / 11211Data access ports

Object-managedCache

Storage Engine

8092Query API

Qu

ery

En

gin

e

http

Data Manager Cluster Manager

11

33 2

Couchbase Single Node

2

Managed Cache

Dis

k Q

ueu

e

Disk

Replication Queue

App Server

Couchbase Server Node

Doc 1Doc 1

Doc 1

To other node

XDCR Queue

Doc 1

To other cluster

View engine

Doc 1

12

Web Application

Couchbase deployment

Data Flow

Cluster Management

Web Application

CouchbaseClient Library

Web Application … …

Couchbase Server Couchbase Server Couchbase Server Couchbase Server

Replication Flow

13

COUCHBASE SERVER CLUSTER

Couchbase in a Cluster

User Configured Replica Count = 1

ACTIVE

Doc 5

Doc 2

Doc

Doc

Doc

SERVER 1

REPLICA

Doc 3

Doc 1

Doc 7

Doc

Doc

Doc

APP SERVER 1

COUCHBASE Client Library

CLUSTER MAP

COUCHBASE Client Library

CLUSTER MAP

APP SERVER 2

Doc 9

ACTIVE

Doc 3

Doc 1

Doc

Doc

Doc

SERVER 2

REPLICA

Doc 6

Doc 4

Doc 9

Doc

Doc

Doc

Doc 8

ACTIVE

Doc 4

Doc 6

Doc

Doc

Doc

SERVER 3

REPLICA

Doc 2

Doc 5

Doc 8

Doc

Doc

Doc

Doc 7

Query

READ/WRITE/UPDATE

15

NODE AND CLUSTER SIZING

Dev-Test Size Deploy Monitor Manage

16

Size Couchbase Server

Sizing == performance• Serve reads out of RAM• Enough IO for writes and disk operations• Mitigate inevitable failures

Reading Data Writing Data

Server

Give medocument A

Here is document A

Application Server

A

Server

Please storedocument A

OK, I storeddocument A

Application Server

A

17

ServerServer Server

Scaling out permits matching of aggregate flow rates so queues do not grow

Application ServerApplication Server Application Server

network networknetwork

18

How many nodes?

5 Key Factors determine number of nodes needed:

1) RAM2) Disk3) CPU4) Network5) Data Distribution/Safety

Couchbase Servers

Web application server

Application user

19

RAM sizing

1) Total RAM:• Managed document cache:

• Working set• Metadata• Active+Replicas

• Index caching (I/O buffer)

Keep working set in RAM for best read performance

Server

Give medocument A

Here is document A

Application Server

A

A

A

Reading Data

20

Working set ratio depends on your application

Server Server Server

Late stage social gameMany users no longer

active; few logged in at any given time.

Ad NetworkAny cookie can show up

at any time.

Business applicationUsers logged in during

the day. Day moves around the globe.

working/total set = 1working/total set = .01 working/total set = .33

Reading Data

21

RAM sizing – Working set managed cache

As memory grows, some cached data will be removed from RAM to make space:

• Active and replica data share RAM• Threshold based (NRU, favoring active data)• Only cleanly persisted data can be “ejected”• Only data values can be “ejected” which means

RAM can fill up with metadata

22

RAM Sizing - View/Index cache (disk I/O)

• File system cache availability for the index has a big impact performance:

• Test runs based on 10 million items with 16GB bucket quota and 4GB, 8GB system RAM availability for indexes

• Performance results show that by doubling system cache availability– query latency reduces by half

– throughput increases by 50%

• Leave RAM free with quotas

23

Disk sizing: Space and I/O

2) Disk• Sustained write rate• Rebalance capacity• Backups• XDCR • Compaction• Total dataset:

(active + replicas + indexes)

• Append-only

I/O

Space

Please storedocument A

OK, I storeddocument A

Application Server

A

Server

A

A

Writing Data

24

Disk sizing: I/O

Impacting disk I/O needed:• Peak write load• Sustained write load• Compaction• XDCR• Views/indexing

Configurable paths/partitions for data and indexes allows for separation of space and I/O

25

Disk sizing: Space

Impacting amount of disk space needed:• Total data set • Indexes• Overhead for compaction (~3x): Both data

and indexes are “append-only”

Configurable paths/partitions for data and indexes allows for separation of space and I/O

26

Disk sizing: Impact of Views on IO and Space

• Number of Design Documents• Extra space for each DD• Extra IO to process for each DD• Segregate views by DD

• Complexity of Views (IO)

• Amount of view output (space)• Emit as little as possible• Doc ID automatically included

• Use Development views and extrapolate

27

• Append-only file format puts all new/updated/deleted items at the end of the on-disk file.

– Better performance and reliability

– No more fragmentation!

• This can lead to invalidated data in the “back” of the file.

• Need to compact data

Disk sizing: Append only

28

Initial file layout:

Update some data:

After compaction:

Disk compaction

Doc A Doc B Doc C

Doc C Doc B’ Doc A’’

Doc A Doc B Doc A’ Doc B’ Doc A’’Doc A Doc B Doc C Doc A’ Doc D

Doc D

29

• Compaction happens automatically:

– Settings for “threshold” of stale data

– Settings for time of day

– Split by data and index files

– Per-bucket or global

• Reduces size of on-disk files – data files AND index files

• Temporarily increased disk I/O

and CPU, but no downtime!

Disk compaction

30

CPU sizing

3) CPU• Disk writing• Views/compaction/XDCR• RAM r/w performance not impacted

1.8 used VERY little CPU. Under the same workloads, 2.0 should not be much different.

New 2.0 features will require more CPU

31

Network sizing

4) Network• Client traffic• Replication (writes)• Rebalancing• XDCR

Reads+Writes

Replication (multiply writes) and Rebalancing

32

Consistent low latency with varying doc sizes

Consistently low latencies in microseconds for

varying documents sizes with a mixed workload

33

Data Distribution

5) Data Distribution / Safety (assuming one replica):• 1 node = BAD• 2 nodes = …better…• 3+ nodes = BEST!

Note: Many applications will need more than 3 nodes

Servers fail, be prepared. The more nodes, the less impact a failure will have.

34

How many nodes? (recap)

New to 2.0 feature will affect sizing requirements:• Views/Indexing/Querying• XDCR• Append-only file format

5 Key Factors still determine number of nodes needed:1) RAM2) Disk3) CPU4) Network5) Data Distribution

Couchbase Servers

Web application server

Application user

35

MONITORING

Dev-Test Size Deploy Monitor Manage

36Server

Key resources: RAM, Disk, Network, CPU

RAM

DISK

NETWORK

Server

RAM

DISK

Server

RAM

DISK

Application Server Application Server Application Server

37

Monitoring

Once in production, heart of operations is monitoring

• RAM Usage• Disk space and I/O:

• write queues / read activity / indexing• Network bandwidth, replication queues• CPU Usage• Data distribution (balance, replicas)

38

Monitoring

IMMENSE amount of information available

• Real-time traffic graphs

• REST API accessible

• Per bucket, per node and aggregate statistics

• Application and inter-node traffic

• RAM <-> Disk

• Inter-system timing

39

Key Stats to Monitor

• Working set doesn’t fit in RAM

–Cache miss rate / disk fetches

• Disk I/O not keeping up

–Disk Write queue size

• Internal replication lag

– TAP queues

• Indexing not keeping up

• XDCR lag

40

41

MANAGEMENT AND MAINTENANCE

Dev-Test Size Deploy Monitor Manage

42

Management/Maintenance

• Scaling

• Upgrading/Scheduled maintenance

• Backup/Restore

• Dealing with Failures

43

Scaling

Couchbase Scales out Linearly:

Need more RAM? Add nodes…

Need more Disk IO or space? Add nodes…

Couchbase also makes it easy to scale up by swapping larger nodes for smaller ones without any disruption

44

Couchbase + Cisco + Solarflare

Number of servers in cluster

Op

era

tio

ns p

er

seco

nd

High throughput with 1.4 GB/sec data transfer rate using 4 servers

Linear throughput scalability

45

Additional benchmark details

• Cluster of 8 nodes running Couchbase Server 1.8.0 • One server used as the client to run the workload• Workload used for the test was Couchbase’s streaming load generator • GET and SET operations were performed in the 70:30 ratio

Test System and Parameters • Couchbase Server 1.8.0 • Cisco Nexus 5548UP Switch • Solarflare SFN5122F 10 Gigabit Ethernet Enhanced Small Form-Factor

Pluggable (SFP+) server adapters • Solarflare OpenOnload• Servers: Nine Cisco UCS C200 M2 High-Density Rack Servers with Intel

Xeon processor X5670 six-core 2.93-GHz CPU, running Red Hat Enterprise Linux (RHEL) 5.5 x86 64-bit, with 100-GB RAM and four 2-TB hard drives

46

1. Add nodes of new version, rebalance…

2. Remove nodes of old version, rebalance…

3. Done!

No disruption

General use for software upgrade, hardware refresh, planned maintenance

Upgrade existing Couchbase Server 1.8 to

Couchbase Server 2.0!

Upgrade

47

Easy to Maintain Couchbase

• Use remove+rebalance on “malfunctioning” node:

– Protects data distribution and “safety”

– Replicas recreated

– Best to “swap” with new node to maintain capacity and move minimal amount of data

48

Backup

Data Files

cbbackup

ServerServer Server

network networknetwork

49

Restore

2) “cbrestore” used to restore data into live/different cluster

Data Files

cbrestore

50

Failures Happen!

Hardware

NetworkBugs

51

Easy to Manage failures with Couchbase

• Failover (automatic or manual):

– Replica data and indexes promoted for immediate access

– Replicas not recreated

– Do NOT failover healthy node

– Perform rebalance after returning cluster to full or greater capacity

52

Fail Over

Doc 7

Doc 9

Doc 3

Active Docs

Replica Docs

Doc 6

COUCHBASE CLIENT LIBRARY

CLUSTER MAP

APP SERVER 1

COUCHBASE CLIENT LIBRARY

CLUSTER MAP

APP SERVER 2

Doc 4

Doc 2

Doc 5

SERVER 1

Doc 6

Doc 4

SERVER 2

Doc 7

Doc 1

SERVER 3

Doc 3

Doc 9

Doc 7 Doc 8

Doc 6

Doc 3

DOC

DOC

DOCDOC

DOC

DOC

DOC DOC

DOC

DOC

DOC DOC

DOC

DOC

DOC

Doc 9

Doc 5DOC

DOC

DOC

Doc 1

Doc 8

Doc 2

Replica Docs Replica Docs Replica Docs

Active Docs Active Docs Active Docs

SERVER 4 SERVER 5

Active Docs Active Docs

Replica Docs Replica Docs

COUCHBASE SERVER CLUSTER

53Dev/Test Size Deploy Monitor Manage

Conclusion

54

Want more?

Lots of details and best practices in our documentation:

http://www.couchbase.com/docs/

55

QUESTIONS?

PERRY@COUCHBASE.COM@PERRYKRUG

top related