operationalizing the value of mongodb: the metlife experience

14
Page 1 Operationalizing value of MongoDB (MetLife experience) Thrills and challenges of building MongoDB operations in a large enterprise

Upload: mongodb

Post on 05-Dec-2014

734 views

Category:

Technology


1 download

DESCRIPTION

It was a lot of fun bringing exciting emerging technology into the rigid enterprise infrastructure eco-system. And then the real work began. How do you make the new technology operational? Learn from MetLife’s journey of operationalizing MongoDB to the level compliant with large enterprise requirements in High Availability, Recoverability, Security, Monitoring, Alerting, Workload management and Automation.

TRANSCRIPT

Page 1: Operationalizing the Value of MongoDB: The MetLife Experience

Page 1

Operationalizing value of MongoDB

(MetLife experience)

Thrills and challenges of building MongoDB operations in a large enterprise

Page 2: Operationalizing the Value of MongoDB: The MetLife Experience

Page 2

A Journey

When new technology meets enterprise standards : - advantages and restrictions of large enterprises - it is always a journey - decisions we have to live with

Page 3: Operationalizing the Value of MongoDB: The MetLife Experience

Page 3

Highly Successful Adoption of New Technologyfor a Fortune 50 Enterprise Organization

• Unknown technology– Proves to be capable

• New platform– Quickly matures

• Untested for the Enterprise– Delivers success

• Many new things to learn– Become experts in time

Page 4: Operationalizing the Value of MongoDB: The MetLife Experience

Page 4

Disclaimer

• The content in this presentation represents MetLife's choices and MongoDB Inc.’s recommendations for MetLife’s specific use case. By no means is this a “universal blueprint for success” and it doesn’t necessarily represent MongoDB Inc.'s recommendations for all use cases.

• In particular- because there were some fixed decisions that predated the MongoDB implementation, MetLife's deployment may require some “manual intervention” (specifically in case of DR) whereas other, differently-organized deployments might not.

Page 5: Operationalizing the Value of MongoDB: The MetLife Experience

Page 5

Introducing “The Wall”

Page 6: Operationalizing the Value of MongoDB: The MetLife Experience

Page 6

Basic System Architecture Decisions

• Company Data Center vs. Public Cloud PlacementControl vs. ease of useMetLife: Compliance requirements dictate company data center(s) placement.

• Server type and sizesEnterprise class servers vs. Pizza boxesMetLife: More cost effective to run on enterprise class servers - 2x8 Core CPU, 512 GB RAM

• VirtualizationVM vs. “Bare Metal”MetLife: Data nodes – physical servers, Configuration Servers and MongoS – VMs.

• SAN vs. Local storageFlexibility of SAN vs. performance of local storageMetLife: Local storage enclosures. 600 GB SAS drives.

• NetworkDedicated LAN for MongoDB replicationMetLife: No dedicated LANs, for MongoDB installation.

Page 7: Operationalizing the Value of MongoDB: The MetLife Experience

Page 7

Business Requirements and System Topology

Business requirements: - mission critical application

- loss of entire data center for indefinite time should not limit the application functionality in any way - significant data growth is expected, as well as a significant increase in the number of users

Drive system topology :a. Geographic placement

MetLife: Geographically dispersed cluster, spanning two data centers

b. Sharded cluster vs. Replica setMetLife: Sharded cluster for elastic horizontal scalability

c. Number of nodes in the replica setMetLife: Minimum of 6 to ensure full operability in case of one data center loss.

d. Writes and reads geography MetLife: Business function driven write-concern implementation, reads are mostly

“secondary preferred”

Page 8: Operationalizing the Value of MongoDB: The MetLife Experience

Page 8

System topology

CConfiguration

Server 1

Local ProdReplica 1

Primary Prod

Local ProdHidden Replica for backups

Remote Prod Replica 1

Remote Prod Replica 2

Remote ProdHidden Replica forbackups

Configuration Server 2

Configuration Server 3

Data Center 1 Data Center 2

BackupSolution

Backup Solution

2 SHARDS comprise

this

2 SHARDS comprise

this

2 SHARDS comprise

this

2 SHARDS comprise

this

2 SHARDS comprise

this

2 SHARDS comprise

this

MongoS Prod Server

MongoS ProdServer

Mongos Server

Mongos Server

Page 9: Operationalizing the Value of MongoDB: The MetLife Experience

Page 9

System Setup for Availability and DR

System has to comply with MetLife’s enterprise standard for availability and DR (No single points of failure):

a. Replica setsMetLife: 6 member replica sets ( 3 in each data center), 2 hidden replicas for backup purposes, 5 voting members ( hidden replicas in DR data center has 0 votes), and 2 replicas in primary

datacenter who have higher priority.

b. Mongo Configuration serversMetLife: 3 configuration servers (2 in primary data center and 1 in DR data center). Loss of entire

data center halts cluster balancing ability, but not the application functionality.

c. MongoSMetLife: 4 MongoS servers (2 in each data center). All active.

d. Application servers connectivityMetLife: MongoDB drivers on application servers are configured to use all MongoS but in a

different order for pseudo load balancing.

e. DR exercise

MetLife: DR exercise is conducted yearly and includes all database and application infrastructure to ensure complete operability from DR data center.

Page 10: Operationalizing the Value of MongoDB: The MetLife Experience

Page 10

System Set up for Recoverability

System has to comply with MetLife’s enterprise standard for recoverability:

Backup and Recovery strategy.MetLife:

- Daily backups in both data centers (alternating). - Backups of hidden replicas are performed with mongod brought down. Balancer

is stopped. - Due to the database size backup is performed at the file system level. - At the same time backup of Configuration server is performed using mongodump.

Current challenges. MetLife:

- No point-in-time recovery - No easy way to restore one specific database

Using MMS Backup solution. MetLife:

- MMS Backup is capable of solving some of our current challenges. - Due to compliance reasons, cannot use MMS cloud backup solution in AWS- Currently looking into an option of running MMS Backup solution on premises

Page 11: Operationalizing the Value of MongoDB: The MetLife Experience

Page 11

Security

System has to comply with MetLife’s enterprise standard for data security:

Authentication and authorization.MetLife:

- Original build in MongoDB 2.2 had very limited options in database authentication and write or read/write permission at the database level.

- Biggest concerns : authentication – no password policy enforcement, authorization – excessive application permissions.

- MetLife’s MongoDB 2.6 goals are : authentication – Active Directory, authorization – custom build roles with least set of permission required by application. LDAP integration

MetLife: - Integration with Active Directory (AD) using LINUX PAMs - Third party product for secure Sever/AD communications - Currently mixed mode (both AD and in-database) authentication

Data-at-rest encryption MetLife: Data-at-rest encryption is implemented using third-party product (LINUX file system

/ device encryption).Audit.

MetLife: - Tactical: MongoDB 2.6 audit capability can do the job.- Strategic: Database activity audit is performed by third party product.

Page 12: Operationalizing the Value of MongoDB: The MetLife Experience

Page 12

Monitoring and Alerting

System has to comply with MetLife’s enterprise standard for monitoring and alerting:

Hardware monitoringMetLife: No munin-node monitoring. Using standard enterprise Linux server monitoring

toolset owned by MetLife

MongoDB monitoring with MMS MetLife: Currently using MMS in cloud for monitoring and alerting. Alerts are sent via SMS

and e-mails to responsible individuals in operations as well as to monitored group mail boxes.

BMC MongoDB Patrol KM as an alternative monitoring solutionMetLife: Third party Knowledge Modules are standard monitoring/alerting tools for MetLife’s

enterprise databases. Currently engaged in MongoDB KM beta-testing.

Integrating monitoring/alerting to the enterprise incident management system MetLife: Currently no integration. Two approaches in parallel:

- In-house written process to parse JSON attachment from MMS alert e-mail and create incident ticket

- Third party KM is natively integrated with enterprise incident management system

Page 13: Operationalizing the Value of MongoDB: The MetLife Experience

Page 13

Workload Management and Automation

System has to reliably support business SLAs and be efficient to manage:

Workload management and resource sharing.MetLife: Workload management and resource sharing is one of the bigger challenges.

MongoDB 2.6 does not have in-database mechanism for managing different workloads, that makes resource sharing problematic.

- Potential options: C-groups in RHEL 6

MMS automation (installation, upgrades).MetLife: Engaged with MongoDB for MMS automation beta-testing.

Page 14: Operationalizing the Value of MongoDB: The MetLife Experience

Page 14

Next Steps in our Journey

• Automation (installation, upgrades, maintenance).

• MMS backup solution (on premises).

• Monitoring/alerting integration with an incident management system.

• Workload management / resource sharing solution.

• Introduction of arbiter to existing replica sets (3rd data center).

• Performance benchmarking toolset.