Cloud Computing Applications Hazelcast, Spark and Ignite
Joseph S. Kuo a.k.a. CyberJos
About Me
.大學唸數學系時玩了一堆語言和架構
.22年程式資歷,17年Java資歷
.擔任過資訊講師,曾任職於遊戲雲端平台公司、全球電子商務公司、知名資安公司以及社群趨勢分析公司
.希望能一輩子寫程式玩技術到老
Agenda
.Briefing of Hazelcast
.More about Hazelcast
.Spark Introduction
.Hazelcast and Spark
.About Apache Ignite
.Things between Ignite and Hazelcast
Briefing of Hazelcast
What is Hazelcast?
Hazelcast is an in-memory data grid which distributed data evenly among the nodes of a computing cluster, and shares available processing power and storage space to provide services. It also has the ability for failure tolerance and node loss.
Features
.Distributed Caching: Queue, Set, List, Map, MultiMap, Lock, Topic, AtomicReference, AtomicLong, IdGenerator, Ringbuffer, Semaphores
.Distributed Compute: Entry Processor, Executor Service, User Defined Services
.Distributed Query: Query, Aggregators, Listener with Predicate, MapReduce
Features (Cont.)
.Integrated Clustering: Hibernate 2nd Level Cache, Grails 3, JCS Resource Adapter
.Standards: JCache, Apache jclouds
.Cloud and Virtualization Support: Docker, AWS, Azure, Discovery Service Provider Interface, Kubernetes, Zookeper Discovery
.Client-Server Protocols: Memcache, Open Binary Client Protocol, REST
Use Cases
.In-Memory Data Grid
.Caching
.In-Memory NoSQL
.Messaging
.Application Scaling
.Clustering
In-Memory Data Grid
.Scale-out Computing: shared CPU power
.Resilience: failure & data loss/performance
.Programming Model: easily code clusters
.Fast, Big Data: handle large sets in RAM
.Dynamic Scalability: join/leave a cluster
.Elastic Main Memory: memory pool
Caching
.Elastic Memcached: Hazelcast has been used as an alternative to Memcached.
.Hibernate 2nd Level Cache: It organizes caching into 1st and 2nd level caches.
.Spring Cache: It supports Spring Cache which allows it to plug in to any Spring application.
In-Memory NoSQL
.Scalability: size of RAM vs DISKBy joining nodes in a cluster, we can gather RAM to store maps, and the CPU and RAM resources become available to the network.
.Volatility: volatility of RAM vs DiskIt uses P2P data distribution to provide no single node of failure. By default, it has data stored in two locations in the cluster.
In-Memory NoSQL (Cont.)
.RebalancingIt automatically rebalances data if a node crashes. Shuffling data has a negative effect as it consumes network, CPU and RAM.
.Going NativeThe High-Density Memory Store can avoid GC pauses. It uses NIO DirectByteBuffers and does not require any defragmentation.
Messaging
Hazelcast provides Topic for distribution mechanism for publishing messages that are delivered to multiple subscribers. Publish and subscriptions are cluster-wide. Messages are ordered, that is, listeners will process the messages in the order they are actually published.
Application Scaling
.Elastic Scalability: new servers join a cluster automatically
.Super Speeds: memory transaction speed
.High Availability: can deploy in backup pairs or even WAN replicated
.Fault Tolerance: no single point of failure
.Cloud Readiness: deploy right into EC2
Clustering
Hazelcast is easily able to handle Session Clustering with in-memory performance, linear scalability as you add new nodes and reliability. This is a great way to ensure that session information is maintained when you are clustering web servers. You can also use a similar pattern for managing user identities.
Dependency
.Maven<dependency> <groupId>com.hazelcast</groupId> <artifactId>hazelcast</artifactId> <version>3.7.2</version></dependency>
.Gradledependencies { compile 'com.hazelcast:hazelcast:3.7.2'}
More about Hazelcast
What’s New in Hazelcast 3.4
.High-Density Memory Store
.Hazelcast Configuration Import
.Back Pressure
What’s New in Hazelcast 3.5
.Async Back Pressure
.Client Configuration Import
.Cluster Quorum
.Hazelcast Client Protocol
.Listener for Lost Partitions
.Increased Visibility of Slow Operations
.Sub-Listener Interfaces for Map Listener
What’s New in Hazelcast 3.6
.High-Density Memory Store for Map
.Discovery SPI
.Client Protocol & Version Compatibility
.Support for cloud providers by jclouds®
.Hot Restart Persistence
.Lite Members
.Lots of Features for Hazelcast JCache
.Hazelcast Docker image
What’s New in Hazelcast 3.7
.Custom Eviction Policies
.Discovery SPI for Azure
.Hazelcast CLI with Scripting
.OpenShift and CloudFoundry Plugin
.Apache Spark Connector
.Alignment of WAN Replication Clusters
.Fault Tolerant Executor Service
Sample Codepublic class GetStartedMain { public static void main(final String[] args) { Config cfg = new Config(); HazelcastInstance instance = Hazelcast.newHazelcastInstance(cfg); Map<Long, String> map = instance.getMap("test"); map.put(1L, "Demo"); System.our.println(map.get(1L)); }}
Sharding – 4 nodes
How Data is Partitioned?
Data entries are distributed into partitions by using a hashing algorithm (key/name):
.the key or name is serialized (converted into a byte array),.this byte array is hashed, and.the result of the hash is mod by the number of partitions.
Partition ID
The result of this modulo - MOD (hash result, partition count) - is the partition in which the data will be stored, that is the partition ID. For ALL members you have in your cluster, the partition ID for a given key will always be the same.
Partition Table
When we start a member, a partition table is created within it. This table stores the partition IDs and the cluster members to which they belong. The purpose of this table is to make all members (including lite members) in the cluster aware of this information, ensuring that each member knows where the data is.
Partition Table (Cont.)
The oldest member in the cluster (the one that started first) periodically sends the partition table to all members. In this way each member in the cluster is informed about any changes to partition ownership. The ownerships may be changed when a new member joins the cluster, or when a member leaves the cluster.
Repartitioning
Repartitioning is the process of redistribution of partition ownerships:
.When a member joins to the cluster..When a member leaves the cluster.
In these cases, the partition table in the oldest member is updated with the new partition ownerships.
Topology - Embedded
Topology - Client/Server
Spark Introduction
What is Spark?
.Spark is a fast and general-purpose cluster computing system. It provides high-level APIs and an optimized engine that supports general execution graphs. It also supports a rich set of higher-level tools.
.It provides an interface for programming entire clusters with implicit data parallelism and fault-tolerance.
Advantages
.SpeedRun programs up to 100x faster than Hadoop MapReduce in memory, or 10x faster on disk.
.Ease of UseWrite application quickly. Spark offers over 80 high-level operators to build parallel applications.
Advantages (Cont.)
.GeneralityCombine SQL, streaming and complex analytics libraries seamlessly in the same application.
.Run EverywhereSupport multiple cluster management and distributed storage system.
Features
.Resilient distributed dataset (RDD)
.Fault Tolerant
.Map-reduce cluster computing
.Build-in libraries
.Languages: Java, Scala, Python and R
.Interactive shell (Python, Scala, R) and web-based UI
RDD
Resilient distributed dataset is a read-only distributed data set of elements partitioned across the nodes of the cluster that can be operated on in parallel. It can stay in memory and fall back to disk gracefully. An RDD in memory (cached) can be reused efficiently across parallel operations. Finally, RDD automatically recovers from node failures.
RDD Operations
Two types of things that can be done on an RDD:
.transformations like map, filter than results in another RDD
.actions like count that result in an output
RDD Operations (Cont.)
RDD Fault Recovery
Directed Acyclic Graph
Cluster Topology
Dependency
.Maven<dependency> <groupId>org.apache.spark</groupId> <artifactId>spark-core_2.11</artifactId> <version>2.0.0</version></dependency>
.Gradledependencies { compile 'org.apache.spark:spark-core_2.11:2.0.0'}
Spark Node with Docker
.Pull image (Docker 2.0)docker pull maguowei/spark
.Launch a Spark nodedocker run -it -p 4040:4040 maguowei/spark pyspark
docker run -it -p 4040:4040 maguowei/spark spark-shell
.Monitoringhttp://localhost:4040/
Spark Cluster with Docker
.Launch master image (driver program)docker run -it -h sandbox1 -p 7077:7077 -p 8080:8080 maguowei/spark bash
.Append text to /etc/hosts172.17.0.2 sandbox1172.17.0.3 sandbox2
.Launch the master node/opt/spark-2.0.0-bin-hadoop2.7/sbin/start-master.sh
.Monitoringhttp://localhost:8080/
Spark Cluster with Docker (Cont.)
.Launch work imagesdocker run -it -h sandbox2 maguowei/spark bash
.Append text to /etc/hosts172.17.0.2 sandbox1172.17.0.3 sandbox2
.Launch a work node/opt/spark-2.0.0-bin-hadoop2.7/sbin/start-slave.sh spark://sandbox1:7077
.Run tasksdocker exec <CONTAINER_ID> run-example <class> <arg>
same version for all placessame version for all places
same version for all places
Very important so say 3 times
Hazelcast and Spark
What is this Connector?
A plug-in which allows maps and caches to be used as shared RDD caches by Spark using the Spark RDD API.
What is this Connector?
Clients Clients
\ /
Hazelcast (MapReduce) Spark (MapReduce)
\ /
Hazelcast Spark Connector
Features
.Read/Write support for Hazelcast Maps
.Read/Write support for Hazelcast Caches
Requirements
.Hazelcast 3.7.x
.Apache Spark 1.6.1
Dependency
.Maven<dependency> <groupId>com.hazelcast</groupId> <artifactId>hazelcast-spark</artifactId> <version>0.1</version></dependency>
.Gradledependencies { compile 'com.hazelcast:hazelcast-spark:0.1'}
Properties
The options for the SparkConf object.hazelcast.server.addresses: 127.0.0.1:5701 (Comma separated list)
.hazelcast.server.groupName: dev
.hazelcast.server.groupPass: dev-pass
.hazelcast.spark.valueBatchingEnabled: true
.hazelcast.spark.readBatchSize: 1000
.hazelcast.spark.writeBatchSize: 1000
.hazelcast.spark.clientXmlPath
Creating the SparkContextSparkConf conf = new SparkConf() .set("hazelcast.server.addresses", "127.0.0.1:5701") .set("hazelcast.server.groupName", "dev") .set("hazelcast.server.groupPass", "dev-pass") .set("hazelcast.spark.valueBatchingEnabled", "true") .set("hazelcast.spark.readBatchSize", "5000") .set("hazelcast.spark.writeBatchSize", "5000");
JavaSparkContext jsc = new JavaSparkContext("spark://127.0.0.1:7077", "appname", conf);// provide Hazelcast functions to the Spark Context.HazelcastSparkContext hsc = new HazelcastSparkContext(jsc);
Read Data to Hazelcast// readHazelcastJavaRDD rddFromMap = hsc.fromHazelcastMap("map-name-to-be-loaded");HazelcastJavaRDD rddFromCache = hsc.fromHazelcastCache("cache-name-to-be-loaded");
Write Data to Hazelcastimport static com.hazelcast.spark.connector.HazelcastJavaPairRDDFunctions.javaPairRddFunctions;
JavaPairRDD<Object, Long> rdd = hsc.parallelize(new ArrayList<Object>() { add(1); add(2); add(3); }).zipWithIndex();
// writejavaPairRddFunctions(rdd).saveToHazelcastMap(name);javaPairRddFunctions(rdd).saveToHazelcastCache(name);
About Apache Ignite
What is Ignite?
Apache Ignite In-Memory Data Fabric is a high-performance, integrated and distributed in-memory platform for computing and transacting on large-scale data sets in real-time, orders of magnitude faster than possible with traditional disk-based or flash technologies.
Features
.Data Grid
.Compute Grid
.Streaming and CEP
.Data Structures
.Messaging and Events
.Service Grid
Data Grid
Data Grid
.Distributed Caching: Key-Value Store, Partitioning & Replication, Client-Side Cache
.Cluster Resiliency: Self-Healing Cluster
.Memory Formats: On-heap, Off-heap, Tiered Storage
.Marshalling: Binary Protocol
.Distributed Transactions and Locks: ACID, Deadlock-free, Cross-partition, Locks
Data Grid (Cont.)
.Distributed Query: SQL Queries, Joins, Continuous Queries, Indexing, Consistency, Fault-Tolerance
.Persistence: Write-Through, Read-Through, Write-Behind Caching, Automatic Persistence
.Standards: JCache, SQL, JDBC, OSGi
.Integrations: DB, Hibernate L2 Cache, Session Clustering, Spring Caching
Computing Grid
Computing Grid
.Distributed Closure Execution
.Clustered Executor Service
.MapReduce and ForkJoin
.Load Balancing
.Fault-Tolerance
.Job Scheduling
.Checkpointing
Streaming and CEP
Ignite streaming allows to process continuous never-ending streams of data in scalable and fault-tolerant fashion. The rates at which data can be injected into Ignite can be very high and easily exceed millions of events per second on a moderately sized cluster.
Streaming and CEP
Data Structures
.Queue and Set
.Atomic Types
.CountDownLatch
.IdGenerator
.Semaphore
Messaging and Events
.Topic Based Messaging
.Point-to-Point Messaging
.Event Notifications
.Automatic Batching
Service Grid
Dependency
.Maven<dependency> <groupId>org.apache.ignite</groupId> <artifactId>ignite-core</artifactId> <version>1.7.0</version></dependency>
.Gradledependencies { compile 'org.apache.ignite:ignite-core:1.7.0'}
Things between Ignite & Hazelcast
Benchmark Fight
.GridGain posted: GridGain vs Hazelcast Benchmarks
.It was also posted to Hazelcast Forum
.Hazelcast CEO removed that post
.Hazelcast fought back and claimed that GridGain cheated
.GridGain re-tested and clarified
DifferenceIgnite Hazelcast
Off-heap Memory Configurable Enterprise
Off-heap Indexing Yes No
Continuous Query Yes Enterprise
SSL Encryption Yes Enterprise
SQL Query Full ANSI 99 Limited
Join Query Yes No
Data Consistency Yes Partial
Difference (Cont.)Ignite Hazelcast
Deadlock-free Yes No
Computing GridMapReduce, ForkJoin
LoadBalance, ...MapReduce
Streaming/ Yes No
Service Grid Yes No
Language .Net/C#/C++/Node.js .Net/C#/C++
Data Structures Less More
Plug-in Less More
It doesn’t matter which you select
How you use it does matter
References.Hazelcast: http://hazelcast.org/
.Hazelcast Doc: http://hazelcast.org/documentation/
.Spark: http://spark.apache.org/
.Hazelcast Spark Connector: https://github.com/hazelcast/hazelcast-spark
.Apache Ignite: https://ignite.apache.org/
.Sample Code: https://github.com/CyberJos/jcconf2016-hazelcast-spark
Thank You!!