cassandra summit 2014: meta — an efficient distributed data hub with batch and streaming query...
DESCRIPTION
Presenter: Alvaro Agea, Big Data Architect at Stratio Big Data analysis is commonly associated with batch processing of data stored in distributed file systems. The advent of streaming data is exposing the shortcomings of the traditional data analysis. Users aiming to combine both worlds - batch processing and streaming - had to turn to unreliable in-house developments. We propose Stratio META to meet this new need. META is a technology based on a structured NoSQL datastore with advanced indexing capabilities. META includes an efficient query planner designed from scratch. The planner determines which is the optimal path to execute a query and which components should be involved.TRANSCRIPT
![Page 1: Cassandra Summit 2014: META — An Efficient Distributed Data Hub with Batch and Streaming Query Capabilities](https://reader034.vdocuments.net/reader034/viewer/2022042813/54815524b4795937578b48b0/html5/thumbnails/1.jpg)
Stratio Meta An efficient distributed datahub with batch and streaming query capabilities Daniel Higuero [email protected]
Alvaro Agea [email protected]
1"#CassandraSummit-2014
![Page 2: Cassandra Summit 2014: META — An Efficient Distributed Data Hub with Batch and Streaming Query Capabilities](https://reader034.vdocuments.net/reader034/viewer/2022042813/54815524b4795937578b48b0/html5/thumbnails/2.jpg)
Stratio Crossdata An efficient distributed datahub with batch and streaming query capabilities Daniel Higuero [email protected]
Alvaro Agea [email protected]
2"#CassandraSummit-2014
![Page 3: Cassandra Summit 2014: META — An Efficient Distributed Data Hub with Batch and Streaming Query Capabilities](https://reader034.vdocuments.net/reader034/viewer/2022042813/54815524b4795937578b48b0/html5/thumbnails/3.jpg)
• Stra3o-is-a-Big-Data-Company • Founded-in-2013 • Commercially-launched-in-2014 • 50+-employees-in-Madrid • Office-in-San-Francisco • Cer3fied-Spark-distribu3on
STRATIO Who are we?
#CassandraSummit-2014 3"
![Page 4: Cassandra Summit 2014: META — An Efficient Distributed Data Hub with Batch and Streaming Query Capabilities](https://reader034.vdocuments.net/reader034/viewer/2022042813/54815524b4795937578b48b0/html5/thumbnails/4.jpg)
• P2P-architecture • Read/write-performance • Fault-tolerance • Easy-to-deploy • CQL
Cassandra We love…
#CassandraSummit-2014 4"
![Page 5: Cassandra Summit 2014: META — An Efficient Distributed Data Hub with Batch and Streaming Query Capabilities](https://reader034.vdocuments.net/reader034/viewer/2022042813/54815524b4795937578b48b0/html5/thumbnails/5.jpg)
• Introduction • Crossdata architecture • Metadata management • Streaming sources • Full text search • Spark and Crossdata • ODBC • The future
Agenda
5"
![Page 6: Cassandra Summit 2014: META — An Efficient Distributed Data Hub with Batch and Streaming Query Capabilities](https://reader034.vdocuments.net/reader034/viewer/2022042813/54815524b4795937578b48b0/html5/thumbnails/6.jpg)
#CassandraSummit-2014
o Big-Data-analysis-is-commonly-associated-with-batch-processing
• Users-aiming-to-combine-batch-and-stream-processing-have-to-rely-on-tailorRmade-architectures
o Users-buy-Big-Data-plaSorms,-but
• How-do-I-start? • What-is-my-entry-point-to-the-plaSorm?
Introduction
6"
![Page 7: Cassandra Summit 2014: META — An Efficient Distributed Data Hub with Batch and Streaming Query Capabilities](https://reader034.vdocuments.net/reader034/viewer/2022042813/54815524b4795937578b48b0/html5/thumbnails/7.jpg)
o Easy-deployment
o Easy-administra3on
o Read/write-performance
o EasyRtoRlearn-query-language- o Integra3on-with-BI-Tools o Join-opera3ons o Support-for-streaming-sources
o Integra3on-with-other-data-stores o Ability-to-query-data-without-thinking-about-the-schema-(nonRindexed-data)
What our clients demand?
#CassandraSummit-2014 7"
![Page 8: Cassandra Summit 2014: META — An Efficient Distributed Data Hub with Batch and Streaming Query Capabilities](https://reader034.vdocuments.net/reader034/viewer/2022042813/54815524b4795937578b48b0/html5/thumbnails/8.jpg)
! Easy%deployment%
! Easy%administra0on%
! Read/write%performance%
! Easy6to6learn%query%language%o Integra3on-with-BI-Tools o Join-opera3ons o Support-for-streaming-sources
o Integra3on-with-other-data-stores o Ability-to-query-data-without-thinking-about-the-schema-(nonRindexed-data)
What our clients demand?
#CassandraSummit-2014 8"
![Page 9: Cassandra Summit 2014: META — An Efficient Distributed Data Hub with Batch and Streaming Query Capabilities](https://reader034.vdocuments.net/reader034/viewer/2022042813/54815524b4795937578b48b0/html5/thumbnails/9.jpg)
! Easy"deployment"
! Easy"administra8on"
! Read/write"performance"
! Easy>to>learn"query"language"! Integra3on-with-BI-Tools ! Join-opera3ons ! Support-for-streaming-sources
! Integra3on-with-other-data-stores ! Ability-to-query-data-without-thinking-about-the-schema-(nonRindexed-data)
What our clients demand?
#CassandraSummit-2014 9"
![Page 10: Cassandra Summit 2014: META — An Efficient Distributed Data Hub with Batch and Streaming Query Capabilities](https://reader034.vdocuments.net/reader034/viewer/2022042813/54815524b4795937578b48b0/html5/thumbnails/10.jpg)
o A-new-technology-that: • Is-not-limited-by-the-underlying-datastore-capabili3es
• Leverages-Spark-to-perform-nonRna3vely-supported-opera3ons
• Supports-batch-and-streaming-queries
• Supports-mul3ple-clusters-and-technologies
Crossdata
#CassandraSummit-2014 10"
![Page 11: Cassandra Summit 2014: META — An Efficient Distributed Data Hub with Batch and Streaming Query Capabilities](https://reader034.vdocuments.net/reader034/viewer/2022042813/54815524b4795937578b48b0/html5/thumbnails/11.jpg)
Our architecture
#CassandraSummit-2014 11"
![Page 12: Cassandra Summit 2014: META — An Efficient Distributed Data Hub with Batch and Streaming Query Capabilities](https://reader034.vdocuments.net/reader034/viewer/2022042813/54815524b4795937578b48b0/html5/thumbnails/12.jpg)
o Crossdata-defines-an-IConnector-extension-interface o User-can-easily-add-new-connectors-to-support • Different-datastores • Different-processing-engines • Different-versions
o Where-each-connector-defines-its-capabili3es
Connecting to the outside world
#CassandraSummit-2014 12"
Our planner will choose the best connector for each query
![Page 13: Cassandra Summit 2014: META — An Efficient Distributed Data Hub with Batch and Streaming Query Capabilities](https://reader034.vdocuments.net/reader034/viewer/2022042813/54815524b4795937578b48b0/html5/thumbnails/13.jpg)
Query execution
#CassandraSummit-2014 13"
Parsing" Valida8on" Planning" Execu8on"
C*"
Connector1"
Connector2"
Connector3"
Our planner will choose the best connector for each query
![Page 14: Cassandra Summit 2014: META — An Efficient Distributed Data Hub with Batch and Streaming Query Capabilities](https://reader034.vdocuments.net/reader034/viewer/2022042813/54815524b4795937578b48b0/html5/thumbnails/14.jpg)
o Stra3o-Crossdata-offers-the-possibility-of-accessing-a-single-catalog-across-a-set-of-datastores.-
• Mul3ple-clusters-can-coexist-to-op3mize-plaSorm-performance
" E.g.,-produc3on-cluster,-test-cluster,-writeRop3mized-cluster,-readRop3mized-cluster,-etc.-
• A-table-is-saved-in-a-unique-datastore
Multi-cluster support
#CassandraSummit-2014 14"
![Page 15: Cassandra Summit 2014: META — An Efficient Distributed Data Hub with Batch and Streaming Query Capabilities](https://reader034.vdocuments.net/reader034/viewer/2022042813/54815524b4795937578b48b0/html5/thumbnails/15.jpg)
#CassandraSummit-2014
Logical and physical mapping
15"
C*"produc8on" C*"development" Other"datastores"
App"catalog"
Users"table" Test"table" old_users"table"
SELECT&*&FROM&app.users;&
![Page 16: Cassandra Summit 2014: META — An Efficient Distributed Data Hub with Batch and Streaming Query Capabilities](https://reader034.vdocuments.net/reader034/viewer/2022042813/54815524b4795937578b48b0/html5/thumbnails/16.jpg)
Metadata Management
16"
![Page 17: Cassandra Summit 2014: META — An Efficient Distributed Data Hub with Batch and Streaming Query Capabilities](https://reader034.vdocuments.net/reader034/viewer/2022042813/54815524b4795937578b48b0/html5/thumbnails/17.jpg)
o Some-datastores-are-schemaless-but-our-applica3ons-are-not!-
• Flexible-schemas-vs-Schemaless
• Crossdata-provides-a-Metadata-manager-that-stores-schemas-for-any-datasource
" Remember-ODBC-and-those-BI-tools
Metadata in the era of Schemaless NoSQL datastores
#CassandraSummit-2014
?""
101001010101010101101010101111010001111001000"
17"
![Page 18: Cassandra Summit 2014: META — An Efficient Distributed Data Hub with Batch and Streaming Query Capabilities](https://reader034.vdocuments.net/reader034/viewer/2022042813/54815524b4795937578b48b0/html5/thumbnails/18.jpg)
#CassandraSummit-2014
Metadata management
18"
C*"produc8on"
Connector"
Infinispan"
Metadata"Store"
Metadata"Manager"
2%
Updated"metadata"informa8on"is"
maintained"among"Crossdata"servers"using"Infinispan"
If"the"connector"does"not"support"metadata"opera8ons"those"are"
skipped" 2%1%
![Page 19: Cassandra Summit 2014: META — An Efficient Distributed Data Hub with Batch and Streaming Query Capabilities](https://reader034.vdocuments.net/reader034/viewer/2022042813/54815524b4795937578b48b0/html5/thumbnails/19.jpg)
Streaming sources
19"
![Page 20: Cassandra Summit 2014: META — An Efficient Distributed Data Hub with Batch and Streaming Query Capabilities](https://reader034.vdocuments.net/reader034/viewer/2022042813/54815524b4795937578b48b0/html5/thumbnails/20.jpg)
#CassandraSummit-2014
o Nowadays-use-cases-expect-some-type-of-streaming-datasource
• Streaming-data-has-an-ephemeral-nature
• In-Stra3o-Crossdata-we-defined-the-ephemeral-table-abstrac3on-to-work-with-streaming-sources-as-classical-RDBMS-tables
Managing streaming sources
20"
streaming"source"
col1:text" col2:int" col3:int" col4:text"
{schema:{col1:…},…}"Streaming_query0"
Streaming_queryn"
…"
![Page 21: Cassandra Summit 2014: META — An Efficient Distributed Data Hub with Batch and Streaming Query Capabilities](https://reader034.vdocuments.net/reader034/viewer/2022042813/54815524b4795937578b48b0/html5/thumbnails/21.jpg)
#CassandraSummit-2014
o Streaming-queries-are-infinite-by-defini3on
• A-3me-window-is-defined-to-create-a-batch-like-view-of-the-rows-ingested-by-the-system-in-that-period
• The-user-launches-queries-specifying-a-processing-3me-window
" Crossdata-provides-methods-to-list-and-stop-running-streaming-queries
Streaming queries
21"
![Page 22: Cassandra Summit 2014: META — An Efficient Distributed Data Hub with Batch and Streaming Query Capabilities](https://reader034.vdocuments.net/reader034/viewer/2022042813/54815524b4795937578b48b0/html5/thumbnails/22.jpg)
#CassandraSummit-2014
Streaming queries: windows syntax
22"
SELECT fieldGroup,avg(Field2) FROM eph_table WITH WINDOW 5 minutes WHERE field1=100 AND field2>100 GROUP BY fieldGroup;
![Page 23: Cassandra Summit 2014: META — An Efficient Distributed Data Hub with Batch and Streaming Query Capabilities](https://reader034.vdocuments.net/reader034/viewer/2022042813/54815524b4795937578b48b0/html5/thumbnails/23.jpg)
#CassandraSummit-2014
Joining batch and streaming SELECT * FROM demo.temporal WITH WINDOW 10 secs INNER JOIN demo.users ON users.name = temporal.name;
SELECT * FROM demo.temporal WITH WINDOW 10 secs "
SELECT * FROM demo.users "
INNER JOIN ON users.name = temporal.name "
23"
![Page 24: Cassandra Summit 2014: META — An Efficient Distributed Data Hub with Batch and Streaming Query Capabilities](https://reader034.vdocuments.net/reader034/viewer/2022042813/54815524b4795937578b48b0/html5/thumbnails/24.jpg)
Full text search
24"
![Page 25: Cassandra Summit 2014: META — An Efficient Distributed Data Hub with Batch and Streaming Query Capabilities](https://reader034.vdocuments.net/reader034/viewer/2022042813/54815524b4795937578b48b0/html5/thumbnails/25.jpg)
o Clients-request-the-ability-to-perform-full-text-searches
o We-have-developed-an-integra3on-between-Lucene-and-Cassandra
o C*-users-can-now-enjoy-all-Lucene-features: • Full-text-searches,-range-queries,-fuzzy-queries….
Full text search with
#CassandraSummit-2014 25"
https://github.com/Stratio/stratio-cassandra
![Page 26: Cassandra Summit 2014: META — An Efficient Distributed Data Hub with Batch and Streaming Query Capabilities](https://reader034.vdocuments.net/reader034/viewer/2022042813/54815524b4795937578b48b0/html5/thumbnails/26.jpg)
Stratio Lucene 2i
#CassandraSummit-2014 26"
C*"node"
C*"node"
Lucene"index"
C*"node"
Lucene"index"
C*"node"
Lucene"index"
Lucene"index"
C*"node"
Lucene"index"
![Page 27: Cassandra Summit 2014: META — An Efficient Distributed Data Hub with Batch and Streaming Query Capabilities](https://reader034.vdocuments.net/reader034/viewer/2022042813/54815524b4795937578b48b0/html5/thumbnails/27.jpg)
o With-Crossdata,-we-simplify:
• The-crea3on-syntax-
• The-query-syntax-using-the-match-operator
Full text search queries
#CassandraSummit-2014 27"
CREATE&FULLTEXT&INDEX&ON&app.users(name,email);&
SELECT&*&FROM&app.users&&where&email&MATCH&‘*@stratio.com’;&
![Page 28: Cassandra Summit 2014: META — An Efficient Distributed Data Hub with Batch and Streaming Query Capabilities](https://reader034.vdocuments.net/reader034/viewer/2022042813/54815524b4795937578b48b0/html5/thumbnails/28.jpg)
& Stratio Crossdata
28"
![Page 29: Cassandra Summit 2014: META — An Efficient Distributed Data Hub with Batch and Streaming Query Capabilities](https://reader034.vdocuments.net/reader034/viewer/2022042813/54815524b4795937578b48b0/html5/thumbnails/29.jpg)
o Stra3o-Crossdata-uses-Spark-to-perform-nonRna3vely-supported-opera3ons
o Spark-brings-several-benefits-over-Hadoop- o InRMemory-processing
o RDD-abstrac3on o Simpler-API-
o Increased-flexibility-(e.g.,-not-need-for-iden3ty-mapping)
Why Spark?
#CassandraSummit-2014 29"
![Page 30: Cassandra Summit 2014: META — An Efficient Distributed Data Hub with Batch and Streaming Query Capabilities](https://reader034.vdocuments.net/reader034/viewer/2022042813/54815524b4795937578b48b0/html5/thumbnails/30.jpg)
o Different-approach-to-query-execu3on • We-only-use-Spark-when-it-speedups-queries
" Na3ve-drivers-are-faster-for-simple-queries
" Spark-SQL-has-limited-RDD-sources
• Avoid-some-Spark-limita3ons
• Several-batch-and-streaming-contexts-in-a-single-JVM-SPARKR2243
What about Spark SQL?
#CassandraSummit-2014 30"
![Page 31: Cassandra Summit 2014: META — An Efficient Distributed Data Hub with Batch and Streaming Query Capabilities](https://reader034.vdocuments.net/reader034/viewer/2022042813/54815524b4795937578b48b0/html5/thumbnails/31.jpg)
#CassandraSummit-2014
Query approach
Cassandra"
Spark"
SparkSQL"
Cassandra"
Spark" Na8ve"driver"
SparkSQL"approach" Crossdata"approach"
31"
Stra8o"Crossdata"
![Page 32: Cassandra Summit 2014: META — An Efficient Distributed Data Hub with Batch and Streaming Query Capabilities](https://reader034.vdocuments.net/reader034/viewer/2022042813/54815524b4795937578b48b0/html5/thumbnails/32.jpg)
#CassandraSummit-2014
o Project-started-in-June-2013 " With-the-objec3ve-of-providing-a-method-to-interact-with-
Cassandra-from-Spark
" Ini3al-approach-based-on-the-HadoopInputFormat-interface
" Current-version-uses-the-na3ve-Datastax-Java-driver
Our Cassandra-Spark integration
32"
https://github.com/Stratio/stratio-deep
![Page 33: Cassandra Summit 2014: META — An Efficient Distributed Data Hub with Batch and Streaming Query Capabilities](https://reader034.vdocuments.net/reader034/viewer/2022042813/54815524b4795937578b48b0/html5/thumbnails/33.jpg)
#CassandraSummit-2014
o Benchmark-in-process-comparing-our-solu3on-with-the-Datastax-Spark-driver
• Results-highly-influenced-by-the-split-size • Ini3al-results-are-promising-for-Stra3o-Spark-Integra3on-
using-Datastax-default-values
• Group-by-–-up-to-40%-faster • Join-–-up-to-17%-faster
• Stay-tuned-for-the-benchmark-publica3on!
Our Cassandra-Spark integration
33"
![Page 34: Cassandra Summit 2014: META — An Efficient Distributed Data Hub with Batch and Streaming Query Capabilities](https://reader034.vdocuments.net/reader034/viewer/2022042813/54815524b4795937578b48b0/html5/thumbnails/34.jpg)
#CassandraSummit-2014
Spark vs Lucene 2i
34"
Time"
Records/node"
Spark"
Lucen"2i"
![Page 35: Cassandra Summit 2014: META — An Efficient Distributed Data Hub with Batch and Streaming Query Capabilities](https://reader034.vdocuments.net/reader034/viewer/2022042813/54815524b4795937578b48b0/html5/thumbnails/35.jpg)
ODBC
35"
![Page 36: Cassandra Summit 2014: META — An Efficient Distributed Data Hub with Batch and Streaming Query Capabilities](https://reader034.vdocuments.net/reader034/viewer/2022042813/54815524b4795937578b48b0/html5/thumbnails/36.jpg)
o WellRknown-interface-standard-(for-BI-tools,-external-apps,-…)
o We-have-implemented-for-Crossdata-using-Simba-SDK
o ODBC-opens-the-full-poten3al-of-Stra3o-Crossdata-to-the-external-world
o Currently-tested-with-Tableau,-Qlikview-and-MS-Excel
Stratio Crossdata ODBC
#CassandraSummit-2014 36"
One ODBC for all datastores!
![Page 37: Cassandra Summit 2014: META — An Efficient Distributed Data Hub with Batch and Streaming Query Capabilities](https://reader034.vdocuments.net/reader034/viewer/2022042813/54815524b4795937578b48b0/html5/thumbnails/37.jpg)
The future
37"
![Page 38: Cassandra Summit 2014: META — An Efficient Distributed Data Hub with Batch and Streaming Query Capabilities](https://reader034.vdocuments.net/reader034/viewer/2022042813/54815524b4795937578b48b0/html5/thumbnails/38.jpg)
#CassandraSummit-2014
o Security o Query-op3mizer-and-smart-query-planner
o Leverage-system-sta3s3cs
o Support-for-UDFs o Become-an-Apache-project
The future
38"
https://github.com/Stratio/stratio-meta
![Page 39: Cassandra Summit 2014: META — An Efficient Distributed Data Hub with Batch and Streaming Query Capabilities](https://reader034.vdocuments.net/reader034/viewer/2022042813/54815524b4795937578b48b0/html5/thumbnails/39.jpg)
#CassandraSummit-2014
We are looking for an Apache Champion
39"
Can"you"help"us?"
![Page 40: Cassandra Summit 2014: META — An Efficient Distributed Data Hub with Batch and Streaming Query Capabilities](https://reader034.vdocuments.net/reader034/viewer/2022042813/54815524b4795937578b48b0/html5/thumbnails/40.jpg)
o Ability-to-stop-running-queries o Interac3ve-users-are-unpredictable
o Some-excep3on-paths-are-not-clear-or-defined-(e.g.,-secondary-indexes)
o Distribute-some-of-the-opera3ons-currently-performed-on-the-coordinator
• E.g.,-aggrega3ons-like-count(*)
A wish list for Cassandra
#CassandraSummit-2014 40"
![Page 41: Cassandra Summit 2014: META — An Efficient Distributed Data Hub with Batch and Streaming Query Capabilities](https://reader034.vdocuments.net/reader034/viewer/2022042813/54815524b4795937578b48b0/html5/thumbnails/41.jpg)
Stratio Crossdata An efficient distributed datahub with batch and streaming query capabilities Daniel Higuero [email protected]
Alvaro Agea [email protected]
41"#CassandraSummit-2014