http://www.tibco.com
Global Headquarters
3303 Hillview Avenue
Palo Alto, CA 94304
Tel: +1 650-846-1000
Toll Free: 1 800-420-8450
Fax: +1 650-846-1005
© 2006, TIBCO Software Inc. All rights
reserved. TIBCO, the TIBCO logo, The
Power of Now, and TIBCO Software are
trademarks or registered trademarks of
TIBCO Software Inc. in the United States
and/or other countries. All other product and
company names and marks mentioned in
this document are the property of their
respective owners and are mentioned for
identification purposes only.
Accelerator for Apache Spark
Interface Specification
23 August 2016
Version 1.0.0
This document outlines the interface specification for inbound and outbound
messages for the Accelerator for Apache Spark
Document
Accelerator for Apache Spark – Interface Specification 2
Revision History
Version Date Author Comments
0.1 10/04/2016 Piotr Smolinski Initial version
0.2 18/04/2016 Piotr Smolinski
0.3 06/06/2016 Piotr Smolinski
1.0.0 23/08/2016 Piotr Smolinski Version for release
Document
Accelerator for Apache Spark – Interface Specification 3
Copyright Notice
COPYRIGHT© 2016 TIBCO Software Inc. This document is unpublished and the foregoing notice is
affixed to protect TIBCO Software Inc. in the event of inadvertent publication. All rights reserved. No
part of this document may be reproduced in any form, including photocopying or transmission
electronically to any computer, without prior written consent of TIBCO Software Inc. The information
contained in this document is confidential and proprietary to TIBCO Software Inc. and may not be used
or disclosed except as expressly authorized in writing by TIBCO Software Inc. Copyright protection
includes material generated from our software programs displayed on the screen, such as icons, screen
displays, and the like.
Trademarks
Technologies described herein are either covered by existing patents or patent applications are in
progress. All brand and product names are trademarks or registered trademarks of their respective
holders and are hereby acknowledged.
Confidentiality
The information in this document is subject to change without notice. This document contains
information that is confidential and proprietary to TIBCO Software Inc. and may not be copied,
published, or disclosed to others, or used for any purposes other than review, without written
authorization of an officer of TIBCO Software Inc. Submission of this document does not represent a
commitment to implement any portion of this specification in the products of the submitters.
Content Warranty
The information in this document is subject to change without notice. THIS DOCUMENT IS PROVIDED
"AS IS" AND TIBCO MAKES NO WARRANTY, EXPRESS, IMPLIED, OR STATUTORY, INCLUDING
BUT NOT LIMITED TO ALL WARRANTIES OF MERCHANTABILITY OR FITNESS FOR A
PARTICULAR PURPOSE. TIBCO Software Inc. shall not be liable for errors contained herein or for
incidental or consequential damages in connection with the furnishing, performance or use of this
material.
For more information, please contact:
TIBCO Software Inc.
3303 Hillview Avenue
Palo Alto, CA 94304
USA
Document
Accelerator for Apache Spark – Interface Specification 4
Table of Contents
TABLE OF CONTENTS .............................................................................................................................4
TABLE OF FIGURES .................................................................................................................................6
TABLE OF TABLES ..................................................................................................................................7
1 PREFACE ..........................................................................................................................................8
1.1 PURPOSE OF DOCUMENT .............................................................................................................8
1.2 SCOPE ........................................................................................................................................8
1.3 REFERENCED DOCUMENTS ..........................................................................................................8
2 EVENT CAPTURE AND EMISSION (KAFKA MESSAGES)............................................................9
2.1 TRANSACTION .............................................................................................................................9
2.2 NOTIFICATION ........................................................................................................................... 11
3 RUNTIME STATE MAINTENANCE (HBASE) ............................................................................... 14
4 RUNTIME DASHBOARD (LIVEVIEW DATAMART) ..................................................................... 17
4.1 TRANSACTION .......................................................................................................................... 17
4.2 TRANSACTIONITEMS ................................................................................................................. 18
4.3 STORESUMMARY ...................................................................................................................... 18
4.4 MODELSUMMARY ..................................................................................................................... 19
4.5 WALLCLOCK ............................................................................................................................. 20
5 DATA COLLECTION (HDFS STRUCTURES) ............................................................................... 21
5.1 STREAMBASE TO FLUME ........................................................................................................... 22
5.2 AVRO TO PARQUET ................................................................................................................... 24
6 DATA ACCESS (REST/HTTP) ....................................................................................................... 26
6.1 SERVICE LAYER ........................................................................................................................ 26
6.1.1 Text format ......................................................................................................................... 27
6.1.2 SBDF format ...................................................................................................................... 27
6.2 /CATEGORIES ............................................................................................................................ 28
6.3 /RANGES .................................................................................................................................. 28
6.4 /TRANSACTIONS ........................................................................................................................ 28
6.5 /PIVOT ...................................................................................................................................... 29
6.6 /SQL ......................................................................................................................................... 30
6.7 /ETL ......................................................................................................................................... 30
6.7.1 /etl (GET) ............................................................................................................................ 30
6.7.2 /etl (POST) ......................................................................................................................... 30
Document
Accelerator for Apache Spark – Interface Specification 5
6.7.3 /etl/{jobId} (GET) ................................................................................................................ 30
6.7.4 /etl/{jobId}/messages (GET) ............................................................................................... 30
6.7.5 /etl/{jobId}/events (GET) ..................................................................................................... 30
6.7.6 /etl/{jobId} (DELETE) .......................................................................................................... 31
6.8 /MODELS (TRAINING) ................................................................................................................. 31
6.8.1 /models/train (GET) ............................................................................................................ 31
6.8.2 /models/train (POST) ......................................................................................................... 31
6.8.3 /models/train/{jobId} (GET) ................................................................................................ 32
6.8.4 /models/train/{jobId}/messages (GET) ............................................................................... 32
6.8.5 /models/train/{jobId}/events (GET) ..................................................................................... 32
6.8.6 /models/train/{jobId} (DELETE) .......................................................................................... 32
6.9 /MODELS (RESULTS AND MEDATATA) .......................................................................................... 32
6.9.1 /models/results ................................................................................................................... 32
6.9.2 /models/roc ......................................................................................................................... 33
6.9.3 /models/varimp ................................................................................................................... 34
7 DATA ACCESS (SQL) ................................................................................................................... 35
7.1 TITEMS ..................................................................................................................................... 35
7.2 STORES .................................................................................................................................... 35
7.3 CUSTOMER_DEMOGRAPHIC_INFO .............................................................................................. 36
7.4 CUSTOMERID_SEGMENTS .......................................................................................................... 37
7.5 CATEGORIES ............................................................................................................................. 37
7.6 RANGES ................................................................................................................................... 37
8 POJO LAYOUT AND CONFIGURATION DEPLOYMENT ............................................................ 38
8.1 HDFS LAYOUT ......................................................................................................................... 38
8.1.1 hdfs://demo.sample/apps/demo/models/pojo .................................................................... 38
8.1.2 hdfs://demo.sample/apps/demo/models/results ................................................................ 39
8.1.3 hdfs://demo.sample/apps/demo/models/roc ...................................................................... 39
8.1.4 hdfs://demo.sample/apps/demo/models/varimp ................................................................ 40
8.1.5 hdfs://demo.sample/apps/demo/models/sets .................................................................... 41
8.2 CONFIGURATION DEPLOYMENT .................................................................................................. 42
8.2.1 Model deployment .............................................................................................................. 42
8.2.2 Product SKU to category id mapping ................................................................................. 43
8.2.3 Feature mapping ................................................................................................................ 44
8.2.4 Model deployment procedure ............................................................................................ 45
Document
Accelerator for Apache Spark – Interface Specification 6
Table of Figures
No table of figures entries found.
Document
Accelerator for Apache Spark – Interface Specification 7
Table of Tables
No table of figures entries found.
Document
Accelerator for Apache Spark – Interface Specification 8
1 Preface
1.1 Purpose of Document
The document describes the data exchange interfaces used by the Accelerator for Apache Spark
project.
1.2 Scope
This document outlines the following:
Inbound report message specifications
Outbound notification message specifications
1.3 Referenced Documents
Document Reference
Accelerator for Apache Spark Quick Start Guide
Accelerator for Apache Spark Functional Specification
Document
Accelerator for Apache Spark – Interface Specification 9
2 Event capture and emission (Kafka messages)
The eventing interface builds the basic part of the Fast Data story in the accelerator. It is good old
reactive messaging. The Event Processing layer receives and publishes messages related to the
handled process.
In the retail processing scenario the messages represent the transaction content and resulting offers.
The input messages contain minimal information about the executed transaction: list of items with
nominal price, applied discount and effective revenue and customer identity (loyalty card number).
The output messages contain information about offers relevant to the executed transaction.
The common characteristic of the input and output event streams in the accelerator is that the number
of events per second can be huge. On the other hand each event is generally independent from other
events delivered or sent within relative small time window. In the micro-scale we deal with some
process tracking, but the number of concurrently tracked processes (like customer purchases) is large.
The target of the accelerator is resilient horizontally scalable event processing solution. Such solution
should:
process the events (both CEP and ESP)
collect the events for analytics
apply knowledge gained in the analytics to the event processing
The demo scenario in the accelerator uses Kafka as the messaging bus and XML as payload format.
The decision was driven by ability to scale out messaging beyond EMS limitations. The second reason
to pick Kafka was that the product gains popularity and is pretty common in Big Data eventing. The XML
was used as typical data format used in integration projects, but any format carrying information with
same semantics (like JSON) would be equally good.
The major advantages of Kafka in the accelerator are:
ability to repeat the traffic (up to the message log retention point)
inherent traffic partitioning (with delivery ordering guarantee within partition)
The messages in the Kafka bus consist of opaque header and body. The interpretation of both fields is
a contract between publisher and consumer. In the sample implementation the header is UTF-8
multiline text a list of text key/value pairs, where each line defines the pair and key is separated from
value by ‘:’ (colon). The payload is UTF-8 encoded XML message.
2.1 Transaction
Transaction event carries XML content defined in transactions.xsd. The schema defines XML structure
in http://demos.tibco.com/retail/transactions namespace.
The event can be annotated with following header fields:
Document
Accelerator for Apache Spark – Interface Specification 10
Field Sample Description
correlationId ca9ef517-8509-
40be-9754-
ca923344ba67
Message correlation id. It is unique id to track the process. In the
implemented scenario it is used to correlate offers with transaction
events. Presence of the correlation id allows the client interface to be
implemented as request-reply (but user is not forced to that). The field
can be also used as duplicate detection mechanism inside event
processing application. In particular case the event initiator may resend
the same transaction. If non-persistent registry of previously seen
events contains the given correlation id, most likely the client sent it
twice (for example as IO partial failure retry), but the message was
indeed delivered to the bus.
sendResponseTo Notifications:0 Destination topic to notification messages. The header allows to send
the response messages to declared destination instead of one common
topic. This is useful to combine with correlationIs for request-reply. The
value may include desired partition number.
customerId 4bf34187-ce60-
482d-b2e9-
ed5584e72ada
Optionally replicated field from the message payload. It can be used to
transparently send messages to given partition. The consistent
customer to partition assignment is needed to guarantee the ordering of
messages without distributed locking.
transactionId d1193d9b-686e-
4089-877e-
3a05ac9d88a3
Optionally replicated field from the message payload.
The header in Kafka is intended to be small and quickly parsable. It is used to quickly filter out irrelevant
messages, route messages to the target processor without parsing the payload, or even pass content
cryptographic signature. It is very similar to JMS headers, the major difference is that Kafka does not
impose any particular interpretation of the binary content.
The particular problem can be the processing of the events in strict order. In Big Data and Fast Data
systems it is not possible to be done globally. Kafka guarantees ordered message delivery within given
topic partition. With large number of topics it is possible to scale out the processing that each consumer
processes relevant messages in strict order. The challenge, though, is sending messages for given key
to the correct partition. This is message sender responsibility. Another problem that may arise is that
given message may be related to two independent processes, each one identified by different key. In
such case the message should be sent to two topics with specific partition assignment. This of course
complicates the sender side design. Placing a façade gateway component hiding the routing logic or
message duplication seems to be a good idea.
The message payload for transaction is XML. The XML content contains the full message information
and the header fields may replicate some of it to provide technical hints.
The namespace is: http://demos.tibco.com/retail/transactions
Element Sample Description
transaction Root element
transaction d1193d9b-686e-
4089-877e-
Unique transaction identification. In the demo it is assumed that all
messages with the same transaction id are identical, therefore it is safe
Document
Accelerator for Apache Spark – Interface Specification 11
Element Sample Description
transactionId 3a05ac9d88a3 to keep only one copy.
transaction
customerId
4bf34187-ce60-
482d-b2e9-
ed5584e72ada
Unique process identifier. The field groups related messages and it
should be used to process messages in the order of production.
transaction
storeId
Transaction originator identifier
transaction
transactionTime
2015-03-
20T20:18:32Z
ISO-8601 compatible time represented as xsd:dateTime.
transaction
transactionLines
Container element for list of items. A transaction consists of arbitrary
number of items and the items on the list can be repeated.
transaction
transactionLines
transactionLine
Single transaction line container.
transaction
transactionLines
transactionLine
productSKU
101231 Stock keeping unit code for the purchased product. Identifies the item in
the inventory.
transaction
transactionLines
transactionLine
quantity
1 Amount of items purchased.
transaction
transactionLines
transactionLine
nominalPrice
145.00 Nominal price as known to the selling point at the moment of transaction.
transaction
transactionLines
transactionLine
purchasePrice
130.00 Actual purchase price after discount(s).
transaction
transactionLines
transactionLine
discount
Optional discount information.
2.2 Notification
Notifications form output stream of Fast Data events. The notification messages, similarly as
transactions, are XML messages sent via Kafka topic. The messages exchange can be organized on
the client side into request-reply pattern, but on the XML level the messages are different.
Document
Accelerator for Apache Spark – Interface Specification 12
Field Sample Description
correlationId ca9ef517-8509-
40be-9754-
ca923344ba67
Message correlation id propagated from the transaction message. It is
used by the client to correlate notification with the sent transaction
message. Usually it should be combined with sentResponseTo header.
customerId 4bf34187-ce60-
482d-b2e9-
ed5584e72ada
Replicated field from the message payload. The field can be used to
route the message to the topic partition specific to the customer.
transactionId d1193d9b-686e-
4089-877e-
3a05ac9d88a3
Replicated field from the message payload. The message can be used
to track messages related to the given transaction without parsing the
actual payload.
In the current implementation of the accelerator there are not many assumptions about the consumer
implementation. The applied design with correlationId and sendResponseTo fields allows the
consumer to specify the expected notification delivery destination. This in turn enables the customer to
be both asynchronous with distinct channel for transactions and notifications as well as synchronous
with low latency answer for the service orchestration.
The synchronous request-reply requires further considerations. Kafka topic management is heavy
operation. Therefore it is unpractical to create notification topic for each transaction or even each client.
API gateway like TIBCO APIX seems to be a perfect fit for such service.
Same as for transaction, the message payload for notification is XML. The XML content contains the
offer information and the header fields may replicate some of it to provide technical hints.
The namespace is: http://demos.tibco.com/retail/notifications
Element Sample Description
notification Root element
notification
transactionId
d1193d9b-686e-
4089-877e-
3a05ac9d88a3
Unique transaction identification. Note that due to technical problems
and recovery there could be several notification messages for given
transaction. It is unlikely to happen under regular processing conditions.
notification
customerId
4bf34187-ce60-
482d-b2e9-
ed5584e72ada
Unique process identifier. The field groups related messages and it
should be used to process messages in the order of production.
notification
propensities
Container element for list of propensities/offers. The single notification
may contain multiple responses, letting the consumer of the message
pick the best one. There could be also no offer at all. In the demo the
event processing layer does the best response selection, so there is at
most one offer line. The design allows to pipeline the processing of
messages, sending the ultimate response to the sendResponseTo in the
last stage.
notification
propensities
propensity
Single offer container.
Document
Accelerator for Apache Spark – Interface Specification 13
Element Sample Description
notification
propensities
propensity
category
Hats Actual category/offer name.
notification
propensities
propensity
propensity
0.3 Propensity score
The notification message content is very simple. It should contain just the essential facts produced by
the event processing layer.
Document
Accelerator for Apache Spark – Interface Specification 14
3 Runtime state maintenance (HBase)
The core aspect of any event processing solution is processing of incoming events in context of
previously captured events. In ultra-low latency solutions, like StreamBase algorithmic trading, the state
is kept locally in the process memory. In retail scenario the customer's numerical description is derived
from customer history. As the process may disappear any time and there is no guarantee that the given
process handles all particular customer related messages, the memory-only storage is unpractical. In
scalable solution the data should be kept shared. For Fast Data the shared storage access must be
immediate.
In the accelerator the event flow application need to read the customer history and update it in possible
short time. HBase looks like a perfect fit for such use-case. The main advantages of HBase in context of
retail accelerator are:
scalable constant-cost customer data access
flexible schema
inherent data appending
In the accelerator the HBase interaction is limited to single read and update by primary key. The
database keeps customer transaction records in table Customers in form of JSON message per entry.
The table is defined with single column Transactions that was defined with history size of 200 entries.
That means up to 200 transactions are kept with no additional cost. The query is always made by
primary key than binary representation of customer id.
In order to access the database shell, go to the HBase bin directory and execute the following
command:
[demo@demo bin]$ cd /opt/java/hbase-1.2.0/bin/
[demo@demo bin]$ ./hbase shell
The table customer was created with the following statement:
create 'Customers', { NAME => "Transactions", VERSIONS => 200 }
To inspect the table run:
describe 'Customers'
To remove all stored data:
truncate 'Customers'
Customer data can be accessed with the following call:
hbase(main):004:0> get 'Customers', "1f969758-3a90-4fcc-ac18-ea0247823f48"
COLUMN CELL
Transactions:transactions timestamp=1460826706531,
value={"transactionId":"9132a1b3-c332-4a6e-926a-83696b6c1f9b","transactionDate":"2015-12-30
04:14:10.000+0000","items":[{"productSku":"100017","quantity":1,"price":38.15,"revenue":38.1
5},{"productSku":"100017","quantity":1,"price":38.15,"revenue":38.15},{"productSku":"100017"
,"quantity":1,"price":38.15,"revenue":38.15},{"productSku":"100991","quantity":1,"price":35.
7,"revenue":35.7},{"productSku":"105104","quantity":1,"price":70.19,"revenue":70.19}]}
1 row(s) in 0.0330 seconds
To get 5 recently recorder transactions use parameterized call:
Document
Accelerator for Apache Spark – Interface Specification 15
hbase(main):005:0> get 'Customers', "1f969758-3a90-4fcc-ac18-ea0247823f48", {VERSIONS=>2}
COLUMN CELL
Transactions:transactions timestamp=1460826706531,
value={"transactionId":"9132a1b3-c332-4a6e-926a-83696b6c1f9b","transactionDate":"2015-12-30
04:14:10.000+0000","items":[{"productSku":"100017","quantity":1,"price":38.15,"revenue":38.1
5},{"productSku":"100017","quantity":1,"price":38.15,"revenue":38.15},{"productSku":"100017"
,"quantity":1,"price":38.15,"revenue":38.15},{"productSku":"100991","quantity":1,"price":35.
7,"revenue":35.7},{"productSku":"105104","quantity":1,"price":70.19,"revenue":70.19}]}
Transactions:transactions timestamp=1460826115580,
value={"transactionId":"d2d29d60-8211-4cbf-b643-445b23eac317","transactionDate":"2015-10-18
20:26:12.000+0000","items":[{"productSku":"100017","quantity":1,"price":38.15,"revenue":38.1
5},{"productSku":"100017","quantity":1,"price":38.15,"revenue":38.15},{"productSku":"100017"
,"quantity":1,"price":38.15,"revenue":38.15},{"productSku":"100017","quantity":1,"price":38.
15,"revenue":38.15},{"productSku":"100289","quantity":1,"price":90.0,"revenue":90.0},{"produ
ctSku":"100442","quantity":1,"price":149.49,"revenue":149.49},{"productSku":"100605","quanti
ty":1,"price":128.27,"revenue":128.27},{"productSku":"100605","quantity":1,"price":128.27,"r
evenue":128.27},{"productSku":"100605","quantity":1,"price":128.27,"revenue":128.27},{"produ
ctSku":"100605","quantity":1,"price":128.27,"revenue":128.27},{"productSku":"100605","quanti
ty":1,"price":128.27,"revenue":128.27},{"productSku":"100870","quantity":1,"price":43.74,"re
venue":43.74},{"productSku":"101022","quantity":1,"price":139.12,"revenue":139.12},{"product
Sku":"101022","quantity":1,"price":139.12,"revenue":139.12},{"productSku":"102769","quantity
":1,"price":99.0,"revenue":99.0},{"productSku":"107191","quantity":1,"price":261.0,"revenue"
:261.0}]}
Transactions:transactions timestamp=1460825844625,
value={"transactionId":"b9ebb575-788b-42a3-8d4e-f8c9a7533606","transactionDate":"2015-09-15
17:23:53.000+0000","items":[{"productSku":"100017","quantity":1,"price":38.15,"revenue":38.1
5},{"productSku":"100024","quantity":1,"price":22.5,"revenue":22.5},{"productSku":"100605","
quantity":1,"price":128.27,"revenue":128.27},{"productSku":"100605","quantity":1,"price":128
.27,"revenue":128.27},{"productSku":"100991","quantity":1,"price":35.7,"revenue":35.7},{"pro
ductSku":"101022","quantity":1,"price":139.12,"revenue":139.12},{"productSku":"103318","quan
tity":1,"price":23.6,"revenue":23.6}]}
Transactions:transactions timestamp=1460825289248,
value={"transactionId":"f4b9ea05-46d8-45a4-9652-8e6bc8731b6f","transactionDate":"2015-07-10
18:02:08.000+0000","items":[{"productSku":"100017","quantity":1,"price":38.15,"revenue":38.1
5},{"productSku":"100017","quantity":1,"price":38.15,"revenue":38.15},{"productSku":"100017"
,"quantity":1,"price":38.15,"revenue":38.15},{"productSku":"100017","quantity":1,"price":38.
15,"revenue":38.15},{"productSku":"100017","quantity":1,"price":38.15,"revenue":38.15},{"pro
ductSku":"100017","quantity":1,"price":38.15,"revenue":38.15},{"productSku":"100024","quanti
ty":1,"price":22.5,"revenue":22.5},{"productSku":"100273","quantity":1,"price":10.99,"revenu
e":10.99},{"productSku":"100352","quantity":1,"price":75.0,"revenue":75.0},{"productSku":"10
2863","quantity":1,"price":9.99,"revenue":9.99},{"productSku":"106997","quantity":1,"price":
12.34,"revenue":12.34}]}
Transactions:transactions timestamp=1460825234330,
value={"transactionId":"84c03a7b-b373-4ee0-89f7-0a953a06cd11","transactionDate":"2015-07-03
21:07:40.000+0000","items":[{"productSku":"100017","quantity":1,"price":38.15,"revenue":38.1
5},{"productSku":"100017","quantity":1,"price":38.15,"revenue":38.15},{"productSku":"100017"
,"quantity":1,"price":38.15,"revenue":38.15},{"productSku":"100024","quantity":1,"price":22.
5,"revenue":22.5},{"productSku":"100024","quantity":1,"price":22.5,"revenue":22.5},{"product
Sku":"100605","quantity":1,"price":128.27,"revenue":128.27}]}
5 row(s) in 0.0270 seconds
The table structure:
Element Sample Description
transactionId 84c03a7b-b373-
4ee0-89f7-
0a953a06cd11
The transaction id. It is used to remove duplicates.
Document
Accelerator for Apache Spark – Interface Specification 16
Element Sample Description
transactionDate 2015-07-03
21:07:40.000+0000
Transaction date and time.
items List of transaction items.
items
productSKU
100017 Stock keeping unit code for the purchased product. Identifies the item in
the inventory.
items
quantity
1 Amount of items purchased.
items
price
38.15 Nominal price as known to the selling point at the moment of transaction.
items
revenue
38.15 Actual purchase price after discount(s).
items
discount
Optional discount information.
What can be noticed, the transaction line category id is absent. The reason is that the given transaction
may be used for feature building with product to category mapping different from the one active when it
was captured.
The event processing application adds supplementary logic to handle the duplicates and out-of-order
transactions. In the demo it is pretty common to resent the same transactions again and again. In such
case the transaction is interpreted as new one and from transactions with timestamp before the current
one and no older than predefined period (270 days), unique copies are extracted.
Note: the entry channel assumes that if transaction id is same, the transaction content is also identical.
Document
Accelerator for Apache Spark – Interface Specification 17
4 Runtime dashboard (LiveView DataMart)
For runtime operational monitoring the solution uses LiveView Web. The web dashboard uses the
realtime LiveView DataMart server hosting the data describing the current state of the event processing
layer. Currently the LVDM keeps the recent transactions and provides aggregation on them.
4.1 Transaction
Transaction table contains basic information about processed transactions with offer acceptance
tracking.
Element Sample Description
transactionId 84c03a7b-b373-
4ee0-89f7-
0a953a06cd11
The transaction id. It is used to remove duplicates.
transactionTimestamp 2015-07-03
21:07:40.000+0000
Transaction date and time.
customerId 1f969758-3a90-
4fcc-ac18-
ea0247823f48
Customer loyalty card number
storeId storeSF Transaction origin identification.
itemCount 5 Number of items included in the transaction
latitude / longitute 30.4151482
-97.6721351
Geographical position of the transaction derived from the originator
store.
state TX State derived from originator.
zipCode 78753 Zip code derived from originator.
region SW Region code derived from originator.
recommendation Backpacks The offer name created for this customer and transaction.
recommendingModel backpacks-2.5.3-
20150701
Model reference for model effectiveness tracking.
upsellSuccess false Recommendation result.
For each transaction, the record is published to the Transaction table together with the winning offer
and model (if any). The SB application tracks all created offers. If the same customer purchases the
item with recommendation category within the opportunity window (90 days), the upsellSuccess is set
to true. Otherwise if the opportunity window closes (time detected from incoming transactions), the
upsellSuccess is set to false.
Document
Accelerator for Apache Spark – Interface Specification 18
4.2 TransactionItems
For overview the transaction content is stored in child table. The table contains details for the executed
transaction for operational inspection.
Element Sample Description
productSku 100017 Stock keeping unit code for the purchased product. Identifies the
item in the inventory.
quantity 1 Amount of items purchased.
price 38.15 Nominal price as known to the selling point at the moment of
transaction.
revenue 38.15 Actual purchase price after discount(s).
discount Optional discount information.
time Transaction time.
latitude / longitute 30.4151482
-97.6721351
Geographical position of the transaction derived from the originator
store.
state TX State derived from originator.
zipCode 78753 Zip code derived from originator.
region SW Region code derived from originator.
category The category id recognized during transaction processing.
transactionId 84c03a7b-b373-
4ee0-89f7-
0a953a06cd11
The transaction id. Foreign key to the Transaction table.
itemId 2 Sequence number for the transaction line.
seqNum 1443 Sequential number to define the transaction item removal
sequence.
The table is keyed by transactionId/itemId pair.
The table implements graceful cleanup mechanism. When high memory utilization is detected the data
removal is triggered. The rows are selected for removal in sequence of insert events. Removal of
transaction item triggers also removal of parent transaction. The side effect is that after data removal
there could be some orphan transaction items left.
4.3 StoreSummary
Store summary table is derived from transaction items. The table contains information used to display
store-related dashboard information. The actual summary is by store and by category.
Document
Accelerator for Apache Spark – Interface Specification 19
Element Sample Description
latitude / longitute 30.4151482
-97.6721351
Geographical position of the transaction derived from the originator
store.
state TX State derived from originator.
zipCode 78753 Zip code derived from originator.
region SW Region code derived from originator.
numItems Number of items purchased
itemSaleRank Rank for the number of items
totalPrice Total value sold for the given category in the store
priceRank Store rank for the category value.
numTransactions Number of transactions for the given category
transactionRank Store rank
category Backpacks Category of the entry
storeId Store id of the entry
4.4 ModelSummary
The model summary contains information about deployed models. The table is propagated from model
loading task result.
Element Sample Description
status success / failure Model status
message Descriptive message reported by model component
modelName Model name as provided in the metadata
modelUrl URL where the model content can be found
modelVersion Model version
cutOff 0.42 The cut-off threshold used for binary discrimination
categoryId Backpacks Category id that the model produces offer for
offerName Summer Time
2015
Descriptive offer name as presented to customer
Document
Accelerator for Apache Spark – Interface Specification 20
Element Sample Description
description Model metadata descriptive information
validFrom 2015-05-01 Model validity boundary
validTo 2015-06-30 Model validity boundary
filter Free text filter defined as StreamBase expression. This can be used
to limit the model to particular region or store
time
seqNum
4.5 WallClock
The wall clock is intended to show the recent transaction timestamp. This timestamp is considered
observable time. The user interface can connect to this table in order to show the timestamp known by
the solution. The important assumption about accelerator is that the system is distributed and there is
no monotonic global clock. The global time instead is derived from observation of the world, transaction
timestamps in particular. This design does not exclude autonomous schedulers that may push the time
forward, but focuses on the observable nature of global time.
The table structure is simple:
Element Sample Description
id WallClock Primary key, only one value
time Timestamp. Highest reported time so far.
Document
Accelerator for Apache Spark – Interface Specification 21
5 Data collection (HDFS structures)
The large stream of events processing is just only beginning of the Fast Data to Big Data story. The
observed events are needed to draw conclusions both about current system behaviour as well as to
predict the observed processes evolution. The data has to be collected and prepared for massive
parallel processing. The main challenge is that semantics of event processing and data processing do
not match. The event processing requires fast access to the identified pieces of data, while the data
processing is all about transforming datasets. The events have to be stored in a storage that supports
massive parallel processing, in typical case it is HDFS. The problem with HDFS is that while it is perfect
for dataset access, it is terribly bad for incremental data appending.
The major problems of storing the data in HDFS are:
Expensive synchronization/buffer flush operation; it is required to acknowledge the data is
safely stored
Large cost of maintaining and processing many small files; the data consumers should not
attempt to read the files that are still written to
Format impedance; some data formats are better for events, some are better for data
The event data collection should provide an efficient mechanism for data appending. The data
availability latency should be relatively low, but for typical data analytics 1 hour latency is perfectly
acceptable.
The transition from events to data is done using staging approach. It allows to store the data efficiently
while not blocking the event processing layer from doing the business. The idea behind this approach is
that each layer in the pipeline processes the data in larger chunks. In the accelerator the pipeline works
as follows:
Fine-grained events are delivered to event processing layer via Kafka (XML)
Event processing layer builds facts about the operation (data enrichment, model processing,
offer preparation)
The facts are grouped in small batches and at the end of the batch sent to the Kafka topic. A
nice batching could be 100 events or 5 seconds (whatever happens first). Nnote: in the first
release there is not batching. Once data is sent out, the last offset in batch is acknowledged.
The data used for transit is JSON.
The JSON messages are collected by Flume and converted from JSON to Avro. Avro is binary
data representation with semantics format similar to JSON. The advantage over JSON is that
the data units are more compact and can be quickly rendered and parsed.
Flume appends the Avro files in batches to files in HDFS using time-based partitioning. The files
are completed and closed in 10 minutes intervals. This way some analytics processes that may
accept multiple relatively small data files can already start processing.
The data saved in Avro is regularly processed using ETL jobs (implemented with Spark) that
normalizes the data, removes duplicates, enriches for the target tasks (by adding category ids
Document
Accelerator for Apache Spark – Interface Specification 22
to items) and stores the data in Parquet. Parquet is another JSON-like binary serialization
format, but optimized for large datasets. In particular it can optimize the data access when only
subset of the fields is needed.
This implementation allows the event processing layer to operate on the data with full speed. The data
is then delivered to subsequent layers. For safe data storage, the Flume jobs can be redundant. They
will generate duplicates of the data, but as the fact that duplicates exist is acknowledged and addressed
in ETL step, the problem is immediately solved.
This staging approach has additional advantage. The pipeline is functional and the previous stage data
is retained for some time. This can be for example 24h for Kafka messages in the Flume topic and 3
months for the Avro files. If there is any bug discovered in the flow, the data can be always
reprocessed. Of course the messages sent out to third-parties have to be acknowledged as results, but
for example some missing x-referencing entries can be easily fixed by rerunning the job. Similar
concept is used in Lambda Architecture.
The partitioning of the Avro files for the event timestamp optimizes the ETL process. In absence of
special circumstances, the ETL process can be executed on two last partitions significantly reducing the
processing.
5.1 StreamBase to Flume
The event processing application produces JSON messages. The messages contain concatenated
JSON payload compliant with target Avro schema.
The header fields contain the routing information.
Field Sample Description
month 2015-08-01 Month extracted from the event timestamp information common to all
events in the payload.
It is important that, for given batch the event processing may produce more than one event. This may
happen for example around midnight. Because the timestamp information is generated by event
initiator, there is no temporal ordering of the events guaranteed. Therefore event with timestamp 2015-
09-01T00:00:05Z may be followed by event with timestamp 2015-08-30T23:59:40Z.
The data contained in the message payload is represented as JSON.
Element Sample Description
predictions List of model answers. Here it contains at most one element that was
sent to the customer.
predictions
modelName
Hats;2015 Spring; Arbitrary text describing the analytic model generating the result
predictions
categoryId
Hats Supplementary information describing the offer.
predictions
prediction
0.3 Prediction score
Document
Accelerator for Apache Spark – Interface Specification 23
Element Sample Description
customerId 4bf34187-ce60-
482d-b2e9-
ed5584e72ada
Loyalty card number
storeId store-123 Point of sale where the event happened.
transaction Transaction data container.
transaction
transactionId
d1193d9b-686e-
4089-877e-
3a05ac9d88a3
The transaction id. It can be later used to remove duplicate reports.
transaction
transactionDate
1427746712000 Transaction date as timestamp.
transaction
items
List of transaction items.
transaction
items
productSKU
101231 Stock keeping unit code for the purchased product. Identifies the item in
the inventory.
transaction
items
quantity
1 Amount of items purchased.
transaction
items
price
145.00 Nominal price as known to the selling point at the moment of transaction.
transaction
items
revenue
130.00 Actual purchase price after discount(s).
transaction
items
discount
Optional discount information.
This list can be extended to support richer data structures. In particular one may want to capture more
metadata about models, all models’ results (including those that did not pass the acceptance threshold),
the resolved category ids, metadata about customer featurization and so on.
The event JSON payload is then taken by Flume that appends the data to the Avro binary log. Flume
opens the files in a schema similar to the following:
hdfs://hdfsmaster:8020/apps/data/transactionLog/month=2015-03-01/tr.1427746712000.avro
The path for each event is built from the date range the batch belongs to (month=2015-03-01) and file
creation timestamp (1427746712000). In cases where Flume is redundant, it is useful to add distinct
prefixes for each copy to avoid (very unlikely) conflicts.
Document
Accelerator for Apache Spark – Interface Specification 24
There could be many such files as a new file is open in each partition every 10 minutes. While the file is
written to, it has slightly different name:
hdfs://hdfsmaster:8020/apps/data/transactionLog/month=2015-03-01/_tr.1427746712000.avro.tmp
Such format uses a convention in HDFS that such file is not yet ready for reading and it will be skipped
by ETL processes. What’s important, if Flume agent is killed abruptly, the file won’t be renamed
automatically, so in order not to lose the data, unplanned shutdown of Flume agents should be followed
by residual file assessment and renaming.
5.2 Avro to Parquet
Once the data is securely written to HDFS, the data can be further processed. In the accelerator
scenario the data is converted to Parquet structure.
In this particular case, the data structure is flattened to list of transaction items. The ETL job also groups
data to minimize amount of files, so further processing can operate with reduced number of tasks.
The result file paths are similar to the following:
hdfs://hdfsmaster:8020/apps/demo/transactions/month=2015-03-01/part-r-00000-0dff8a46-18f4-4353-
9b8c-2d0d7eea2129.gz.parquet
This structure allows to execute selection by month without reading all the files. In addition there would
be one aggregation task for month. Actually in real large data clusters it would be convenient to have
several files per month, but it all depends on the analytic process characteristics.
The analytic process in the accelerator focuses currently on the model training using past data.
Therefore the ETL process extracts the required fields and does local category enrichment. The
Parquet files provide the following structure:
Element Sample Description
customerId 4bf34187-ce60-
482d-b2e9-
ed5584e72ada
Customer identity (loyalty card)
transactionId d1193d9b-686e-
4089-877e-
3a05ac9d88a3
Transaction Id.
transactionDate Transaction timestamp
productSku 101231 Product id in inventory
categoryId Hats Category id as resolved in the ETL. Note that this can be different from
the category id resolved during event processing. A single item can be
also potentially represented as several categories.
quantity 1 Number of items purchased.
price 130.00 Effective price
In addition the files are partitioned with month field holding yyyy-mm-dd value for the first day of the
month.
Document
Accelerator for Apache Spark – Interface Specification 25
Because the all transaction items share the same transaction date, they are guaranteed to be stored in
the same partition. This way transaction level aggregation can be made in place without distributed
grouping by key.
The customer featurization process in the data analytics relies on the customer history. Because the
history is accessed in the preselected date ranges, the partitioned Parquet files prevent the
transformation process from reading the files with irrelevant data.
Document
Accelerator for Apache Spark – Interface Specification 26
6 Data access (REST/HTTP)
After the data is ETL-ed into Parquet files, the data lives in Big Data cluster. That means the data is
already in accessible form, but the amount of data does not allow it to be simply loaded it into memory.
The ultimate consumer of the data (like Spotfire) must use its reduced form. The reduced form in this
sense means usable information that can be relatively quickly obtained from the full data and has size
that can be moved around.
In the demo the data shows examples for:
Data sampling
Full dataset aggregation
Customer data pivoting
Important note about data access: The data in HDFS cluster is intended to be processed holistic way.
The particular piece of data lookup in HDFS is difficult. It is feasible, but the cost is in most of the cases
similar to the full table scan in relational databases.
6.1 Service layer
The data service layer is implemented as a thin REST layer on top of Spark application. The Spark
saves a lot of cost of processing Big Data by caching common queries and intermediary results. The
application uses a grid of worker nodes to process the data efficiently.
The service exposes two major data access channels:
Hive-compatible Spark SQL Thrift server (JDBC/ODBC)
Lightweight REST/HTTP interface
The Thrift interface is regular access point for JDBC and ODBC drivers. With Spotfire connector for
Spark SQL it is possible to access any registered tables the regular Spotfire/SQL way. The Thrift
interface is accessed with beeline console shell allowing to manage the metadata, register tables, etc.
The JDBC/ODBC interface has significant disadvantage, which is performance. The queries are
executed with the same speed independent from the access channel, but the data transfer interface
driven by JDBC/ODBC design is suboptimal. Also internal design of Spotfire connector translates the
ODBC resultsets into in-memory SBDF representation. The performance impact is especially visible for
fast queries or relatively long roundtrips between Spotfire and Spark, like Spark in AWS and Spotfire
running locally.
The REST/HTTP interface tries solving this issue by providing the access channel with minimal network
usage. The design assumes the query is triggered by HTTP call and the result comes back as simple
HTTP response optimized for the recipient. This interface is much lighter than ODBC. The downside is
large queries, where the query time is significant.
The REST/HTTP in the accelerator uses same pattern for data access. It is assumed a GET query is
done to a known URL resource, sometimes containing query parameters used to modify the output. The
interface implementation uses smart client format resolution. The data access URLs return two types of
data: text or sbdf (Spotfire Binary Data Format). The resolution is done using Accept URL extension.
Document
Accelerator for Apache Spark – Interface Specification 27
The important limitation of this kind of data service, no matter if it is JDBC/ODBC or HTTP, is that it is
not intended to execute queries from online-services. While the Spark cluster is capable operating on
huge datasets, it is not suited for fast serving of the data, especially selecting small piece of data. The
reason for that is lack of query scalability and it is architectural constraint. That means no future
improvements in products will solve this problem.
The advantage of HTTP is that it abstracts the underlying implementation from the interface. For
example in order to get the list of all identified categories a (cached) query is executed. The same query
can be hosted as simple file on the web server returning the same results without touching the client
interface. This is not possible with SQL-based design, where the actual data processing is driven from
the client application.
The examples are using standard command line HTTP client curl typically available in Linux with no
additional actions. The tool is available for other systems too.
6.1.1 Text format
The text format is headerless tab-separated text. This format assumes the consumer already knows the
data structure and can do its interpretation. The text format is default one.
Samples using curl:
curl http://demo.sample:9060/ranges
curl http://demo.sample:9060/ranges.txt
curl -H "Accept: text/plain" http://demo.sample:9060/ranges
curl --header "Accept: text/plain" http://demo.sample:9060/ranges
The sample response:
2015-03-01 248794 15458514.52
2015-02-01 224135 16416405.89
2015-01-01 247171 20111137.93
2014-12-01 271666 23221924.09
2014-11-01 239934 20510259.33
2014-09-01 201572 14582166.06
2014-10-01 230545 18691703.79
2014-08-01 187676 11360695.33
2014-07-01 167522 7915722.47
The text format can be easily consumed in shell scripts, R programs or any text-processing capable
languages.
6.1.2 SBDF format
The SBDF is native format of data representation in Spotfire. It contains both column names and types.
Because this is a proprietary binary format, it is not human readable.
Samples using curl (command line HTTP client):
curl http://demo.sample:9060/ranges.sbdf > ranges.sbdf
curl -H "Accept: application/vnd.tibco.spotfire.sbdf" http://demo.sample:9060/ranges >
ranges.sbdf
curl --header "Accept: application/vnd.tibco.spotfire.sbdf" http://demo.sample:9060/ranges >
ranges.sbdf
Document
Accelerator for Apache Spark – Interface Specification 28
6.2 /categories
The first aggregation query exposed by the data service is list of categories. This query selects all
unique categoryId values found in the ETL-ed dataset (Parquet files).
The returned data table has following fields:
Field name Field type Description
Category Text Category name
Sample service call:
curl http://demo.sample:9060/categories
6.3 /ranges
Ranges query aggregates the sales data by months.
Field name Field type Description
Date Text First day of month in yyyy-mm-dd format
Quantity Integer Total number of items sold in this range
Revenue Real Sum of all purchase values in the range
Sample service call:
curl http://demo.sample:9060/ranges
6.4 /transactions
There are three transactions resource implementations testing various ways of data selection. The
‘/transactions’ resource provide sampled view of the collected data. This operation is intended to
provide quick insight into the structure of the data. The goal is to retrieve randomly selected
transactions from the collected data. The call accepts parameters.
Parameter Field type Description
r Text First day of month in yyyy-mm-dd format. The parameter can be
repeated multiple times. When it is absent there is no filtering executed.
p Real Bernoulli sampling rate as a number between 0 and 1. Missing
parameter means there is no Bernoulli sampling.
n Integer Maximum number of transactions to be retrieved.
The three implementations use various grouping and filtering strategies. The ‘/transactions2’ resource is
at the moment the most performant now. It filters out the Parquet files using ‘r’ parameter. This
implementation guarantees the complete transaction content is delivered, but ensuring that constraint is
resource consuming.
Document
Accelerator for Apache Spark – Interface Specification 29
Field name Field type Description
TransactionId Text Unique transaction identifier.
TransactionDate Text Transaction date and time.
CustomerId Text Customer identifier
ProductCode Text Product SKU.
ProductCategory Text Category for the SKU as assigned in ETL.
Quantity Integer Number of purchased items in this transaction line.
Revenue Real Transaction line revenue.
Sample service call (note about quotes to suppress shell interpretation of &):
curl "http://demo.sample:9060/transactions2?r=2015-05-01&n=10&p=0.01"
6.5 /pivot
This is a control endpoint for execution of pivoting of the data for featurization of customers for model
training. The resource triggers underlying transformation of the flat transaction items dataset into
pivoted table with single record for each customer containing flags for requested response categories
and total quantities for all historical categories and total purchase quantities in each month.
Parameter Field type Description
h Text Historical ranges as first day of month in yyyy-mm-dd format. The
parameter can be repeated multiple times.
r Text Response ranges as first day of month in yyyy-mm-dd format. The
parameter can be repeated multiple times.
s Text Response category names.
The operation filters the ranges based on partitioning scheme and executes aggregation. As response a
table with dynamic columns is created
Field name Field type Description
resp_<category> Integer 0 or 1 if for the given customer there were any purchase in the response
ranges for category (taken from s parameter). Columns are sorted
lexicographically within this block.
<category> Integer Total number of items in the given category purchased by the customer
in the predictor ranges. Columns are sorted lexicographically within this
block. Note that depending on the range selection some categories may
be missing.
<month> Integer 12 columns having names in form ‘mmm’ (three letter month code) that
contain total purchases by customer in the given month in predictor
Document
Accelerator for Apache Spark – Interface Specification 30
Field name Field type Description
ranges. The columns are sorted by month order.
Sample service calls producing txt and sbdf are:
curl "http://demo.sample:9060/pivot?s=Hats&r=2016-11-01&r=2016-12-01&h=2015-11-01&r=2015-12-
01&h=2016-01-01&r=2016-02-01&h=2016-03-01&r=2016-04-01&h=2016-05-01&r=2016-06-01&h=2016-07-
01&r=2016-08-01&h=2016-09-01&r=2016-10-01" > pivot.txt
curl "http://demo.sample:9060/pivot.sbdf?s=Hats&r=2016-11-01&r=2016-12-01&h=2015-11-
01&r=2015-12-01&h=2016-01-01&r=2016-02-01&h=2016-03-01&r=2016-04-01&h=2016-05-01&r=2016-06-
01&h=2016-07-01&r=2016-08-01&h=2016-09-01&r=2016-10-01" > pivot.sbdf
Note that the text representation does not contain column names, so it is not precise. This resource is
provided as validation endpoint before actual model training.
6.6 /sql
SQL resource is a free query interface. It is similar to the /pivot call as it returns arbitrary columns. The
call may query any registered table.
Parameter Field type Description
s Text Spark SQL compatible query.
Sample service call executing query against ranges table:
curl "http://demo.sample:9060/sql?s=select+*+from+ranges"
6.7 /etl
Apart from read-only queries, the Spark applications may offer also transformation tools. In the
accelerator demo the conversion from Avro to Parquet is done by the same Spark application.
6.7.1 /etl (GET)
The basic resource executes the ETL within the single request-reply cycle. If the whole process
finishes, the call will return with code 200 (HTTP response OK) and text OK.
6.7.2 /etl (POST)
The POST version of the call creates a background job. The call returns immediately with URL to the
job information in the payload.
6.7.3 /etl/{jobId} (GET)
The call returns status of the tracked job as text ‘running’ or ‘done’.
6.7.4 /etl/{jobId}/messages (GET)
List of messages reported by the ETL job. It is returned as plain text.
6.7.5 /etl/{jobId}/events (GET)
List of events reported by Spark.
Document
Accelerator for Apache Spark – Interface Specification 31
NOTE: This call does not work. The reason to have this call is that the explicit messages can be
reported only between Spark cluster jobs. Spark reports events for every stage start and stop while the
job is running. This way the consumer may track the progress of the given task even if the main thread
waits for results. The problem is matching the launching thread with the events. This has not been
solved yet.
6.7.6 /etl/{jobId} (DELETE)
Untrack the given job. It is usually executed once the job reaches state ‘done’.
6.8 /models (training)
Model subresource is related to the model training. It covers the actual model training job and access to
the related results.
6.8.1 /models/train (GET)
This is original model training control resource. It executes the model training cycle (pivot customers,
move data to H2O, launch models, collect results) within the HTTP request duration. As such it is a
subject of timeouts. If the whole process finishes, the call will return with code 200 (HTTP response OK)
and text OK.
Unlike ETL job, this operation takes parameters encoded as HTTP URL query parameters.
Parameter Field type Description
m Text Reference model name, later used as directory name.
h Text Historical ranges as first day of month in yyyy-mm-dd format. The
parameter can be repeated multiple times.
r Text Response ranges as first day of month in yyyy-mm-dd format. The
parameter can be repeated multiple times.
s Text Response category names.
6.8.2 /models/train (POST)
Similarly as the GET version, this call triggers job execution. The model training job is executed by a
background thread and the status can be tracked using returned job id.
The job requires parameters:
Parameter Field type Description
m Text Reference model name, later used as directory name. Model name
overrides previous executions.
h Text Historical ranges as first day of month in yyyy-mm-dd format. The
parameter can be repeated multiple times.
r Text Response ranges as first day of month in yyyy-mm-dd format. The
Document
Accelerator for Apache Spark – Interface Specification 32
Parameter Field type Description
parameter can be repeated multiple times.
s Text Response category names.
The parameters are passed as HTTP form encoding.
6.8.3 /models/train/{jobId} (GET)
The call returns status of the tracked job as text ‘running’ or ‘done’.
6.8.4 /models/train/{jobId}/messages (GET)
List of messages reported by the model training job. It is returned as plain text.
6.8.5 /models/train/{jobId}/events (GET)
List of events reported by Spark. See the /etl/{jobId}/events discussion before.
6.8.6 /models/train/{jobId} (DELETE)
Untrack the given job. It is usually executed once the job reaches state ‘done’.
6.9 /models (results and medatata)
Once the model training job completes, the results are available for inspection and assessment. The
training process stores POJO class in HDFS and collects metadata reported by H2O model generation
jobs: AUC model metric, threshold values maximizing various metric functions and variable importance
chart.
6.9.1 /models/results
The results provide a summary of the trained models. The response is consumable data frame with the
format adapted to the caller.
Field name Field type Description
Model Text Model name given during training job submission as m parameter.
Category Text Requested category name given during training job submission as s
parameter.
AUC Real Area-Under-Curve. One of possible model efficiency metrics
representing the area under ROC points on fpr/tpr chart
f1 Real Cut-off value maximizing the F1 metrics.
f2 Real Cut-off value maximizing the F2 metrics.
f0point5 Real Cut-off value maximizing the F0.5 metrics.
accuracy, precision, Real Cut-off values maximizing other standard metrics as returned by
Document
Accelerator for Apache Spark – Interface Specification 33
Field name Field type Description
recall, specificity,
absolute_MCC,
min_per_class_accuracy
H2O model job results.
Examples:
curl "http://demo.sample:9060/models/results"
curl "http://demo.sample:9060/models/results.sbdf" > results.sbdf
6.9.2 /models/roc
H2O model training jobs for binomial supervised classification collect the model answer rates for both
training and validation sets. The set of operating points is referred as ROC (Receiver Operating
Characteristic). The ROC provides tpr/fpr values for given cut-off value. In the accelerator the result of
validation set is collected for visualization. The data is available as consumer adapted format (text or
sbdf).
Field name Field type Description
Model Text Model name given during training job submission as m parameter.
Category Text Requested category name given during training job submission as s
parameter.
Threshold Real Cut-off value of the POC
tpr Real True positives rate TP/(TP+FN) (sensitivity)
fpr Real False positives rate FP/(FP+RN) (1-specificity)
tp Integer Number of detected true cases.
fp Integer Number of false cases classified as true.
tn Integer Number of correctly detected false cases.
fn Integer Number of undetected true cases.
The ROC points can be used to maximize the expected outcome from the model. In particular the
collected values can be combined with cost function to maximize the gain / minimize the loss.
For more details please consult the Wikipedia:
https://en.wikipedia.org/wiki/Receiver_operating_characteristic
https://en.wikipedia.org/wiki/F1_score
Examples:
curl "http://demo.sample:9060/models/roc"
curl "http://demo.sample:9060/models/roc.sbdf" > roc.sbdf
Document
Accelerator for Apache Spark – Interface Specification 34
6.9.3 /models/varimp
The models trained using H2O report variable importance list. Each predictor variable is scored for
prediction power in the model training job. The list of model predictor variable importance is returned in
form of table.
Field name Field type Description
Model Text Model name given during training job submission as m parameter.
Category Text Requested category name given during training job submission as s
parameter.
Variable Text Predictor variable name
Relative Importance Real Variable importance score.
Scaled Importance Real Variable importance score scaled to the highest one.
Percentage Real Variable importance score normalized to 1.
Examples:
curl "http://demo.sample:9060/models/varimp"
curl "http://demo.sample:9060/models/varimp.sbdf" > varimp.sbdf
Document
Accelerator for Apache Spark – Interface Specification 35
7 Data access (SQL)
Although the Spark component primarily access channel is REST/HTTP, it exposes structures for
arbitrary usage in data discovery in Spotfire using Spark SQL Adapter.
In addition to the REST channel on port 9060, the data access service exposes the Hive/Thrift
compatible interface on port 10001. The default port is 10000, but it conflicts with standard StreamBase
port.
7.1 titems
The titems table is Parquet relation containing ETL-ed data from Avro files. The table contains flat view
of transaction items.
Field name Field type Description
customerId Text Customer unique identifier
transactionId Text Transaction unique identifier
transactionDate Timestamp Transaction date and time; same value for entries with the same
transactionId
storeId Text Store identifier; same value for entries with the same transactionId
productSku Text Product SKU for the transaction line
categoryId Text Recognized category id
quantity Integer Product quantity
price Double Total price of the item
month Text First day of month used to partition the data.
7.2 stores
The stores table contain store reference data. The dataset is converted into Parquet and exposed as
table. The table is used to execute analysis based on spatial and regional information.
Field name Field type Description
storeNum Text Unique store identifier
street Text Store street address
state Text State code
city Text City where the store is located
Document
Accelerator for Apache Spark – Interface Specification 36
Field name Field type Description
zip Text Zip code of the store
latitude / longitude Double Spatial position of the store
region Text Region code of the store
The stores table is read from Parquet files stored at:
hdfs://demo.sample/apps/demo/reference/stores
7.3 customer_demographic_info
In the customer_demographic_info table customer demographic info is stored. This is auxiliary
information used to in the transaction generation process. In the accelerator concept this is hidden
variable and in the visualization it is used to compare observed customer behaviour against assumed
model.
Field name Field type Description
customerId Text Unique customer identifier
name Text Customer name
age Integer Customer age used for data generation. Note: in the real information
collection the age value makes only sense with the time when it was
captured.
gender Integer Customer gender.
married Logical Marital status.
hasKids Logical Flag indicating that customer may have higher tendency to buy
goods for children.
incomeQuartile Integer Income level indicator.
hobby Text Customer hobby.
zip Text Customer zip code (used to select the best store)
educationalLevel Integer Customer education indicator
employment Integer Customer employment status
ownHome Logical Flag indicating that customer has real estate.
The table is read from Parquet files stored at:
hdfs://demo.sample/apps/demo/reference/customer_demographic_info
Document
Accelerator for Apache Spark – Interface Specification 37
7.4 customerid_segments
The customerid_segments table provides precomputed customer segmentation information and
preferred store id.
Field name Field type Description
customerId Text Unique customer identifier
marketSegment Integer Customer segment.
storeNum Text Preferred store id.
The table is read from Parquet files stored at:
hdfs://demo.sample/apps/demo/reference/customerid_segments
7.5 categories
The categories table contains unique categories extracted from the titems table. It is temporary table
defined as Spark transformation in Scala and registered in SQL context.
Field name Field type Description
categoryId Text Category id
7.6 ranges
The ranges table contains aggregated monthly sales information. Similar as categories it is temporary
table defined as Spark transformation in Scala and registered in SQL context.
Field name Field type Description
month Text Month in format yyyy-mm-01
quantity Integer Total number of items sold
revenue Double Total revenue in the given month
Document
Accelerator for Apache Spark – Interface Specification 38
8 POJO layout and configuration deployment
The accelerator uses H2O POJO model representation to evaluate models in the event processing
layer. The POJOs (a.k.a. genmodel) are lightweight model implementations focused on real-time
execution. They are directly compiled into Java bytecode and use no runtime object allocation on heap
during scoring. This way the scoring latency can be reduced to few microseconds and makes H2O
model evaluation lighning-fast.
The POJOs are exported from H2O cluster as Java source code that has to be compiled and loaded
into memory. This feature is directly supported by new the H2O operator in StreamBase 7.6.4. The
operator requires the POJO class content to be available as referenceable reseource, for example in
HDFS.
A single H2O model operator instance supports multiple models deployed at the same time. They
should share the same characteristics:
compatible predictor fields
compatible model type (regression, binomial classification or multinomial classification)
In the accelerator the models are binomial classification type and use featurization as list of fields
containing total item count in each category and in month in the past period. The prediction period is
considered here 270 days before the incoming transaction and the new transaction itself. The response
is classification of the customer history to have propensity to do a purchase in the target category in
coming days.
In order to support this construction the model training jobs save the result data in HDFS. The models
are then bundled together, described with metadata and deployed to the cluster.
8.1 HDFS layout
The models are prepared for deployment in HDFS. In the distributed filesystem the solution stores
metadata about training results and target deployment bundles.
8.1.1 hdfs://demo.sample/apps/demo/models/pojo
The directory stores generated POJO files. It contains a subdirectory for each model training job. The
directory name is provided as m parameter.
[demo@demo ~]$ hadoop fs -ls /apps/demo/models/pojo/Demo
Found 2 items
-rw-r--r-- 3 demo supergroup 1758956 2016-04-07 20:08
/apps/demo/models/pojo/Demo/Backpacks.pojo
-rw-r--r-- 3 demo supergroup 21840350 2016-04-07 20:08
/apps/demo/models/pojo/Demo/Hats.pojo
Each file contains POJO source code for the target category provided to the model training job as s
parameter.
[demo@demo ~]$ hadoop fs -cat /apps/demo/models/pojo/Demo/Hats.pojo | head
/*
Licensed under the Apache License, Version 2.0
http://www.apache.org/licenses/LICENSE-2.0.html
Document
Accelerator for Apache Spark – Interface Specification 39
AUTOGENERATED BY H2O at 2016-04-07T20:08:48.441Z
3.8.1.3
Standalone prediction code with sample test data for DRFModel named
DRF_model_1460059617025_1
How to download, compile and execute:
During model training with the same model name the existing directory is removed.
8.1.2 hdfs://demo.sample/apps/demo/models/results
The results of the model training are stored in tab-separated files, one for each training job. Similarly as
for POJOs, the file is overwritten by jobs with the same model names.
[demo@demo ~]$ hadoop fs -ls /apps/demo/models/results
Found 10 items
-rw-r--r-- 3 demo supergroup 353 2016-04-07 20:08
/apps/demo/models/results/Demo.txt
-rw-r--r-- 3 demo supergroup 511 2016-03-24 13:03 /apps/demo/models/results/Test 2
.txt
-rw-r--r-- 3 demo supergroup 630 2016-03-28 13:59 /apps/demo/models/results/Test
2.txt
-rw-r--r-- 3 demo supergroup 0 2016-04-05 10:46 /apps/demo/models/results/Test
3.txt
-rw-r--r-- 3 demo supergroup 630 2016-04-04 10:18 /apps/demo/models/results/Test
4.txt
-rw-r--r-- 3 demo supergroup 359 2016-04-07 22:11 /apps/demo/models/results/Test
5.txt
-rw-r--r-- 3 demo supergroup 493 2016-03-23 15:15
/apps/demo/models/results/Test.txt
-rw-r--r-- 3 demo supergroup 235 2016-04-07 22:32 /apps/demo/models/results/XYZ.txt
-rw-r--r-- 3 demo supergroup 244 2016-04-07 22:35 /apps/demo/models/results/qwerty
12345.txt
-rw-r--r-- 3 demo supergroup 373 2016-04-13 15:49 /apps/demo/models/results/test
model.txt
The results file content can be verified using cat command:
[demo@demo ~]$ hadoop fs -cat /apps/demo/models/results/Demo.txt
Model Category AUC f1 f2 f0point5 accuracy precision
recall specificity absolute_MCC min_per_class_accuracy
Demo Hats 0.74462465 0.30282423 0.20221285 0.46907704 0.50914093
0.80936127 0.05373563 0.80936127 0.30282423 0.41259288
Demo Backpacks 0.47964015 0.02173136 0.00002418 0.02173136
0.24015803 0.02173136 0.00002418 0.24015803 0.00012852 0.00258690
The files available in the directory are aggregated together by Spark data access service and hosted as
REST resource.
8.1.3 hdfs://demo.sample/apps/demo/models/roc
Similarly as training job results, the ROC points are also stored in HDFS in a text file for each job.
[demo@demo ~]$ hadoop fs -ls /apps/demo/models/roc
Found 10 items
-rw-r--r-- 3 demo supergroup 51228 2016-04-07 20:08 /apps/demo/models/roc/Demo.txt
-rw-r--r-- 3 demo supergroup 91074 2016-03-24 13:03 /apps/demo/models/roc/Test 2 .txt
-rw-r--r-- 3 demo supergroup 117505 2016-03-28 13:59 /apps/demo/models/roc/Test 2.txt
-rw-r--r-- 3 demo supergroup 0 2016-04-05 10:46 /apps/demo/models/roc/Test 3.txt
Document
Accelerator for Apache Spark – Interface Specification 40
-rw-r--r-- 3 demo supergroup 117480 2016-04-04 10:18 /apps/demo/models/roc/Test 4.txt
-rw-r--r-- 3 demo supergroup 54803 2016-04-07 22:11 /apps/demo/models/roc/Test 5.txt
-rw-r--r-- 3 demo supergroup 85489 2016-03-23 15:15 /apps/demo/models/roc/Test.txt
-rw-r--r-- 3 demo supergroup 27239 2016-04-07 22:32 /apps/demo/models/roc/XYZ.txt
-rw-r--r-- 3 demo supergroup 30846 2016-04-07 22:35 /apps/demo/models/roc/qwerty
12345.txt
-rw-r--r-- 3 demo supergroup 57439 2016-04-13 15:49 /apps/demo/models/roc/test
model.txt
The file content can be verified using cat command:
[demo@demo ~]$ hadoop fs -cat /apps/demo/models/roc/Demo.txt | head
Model Category Threshold tpr fpr tp fp tn fn
Demo Hats 0.80936127 0.00010692 0.00000000 1 0 15644 9352
Demo Hats 0.80061281 0.00010692 0.00025569 1 4 15640 9352
Demo Hats 0.79416358 0.00042767 0.00025569 4 4 15640 9349
Demo Hats 0.78396730 0.00117609 0.00031961 11 5 15639 9342
Demo Hats 0.77707700 0.00171068 0.00038353 16 6 15638 9337
Demo Hats 0.77294468 0.00192452 0.00044746 18 7 15637 9335
Demo Hats 0.76663585 0.00256602 0.00070314 24 11 15633 9329
Demo Hats 0.76177124 0.00310061 0.00089491 29 14 15630 9324
Demo Hats 0.75874359 0.00363520 0.00102276 34 16 15628 9319
The files available in the directory are aggregated together by Spark data access service and hosted as
REST resource.
8.1.4 hdfs://demo.sample/apps/demo/models/varimp
Finally the variable importance results are also stored as text files.
[demo@demo ~]$ hadoop fs -ls /apps/demo/models/varimp
Found 10 items
-rw-r--r-- 3 demo supergroup 6980 2016-04-07 20:08 /apps/demo/models/varimp/Demo.txt
-rw-r--r-- 3 demo supergroup 12385 2016-03-24 13:03 /apps/demo/models/varimp/Test 2
.txt
-rw-r--r-- 3 demo supergroup 15785 2016-03-28 13:59 /apps/demo/models/varimp/Test
2.txt
-rw-r--r-- 3 demo supergroup 0 2016-04-05 10:46 /apps/demo/models/varimp/Test
3.txt
-rw-r--r-- 3 demo supergroup 15785 2016-04-04 10:18 /apps/demo/models/varimp/Test
4.txt
-rw-r--r-- 3 demo supergroup 7383 2016-04-07 22:11 /apps/demo/models/varimp/Test
5.txt
-rw-r--r-- 3 demo supergroup 11499 2016-03-23 15:15 /apps/demo/models/varimp/Test.txt
-rw-r--r-- 3 demo supergroup 3766 2016-04-07 22:32 /apps/demo/models/varimp/XYZ.txt
-rw-r--r-- 3 demo supergroup 4261 2016-04-07 22:35 /apps/demo/models/varimp/qwerty
12345.txt
-rw-r--r-- 3 demo supergroup 8002 2016-04-13 15:49 /apps/demo/models/varimp/test
model.txt
The file content can be verified using cat command:
[demo@demo ~]$ hadoop fs -cat /apps/demo/models/varimp/Demo.txt | head
Model Category Variable Relative Importance Scaled Importance
Percentage
Demo Hats Hats 42515.66015625 1.00000000 0.12774973
Demo Hats Other Youth Clothes 18572.19140625 0.43683178 0.05580514
Demo Hats Other Sportswear 18197.13476563 0.42801017 0.05467818
Demo Hats Running Clothes 13642.03417969 0.32087081 0.04099116
Demo Hats Other Activity Gear 12797.34765625 0.30100315 0.03845307
Demo Hats Womens Fleece 9781.95605469 0.23007889 0.02939252
Demo Hats Mens Fleece 9526.95605469 0.22408110 0.02862630
Document
Accelerator for Apache Spark – Interface Specification 41
Demo Hats Endurance Training Clothes 8466.57910156 0.19914025 0.02544011
Demo Hats Socks and Belts 8291.94042969 0.19503262 0.02491536
The files available in the directory are aggregated together by Spark data access service and hosted as
REST resource.
8.1.5 hdfs://demo.sample/apps/demo/models/sets
The model metadata is kept in the HDFS for easy sharing within the cluster. The directory contains a
set of model definition entries describing model execution behaviour.
[demo@demo ~]$ hadoop fs -ls /apps/demo/models/sets
Found 5 items
-rw-r--r-- 1 demo supergroup 4 2016-03-28 16:07 /apps/demo/models/sets/empty.txt
-rw-r--r-- 1 demo supergroup 312 2016-04-07 15:42 /apps/demo/models/sets/models-
a.txt
-rw-r--r-- 1 demo supergroup 235 2016-03-28 22:18 /apps/demo/models/sets/models-
b.txt
-rw-r--r-- 1 demo supergroup 77 2016-04-05 22:59 /apps/demo/models/sets/models-
c.txt
-rw-r--r-- 1 demo supergroup 311 2016-04-13 16:01 /apps/demo/models/sets/models-
q.txt
Example model descriptor can be accessed with cat command:
[demo@demo ~]$ hadoop fs -cat /apps/demo/models/sets/models-a.txt
Hats hdfs://demo.sample/apps/demo/models/pojo/Test/Hats.pojo 0.19562227 Hats
Mens Fleece hdfs://demo.sample/apps/demo/models/pojo/Test/Mens%20Fleece.pojo
0.28281646 Mens Fleece
Mens Hardshell Jackets
hdfs://demo.sample/apps/demo/models/pojo/Test/Mens%20Hardshell%20Jackets.pojo 0.13727364
Mens Hardshell Jackets
The file is tab-separated with each line describing a model. The fields are:
Field name Field type Description
Model Name Text Model name reference. It is used to identify the model generating
given score. It is recommended to keep it consistent across bundles.
POJO URL Text URL pointing to the POJO source. H2O operator reads the file and
compiles it on the flight. The restriction of Java compiler is relaxed.
Cut-Off Real The actual cut-off value to be used for binomial classification. H2O
by default uses threshold maximizing F1 metric. This parameter
allows to override the classification results to maximized use-case
specific metric. The field has value between 0.0 and 1.0.
Category Id Text The category id to track customer responses to the offer.
Offer Name Text Descriptive offer name.
Model Version Text Model version label.
Description Text Details about the model
Valid From Date Effective start date for the model validity in form yyyy-mm-dd.
Document
Accelerator for Apache Spark – Interface Specification 42
Field name Field type Description
Inclusive.
Valid To Date Effective end date for the model validity in form yyyy-mm-dd.
Exclusive.
8.2 Configuration deployment
The model deployment assumes decoupling of the model operations from execution. To achieve that,
the deployment is done via ZooKeeper. ZooKeeper is cluster data sharing component. It is similar in
concepts to ActiveSpaces. It supports addressed data modifications with notifications to all listeners.
ZooKeeper was built for stability and strong consistence in first place. The product is used to deliver
communication primitives in large clusters.
In the accelerator the event processing components are clients for configuration change notifications.
ZooKeeper guarantees that the client eventually receives the last state change notification. That means
several fast updates may result in single notification to the consumer. The entries in ZooKeeper form a
filesystem-like tree structure and they are called z-nodes.
The following items are configurable and deployable on demand:
model metadata
product SKU to category mapping
feature list
ZooKeeper can be accessed with command line client. An example of ZK session:
[demo@demo bin]$ pwd
/opt/java/zookeeper-3.4.8/bin
[demo@demo bin]$ ./zkCli.sh -server localhost:2181/demo
Connecting to localhost:2181/demo
(... skipped for clarity ...)
[zk: localhost:2181/demo(CONNECTED) 0] ls /config
[h2oModel, features, products]
[zk: localhost:2181/demo(CONNECTED) 1] get /config/h2oModel
hdfs://demo.sample/apps/demo/models/sets/models-c.txt
cZxid = 0x19a
ctime = Sun Mar 20 21:48:52 UTC 2016
mZxid = 0x956635
mtime = Wed Apr 13 16:02:58 UTC 2016
pZxid = 0x19a
cversion = 0
dataVersion = 32
aclVersion = 0
ephemeralOwner = 0x0
dataLength = 53
numChildren = 0
[zk: localhost:2181/demo(CONNECTED) 2]
8.2.1 Model deployment
The actually active model bundle is configurable at /config/h2oModel z-node. The z-node contains
URL pointer to the current model set as described above. Whenever the z-node content changes, all
Document
Accelerator for Apache Spark – Interface Specification 43
event processing applications (StreamBase) try loading the models. If any model on the list fails loading
the change is rejected, but only on the node that failed. If the behaviour shall be consistent across the
cluster, special actions should be implemented. An example of the design is fail-fast engine shutdown
on model loading failure. This would prevent further processing with old configuration.
The model deployment bundle fields (repeated from previous chapter):
Field name Field type Description
Model Name Text Model name reference. It is used to identify the model generating
given score. It is recommended to keep it consistent across bundles.
POJO URL Text URL pointing to the POJO source. H2O operator reads the file and
compiles it on the flight. The restriction of Java compiler is relaxed.
Cut-Off Real The actual cut-off value to be used for binomial classification. H2O
by default uses threshold maximizing F1 metric. This parameter
allows to override the classification results to maximized use-case
specific metric. The field has value between 0.0 and 1.0.
Category Id Text The category id to track customer responses to the offer.
Offer Name Text Descriptive offer name.
Model Version Text Model version label.
Description Text Details about the model
Valid From Date Effective start date for the model validity in form yyyy-mm-dd.
Inclusive.
Valid To Date Effective end date for the model validity in form yyyy-mm-dd.
Exclusive.
8.2.2 Product SKU to category id mapping
The accelerator assumes the transaction originator has limited features. It is expected that it delivers
only information guaranteed to be true. That means the transaction message has minimalistic content. It
allows avoiding inconsistency if reference data is different between transaction terminals.
A consequence of such design is that the transaction line category id is unknown when message arrives
to the event processing layer. At the same time the model logic operates on category ids. What's more,
the category ids may change over time and there could be zero or more categories valid for a product.
An example of transaction line categorization can be: female, jacket, green, winter, collection-2016. This
approach enables multi-dimensional processing of incoming transaction.
The accelerator keeps the mapping locally for x-referencing efficiency. The reference table is read from
HDFS file. The current table is registered in ZooKeeper at /config/products. The file is tab-separated
text.
Field name Field type Description
Product SKU Text Product inventory key as sent in the transaction lines.
Document
Accelerator for Apache Spark – Interface Specification 44
Field name Field type Description
Category ID Text Category id assigned to the product.
Other Text Other reference fields; ignored at the moment.
Example access to the ZooKeeper and file content:
[demo@demo bin]$ echo "get /config/products" | ./zkCli.sh -server localhost:2181/demo
Connecting to localhost:2181/demo
(... skipped for clarity ...)
[zk: localhost:2181/demo(CONNECTED) 0] get /config/products
hdfs://demo.sample/apps/demo/config/products.txt
cZxid = 0x199
ctime = Sun Mar 20 21:48:52 UTC 2016
mZxid = 0x199
mtime = Sun Mar 20 21:48:52 UTC 2016
pZxid = 0x199
cversion = 0
dataVersion = 0
aclVersion = 0
ephemeralOwner = 0x0
dataLength = 48
numChildren = 0
[zk: localhost:2181/demo(CONNECTED) 1] [demo@demo bin]$
[demo@demo bin]$
[demo@demo bin]$ hadoop fs -cat hdfs://demo.sample/apps/demo/config/products.txt | head
100005 Womens Insulated Jackets 2 94.99 172.00 94.99
100017 Hats 13 20.00 38.15 40.00
100024 Other Sportswear 1 22.50 22.50 22.50
100061 Running Clothes 2 50.00 50.00 50.00
100080 Mens Footwear 1 117.00 117.00 117.00
100138 Running Clothes 1 25.00 25.00 25.00
100145 Outdoor Gear 1 99.00 99.00 99.00
100191 Hats 13 36.00 39.38 40.00
100249 Other Activity Gear 3 10.00 13.33 20.00
100265 Other Sportswear 2 34.19 44.59 55.00
8.2.3 Feature mapping
The numerical models operate on set of features. Currently they are total quantities for each known
category. If the category has never shown up on customer purchases, the field to be passed to the
model is unknown and the H2O operator passes NaN value. In order to prefill empty categories with
zeros it is required to know the expected category id list before. The list of categories can be also used
to optimize the feature list calculation by directly updating positions in a fixed array of doubles. Currently
the last point is more efficient in the operator itself.
The list of features is also kept as tab-separated text file in HDFS. The currently active file is registered
in z-node /config/features. The file is tab-separated text.
Field name Field type Description
Category ID Text Category ID as understood by the models.
Feature position Integer Position of the feature on the list. Currently unused as provided by
the model itself. In addition each model may be trained using
Document
Accelerator for Apache Spark – Interface Specification 45
Field name Field type Description
different feature fields, which makes this field obsolete.
Example access to the ZooKeeper and file content:
[demo@demo bin]$ echo "get /config/features" | ./zkCli.sh -server localhost:2181/demo
Connecting to localhost:2181/demo
(... skipped for clarity ...)
[zk: localhost:2181/demo(CONNECTED) 0] get /config/features
hdfs://demo.sample/apps/demo/config/features.txt
cZxid = 0x198
ctime = Sun Mar 20 21:48:52 UTC 2016
mZxid = 0x198
mtime = Sun Mar 20 21:48:52 UTC 2016
pZxid = 0x198
cversion = 0
dataVersion = 0
aclVersion = 0
ephemeralOwner = 0x0
dataLength = 48
numChildren = 0
[zk: localhost:21
[demo@demo bin]$
[demo@demo bin]$ hadoop fs -cat hdfs://demo.sample/apps/demo/config/features.txt | head
Active Sportswear 0
Backpacks 1
Boys Insulated Jackets 2
Endurance Training Clothes 3
Girls Fleece 4
Girls Insulated Jackets 5
Hats 6
Hydration Packs 7
Mens Alpine Jackets 8
Mens Climbing Gloves 9
8.2.4 Model deployment procedure
The model deployment lifecycle requires prior existence of configuration items. For example if a new
model expects a category field "Collection 2016", the category should appear on feature list and the x-
referencing should be defined. The correct sequence would be then:
create a copy of current feature list and add the new features
deploy the features; zeros appear as model input and field is ignored
create a copy and add entries to the products category mapping
deploy the features; if there are already products assigned to the new feature, the values start
appearing in the model input
take the current model bundle and add entries describing POJOs using the new features, define
the model validity range; the models are evaluated, but their results are discarded until the
incoming transactions have timestamp within validity range; the cost of H2O model evaluation is
minimal, so unnecessary execution is not an issue.