Transcript

http://www.tibco.com

Global Headquarters

3303 Hillview Avenue

Palo Alto, CA 94304

Tel: +1 650-846-1000

Toll Free: 1 800-420-8450

Fax: +1 650-846-1005

© 2006, TIBCO Software Inc. All rights

reserved. TIBCO, the TIBCO logo, The

Power of Now, and TIBCO Software are

trademarks or registered trademarks of

TIBCO Software Inc. in the United States

and/or other countries. All other product and

company names and marks mentioned in

this document are the property of their

respective owners and are mentioned for

identification purposes only.

Accelerator for Apache Spark

Interface Specification

23 August 2016

Version 1.0.0

This document outlines the interface specification for inbound and outbound

messages for the Accelerator for Apache Spark

Document

Accelerator for Apache Spark – Interface Specification 2

Revision History

Version Date Author Comments

0.1 10/04/2016 Piotr Smolinski Initial version

0.2 18/04/2016 Piotr Smolinski

0.3 06/06/2016 Piotr Smolinski

1.0.0 23/08/2016 Piotr Smolinski Version for release

Document

Accelerator for Apache Spark – Interface Specification 3

Copyright Notice

COPYRIGHT© 2016 TIBCO Software Inc. This document is unpublished and the foregoing notice is

affixed to protect TIBCO Software Inc. in the event of inadvertent publication. All rights reserved. No

part of this document may be reproduced in any form, including photocopying or transmission

electronically to any computer, without prior written consent of TIBCO Software Inc. The information

contained in this document is confidential and proprietary to TIBCO Software Inc. and may not be used

or disclosed except as expressly authorized in writing by TIBCO Software Inc. Copyright protection

includes material generated from our software programs displayed on the screen, such as icons, screen

displays, and the like.

Trademarks

Technologies described herein are either covered by existing patents or patent applications are in

progress. All brand and product names are trademarks or registered trademarks of their respective

holders and are hereby acknowledged.

Confidentiality

The information in this document is subject to change without notice. This document contains

information that is confidential and proprietary to TIBCO Software Inc. and may not be copied,

published, or disclosed to others, or used for any purposes other than review, without written

authorization of an officer of TIBCO Software Inc. Submission of this document does not represent a

commitment to implement any portion of this specification in the products of the submitters.

Content Warranty

The information in this document is subject to change without notice. THIS DOCUMENT IS PROVIDED

"AS IS" AND TIBCO MAKES NO WARRANTY, EXPRESS, IMPLIED, OR STATUTORY, INCLUDING

BUT NOT LIMITED TO ALL WARRANTIES OF MERCHANTABILITY OR FITNESS FOR A

PARTICULAR PURPOSE. TIBCO Software Inc. shall not be liable for errors contained herein or for

incidental or consequential damages in connection with the furnishing, performance or use of this

material.

For more information, please contact:

TIBCO Software Inc.

3303 Hillview Avenue

Palo Alto, CA 94304

USA

Document

Accelerator for Apache Spark – Interface Specification 4

Table of Contents

TABLE OF CONTENTS .............................................................................................................................4

TABLE OF FIGURES .................................................................................................................................6

TABLE OF TABLES ..................................................................................................................................7

1 PREFACE ..........................................................................................................................................8

1.1 PURPOSE OF DOCUMENT .............................................................................................................8

1.2 SCOPE ........................................................................................................................................8

1.3 REFERENCED DOCUMENTS ..........................................................................................................8

2 EVENT CAPTURE AND EMISSION (KAFKA MESSAGES)............................................................9

2.1 TRANSACTION .............................................................................................................................9

2.2 NOTIFICATION ........................................................................................................................... 11

3 RUNTIME STATE MAINTENANCE (HBASE) ............................................................................... 14

4 RUNTIME DASHBOARD (LIVEVIEW DATAMART) ..................................................................... 17

4.1 TRANSACTION .......................................................................................................................... 17

4.2 TRANSACTIONITEMS ................................................................................................................. 18

4.3 STORESUMMARY ...................................................................................................................... 18

4.4 MODELSUMMARY ..................................................................................................................... 19

4.5 WALLCLOCK ............................................................................................................................. 20

5 DATA COLLECTION (HDFS STRUCTURES) ............................................................................... 21

5.1 STREAMBASE TO FLUME ........................................................................................................... 22

5.2 AVRO TO PARQUET ................................................................................................................... 24

6 DATA ACCESS (REST/HTTP) ....................................................................................................... 26

6.1 SERVICE LAYER ........................................................................................................................ 26

6.1.1 Text format ......................................................................................................................... 27

6.1.2 SBDF format ...................................................................................................................... 27

6.2 /CATEGORIES ............................................................................................................................ 28

6.3 /RANGES .................................................................................................................................. 28

6.4 /TRANSACTIONS ........................................................................................................................ 28

6.5 /PIVOT ...................................................................................................................................... 29

6.6 /SQL ......................................................................................................................................... 30

6.7 /ETL ......................................................................................................................................... 30

6.7.1 /etl (GET) ............................................................................................................................ 30

6.7.2 /etl (POST) ......................................................................................................................... 30

Document

Accelerator for Apache Spark – Interface Specification 5

6.7.3 /etl/{jobId} (GET) ................................................................................................................ 30

6.7.4 /etl/{jobId}/messages (GET) ............................................................................................... 30

6.7.5 /etl/{jobId}/events (GET) ..................................................................................................... 30

6.7.6 /etl/{jobId} (DELETE) .......................................................................................................... 31

6.8 /MODELS (TRAINING) ................................................................................................................. 31

6.8.1 /models/train (GET) ............................................................................................................ 31

6.8.2 /models/train (POST) ......................................................................................................... 31

6.8.3 /models/train/{jobId} (GET) ................................................................................................ 32

6.8.4 /models/train/{jobId}/messages (GET) ............................................................................... 32

6.8.5 /models/train/{jobId}/events (GET) ..................................................................................... 32

6.8.6 /models/train/{jobId} (DELETE) .......................................................................................... 32

6.9 /MODELS (RESULTS AND MEDATATA) .......................................................................................... 32

6.9.1 /models/results ................................................................................................................... 32

6.9.2 /models/roc ......................................................................................................................... 33

6.9.3 /models/varimp ................................................................................................................... 34

7 DATA ACCESS (SQL) ................................................................................................................... 35

7.1 TITEMS ..................................................................................................................................... 35

7.2 STORES .................................................................................................................................... 35

7.3 CUSTOMER_DEMOGRAPHIC_INFO .............................................................................................. 36

7.4 CUSTOMERID_SEGMENTS .......................................................................................................... 37

7.5 CATEGORIES ............................................................................................................................. 37

7.6 RANGES ................................................................................................................................... 37

8 POJO LAYOUT AND CONFIGURATION DEPLOYMENT ............................................................ 38

8.1 HDFS LAYOUT ......................................................................................................................... 38

8.1.1 hdfs://demo.sample/apps/demo/models/pojo .................................................................... 38

8.1.2 hdfs://demo.sample/apps/demo/models/results ................................................................ 39

8.1.3 hdfs://demo.sample/apps/demo/models/roc ...................................................................... 39

8.1.4 hdfs://demo.sample/apps/demo/models/varimp ................................................................ 40

8.1.5 hdfs://demo.sample/apps/demo/models/sets .................................................................... 41

8.2 CONFIGURATION DEPLOYMENT .................................................................................................. 42

8.2.1 Model deployment .............................................................................................................. 42

8.2.2 Product SKU to category id mapping ................................................................................. 43

8.2.3 Feature mapping ................................................................................................................ 44

8.2.4 Model deployment procedure ............................................................................................ 45

Document

Accelerator for Apache Spark – Interface Specification 6

Table of Figures

No table of figures entries found.

Document

Accelerator for Apache Spark – Interface Specification 7

Table of Tables

No table of figures entries found.

Document

Accelerator for Apache Spark – Interface Specification 8

1 Preface

1.1 Purpose of Document

The document describes the data exchange interfaces used by the Accelerator for Apache Spark

project.

1.2 Scope

This document outlines the following:

Inbound report message specifications

Outbound notification message specifications

1.3 Referenced Documents

Document Reference

Accelerator for Apache Spark Quick Start Guide

Accelerator for Apache Spark Functional Specification

Document

Accelerator for Apache Spark – Interface Specification 9

2 Event capture and emission (Kafka messages)

The eventing interface builds the basic part of the Fast Data story in the accelerator. It is good old

reactive messaging. The Event Processing layer receives and publishes messages related to the

handled process.

In the retail processing scenario the messages represent the transaction content and resulting offers.

The input messages contain minimal information about the executed transaction: list of items with

nominal price, applied discount and effective revenue and customer identity (loyalty card number).

The output messages contain information about offers relevant to the executed transaction.

The common characteristic of the input and output event streams in the accelerator is that the number

of events per second can be huge. On the other hand each event is generally independent from other

events delivered or sent within relative small time window. In the micro-scale we deal with some

process tracking, but the number of concurrently tracked processes (like customer purchases) is large.

The target of the accelerator is resilient horizontally scalable event processing solution. Such solution

should:

process the events (both CEP and ESP)

collect the events for analytics

apply knowledge gained in the analytics to the event processing

The demo scenario in the accelerator uses Kafka as the messaging bus and XML as payload format.

The decision was driven by ability to scale out messaging beyond EMS limitations. The second reason

to pick Kafka was that the product gains popularity and is pretty common in Big Data eventing. The XML

was used as typical data format used in integration projects, but any format carrying information with

same semantics (like JSON) would be equally good.

The major advantages of Kafka in the accelerator are:

ability to repeat the traffic (up to the message log retention point)

inherent traffic partitioning (with delivery ordering guarantee within partition)

The messages in the Kafka bus consist of opaque header and body. The interpretation of both fields is

a contract between publisher and consumer. In the sample implementation the header is UTF-8

multiline text a list of text key/value pairs, where each line defines the pair and key is separated from

value by ‘:’ (colon). The payload is UTF-8 encoded XML message.

2.1 Transaction

Transaction event carries XML content defined in transactions.xsd. The schema defines XML structure

in http://demos.tibco.com/retail/transactions namespace.

The event can be annotated with following header fields:

Document

Accelerator for Apache Spark – Interface Specification 10

Field Sample Description

correlationId ca9ef517-8509-

40be-9754-

ca923344ba67

Message correlation id. It is unique id to track the process. In the

implemented scenario it is used to correlate offers with transaction

events. Presence of the correlation id allows the client interface to be

implemented as request-reply (but user is not forced to that). The field

can be also used as duplicate detection mechanism inside event

processing application. In particular case the event initiator may resend

the same transaction. If non-persistent registry of previously seen

events contains the given correlation id, most likely the client sent it

twice (for example as IO partial failure retry), but the message was

indeed delivered to the bus.

sendResponseTo Notifications:0 Destination topic to notification messages. The header allows to send

the response messages to declared destination instead of one common

topic. This is useful to combine with correlationIs for request-reply. The

value may include desired partition number.

customerId 4bf34187-ce60-

482d-b2e9-

ed5584e72ada

Optionally replicated field from the message payload. It can be used to

transparently send messages to given partition. The consistent

customer to partition assignment is needed to guarantee the ordering of

messages without distributed locking.

transactionId d1193d9b-686e-

4089-877e-

3a05ac9d88a3

Optionally replicated field from the message payload.

The header in Kafka is intended to be small and quickly parsable. It is used to quickly filter out irrelevant

messages, route messages to the target processor without parsing the payload, or even pass content

cryptographic signature. It is very similar to JMS headers, the major difference is that Kafka does not

impose any particular interpretation of the binary content.

The particular problem can be the processing of the events in strict order. In Big Data and Fast Data

systems it is not possible to be done globally. Kafka guarantees ordered message delivery within given

topic partition. With large number of topics it is possible to scale out the processing that each consumer

processes relevant messages in strict order. The challenge, though, is sending messages for given key

to the correct partition. This is message sender responsibility. Another problem that may arise is that

given message may be related to two independent processes, each one identified by different key. In

such case the message should be sent to two topics with specific partition assignment. This of course

complicates the sender side design. Placing a façade gateway component hiding the routing logic or

message duplication seems to be a good idea.

The message payload for transaction is XML. The XML content contains the full message information

and the header fields may replicate some of it to provide technical hints.

The namespace is: http://demos.tibco.com/retail/transactions

Element Sample Description

transaction Root element

transaction d1193d9b-686e-

4089-877e-

Unique transaction identification. In the demo it is assumed that all

messages with the same transaction id are identical, therefore it is safe

Document

Accelerator for Apache Spark – Interface Specification 11

Element Sample Description

transactionId 3a05ac9d88a3 to keep only one copy.

transaction

customerId

4bf34187-ce60-

482d-b2e9-

ed5584e72ada

Unique process identifier. The field groups related messages and it

should be used to process messages in the order of production.

transaction

storeId

Transaction originator identifier

transaction

transactionTime

2015-03-

20T20:18:32Z

ISO-8601 compatible time represented as xsd:dateTime.

transaction

transactionLines

Container element for list of items. A transaction consists of arbitrary

number of items and the items on the list can be repeated.

transaction

transactionLines

transactionLine

Single transaction line container.

transaction

transactionLines

transactionLine

productSKU

101231 Stock keeping unit code for the purchased product. Identifies the item in

the inventory.

transaction

transactionLines

transactionLine

quantity

1 Amount of items purchased.

transaction

transactionLines

transactionLine

nominalPrice

145.00 Nominal price as known to the selling point at the moment of transaction.

transaction

transactionLines

transactionLine

purchasePrice

130.00 Actual purchase price after discount(s).

transaction

transactionLines

transactionLine

discount

Optional discount information.

2.2 Notification

Notifications form output stream of Fast Data events. The notification messages, similarly as

transactions, are XML messages sent via Kafka topic. The messages exchange can be organized on

the client side into request-reply pattern, but on the XML level the messages are different.

Document

Accelerator for Apache Spark – Interface Specification 12

Field Sample Description

correlationId ca9ef517-8509-

40be-9754-

ca923344ba67

Message correlation id propagated from the transaction message. It is

used by the client to correlate notification with the sent transaction

message. Usually it should be combined with sentResponseTo header.

customerId 4bf34187-ce60-

482d-b2e9-

ed5584e72ada

Replicated field from the message payload. The field can be used to

route the message to the topic partition specific to the customer.

transactionId d1193d9b-686e-

4089-877e-

3a05ac9d88a3

Replicated field from the message payload. The message can be used

to track messages related to the given transaction without parsing the

actual payload.

In the current implementation of the accelerator there are not many assumptions about the consumer

implementation. The applied design with correlationId and sendResponseTo fields allows the

consumer to specify the expected notification delivery destination. This in turn enables the customer to

be both asynchronous with distinct channel for transactions and notifications as well as synchronous

with low latency answer for the service orchestration.

The synchronous request-reply requires further considerations. Kafka topic management is heavy

operation. Therefore it is unpractical to create notification topic for each transaction or even each client.

API gateway like TIBCO APIX seems to be a perfect fit for such service.

Same as for transaction, the message payload for notification is XML. The XML content contains the

offer information and the header fields may replicate some of it to provide technical hints.

The namespace is: http://demos.tibco.com/retail/notifications

Element Sample Description

notification Root element

notification

transactionId

d1193d9b-686e-

4089-877e-

3a05ac9d88a3

Unique transaction identification. Note that due to technical problems

and recovery there could be several notification messages for given

transaction. It is unlikely to happen under regular processing conditions.

notification

customerId

4bf34187-ce60-

482d-b2e9-

ed5584e72ada

Unique process identifier. The field groups related messages and it

should be used to process messages in the order of production.

notification

propensities

Container element for list of propensities/offers. The single notification

may contain multiple responses, letting the consumer of the message

pick the best one. There could be also no offer at all. In the demo the

event processing layer does the best response selection, so there is at

most one offer line. The design allows to pipeline the processing of

messages, sending the ultimate response to the sendResponseTo in the

last stage.

notification

propensities

propensity

Single offer container.

Document

Accelerator for Apache Spark – Interface Specification 13

Element Sample Description

notification

propensities

propensity

category

Hats Actual category/offer name.

notification

propensities

propensity

propensity

0.3 Propensity score

The notification message content is very simple. It should contain just the essential facts produced by

the event processing layer.

Document

Accelerator for Apache Spark – Interface Specification 14

3 Runtime state maintenance (HBase)

The core aspect of any event processing solution is processing of incoming events in context of

previously captured events. In ultra-low latency solutions, like StreamBase algorithmic trading, the state

is kept locally in the process memory. In retail scenario the customer's numerical description is derived

from customer history. As the process may disappear any time and there is no guarantee that the given

process handles all particular customer related messages, the memory-only storage is unpractical. In

scalable solution the data should be kept shared. For Fast Data the shared storage access must be

immediate.

In the accelerator the event flow application need to read the customer history and update it in possible

short time. HBase looks like a perfect fit for such use-case. The main advantages of HBase in context of

retail accelerator are:

scalable constant-cost customer data access

flexible schema

inherent data appending

In the accelerator the HBase interaction is limited to single read and update by primary key. The

database keeps customer transaction records in table Customers in form of JSON message per entry.

The table is defined with single column Transactions that was defined with history size of 200 entries.

That means up to 200 transactions are kept with no additional cost. The query is always made by

primary key than binary representation of customer id.

In order to access the database shell, go to the HBase bin directory and execute the following

command:

[demo@demo bin]$ cd /opt/java/hbase-1.2.0/bin/

[demo@demo bin]$ ./hbase shell

The table customer was created with the following statement:

create 'Customers', { NAME => "Transactions", VERSIONS => 200 }

To inspect the table run:

describe 'Customers'

To remove all stored data:

truncate 'Customers'

Customer data can be accessed with the following call:

hbase(main):004:0> get 'Customers', "1f969758-3a90-4fcc-ac18-ea0247823f48"

COLUMN CELL

Transactions:transactions timestamp=1460826706531,

value={"transactionId":"9132a1b3-c332-4a6e-926a-83696b6c1f9b","transactionDate":"2015-12-30

04:14:10.000+0000","items":[{"productSku":"100017","quantity":1,"price":38.15,"revenue":38.1

5},{"productSku":"100017","quantity":1,"price":38.15,"revenue":38.15},{"productSku":"100017"

,"quantity":1,"price":38.15,"revenue":38.15},{"productSku":"100991","quantity":1,"price":35.

7,"revenue":35.7},{"productSku":"105104","quantity":1,"price":70.19,"revenue":70.19}]}

1 row(s) in 0.0330 seconds

To get 5 recently recorder transactions use parameterized call:

Document

Accelerator for Apache Spark – Interface Specification 15

hbase(main):005:0> get 'Customers', "1f969758-3a90-4fcc-ac18-ea0247823f48", {VERSIONS=>2}

COLUMN CELL

Transactions:transactions timestamp=1460826706531,

value={"transactionId":"9132a1b3-c332-4a6e-926a-83696b6c1f9b","transactionDate":"2015-12-30

04:14:10.000+0000","items":[{"productSku":"100017","quantity":1,"price":38.15,"revenue":38.1

5},{"productSku":"100017","quantity":1,"price":38.15,"revenue":38.15},{"productSku":"100017"

,"quantity":1,"price":38.15,"revenue":38.15},{"productSku":"100991","quantity":1,"price":35.

7,"revenue":35.7},{"productSku":"105104","quantity":1,"price":70.19,"revenue":70.19}]}

Transactions:transactions timestamp=1460826115580,

value={"transactionId":"d2d29d60-8211-4cbf-b643-445b23eac317","transactionDate":"2015-10-18

20:26:12.000+0000","items":[{"productSku":"100017","quantity":1,"price":38.15,"revenue":38.1

5},{"productSku":"100017","quantity":1,"price":38.15,"revenue":38.15},{"productSku":"100017"

,"quantity":1,"price":38.15,"revenue":38.15},{"productSku":"100017","quantity":1,"price":38.

15,"revenue":38.15},{"productSku":"100289","quantity":1,"price":90.0,"revenue":90.0},{"produ

ctSku":"100442","quantity":1,"price":149.49,"revenue":149.49},{"productSku":"100605","quanti

ty":1,"price":128.27,"revenue":128.27},{"productSku":"100605","quantity":1,"price":128.27,"r

evenue":128.27},{"productSku":"100605","quantity":1,"price":128.27,"revenue":128.27},{"produ

ctSku":"100605","quantity":1,"price":128.27,"revenue":128.27},{"productSku":"100605","quanti

ty":1,"price":128.27,"revenue":128.27},{"productSku":"100870","quantity":1,"price":43.74,"re

venue":43.74},{"productSku":"101022","quantity":1,"price":139.12,"revenue":139.12},{"product

Sku":"101022","quantity":1,"price":139.12,"revenue":139.12},{"productSku":"102769","quantity

":1,"price":99.0,"revenue":99.0},{"productSku":"107191","quantity":1,"price":261.0,"revenue"

:261.0}]}

Transactions:transactions timestamp=1460825844625,

value={"transactionId":"b9ebb575-788b-42a3-8d4e-f8c9a7533606","transactionDate":"2015-09-15

17:23:53.000+0000","items":[{"productSku":"100017","quantity":1,"price":38.15,"revenue":38.1

5},{"productSku":"100024","quantity":1,"price":22.5,"revenue":22.5},{"productSku":"100605","

quantity":1,"price":128.27,"revenue":128.27},{"productSku":"100605","quantity":1,"price":128

.27,"revenue":128.27},{"productSku":"100991","quantity":1,"price":35.7,"revenue":35.7},{"pro

ductSku":"101022","quantity":1,"price":139.12,"revenue":139.12},{"productSku":"103318","quan

tity":1,"price":23.6,"revenue":23.6}]}

Transactions:transactions timestamp=1460825289248,

value={"transactionId":"f4b9ea05-46d8-45a4-9652-8e6bc8731b6f","transactionDate":"2015-07-10

18:02:08.000+0000","items":[{"productSku":"100017","quantity":1,"price":38.15,"revenue":38.1

5},{"productSku":"100017","quantity":1,"price":38.15,"revenue":38.15},{"productSku":"100017"

,"quantity":1,"price":38.15,"revenue":38.15},{"productSku":"100017","quantity":1,"price":38.

15,"revenue":38.15},{"productSku":"100017","quantity":1,"price":38.15,"revenue":38.15},{"pro

ductSku":"100017","quantity":1,"price":38.15,"revenue":38.15},{"productSku":"100024","quanti

ty":1,"price":22.5,"revenue":22.5},{"productSku":"100273","quantity":1,"price":10.99,"revenu

e":10.99},{"productSku":"100352","quantity":1,"price":75.0,"revenue":75.0},{"productSku":"10

2863","quantity":1,"price":9.99,"revenue":9.99},{"productSku":"106997","quantity":1,"price":

12.34,"revenue":12.34}]}

Transactions:transactions timestamp=1460825234330,

value={"transactionId":"84c03a7b-b373-4ee0-89f7-0a953a06cd11","transactionDate":"2015-07-03

21:07:40.000+0000","items":[{"productSku":"100017","quantity":1,"price":38.15,"revenue":38.1

5},{"productSku":"100017","quantity":1,"price":38.15,"revenue":38.15},{"productSku":"100017"

,"quantity":1,"price":38.15,"revenue":38.15},{"productSku":"100024","quantity":1,"price":22.

5,"revenue":22.5},{"productSku":"100024","quantity":1,"price":22.5,"revenue":22.5},{"product

Sku":"100605","quantity":1,"price":128.27,"revenue":128.27}]}

5 row(s) in 0.0270 seconds

The table structure:

Element Sample Description

transactionId 84c03a7b-b373-

4ee0-89f7-

0a953a06cd11

The transaction id. It is used to remove duplicates.

Document

Accelerator for Apache Spark – Interface Specification 16

Element Sample Description

transactionDate 2015-07-03

21:07:40.000+0000

Transaction date and time.

items List of transaction items.

items

productSKU

100017 Stock keeping unit code for the purchased product. Identifies the item in

the inventory.

items

quantity

1 Amount of items purchased.

items

price

38.15 Nominal price as known to the selling point at the moment of transaction.

items

revenue

38.15 Actual purchase price after discount(s).

items

discount

Optional discount information.

What can be noticed, the transaction line category id is absent. The reason is that the given transaction

may be used for feature building with product to category mapping different from the one active when it

was captured.

The event processing application adds supplementary logic to handle the duplicates and out-of-order

transactions. In the demo it is pretty common to resent the same transactions again and again. In such

case the transaction is interpreted as new one and from transactions with timestamp before the current

one and no older than predefined period (270 days), unique copies are extracted.

Note: the entry channel assumes that if transaction id is same, the transaction content is also identical.

Document

Accelerator for Apache Spark – Interface Specification 17

4 Runtime dashboard (LiveView DataMart)

For runtime operational monitoring the solution uses LiveView Web. The web dashboard uses the

realtime LiveView DataMart server hosting the data describing the current state of the event processing

layer. Currently the LVDM keeps the recent transactions and provides aggregation on them.

4.1 Transaction

Transaction table contains basic information about processed transactions with offer acceptance

tracking.

Element Sample Description

transactionId 84c03a7b-b373-

4ee0-89f7-

0a953a06cd11

The transaction id. It is used to remove duplicates.

transactionTimestamp 2015-07-03

21:07:40.000+0000

Transaction date and time.

customerId 1f969758-3a90-

4fcc-ac18-

ea0247823f48

Customer loyalty card number

storeId storeSF Transaction origin identification.

itemCount 5 Number of items included in the transaction

latitude / longitute 30.4151482

-97.6721351

Geographical position of the transaction derived from the originator

store.

state TX State derived from originator.

zipCode 78753 Zip code derived from originator.

region SW Region code derived from originator.

recommendation Backpacks The offer name created for this customer and transaction.

recommendingModel backpacks-2.5.3-

20150701

Model reference for model effectiveness tracking.

upsellSuccess false Recommendation result.

For each transaction, the record is published to the Transaction table together with the winning offer

and model (if any). The SB application tracks all created offers. If the same customer purchases the

item with recommendation category within the opportunity window (90 days), the upsellSuccess is set

to true. Otherwise if the opportunity window closes (time detected from incoming transactions), the

upsellSuccess is set to false.

Document

Accelerator for Apache Spark – Interface Specification 18

4.2 TransactionItems

For overview the transaction content is stored in child table. The table contains details for the executed

transaction for operational inspection.

Element Sample Description

productSku 100017 Stock keeping unit code for the purchased product. Identifies the

item in the inventory.

quantity 1 Amount of items purchased.

price 38.15 Nominal price as known to the selling point at the moment of

transaction.

revenue 38.15 Actual purchase price after discount(s).

discount Optional discount information.

time Transaction time.

latitude / longitute 30.4151482

-97.6721351

Geographical position of the transaction derived from the originator

store.

state TX State derived from originator.

zipCode 78753 Zip code derived from originator.

region SW Region code derived from originator.

category The category id recognized during transaction processing.

transactionId 84c03a7b-b373-

4ee0-89f7-

0a953a06cd11

The transaction id. Foreign key to the Transaction table.

itemId 2 Sequence number for the transaction line.

seqNum 1443 Sequential number to define the transaction item removal

sequence.

The table is keyed by transactionId/itemId pair.

The table implements graceful cleanup mechanism. When high memory utilization is detected the data

removal is triggered. The rows are selected for removal in sequence of insert events. Removal of

transaction item triggers also removal of parent transaction. The side effect is that after data removal

there could be some orphan transaction items left.

4.3 StoreSummary

Store summary table is derived from transaction items. The table contains information used to display

store-related dashboard information. The actual summary is by store and by category.

Document

Accelerator for Apache Spark – Interface Specification 19

Element Sample Description

latitude / longitute 30.4151482

-97.6721351

Geographical position of the transaction derived from the originator

store.

state TX State derived from originator.

zipCode 78753 Zip code derived from originator.

region SW Region code derived from originator.

numItems Number of items purchased

itemSaleRank Rank for the number of items

totalPrice Total value sold for the given category in the store

priceRank Store rank for the category value.

numTransactions Number of transactions for the given category

transactionRank Store rank

category Backpacks Category of the entry

storeId Store id of the entry

4.4 ModelSummary

The model summary contains information about deployed models. The table is propagated from model

loading task result.

Element Sample Description

status success / failure Model status

message Descriptive message reported by model component

modelName Model name as provided in the metadata

modelUrl URL where the model content can be found

modelVersion Model version

cutOff 0.42 The cut-off threshold used for binary discrimination

categoryId Backpacks Category id that the model produces offer for

offerName Summer Time

2015

Descriptive offer name as presented to customer

Document

Accelerator for Apache Spark – Interface Specification 20

Element Sample Description

description Model metadata descriptive information

validFrom 2015-05-01 Model validity boundary

validTo 2015-06-30 Model validity boundary

filter Free text filter defined as StreamBase expression. This can be used

to limit the model to particular region or store

time

seqNum

4.5 WallClock

The wall clock is intended to show the recent transaction timestamp. This timestamp is considered

observable time. The user interface can connect to this table in order to show the timestamp known by

the solution. The important assumption about accelerator is that the system is distributed and there is

no monotonic global clock. The global time instead is derived from observation of the world, transaction

timestamps in particular. This design does not exclude autonomous schedulers that may push the time

forward, but focuses on the observable nature of global time.

The table structure is simple:

Element Sample Description

id WallClock Primary key, only one value

time Timestamp. Highest reported time so far.

Document

Accelerator for Apache Spark – Interface Specification 21

5 Data collection (HDFS structures)

The large stream of events processing is just only beginning of the Fast Data to Big Data story. The

observed events are needed to draw conclusions both about current system behaviour as well as to

predict the observed processes evolution. The data has to be collected and prepared for massive

parallel processing. The main challenge is that semantics of event processing and data processing do

not match. The event processing requires fast access to the identified pieces of data, while the data

processing is all about transforming datasets. The events have to be stored in a storage that supports

massive parallel processing, in typical case it is HDFS. The problem with HDFS is that while it is perfect

for dataset access, it is terribly bad for incremental data appending.

The major problems of storing the data in HDFS are:

Expensive synchronization/buffer flush operation; it is required to acknowledge the data is

safely stored

Large cost of maintaining and processing many small files; the data consumers should not

attempt to read the files that are still written to

Format impedance; some data formats are better for events, some are better for data

The event data collection should provide an efficient mechanism for data appending. The data

availability latency should be relatively low, but for typical data analytics 1 hour latency is perfectly

acceptable.

The transition from events to data is done using staging approach. It allows to store the data efficiently

while not blocking the event processing layer from doing the business. The idea behind this approach is

that each layer in the pipeline processes the data in larger chunks. In the accelerator the pipeline works

as follows:

Fine-grained events are delivered to event processing layer via Kafka (XML)

Event processing layer builds facts about the operation (data enrichment, model processing,

offer preparation)

The facts are grouped in small batches and at the end of the batch sent to the Kafka topic. A

nice batching could be 100 events or 5 seconds (whatever happens first). Nnote: in the first

release there is not batching. Once data is sent out, the last offset in batch is acknowledged.

The data used for transit is JSON.

The JSON messages are collected by Flume and converted from JSON to Avro. Avro is binary

data representation with semantics format similar to JSON. The advantage over JSON is that

the data units are more compact and can be quickly rendered and parsed.

Flume appends the Avro files in batches to files in HDFS using time-based partitioning. The files

are completed and closed in 10 minutes intervals. This way some analytics processes that may

accept multiple relatively small data files can already start processing.

The data saved in Avro is regularly processed using ETL jobs (implemented with Spark) that

normalizes the data, removes duplicates, enriches for the target tasks (by adding category ids

Document

Accelerator for Apache Spark – Interface Specification 22

to items) and stores the data in Parquet. Parquet is another JSON-like binary serialization

format, but optimized for large datasets. In particular it can optimize the data access when only

subset of the fields is needed.

This implementation allows the event processing layer to operate on the data with full speed. The data

is then delivered to subsequent layers. For safe data storage, the Flume jobs can be redundant. They

will generate duplicates of the data, but as the fact that duplicates exist is acknowledged and addressed

in ETL step, the problem is immediately solved.

This staging approach has additional advantage. The pipeline is functional and the previous stage data

is retained for some time. This can be for example 24h for Kafka messages in the Flume topic and 3

months for the Avro files. If there is any bug discovered in the flow, the data can be always

reprocessed. Of course the messages sent out to third-parties have to be acknowledged as results, but

for example some missing x-referencing entries can be easily fixed by rerunning the job. Similar

concept is used in Lambda Architecture.

The partitioning of the Avro files for the event timestamp optimizes the ETL process. In absence of

special circumstances, the ETL process can be executed on two last partitions significantly reducing the

processing.

5.1 StreamBase to Flume

The event processing application produces JSON messages. The messages contain concatenated

JSON payload compliant with target Avro schema.

The header fields contain the routing information.

Field Sample Description

month 2015-08-01 Month extracted from the event timestamp information common to all

events in the payload.

It is important that, for given batch the event processing may produce more than one event. This may

happen for example around midnight. Because the timestamp information is generated by event

initiator, there is no temporal ordering of the events guaranteed. Therefore event with timestamp 2015-

09-01T00:00:05Z may be followed by event with timestamp 2015-08-30T23:59:40Z.

The data contained in the message payload is represented as JSON.

Element Sample Description

predictions List of model answers. Here it contains at most one element that was

sent to the customer.

predictions

modelName

Hats;2015 Spring; Arbitrary text describing the analytic model generating the result

predictions

categoryId

Hats Supplementary information describing the offer.

predictions

prediction

0.3 Prediction score

Document

Accelerator for Apache Spark – Interface Specification 23

Element Sample Description

customerId 4bf34187-ce60-

482d-b2e9-

ed5584e72ada

Loyalty card number

storeId store-123 Point of sale where the event happened.

transaction Transaction data container.

transaction

transactionId

d1193d9b-686e-

4089-877e-

3a05ac9d88a3

The transaction id. It can be later used to remove duplicate reports.

transaction

transactionDate

1427746712000 Transaction date as timestamp.

transaction

items

List of transaction items.

transaction

items

productSKU

101231 Stock keeping unit code for the purchased product. Identifies the item in

the inventory.

transaction

items

quantity

1 Amount of items purchased.

transaction

items

price

145.00 Nominal price as known to the selling point at the moment of transaction.

transaction

items

revenue

130.00 Actual purchase price after discount(s).

transaction

items

discount

Optional discount information.

This list can be extended to support richer data structures. In particular one may want to capture more

metadata about models, all models’ results (including those that did not pass the acceptance threshold),

the resolved category ids, metadata about customer featurization and so on.

The event JSON payload is then taken by Flume that appends the data to the Avro binary log. Flume

opens the files in a schema similar to the following:

hdfs://hdfsmaster:8020/apps/data/transactionLog/month=2015-03-01/tr.1427746712000.avro

The path for each event is built from the date range the batch belongs to (month=2015-03-01) and file

creation timestamp (1427746712000). In cases where Flume is redundant, it is useful to add distinct

prefixes for each copy to avoid (very unlikely) conflicts.

Document

Accelerator for Apache Spark – Interface Specification 24

There could be many such files as a new file is open in each partition every 10 minutes. While the file is

written to, it has slightly different name:

hdfs://hdfsmaster:8020/apps/data/transactionLog/month=2015-03-01/_tr.1427746712000.avro.tmp

Such format uses a convention in HDFS that such file is not yet ready for reading and it will be skipped

by ETL processes. What’s important, if Flume agent is killed abruptly, the file won’t be renamed

automatically, so in order not to lose the data, unplanned shutdown of Flume agents should be followed

by residual file assessment and renaming.

5.2 Avro to Parquet

Once the data is securely written to HDFS, the data can be further processed. In the accelerator

scenario the data is converted to Parquet structure.

In this particular case, the data structure is flattened to list of transaction items. The ETL job also groups

data to minimize amount of files, so further processing can operate with reduced number of tasks.

The result file paths are similar to the following:

hdfs://hdfsmaster:8020/apps/demo/transactions/month=2015-03-01/part-r-00000-0dff8a46-18f4-4353-

9b8c-2d0d7eea2129.gz.parquet

This structure allows to execute selection by month without reading all the files. In addition there would

be one aggregation task for month. Actually in real large data clusters it would be convenient to have

several files per month, but it all depends on the analytic process characteristics.

The analytic process in the accelerator focuses currently on the model training using past data.

Therefore the ETL process extracts the required fields and does local category enrichment. The

Parquet files provide the following structure:

Element Sample Description

customerId 4bf34187-ce60-

482d-b2e9-

ed5584e72ada

Customer identity (loyalty card)

transactionId d1193d9b-686e-

4089-877e-

3a05ac9d88a3

Transaction Id.

transactionDate Transaction timestamp

productSku 101231 Product id in inventory

categoryId Hats Category id as resolved in the ETL. Note that this can be different from

the category id resolved during event processing. A single item can be

also potentially represented as several categories.

quantity 1 Number of items purchased.

price 130.00 Effective price

In addition the files are partitioned with month field holding yyyy-mm-dd value for the first day of the

month.

Document

Accelerator for Apache Spark – Interface Specification 25

Because the all transaction items share the same transaction date, they are guaranteed to be stored in

the same partition. This way transaction level aggregation can be made in place without distributed

grouping by key.

The customer featurization process in the data analytics relies on the customer history. Because the

history is accessed in the preselected date ranges, the partitioned Parquet files prevent the

transformation process from reading the files with irrelevant data.

Document

Accelerator for Apache Spark – Interface Specification 26

6 Data access (REST/HTTP)

After the data is ETL-ed into Parquet files, the data lives in Big Data cluster. That means the data is

already in accessible form, but the amount of data does not allow it to be simply loaded it into memory.

The ultimate consumer of the data (like Spotfire) must use its reduced form. The reduced form in this

sense means usable information that can be relatively quickly obtained from the full data and has size

that can be moved around.

In the demo the data shows examples for:

Data sampling

Full dataset aggregation

Customer data pivoting

Important note about data access: The data in HDFS cluster is intended to be processed holistic way.

The particular piece of data lookup in HDFS is difficult. It is feasible, but the cost is in most of the cases

similar to the full table scan in relational databases.

6.1 Service layer

The data service layer is implemented as a thin REST layer on top of Spark application. The Spark

saves a lot of cost of processing Big Data by caching common queries and intermediary results. The

application uses a grid of worker nodes to process the data efficiently.

The service exposes two major data access channels:

Hive-compatible Spark SQL Thrift server (JDBC/ODBC)

Lightweight REST/HTTP interface

The Thrift interface is regular access point for JDBC and ODBC drivers. With Spotfire connector for

Spark SQL it is possible to access any registered tables the regular Spotfire/SQL way. The Thrift

interface is accessed with beeline console shell allowing to manage the metadata, register tables, etc.

The JDBC/ODBC interface has significant disadvantage, which is performance. The queries are

executed with the same speed independent from the access channel, but the data transfer interface

driven by JDBC/ODBC design is suboptimal. Also internal design of Spotfire connector translates the

ODBC resultsets into in-memory SBDF representation. The performance impact is especially visible for

fast queries or relatively long roundtrips between Spotfire and Spark, like Spark in AWS and Spotfire

running locally.

The REST/HTTP interface tries solving this issue by providing the access channel with minimal network

usage. The design assumes the query is triggered by HTTP call and the result comes back as simple

HTTP response optimized for the recipient. This interface is much lighter than ODBC. The downside is

large queries, where the query time is significant.

The REST/HTTP in the accelerator uses same pattern for data access. It is assumed a GET query is

done to a known URL resource, sometimes containing query parameters used to modify the output. The

interface implementation uses smart client format resolution. The data access URLs return two types of

data: text or sbdf (Spotfire Binary Data Format). The resolution is done using Accept URL extension.

Document

Accelerator for Apache Spark – Interface Specification 27

The important limitation of this kind of data service, no matter if it is JDBC/ODBC or HTTP, is that it is

not intended to execute queries from online-services. While the Spark cluster is capable operating on

huge datasets, it is not suited for fast serving of the data, especially selecting small piece of data. The

reason for that is lack of query scalability and it is architectural constraint. That means no future

improvements in products will solve this problem.

The advantage of HTTP is that it abstracts the underlying implementation from the interface. For

example in order to get the list of all identified categories a (cached) query is executed. The same query

can be hosted as simple file on the web server returning the same results without touching the client

interface. This is not possible with SQL-based design, where the actual data processing is driven from

the client application.

The examples are using standard command line HTTP client curl typically available in Linux with no

additional actions. The tool is available for other systems too.

6.1.1 Text format

The text format is headerless tab-separated text. This format assumes the consumer already knows the

data structure and can do its interpretation. The text format is default one.

Samples using curl:

curl http://demo.sample:9060/ranges

curl http://demo.sample:9060/ranges.txt

curl -H "Accept: text/plain" http://demo.sample:9060/ranges

curl --header "Accept: text/plain" http://demo.sample:9060/ranges

The sample response:

2015-03-01 248794 15458514.52

2015-02-01 224135 16416405.89

2015-01-01 247171 20111137.93

2014-12-01 271666 23221924.09

2014-11-01 239934 20510259.33

2014-09-01 201572 14582166.06

2014-10-01 230545 18691703.79

2014-08-01 187676 11360695.33

2014-07-01 167522 7915722.47

The text format can be easily consumed in shell scripts, R programs or any text-processing capable

languages.

6.1.2 SBDF format

The SBDF is native format of data representation in Spotfire. It contains both column names and types.

Because this is a proprietary binary format, it is not human readable.

Samples using curl (command line HTTP client):

curl http://demo.sample:9060/ranges.sbdf > ranges.sbdf

curl -H "Accept: application/vnd.tibco.spotfire.sbdf" http://demo.sample:9060/ranges >

ranges.sbdf

curl --header "Accept: application/vnd.tibco.spotfire.sbdf" http://demo.sample:9060/ranges >

ranges.sbdf

Document

Accelerator for Apache Spark – Interface Specification 28

6.2 /categories

The first aggregation query exposed by the data service is list of categories. This query selects all

unique categoryId values found in the ETL-ed dataset (Parquet files).

The returned data table has following fields:

Field name Field type Description

Category Text Category name

Sample service call:

curl http://demo.sample:9060/categories

6.3 /ranges

Ranges query aggregates the sales data by months.

Field name Field type Description

Date Text First day of month in yyyy-mm-dd format

Quantity Integer Total number of items sold in this range

Revenue Real Sum of all purchase values in the range

Sample service call:

curl http://demo.sample:9060/ranges

6.4 /transactions

There are three transactions resource implementations testing various ways of data selection. The

‘/transactions’ resource provide sampled view of the collected data. This operation is intended to

provide quick insight into the structure of the data. The goal is to retrieve randomly selected

transactions from the collected data. The call accepts parameters.

Parameter Field type Description

r Text First day of month in yyyy-mm-dd format. The parameter can be

repeated multiple times. When it is absent there is no filtering executed.

p Real Bernoulli sampling rate as a number between 0 and 1. Missing

parameter means there is no Bernoulli sampling.

n Integer Maximum number of transactions to be retrieved.

The three implementations use various grouping and filtering strategies. The ‘/transactions2’ resource is

at the moment the most performant now. It filters out the Parquet files using ‘r’ parameter. This

implementation guarantees the complete transaction content is delivered, but ensuring that constraint is

resource consuming.

Document

Accelerator for Apache Spark – Interface Specification 29

Field name Field type Description

TransactionId Text Unique transaction identifier.

TransactionDate Text Transaction date and time.

CustomerId Text Customer identifier

ProductCode Text Product SKU.

ProductCategory Text Category for the SKU as assigned in ETL.

Quantity Integer Number of purchased items in this transaction line.

Revenue Real Transaction line revenue.

Sample service call (note about quotes to suppress shell interpretation of &):

curl "http://demo.sample:9060/transactions2?r=2015-05-01&n=10&p=0.01"

6.5 /pivot

This is a control endpoint for execution of pivoting of the data for featurization of customers for model

training. The resource triggers underlying transformation of the flat transaction items dataset into

pivoted table with single record for each customer containing flags for requested response categories

and total quantities for all historical categories and total purchase quantities in each month.

Parameter Field type Description

h Text Historical ranges as first day of month in yyyy-mm-dd format. The

parameter can be repeated multiple times.

r Text Response ranges as first day of month in yyyy-mm-dd format. The

parameter can be repeated multiple times.

s Text Response category names.

The operation filters the ranges based on partitioning scheme and executes aggregation. As response a

table with dynamic columns is created

Field name Field type Description

resp_<category> Integer 0 or 1 if for the given customer there were any purchase in the response

ranges for category (taken from s parameter). Columns are sorted

lexicographically within this block.

<category> Integer Total number of items in the given category purchased by the customer

in the predictor ranges. Columns are sorted lexicographically within this

block. Note that depending on the range selection some categories may

be missing.

<month> Integer 12 columns having names in form ‘mmm’ (three letter month code) that

contain total purchases by customer in the given month in predictor

Document

Accelerator for Apache Spark – Interface Specification 30

Field name Field type Description

ranges. The columns are sorted by month order.

Sample service calls producing txt and sbdf are:

curl "http://demo.sample:9060/pivot?s=Hats&r=2016-11-01&r=2016-12-01&h=2015-11-01&r=2015-12-

01&h=2016-01-01&r=2016-02-01&h=2016-03-01&r=2016-04-01&h=2016-05-01&r=2016-06-01&h=2016-07-

01&r=2016-08-01&h=2016-09-01&r=2016-10-01" > pivot.txt

curl "http://demo.sample:9060/pivot.sbdf?s=Hats&r=2016-11-01&r=2016-12-01&h=2015-11-

01&r=2015-12-01&h=2016-01-01&r=2016-02-01&h=2016-03-01&r=2016-04-01&h=2016-05-01&r=2016-06-

01&h=2016-07-01&r=2016-08-01&h=2016-09-01&r=2016-10-01" > pivot.sbdf

Note that the text representation does not contain column names, so it is not precise. This resource is

provided as validation endpoint before actual model training.

6.6 /sql

SQL resource is a free query interface. It is similar to the /pivot call as it returns arbitrary columns. The

call may query any registered table.

Parameter Field type Description

s Text Spark SQL compatible query.

Sample service call executing query against ranges table:

curl "http://demo.sample:9060/sql?s=select+*+from+ranges"

6.7 /etl

Apart from read-only queries, the Spark applications may offer also transformation tools. In the

accelerator demo the conversion from Avro to Parquet is done by the same Spark application.

6.7.1 /etl (GET)

The basic resource executes the ETL within the single request-reply cycle. If the whole process

finishes, the call will return with code 200 (HTTP response OK) and text OK.

6.7.2 /etl (POST)

The POST version of the call creates a background job. The call returns immediately with URL to the

job information in the payload.

6.7.3 /etl/{jobId} (GET)

The call returns status of the tracked job as text ‘running’ or ‘done’.

6.7.4 /etl/{jobId}/messages (GET)

List of messages reported by the ETL job. It is returned as plain text.

6.7.5 /etl/{jobId}/events (GET)

List of events reported by Spark.

Document

Accelerator for Apache Spark – Interface Specification 31

NOTE: This call does not work. The reason to have this call is that the explicit messages can be

reported only between Spark cluster jobs. Spark reports events for every stage start and stop while the

job is running. This way the consumer may track the progress of the given task even if the main thread

waits for results. The problem is matching the launching thread with the events. This has not been

solved yet.

6.7.6 /etl/{jobId} (DELETE)

Untrack the given job. It is usually executed once the job reaches state ‘done’.

6.8 /models (training)

Model subresource is related to the model training. It covers the actual model training job and access to

the related results.

6.8.1 /models/train (GET)

This is original model training control resource. It executes the model training cycle (pivot customers,

move data to H2O, launch models, collect results) within the HTTP request duration. As such it is a

subject of timeouts. If the whole process finishes, the call will return with code 200 (HTTP response OK)

and text OK.

Unlike ETL job, this operation takes parameters encoded as HTTP URL query parameters.

Parameter Field type Description

m Text Reference model name, later used as directory name.

h Text Historical ranges as first day of month in yyyy-mm-dd format. The

parameter can be repeated multiple times.

r Text Response ranges as first day of month in yyyy-mm-dd format. The

parameter can be repeated multiple times.

s Text Response category names.

6.8.2 /models/train (POST)

Similarly as the GET version, this call triggers job execution. The model training job is executed by a

background thread and the status can be tracked using returned job id.

The job requires parameters:

Parameter Field type Description

m Text Reference model name, later used as directory name. Model name

overrides previous executions.

h Text Historical ranges as first day of month in yyyy-mm-dd format. The

parameter can be repeated multiple times.

r Text Response ranges as first day of month in yyyy-mm-dd format. The

Document

Accelerator for Apache Spark – Interface Specification 32

Parameter Field type Description

parameter can be repeated multiple times.

s Text Response category names.

The parameters are passed as HTTP form encoding.

6.8.3 /models/train/{jobId} (GET)

The call returns status of the tracked job as text ‘running’ or ‘done’.

6.8.4 /models/train/{jobId}/messages (GET)

List of messages reported by the model training job. It is returned as plain text.

6.8.5 /models/train/{jobId}/events (GET)

List of events reported by Spark. See the /etl/{jobId}/events discussion before.

6.8.6 /models/train/{jobId} (DELETE)

Untrack the given job. It is usually executed once the job reaches state ‘done’.

6.9 /models (results and medatata)

Once the model training job completes, the results are available for inspection and assessment. The

training process stores POJO class in HDFS and collects metadata reported by H2O model generation

jobs: AUC model metric, threshold values maximizing various metric functions and variable importance

chart.

6.9.1 /models/results

The results provide a summary of the trained models. The response is consumable data frame with the

format adapted to the caller.

Field name Field type Description

Model Text Model name given during training job submission as m parameter.

Category Text Requested category name given during training job submission as s

parameter.

AUC Real Area-Under-Curve. One of possible model efficiency metrics

representing the area under ROC points on fpr/tpr chart

f1 Real Cut-off value maximizing the F1 metrics.

f2 Real Cut-off value maximizing the F2 metrics.

f0point5 Real Cut-off value maximizing the F0.5 metrics.

accuracy, precision, Real Cut-off values maximizing other standard metrics as returned by

Document

Accelerator for Apache Spark – Interface Specification 33

Field name Field type Description

recall, specificity,

absolute_MCC,

min_per_class_accuracy

H2O model job results.

Examples:

curl "http://demo.sample:9060/models/results"

curl "http://demo.sample:9060/models/results.sbdf" > results.sbdf

6.9.2 /models/roc

H2O model training jobs for binomial supervised classification collect the model answer rates for both

training and validation sets. The set of operating points is referred as ROC (Receiver Operating

Characteristic). The ROC provides tpr/fpr values for given cut-off value. In the accelerator the result of

validation set is collected for visualization. The data is available as consumer adapted format (text or

sbdf).

Field name Field type Description

Model Text Model name given during training job submission as m parameter.

Category Text Requested category name given during training job submission as s

parameter.

Threshold Real Cut-off value of the POC

tpr Real True positives rate TP/(TP+FN) (sensitivity)

fpr Real False positives rate FP/(FP+RN) (1-specificity)

tp Integer Number of detected true cases.

fp Integer Number of false cases classified as true.

tn Integer Number of correctly detected false cases.

fn Integer Number of undetected true cases.

The ROC points can be used to maximize the expected outcome from the model. In particular the

collected values can be combined with cost function to maximize the gain / minimize the loss.

For more details please consult the Wikipedia:

https://en.wikipedia.org/wiki/Receiver_operating_characteristic

https://en.wikipedia.org/wiki/F1_score

Examples:

curl "http://demo.sample:9060/models/roc"

curl "http://demo.sample:9060/models/roc.sbdf" > roc.sbdf

Document

Accelerator for Apache Spark – Interface Specification 34

6.9.3 /models/varimp

The models trained using H2O report variable importance list. Each predictor variable is scored for

prediction power in the model training job. The list of model predictor variable importance is returned in

form of table.

Field name Field type Description

Model Text Model name given during training job submission as m parameter.

Category Text Requested category name given during training job submission as s

parameter.

Variable Text Predictor variable name

Relative Importance Real Variable importance score.

Scaled Importance Real Variable importance score scaled to the highest one.

Percentage Real Variable importance score normalized to 1.

Examples:

curl "http://demo.sample:9060/models/varimp"

curl "http://demo.sample:9060/models/varimp.sbdf" > varimp.sbdf

Document

Accelerator for Apache Spark – Interface Specification 35

7 Data access (SQL)

Although the Spark component primarily access channel is REST/HTTP, it exposes structures for

arbitrary usage in data discovery in Spotfire using Spark SQL Adapter.

In addition to the REST channel on port 9060, the data access service exposes the Hive/Thrift

compatible interface on port 10001. The default port is 10000, but it conflicts with standard StreamBase

port.

7.1 titems

The titems table is Parquet relation containing ETL-ed data from Avro files. The table contains flat view

of transaction items.

Field name Field type Description

customerId Text Customer unique identifier

transactionId Text Transaction unique identifier

transactionDate Timestamp Transaction date and time; same value for entries with the same

transactionId

storeId Text Store identifier; same value for entries with the same transactionId

productSku Text Product SKU for the transaction line

categoryId Text Recognized category id

quantity Integer Product quantity

price Double Total price of the item

month Text First day of month used to partition the data.

7.2 stores

The stores table contain store reference data. The dataset is converted into Parquet and exposed as

table. The table is used to execute analysis based on spatial and regional information.

Field name Field type Description

storeNum Text Unique store identifier

street Text Store street address

state Text State code

city Text City where the store is located

Document

Accelerator for Apache Spark – Interface Specification 36

Field name Field type Description

zip Text Zip code of the store

latitude / longitude Double Spatial position of the store

region Text Region code of the store

The stores table is read from Parquet files stored at:

hdfs://demo.sample/apps/demo/reference/stores

7.3 customer_demographic_info

In the customer_demographic_info table customer demographic info is stored. This is auxiliary

information used to in the transaction generation process. In the accelerator concept this is hidden

variable and in the visualization it is used to compare observed customer behaviour against assumed

model.

Field name Field type Description

customerId Text Unique customer identifier

name Text Customer name

age Integer Customer age used for data generation. Note: in the real information

collection the age value makes only sense with the time when it was

captured.

gender Integer Customer gender.

married Logical Marital status.

hasKids Logical Flag indicating that customer may have higher tendency to buy

goods for children.

incomeQuartile Integer Income level indicator.

hobby Text Customer hobby.

zip Text Customer zip code (used to select the best store)

educationalLevel Integer Customer education indicator

employment Integer Customer employment status

ownHome Logical Flag indicating that customer has real estate.

The table is read from Parquet files stored at:

hdfs://demo.sample/apps/demo/reference/customer_demographic_info

Document

Accelerator for Apache Spark – Interface Specification 37

7.4 customerid_segments

The customerid_segments table provides precomputed customer segmentation information and

preferred store id.

Field name Field type Description

customerId Text Unique customer identifier

marketSegment Integer Customer segment.

storeNum Text Preferred store id.

The table is read from Parquet files stored at:

hdfs://demo.sample/apps/demo/reference/customerid_segments

7.5 categories

The categories table contains unique categories extracted from the titems table. It is temporary table

defined as Spark transformation in Scala and registered in SQL context.

Field name Field type Description

categoryId Text Category id

7.6 ranges

The ranges table contains aggregated monthly sales information. Similar as categories it is temporary

table defined as Spark transformation in Scala and registered in SQL context.

Field name Field type Description

month Text Month in format yyyy-mm-01

quantity Integer Total number of items sold

revenue Double Total revenue in the given month

Document

Accelerator for Apache Spark – Interface Specification 38

8 POJO layout and configuration deployment

The accelerator uses H2O POJO model representation to evaluate models in the event processing

layer. The POJOs (a.k.a. genmodel) are lightweight model implementations focused on real-time

execution. They are directly compiled into Java bytecode and use no runtime object allocation on heap

during scoring. This way the scoring latency can be reduced to few microseconds and makes H2O

model evaluation lighning-fast.

The POJOs are exported from H2O cluster as Java source code that has to be compiled and loaded

into memory. This feature is directly supported by new the H2O operator in StreamBase 7.6.4. The

operator requires the POJO class content to be available as referenceable reseource, for example in

HDFS.

A single H2O model operator instance supports multiple models deployed at the same time. They

should share the same characteristics:

compatible predictor fields

compatible model type (regression, binomial classification or multinomial classification)

In the accelerator the models are binomial classification type and use featurization as list of fields

containing total item count in each category and in month in the past period. The prediction period is

considered here 270 days before the incoming transaction and the new transaction itself. The response

is classification of the customer history to have propensity to do a purchase in the target category in

coming days.

In order to support this construction the model training jobs save the result data in HDFS. The models

are then bundled together, described with metadata and deployed to the cluster.

8.1 HDFS layout

The models are prepared for deployment in HDFS. In the distributed filesystem the solution stores

metadata about training results and target deployment bundles.

8.1.1 hdfs://demo.sample/apps/demo/models/pojo

The directory stores generated POJO files. It contains a subdirectory for each model training job. The

directory name is provided as m parameter.

[demo@demo ~]$ hadoop fs -ls /apps/demo/models/pojo/Demo

Found 2 items

-rw-r--r-- 3 demo supergroup 1758956 2016-04-07 20:08

/apps/demo/models/pojo/Demo/Backpacks.pojo

-rw-r--r-- 3 demo supergroup 21840350 2016-04-07 20:08

/apps/demo/models/pojo/Demo/Hats.pojo

Each file contains POJO source code for the target category provided to the model training job as s

parameter.

[demo@demo ~]$ hadoop fs -cat /apps/demo/models/pojo/Demo/Hats.pojo | head

/*

Licensed under the Apache License, Version 2.0

http://www.apache.org/licenses/LICENSE-2.0.html

Document

Accelerator for Apache Spark – Interface Specification 39

AUTOGENERATED BY H2O at 2016-04-07T20:08:48.441Z

3.8.1.3

Standalone prediction code with sample test data for DRFModel named

DRF_model_1460059617025_1

How to download, compile and execute:

During model training with the same model name the existing directory is removed.

8.1.2 hdfs://demo.sample/apps/demo/models/results

The results of the model training are stored in tab-separated files, one for each training job. Similarly as

for POJOs, the file is overwritten by jobs with the same model names.

[demo@demo ~]$ hadoop fs -ls /apps/demo/models/results

Found 10 items

-rw-r--r-- 3 demo supergroup 353 2016-04-07 20:08

/apps/demo/models/results/Demo.txt

-rw-r--r-- 3 demo supergroup 511 2016-03-24 13:03 /apps/demo/models/results/Test 2

.txt

-rw-r--r-- 3 demo supergroup 630 2016-03-28 13:59 /apps/demo/models/results/Test

2.txt

-rw-r--r-- 3 demo supergroup 0 2016-04-05 10:46 /apps/demo/models/results/Test

3.txt

-rw-r--r-- 3 demo supergroup 630 2016-04-04 10:18 /apps/demo/models/results/Test

4.txt

-rw-r--r-- 3 demo supergroup 359 2016-04-07 22:11 /apps/demo/models/results/Test

5.txt

-rw-r--r-- 3 demo supergroup 493 2016-03-23 15:15

/apps/demo/models/results/Test.txt

-rw-r--r-- 3 demo supergroup 235 2016-04-07 22:32 /apps/demo/models/results/XYZ.txt

-rw-r--r-- 3 demo supergroup 244 2016-04-07 22:35 /apps/demo/models/results/qwerty

12345.txt

-rw-r--r-- 3 demo supergroup 373 2016-04-13 15:49 /apps/demo/models/results/test

model.txt

The results file content can be verified using cat command:

[demo@demo ~]$ hadoop fs -cat /apps/demo/models/results/Demo.txt

Model Category AUC f1 f2 f0point5 accuracy precision

recall specificity absolute_MCC min_per_class_accuracy

Demo Hats 0.74462465 0.30282423 0.20221285 0.46907704 0.50914093

0.80936127 0.05373563 0.80936127 0.30282423 0.41259288

Demo Backpacks 0.47964015 0.02173136 0.00002418 0.02173136

0.24015803 0.02173136 0.00002418 0.24015803 0.00012852 0.00258690

The files available in the directory are aggregated together by Spark data access service and hosted as

REST resource.

8.1.3 hdfs://demo.sample/apps/demo/models/roc

Similarly as training job results, the ROC points are also stored in HDFS in a text file for each job.

[demo@demo ~]$ hadoop fs -ls /apps/demo/models/roc

Found 10 items

-rw-r--r-- 3 demo supergroup 51228 2016-04-07 20:08 /apps/demo/models/roc/Demo.txt

-rw-r--r-- 3 demo supergroup 91074 2016-03-24 13:03 /apps/demo/models/roc/Test 2 .txt

-rw-r--r-- 3 demo supergroup 117505 2016-03-28 13:59 /apps/demo/models/roc/Test 2.txt

-rw-r--r-- 3 demo supergroup 0 2016-04-05 10:46 /apps/demo/models/roc/Test 3.txt

Document

Accelerator for Apache Spark – Interface Specification 40

-rw-r--r-- 3 demo supergroup 117480 2016-04-04 10:18 /apps/demo/models/roc/Test 4.txt

-rw-r--r-- 3 demo supergroup 54803 2016-04-07 22:11 /apps/demo/models/roc/Test 5.txt

-rw-r--r-- 3 demo supergroup 85489 2016-03-23 15:15 /apps/demo/models/roc/Test.txt

-rw-r--r-- 3 demo supergroup 27239 2016-04-07 22:32 /apps/demo/models/roc/XYZ.txt

-rw-r--r-- 3 demo supergroup 30846 2016-04-07 22:35 /apps/demo/models/roc/qwerty

12345.txt

-rw-r--r-- 3 demo supergroup 57439 2016-04-13 15:49 /apps/demo/models/roc/test

model.txt

The file content can be verified using cat command:

[demo@demo ~]$ hadoop fs -cat /apps/demo/models/roc/Demo.txt | head

Model Category Threshold tpr fpr tp fp tn fn

Demo Hats 0.80936127 0.00010692 0.00000000 1 0 15644 9352

Demo Hats 0.80061281 0.00010692 0.00025569 1 4 15640 9352

Demo Hats 0.79416358 0.00042767 0.00025569 4 4 15640 9349

Demo Hats 0.78396730 0.00117609 0.00031961 11 5 15639 9342

Demo Hats 0.77707700 0.00171068 0.00038353 16 6 15638 9337

Demo Hats 0.77294468 0.00192452 0.00044746 18 7 15637 9335

Demo Hats 0.76663585 0.00256602 0.00070314 24 11 15633 9329

Demo Hats 0.76177124 0.00310061 0.00089491 29 14 15630 9324

Demo Hats 0.75874359 0.00363520 0.00102276 34 16 15628 9319

The files available in the directory are aggregated together by Spark data access service and hosted as

REST resource.

8.1.4 hdfs://demo.sample/apps/demo/models/varimp

Finally the variable importance results are also stored as text files.

[demo@demo ~]$ hadoop fs -ls /apps/demo/models/varimp

Found 10 items

-rw-r--r-- 3 demo supergroup 6980 2016-04-07 20:08 /apps/demo/models/varimp/Demo.txt

-rw-r--r-- 3 demo supergroup 12385 2016-03-24 13:03 /apps/demo/models/varimp/Test 2

.txt

-rw-r--r-- 3 demo supergroup 15785 2016-03-28 13:59 /apps/demo/models/varimp/Test

2.txt

-rw-r--r-- 3 demo supergroup 0 2016-04-05 10:46 /apps/demo/models/varimp/Test

3.txt

-rw-r--r-- 3 demo supergroup 15785 2016-04-04 10:18 /apps/demo/models/varimp/Test

4.txt

-rw-r--r-- 3 demo supergroup 7383 2016-04-07 22:11 /apps/demo/models/varimp/Test

5.txt

-rw-r--r-- 3 demo supergroup 11499 2016-03-23 15:15 /apps/demo/models/varimp/Test.txt

-rw-r--r-- 3 demo supergroup 3766 2016-04-07 22:32 /apps/demo/models/varimp/XYZ.txt

-rw-r--r-- 3 demo supergroup 4261 2016-04-07 22:35 /apps/demo/models/varimp/qwerty

12345.txt

-rw-r--r-- 3 demo supergroup 8002 2016-04-13 15:49 /apps/demo/models/varimp/test

model.txt

The file content can be verified using cat command:

[demo@demo ~]$ hadoop fs -cat /apps/demo/models/varimp/Demo.txt | head

Model Category Variable Relative Importance Scaled Importance

Percentage

Demo Hats Hats 42515.66015625 1.00000000 0.12774973

Demo Hats Other Youth Clothes 18572.19140625 0.43683178 0.05580514

Demo Hats Other Sportswear 18197.13476563 0.42801017 0.05467818

Demo Hats Running Clothes 13642.03417969 0.32087081 0.04099116

Demo Hats Other Activity Gear 12797.34765625 0.30100315 0.03845307

Demo Hats Womens Fleece 9781.95605469 0.23007889 0.02939252

Demo Hats Mens Fleece 9526.95605469 0.22408110 0.02862630

Document

Accelerator for Apache Spark – Interface Specification 41

Demo Hats Endurance Training Clothes 8466.57910156 0.19914025 0.02544011

Demo Hats Socks and Belts 8291.94042969 0.19503262 0.02491536

The files available in the directory are aggregated together by Spark data access service and hosted as

REST resource.

8.1.5 hdfs://demo.sample/apps/demo/models/sets

The model metadata is kept in the HDFS for easy sharing within the cluster. The directory contains a

set of model definition entries describing model execution behaviour.

[demo@demo ~]$ hadoop fs -ls /apps/demo/models/sets

Found 5 items

-rw-r--r-- 1 demo supergroup 4 2016-03-28 16:07 /apps/demo/models/sets/empty.txt

-rw-r--r-- 1 demo supergroup 312 2016-04-07 15:42 /apps/demo/models/sets/models-

a.txt

-rw-r--r-- 1 demo supergroup 235 2016-03-28 22:18 /apps/demo/models/sets/models-

b.txt

-rw-r--r-- 1 demo supergroup 77 2016-04-05 22:59 /apps/demo/models/sets/models-

c.txt

-rw-r--r-- 1 demo supergroup 311 2016-04-13 16:01 /apps/demo/models/sets/models-

q.txt

Example model descriptor can be accessed with cat command:

[demo@demo ~]$ hadoop fs -cat /apps/demo/models/sets/models-a.txt

Hats hdfs://demo.sample/apps/demo/models/pojo/Test/Hats.pojo 0.19562227 Hats

Mens Fleece hdfs://demo.sample/apps/demo/models/pojo/Test/Mens%20Fleece.pojo

0.28281646 Mens Fleece

Mens Hardshell Jackets

hdfs://demo.sample/apps/demo/models/pojo/Test/Mens%20Hardshell%20Jackets.pojo 0.13727364

Mens Hardshell Jackets

The file is tab-separated with each line describing a model. The fields are:

Field name Field type Description

Model Name Text Model name reference. It is used to identify the model generating

given score. It is recommended to keep it consistent across bundles.

POJO URL Text URL pointing to the POJO source. H2O operator reads the file and

compiles it on the flight. The restriction of Java compiler is relaxed.

Cut-Off Real The actual cut-off value to be used for binomial classification. H2O

by default uses threshold maximizing F1 metric. This parameter

allows to override the classification results to maximized use-case

specific metric. The field has value between 0.0 and 1.0.

Category Id Text The category id to track customer responses to the offer.

Offer Name Text Descriptive offer name.

Model Version Text Model version label.

Description Text Details about the model

Valid From Date Effective start date for the model validity in form yyyy-mm-dd.

Document

Accelerator for Apache Spark – Interface Specification 42

Field name Field type Description

Inclusive.

Valid To Date Effective end date for the model validity in form yyyy-mm-dd.

Exclusive.

8.2 Configuration deployment

The model deployment assumes decoupling of the model operations from execution. To achieve that,

the deployment is done via ZooKeeper. ZooKeeper is cluster data sharing component. It is similar in

concepts to ActiveSpaces. It supports addressed data modifications with notifications to all listeners.

ZooKeeper was built for stability and strong consistence in first place. The product is used to deliver

communication primitives in large clusters.

In the accelerator the event processing components are clients for configuration change notifications.

ZooKeeper guarantees that the client eventually receives the last state change notification. That means

several fast updates may result in single notification to the consumer. The entries in ZooKeeper form a

filesystem-like tree structure and they are called z-nodes.

The following items are configurable and deployable on demand:

model metadata

product SKU to category mapping

feature list

ZooKeeper can be accessed with command line client. An example of ZK session:

[demo@demo bin]$ pwd

/opt/java/zookeeper-3.4.8/bin

[demo@demo bin]$ ./zkCli.sh -server localhost:2181/demo

Connecting to localhost:2181/demo

(... skipped for clarity ...)

[zk: localhost:2181/demo(CONNECTED) 0] ls /config

[h2oModel, features, products]

[zk: localhost:2181/demo(CONNECTED) 1] get /config/h2oModel

hdfs://demo.sample/apps/demo/models/sets/models-c.txt

cZxid = 0x19a

ctime = Sun Mar 20 21:48:52 UTC 2016

mZxid = 0x956635

mtime = Wed Apr 13 16:02:58 UTC 2016

pZxid = 0x19a

cversion = 0

dataVersion = 32

aclVersion = 0

ephemeralOwner = 0x0

dataLength = 53

numChildren = 0

[zk: localhost:2181/demo(CONNECTED) 2]

8.2.1 Model deployment

The actually active model bundle is configurable at /config/h2oModel z-node. The z-node contains

URL pointer to the current model set as described above. Whenever the z-node content changes, all

Document

Accelerator for Apache Spark – Interface Specification 43

event processing applications (StreamBase) try loading the models. If any model on the list fails loading

the change is rejected, but only on the node that failed. If the behaviour shall be consistent across the

cluster, special actions should be implemented. An example of the design is fail-fast engine shutdown

on model loading failure. This would prevent further processing with old configuration.

The model deployment bundle fields (repeated from previous chapter):

Field name Field type Description

Model Name Text Model name reference. It is used to identify the model generating

given score. It is recommended to keep it consistent across bundles.

POJO URL Text URL pointing to the POJO source. H2O operator reads the file and

compiles it on the flight. The restriction of Java compiler is relaxed.

Cut-Off Real The actual cut-off value to be used for binomial classification. H2O

by default uses threshold maximizing F1 metric. This parameter

allows to override the classification results to maximized use-case

specific metric. The field has value between 0.0 and 1.0.

Category Id Text The category id to track customer responses to the offer.

Offer Name Text Descriptive offer name.

Model Version Text Model version label.

Description Text Details about the model

Valid From Date Effective start date for the model validity in form yyyy-mm-dd.

Inclusive.

Valid To Date Effective end date for the model validity in form yyyy-mm-dd.

Exclusive.

8.2.2 Product SKU to category id mapping

The accelerator assumes the transaction originator has limited features. It is expected that it delivers

only information guaranteed to be true. That means the transaction message has minimalistic content. It

allows avoiding inconsistency if reference data is different between transaction terminals.

A consequence of such design is that the transaction line category id is unknown when message arrives

to the event processing layer. At the same time the model logic operates on category ids. What's more,

the category ids may change over time and there could be zero or more categories valid for a product.

An example of transaction line categorization can be: female, jacket, green, winter, collection-2016. This

approach enables multi-dimensional processing of incoming transaction.

The accelerator keeps the mapping locally for x-referencing efficiency. The reference table is read from

HDFS file. The current table is registered in ZooKeeper at /config/products. The file is tab-separated

text.

Field name Field type Description

Product SKU Text Product inventory key as sent in the transaction lines.

Document

Accelerator for Apache Spark – Interface Specification 44

Field name Field type Description

Category ID Text Category id assigned to the product.

Other Text Other reference fields; ignored at the moment.

Example access to the ZooKeeper and file content:

[demo@demo bin]$ echo "get /config/products" | ./zkCli.sh -server localhost:2181/demo

Connecting to localhost:2181/demo

(... skipped for clarity ...)

[zk: localhost:2181/demo(CONNECTED) 0] get /config/products

hdfs://demo.sample/apps/demo/config/products.txt

cZxid = 0x199

ctime = Sun Mar 20 21:48:52 UTC 2016

mZxid = 0x199

mtime = Sun Mar 20 21:48:52 UTC 2016

pZxid = 0x199

cversion = 0

dataVersion = 0

aclVersion = 0

ephemeralOwner = 0x0

dataLength = 48

numChildren = 0

[zk: localhost:2181/demo(CONNECTED) 1] [demo@demo bin]$

[demo@demo bin]$

[demo@demo bin]$ hadoop fs -cat hdfs://demo.sample/apps/demo/config/products.txt | head

100005 Womens Insulated Jackets 2 94.99 172.00 94.99

100017 Hats 13 20.00 38.15 40.00

100024 Other Sportswear 1 22.50 22.50 22.50

100061 Running Clothes 2 50.00 50.00 50.00

100080 Mens Footwear 1 117.00 117.00 117.00

100138 Running Clothes 1 25.00 25.00 25.00

100145 Outdoor Gear 1 99.00 99.00 99.00

100191 Hats 13 36.00 39.38 40.00

100249 Other Activity Gear 3 10.00 13.33 20.00

100265 Other Sportswear 2 34.19 44.59 55.00

8.2.3 Feature mapping

The numerical models operate on set of features. Currently they are total quantities for each known

category. If the category has never shown up on customer purchases, the field to be passed to the

model is unknown and the H2O operator passes NaN value. In order to prefill empty categories with

zeros it is required to know the expected category id list before. The list of categories can be also used

to optimize the feature list calculation by directly updating positions in a fixed array of doubles. Currently

the last point is more efficient in the operator itself.

The list of features is also kept as tab-separated text file in HDFS. The currently active file is registered

in z-node /config/features. The file is tab-separated text.

Field name Field type Description

Category ID Text Category ID as understood by the models.

Feature position Integer Position of the feature on the list. Currently unused as provided by

the model itself. In addition each model may be trained using

Document

Accelerator for Apache Spark – Interface Specification 45

Field name Field type Description

different feature fields, which makes this field obsolete.

Example access to the ZooKeeper and file content:

[demo@demo bin]$ echo "get /config/features" | ./zkCli.sh -server localhost:2181/demo

Connecting to localhost:2181/demo

(... skipped for clarity ...)

[zk: localhost:2181/demo(CONNECTED) 0] get /config/features

hdfs://demo.sample/apps/demo/config/features.txt

cZxid = 0x198

ctime = Sun Mar 20 21:48:52 UTC 2016

mZxid = 0x198

mtime = Sun Mar 20 21:48:52 UTC 2016

pZxid = 0x198

cversion = 0

dataVersion = 0

aclVersion = 0

ephemeralOwner = 0x0

dataLength = 48

numChildren = 0

[zk: localhost:21

[demo@demo bin]$

[demo@demo bin]$ hadoop fs -cat hdfs://demo.sample/apps/demo/config/features.txt | head

Active Sportswear 0

Backpacks 1

Boys Insulated Jackets 2

Endurance Training Clothes 3

Girls Fleece 4

Girls Insulated Jackets 5

Hats 6

Hydration Packs 7

Mens Alpine Jackets 8

Mens Climbing Gloves 9

8.2.4 Model deployment procedure

The model deployment lifecycle requires prior existence of configuration items. For example if a new

model expects a category field "Collection 2016", the category should appear on feature list and the x-

referencing should be defined. The correct sequence would be then:

create a copy of current feature list and add the new features

deploy the features; zeros appear as model input and field is ignored

create a copy and add entries to the products category mapping

deploy the features; if there are already products assigned to the new feature, the values start

appearing in the model input

take the current model bundle and add entries describing POJOs using the new features, define

the model validity range; the models are evaluated, but their results are discarded until the

incoming transactions have timestamp within validity range; the cost of H2O model evaluation is

minimal, so unnecessary execution is not an issue.


Top Related