arxiv:1504.05707v1 [cs.db] 22 apr 2015 · 2018-10-03 · convergent invoicing generate legally...

18
Towards More Data-Aware Application Integration (extended version) Daniel Ritter SAP SE, Technology Development, Dietmar-Hopp-Allee 16, 69190 Walldorf, Germany [email protected] Abstract. Although most business application data is stored in rela- tional databases, programming languages and wire formats in integration middleware systems are not table-centric. Due to costly format conver- sions, data-shipments and faster computation, the trend is to “push-down” the integration operations closer to the storage representation. We ad- dress the alternative case of defining declarative, table-centric integration semantics within standard integration systems. For that, we replace the current operator implementations for the well-known Enterprise Integra- tion Patterns by equivalent “in-memory” table processing, and show a practical realization in a conventional integration system for a non-reliable, “data-intensive” messaging example. The results of the runtime analy- sis show that table-centric processing is promising already in standard, “single-record” message routing and transformations, and can potentially excel the message throughput for “multi-record” table messages. Keywords: Datalog, Message-based / Data integration, Integration System. 1 Introduction Integration middleware systems in the sense of EAI brokers [5] (e. g., SAP HANA Cloud Integration 1 , Boomi AtomSphere 2 ) address the fundamental need for (business) application integration by acting as a messaging hub between applications. As such, they have become ubiquitous in service-oriented enterprise computing environments. Messages are mediated between applications mostly in wire formats based on XML (e.g., SOAP for Web Services). The advent of more “data-aware” integration scenarios (observation O1 ) put emphasis on (near) “real-time” or online processing (O2 ), which requires us to revisit the standard integration capabilities, system design and architectural decisions. For instance, in the financial / utilities industry, China Mobile generates 5-8 TB of call detail records per day, which have to be processed by integration systems (i. e., mostly message routing and transformation patterns), “convergent charging” 3 (CC) and “invoicing” applications (not further discussed). In addition, 1 SAP HCI, visited 04/2015: https://help.sap.com/cloudintegration. 2 Boomi AtomSphere, visited 04/2015: http://www.boomi.com/integration. 3 Solace Solutions, visited 02/2015; last update 2012: http://www.solacesystems. com/techblog/deconstructing-kafka. arXiv:1504.05707v1 [cs.DB] 22 Apr 2015

Upload: others

Post on 12-May-2020

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: arXiv:1504.05707v1 [cs.DB] 22 Apr 2015 · 2018-10-03 · Convergent Invoicing generate legally binding payment documents. Alternatively, new application and data analytics stacks

Towards More Data-Aware ApplicationIntegration (extended version)

Daniel Ritter

SAP SE, Technology Development,Dietmar-Hopp-Allee 16, 69190 Walldorf, Germany

[email protected]

Abstract. Although most business application data is stored in rela-tional databases, programming languages and wire formats in integrationmiddleware systems are not table-centric. Due to costly format conver-sions, data-shipments and faster computation, the trend is to “push-down”the integration operations closer to the storage representation. We ad-dress the alternative case of defining declarative, table-centric integrationsemantics within standard integration systems. For that, we replace thecurrent operator implementations for the well-known Enterprise Integra-tion Patterns by equivalent “in-memory” table processing, and show apractical realization in a conventional integration system for a non-reliable,“data-intensive” messaging example. The results of the runtime analy-sis show that table-centric processing is promising already in standard,“single-record” message routing and transformations, and can potentiallyexcel the message throughput for “multi-record” table messages.

Keywords: Datalog, Message-based / Data integration, Integration System.

1 Introduction

Integration middleware systems in the sense of EAI brokers [5] (e. g., SAPHANA Cloud Integration1, Boomi AtomSphere2) address the fundamental needfor (business) application integration by acting as a messaging hub betweenapplications. As such, they have become ubiquitous in service-oriented enterprisecomputing environments. Messages are mediated between applications mostly inwire formats based on XML (e. g., SOAP for Web Services).

The advent of more “data-aware” integration scenarios (observation O1 ) putemphasis on (near) “real-time” or online processing (O2 ), which requires usto revisit the standard integration capabilities, system design and architecturaldecisions. For instance, in the financial / utilities industry, China Mobile generates5-8 TB of call detail records per day, which have to be processed by integrationsystems (i. e., mostly message routing and transformation patterns), “convergentcharging”3 (CC) and “invoicing” applications (not further discussed). In addition,

1 SAP HCI, visited 04/2015: https://help.sap.com/cloudintegration.2 Boomi AtomSphere, visited 04/2015: http://www.boomi.com/integration.3 Solace Solutions, visited 02/2015; last update 2012: http://www.solacesystems.

com/techblog/deconstructing-kafka.

arX

iv:1

504.

0570

7v1

[cs

.DB

] 2

2 A

pr 2

015

Page 2: arXiv:1504.05707v1 [cs.DB] 22 Apr 2015 · 2018-10-03 · Convergent Invoicing generate legally binding payment documents. Alternatively, new application and data analytics stacks

2 Daniel Ritter

the standard XML-processing has to give ground to other formats like JSONand CSV (O3 ). These observations (O1–3 ) are backed by similar scenarios fromsports management (e. g., online player tracking) and the rapidly growing amountof data from the Internet of Things and Cyber Physical System domains. Forthose scenarios, an architectural setup with systems like Message Queuing (MQ)are used as reliable “message buffers” (i. e., queues, topics) that handle “bursty”incoming messages and smoothen peak loads (cf. Fig. 1). Integration systemsare used as message consumers, which (transactionally) dequeue, transform(e. g., mapping, content enrichment) and route messages to applications. Forreliable transport and message provenance, integration systems require relationalDatabase Systems, in which most of the (business) application data is currentlystored (O4 ). When looking at the throughput capabilities of the named systems,software-/hardware-based MQ systems like Apache Kafka or Solace3 are able toprocess several millions of messages per second. RDBMS benchmarks like TPC-Hmeasure queries and inserts in PB sizes, while simple, header-based routingbenchmarks for integration systems show message throuphputs of few thousandsof messages per second [2] (O5 ). In other words, MQ and DBMS (e. g., RDBMS,NoSQL, NewSQL) systems are already addressing observations O1–5. Integrationsystems, however, seem to not be there yet.

Compared to MQs, integration systems work on message data, which seems tomake the difference in message throughput. We argue that integration operations,represented by Enterprise Integration Patterns (EIP) [9], can be mapped to an“in-memory” representation of the table-centric RDBMS operations to profitfrom their efficient and fast evaluation. Early ideas on this were brought upin our position papers [11,12]. In this work, we follow up to shed light on theobserved discrepancies. We revisit the EIP building blocks and operator modelof integration systems, for which we define RDBMS-like table operators (so farwithout changing their semantics) as a symbiosis of RDBMS and integrationprocessing by using Datalog [16]. We choose Datalog as example of an efficientlycomputable, table-like integration co-processing facility close to the actual storagerepresentation with expressive foundations (e. g., recursion), which we call Table-centric Integration Patterns (TIP). To show the applicability of our approachto integration scenarios along observations O1–5 we conduct an experimentalmessage throughput analysis for selected routing and transformation patterns,where we carefully embed the TIP definitions into the open-source integrationsystem Apache Camel [3] that implements most of the EIPs. Not changing theEIP semantics means that table operations are executed on “single-record” tablemessages. We give an outlook to “multi-record” table message processing.

The remainder of this paper is organized along its main contributions. After amore comprehensive explanation of the motivating CC example and a brief sketchof our approach in Sect. 2, we analyse common integration patterns with respectto their extensibility for alternative operator models and define a table-centricoperator / processing model that can be embedded into the patterns (still) alignedwith their semantics using for example Datalog in Sect. 3. In Sect. 4 we apply ourapproach to a conventional integration system and briefly describe and discuss

Page 3: arXiv:1504.05707v1 [cs.DB] 22 Apr 2015 · 2018-10-03 · Convergent Invoicing generate legally binding payment documents. Alternatively, new application and data analytics stacks

Towards More Data-Aware Application Integration (extended version) 3

our experimental performance analysis, and we draw an experimental sketch ofthe idea of “multi-record” table message processing. Section 5 examines relatedwork and Sect. 6 concludes the paper.

2 Motivating Example and General Approach

In this section, the motivating “Call Record Detail” example in the context of the“Convergent Charging” application is described more comprehensively along witha sketch of our approach. Figure 1 shows aspects of both as part of a commonintegration system architecture.

Fig. 1. High-level overview of the convergent charging application and architecture.

2.1 The Convergent Charging Scenario

Mobile service providers like China Mobile generate large amounts of “CallRecord Details” (CRDs) per day that have been processed by applications likeSAP Convergent Charging4 (CC). As shown in Fig. 1, these CRDs are usuallysent from mobile devices to integration systems (optionally buffered in a MQSystem), where they are translated to an intermediate (application) format andenriched with additional master data (e. g., business partner, product). Themaster data helps to calculate pricing information, with which the messageis split into several messages, denoting billable items (i. e., item for billing)that are routed to their receivers (e. g., DB). From there applications like SAPConvergent Invoicing generate legally binding payment documents. Alternatively,new application and data analytics stacks like LogicBlox [7], WebdamLog [1], andSAP S/4HANA5 (not shown) access the data for further processing. Some of these

4 SAP Convergent Charging, last visited 04/2015: https://help.sap.com/cc.5 SAP S/4HANA, last visited 04/2015: http://discover.sap.com/S4HANA

Page 4: arXiv:1504.05707v1 [cs.DB] 22 Apr 2015 · 2018-10-03 · Convergent Invoicing generate legally binding payment documents. Alternatively, new application and data analytics stacks

4 Daniel Ritter

“smart” stacks even provide declarative, Datalog-like language for applicationand user-interface programming, which complements our integration approach.As motivated before, standard integration systems have problems processing thehigh number and rate of incoming messages, which usually leads to an “offline”,multi-step processing using indirections like ETL systems and pushing integrationlogic to the applications, leading to long-running CC runs.

2.2 General Approach

The Enterprise Integration Patterns (EIPs) [9] define “de-facto” standard opera-tions on the header (i. e., payload’s meta-data) and body (i. e., message payload)of a message, which are normally implemented in the integration system’s hostlanguage (e. g., Java, C#). This way, the actual integration operation (i. e., thecontent developed by an integration expert like mapping programs and routingconditions) can be differentiated from the implementation of the runtime systemthat invokes the content operations and processes their results. For instance,Fig. 1 shows the separation of concerns within integration systems with respectto “system-related” and “content-related parts” and sketches which patternoperations to re-define using relational table operators, while leaving the runtimesystem (implementation) as is. The goal is to only change these operations andmake integration language additions for table-centric processing within the con-ventional integration system, while preserving the general integration semanticslike Quality of Service (e. g., best effort, exactly once) and the Message ExchangePattern (e. g., one-way, two-way). In other words, the content-related parts of thepattern definitions are evaluated by an “in-process” table operation processor(e. g., a Datalog system), which is embedded into the standard integration systemand invoked during the message processing.

3 Table-centric Integration Patterns

Before defining Table-centric Integration Patterns (short TIP) for message routingand transformation more formally, let us recall the encoding of some relevant, basicdatabase operations / operators into Datalog: join, projection, union, andselection. The join of two relations r(x, y) and s(y, z) on parameter y is encodedas j(x, y, z)← r(x, y), s(y, z), which projects all three parameters to the resultingpredicate j. More explicitly, a projection on parameter x of relation r(x, y) isencoded as p(x)← r(x, y). The union of r(x, y) and s(x, y) is u(x, y)← r(x, y).u(x, y)← s(x, y), which combines several relations to one. The selection r(x, y)according to a built-in predicate φ(x), where φ(x) can contain constants andfree variables, is encoded as s(x, y) ← r(x, y), φ(x). Built-in predicates can bebinary relations on numbers such as <,<=,=, binary relations on strings such asequals, contains, startswith or predicates applied to expressions based on binaryoperators like +,−, ∗, / (e. g., x = p(y) + 1), and operations on relations likez = max(p(x, y), x), z = min(p(x, y), x), which would assign the maximal or theminimal value x of a predicate p to a parameter z.

Page 5: arXiv:1504.05707v1 [cs.DB] 22 Apr 2015 · 2018-10-03 · Convergent Invoicing generate legally binding payment documents. Alternatively, new application and data analytics stacks

Towards More Data-Aware Application Integration (extended version) 5

Although our approach allows each single pattern definition to evaluatearbitrary, recursive Datalog operations and built-in predicates, the Datalog topattern mapping tries to identify and focus on the most relevant table-centricoperations for a specific pattern. An overview of the mapping of all discussedmessage routing and transformation operations to Datalog constructs is shownin Fig. 2 and is subsequently discussed. Subsequently, we enumerate common

bu

ilt-i

njo

inse

lect

ion

proj

ecti

onu

nio

n

Message RoutingRouter, Filter:Recipient List

Multicast, Join RouterSplitter

Correlation, CompletionAggregation

Message TransformationMessage translator

Content filterContent enricher

Fig. 2. Message routing and transformation patterns mapped to Datalog. Most commonDatalog operations for a single pattern are marked “dark blue”, less common ones“light blue”, and possible but uncommon ones “white”.

EIPs and separate system- from content-related parts more formally for the TIPdefinition by example of standard Datalog.

3.1 Canonical Data Model

When connecting applications, various operations are executed on the transferredmessages in a uniform way. The arriving message instances are converted into aninternal format understood by the pattern implementation, called the CanonicalData Model (CDM) [9], before the messages are transformed to the target format.Hence, if a new application is added to the integration solution, only conversionsbetween the CDM and the application format have to be created. Consequently,for a table-centric re-definition of integration patterns, we define a CDM similarto relational database tables as Datalog programs, which consists of a collectionof facts / a table, optional (supporting) rules as message body and an optionalset of meta-facts that describes the actual data as header. For instance, thedata-part of an incoming message in JSON format is transformed to a collectionof Open-Next-Close (ONC)-style table iterators, each representing a table rowor fact. These ONC-operators are part of the evaluated execution plan for moreefficient evaluation.

Page 6: arXiv:1504.05707v1 [cs.DB] 22 Apr 2015 · 2018-10-03 · Convergent Invoicing generate legally binding payment documents. Alternatively, new application and data analytics stacks

6 Daniel Ritter

3.2 Message Routing Patterns

In this section the message routing pattern implementations are re-defined, whichcan be seen as control and data flow definitions of an integration channel pipeline.For that, they access the message to route it within the integration system andeventually to its receiver(s). They influence the channel and message cardinalityas well as the content of the message.

Content-based Router / Message Filter The most common routing patterns thatdetermine the message’s route based on its body are the Content-based Routerand the Message Filter. The stateless router has a channel cardinality of 1:n,where n is the number of leaving channels, while one channel enters the router,and a message cardinality of 1:1. The entering message constitutes the leavingmessage according to the evaluation of a routing condition. This condition isa function rc, with {bool1, bool2, ..., booln} := rc(msgin, conds), where msgin isthe entering message. The function rc evaluates to a list of Boolean output{bool1, bool2, ..., booln} based on a list of conditions conds of the same arity (e. g.,Datalog rules in Suppl. Material, List. 1.1) for each of the n ∈ N leaving channels.In case several conditions evaluate to true, only the first matching channelreceives the message.

Through the separation of concerns, a system-level routing function providesthe entering message msgin to the content-level implementation (i. e., in CDMrepresentation), which is configured by conds. Since standard Datalog rules aretruth judgements, and hence do not directly produce Boolean values, we decided,for performance and generality considerations, to add an additional functionboolrc to the integration system. The function boolrc converts the output list factof the routing function from a truth judgement to a Boolean by emitting true iffact 6= ∅, and false otherwise. Accordingly we define the TIP routing conditionas fact := rctip(msgin, conds), while being evaluated for each channel condition(e. g., selection / built-in predicates). The integration system will then use thefunction boolrc to convert this into a Boolean value. For the message filter, whichis a special case of the router that differs only from its channel cardinality of 1:1and message cardinality of 1:[0|1], the filter condition is equal to rctip.

Multicast / Recipient List / Join Router The stateless Multicast and RecipientList patterns route multiple messages to several leaving channels, which gives thema message and channel cardinality of 1:n. While the multicast statically routesmessages to the leaving channels (i. e., no re-definition required), the recipient listdetermines the channels dynamically. The receiver determination function rd, with{out1, out2, ..., outn} := rd(msgin, [header.y|body.x])., computes n ∈ N receiverchannel configurations {recv1, recv2, ..., recvn} by extracting their key values ei-ther from an arbitrary message header field header.y or from a message body fieldbody.x. The integration system has to implement a receiver determination func-tion that takes the list of key-strings {recvId1, recvId2, ..., recvIdm} as input, forwhich it looks up receiver configurations recv0, recv1, ..., recvn, where m,n ∈ Nandm > n, and routes copies of the entering message {msg′out,msg′′out, ...,msgn

out}.

Page 7: arXiv:1504.05707v1 [cs.DB] 22 Apr 2015 · 2018-10-03 · Convergent Invoicing generate legally binding payment documents. Alternatively, new application and data analytics stacks

Towards More Data-Aware Application Integration (extended version) 7

In terms of TIP, rdtip is a projection of message body or header values toa unary, output relation. For instance, the receiver configuration keys recvId1 andrecvId2 have to be part of the message body like body(x,′ recvId′1).body(x,′ recvId′2)..Then the rdtip would evaluate a Datalog rule similar to config(y)← body(x, y),while the keys recvId1 and recvId2 correspond to receiver configurations {recv1, recv2}.

Splitter / Aggregator The antipodal Splitter and Aggregator patterns both have achannel cardinality of 1:1 and create new, leaving messages. Thereby the splitterbreaks the entering message into multiple (smaller) messages (i. e., messagecardinality of 1:n) and the aggregator combines multiple entering messagesto one leaving message (i. e., message cardinality of n:1). Hereby, the statelesssplitter uses a split condition sc on the content-level, with {out1, out2, ..., outn} :=sc(msgin, conds), which accesses the entering message’s body to determine a listof distinct body parts {out1, out2, ..., outn}, based on a list of conditions conds,that are each inserted to a list of individual, newly created, leaving messages{msgout1,msgout2, ...,msgoutn} with n ∈ N by a splitter function. The headerand attachments are copied from the entering to each leaving message.

The re-defined split condition sctip evaluates a set of Datalog rules as conds(i. e., mostly selection, and sometimes built-in and join constructs; the latter twoare marked “light blue”). Each part of the body outi with i ∈ N is a set of factsthat is passed to a split function, which wraps each set into a single message.

The stateful aggregator defines a correlation condition, completion condi-tion and an aggregation strategy. The correlation condition crc, with colli :=crc(msgin, conds), determines the aggregate collection colli, to which the mes-sage is stored, based on a list of conditions conds. The completion conditioncpc, with cpout := cpc(msgin, [header.y|body.x]), evaluates to a Boolean out-put cpout based on header or body field information (similar to the messagefilter). If cpout equals true, then the aggregation strategy as, with aggout :=as(msg1in,msg

2in, ...,msg

nin), is called by an implementation of the messaging

system and executed, else the current message is added to the collection colli.The as evaluates the correlated entering messages colli and emits a new messagemsgout. For that, the messaging system has to implement an aggregation functionthat takes aggout (i. e., the output of as) as input.

These functions are re-defined as crctip and cpctip such that the conds areDatalog rules mainly with selection and built-in constructs. The cpctip makes useof the defined boolrc function to map its evaluation result (i. e., list of facts orempty) to the Boolean value cpout. The aggregation strategy as is re-defined asastip, which mainly uses union to combine lists of facts from different messagesto one. The message format remains the same. To transform the aggregates’formats, a message translator is used to keep the patterns modular. However, thecombination of the aggregation strategy with translation capabilities could leadto runtime optimizations.

Page 8: arXiv:1504.05707v1 [cs.DB] 22 Apr 2015 · 2018-10-03 · Convergent Invoicing generate legally binding payment documents. Alternatively, new application and data analytics stacks

8 Daniel Ritter

3.3 Message Transformation Patterns

The transformation patterns exclusively target the content of the messages interms of format conversations and content modifications.

The stateless Message Translator changes the structure or format of theentering message without generating a new one (i. e., channel, message cardinality1:1). For that, the translator computes the transformed structure by evaluatinga mapping program mt (e. g., Datalog rules in Suppl. Material, List. 1.2), withmsgout.body := mt(msgin.body). Thereby the field content can be altered. Therelated Content Filter and Content Enricher patterns can be subsumed bythe general Content Modifier pattern and share the same characteristics as thetranslator pattern. The filter evaluates a filter function mt, which only filtersout parts of the message structure (e. g., fields or values) and the enricher addsnew fields or values as data to the existing content structure using an enricherprogram ep, with msgout.body := ep(msgin.body, data).

The re-definition of the transformation function mttip for the message transla-tor mainly uses join and projection (plus built-in for numerical calculationsand string operations, thus marked “light blue”) and selection, projectionand built-in (mainly numerical expressions and character operations) for thecontent filter. While projections allow for rather static, structural filtering, thebuilt-in and selection operators can be used to filter more dynamically basedon the content. The resulting Datalog programs are passed as msgout.body. Inaddition, the re-defined enricher program eptip mainly uses union operations toadd additional data to the message as Datalog programs.

3.4 Pattern Composition

Since the TIP definitions target the content-level, all patterns can still becomposed to more complex integration programs (i. e., integration scenariosor pipelines). From the many combinations of patterns, we briefly discuss twoimportant structural patterns that are frequently used in integration scenarios:(1) scatter/gather and (2) splitter/gather [9]. The scatter/gather pattern (witha 1:n:1 channel cardinality) is a multicast or recipient list that copies messagesto several, statically or dynamically determined pipeline configurations, whicheach evaluate a sequence of patterns on the messages in parallel. Through anaggregator pattern, the messages are structurally and content-wise joined. Thesplitter/gather pattern (with a 1:n:1 message cardinality) splits one messageinto multiple parts, which can be processed in parallel. In contrast to the scat-ter/gather the pattern sequence is the same for each instance. A subsequentlyconfigured aggregator combines the messages to one.

4 Experimental Evaluation

As System under Test (SuT) for an experimental evaluation we used the opensource, Java-based Apache Camel integration system [3] in version 2.14.0, which

Page 9: arXiv:1504.05707v1 [cs.DB] 22 Apr 2015 · 2018-10-03 · Convergent Invoicing generate legally binding payment documents. Alternatively, new application and data analytics stacks

Towards More Data-Aware Application Integration (extended version) 9

implements most of the EIPs. The Camel system allows content-level extensionsthrough several interfaces, with which the TIP definitions were implemented andembedded (e. g., own Camel Expression definitions for existing patterns, andCamel Processor definitions for custom or non-supported patterns). The Datalogsystem we used for the measurements is a Java-based, standard naıve-recursiveDatalog processor (i. e., without stratification) [16] in version 0.0.6 from [13].

Subsequently, the basic setup and execution of the measurements are in-troduced. However, due to brevity, a more detailed description of the setup isprovided in the Suppl. Material, Sect. A.1, the routing condition and mappingprograms are shown in Sect. A.2, the integration scenarios in Sect. A.3 and themore detailed results in Sect. A.4.

4.1 Setup

In the absence of an EIP benchmark, which we are currently developing on thebasis of this paper, we used Apache JMeter6 in version 2.12 as a load generatorclient that sends messages to the SuT. We implemented a JMeter Sampler, whichallows to inject messages directly to the integration pipeline via a Camel directendpoint / adapter. For the throughput measurements, we used the JMeter jp@gctransaction per second listener plugin from the standard package.

To measure the message throughput in a “data-intensive” (cf. O1 ), non-reliable integration scenario, we use the standard TPC-H order, customer andnation data sets. We added additional, unique message identifier and type fieldsand translate the single records to JSON objects (cf. O3 ), each representing thepayload of a single message (i. e., “single-record” table message). In this way wegenerated 1.5 million order-only messages (i. e., TPC-H scale level 1) and thesame amount of “multi-format” customer / nation messages, consisting of onecustomer and all 25 nation records per message (in the “single-record” tablemessage case). During the measurements these messages are streamed to theCamel endpoint, serialized to either Java Objects for the Camel-Java and tothe ONC representation for the Camel-Datalog case (cf. recall ONC-iterators ascanonical data model).

All measurements are conducted on a HP Z600 work station, equipped withtwo Intel processors clocked at 2.67GHz with a total of 12 cores, 24GB of mainmemory, running a 64-bit Windows 7 SP1 and a JDK version 1.7.0. The JMeterSampler and the integration system pipeline JVM process get 5GB heap space.

4.2 “Single-record” / “Multi-Format” Table Message Processing

Instead of testing all discussed patterns, we focus on the identified table-operations(e. g., selection / built-in, projection, join) and show the respective evalu-ation by example of a representative pattern (cf. Fig. 2). The measurements forselection and projection use the TPC-H Order-based, approximately 4kB mes-sages (i. e., 1.5 million order messages). The union operation (e. g., aggregationstrategy, content enricher) is not tested.

6 Apache JMeter, visited 02/2015: http://jmeter.apache.org/.

Page 10: arXiv:1504.05707v1 [cs.DB] 22 Apr 2015 · 2018-10-03 · Convergent Invoicing generate legally binding payment documents. Alternatively, new application and data analytics stacks

10 Daniel Ritter

We measured the selection / built-in operations in a content-based routerscenario with a routing condition tip rc (cf. List. 1.1), which routes the ordermessage to its receiver based on conds for {string equality, integer less than}on fields {objecttype, ototalprice}. The boolrc function is implemented in Javato pass the expected value to the runtime system on system-level. The corre-sponding “hand-coded” content-level Camel-Java implementation uses JSONpath statements for O(1) element access and conducts the type-specific conditionevaluation. The routing condition is defined to route 904, 500 of the 1.5millionmessages to the first and the rest to the second receiver. Similarly, the projectionoperation is measured using a message translator. The translator projects thefields of the incoming order message to a target format (cf. List. 1.2) using amttip implementation or a “hand-coded” projection on the Java Object represen-tation. Now, the “multi-format” customer messages (cf. O1 ) with nation recordsas processing context are used to measure a routing condition with selection /built-in and join operations (cf. List. 1.3). The customer message is routed, if andonly if, the customers balance (ACCTBAL) is bigger than 3, 000 and the customeris from the European region determined through join via nation key.

Listing 1.1. Routing condition: tip rc

1 cbr−order ( id ,− ,OTOTALPRICE,−):−2 order ( id , otype ,− ,3OTOTALPRICE,−OPRIORITY,−) ,4=(OPRIORITY, ”1−URGENT” )5>(OTOTALPRICE, 1 0 0 0 0 0 . 0 0 ) .

Listing 1.2. Message translation pro-gram: mttip

1 conv−order ( id , otype ,2ORDERKEY,CUSTKEY, SHIPPRIORITY):−3 order ( id , otype ,ORDERKEY,4CUSTKEY,− ,SHIPPRIORITY,−) .

Listing 1.3. Routing condition withjoin over “multi-format” message

1 cbr−cust (CUSTKEY,−):−2 customer ( cid , ctype ,CUSTKEY,− ,3CNATIONKEY,− ,ACCTBAL,−) ,4 nat ion ( nid , ntype ,NATIONKEY,− ,5NREGIONKEY,−) ,6>(ACCTBAL, 3 0 0 0 . 0 ) ,7=(CNATIONKEY,NATIONKEY)8=(NREGIONKEY, 3 ) .

The throughput test streams all 1.5 million order / customer messages tothe pipeline. The performance measurement results are depicted in Table 1 for asingle thread execution. Measurements with multiple threads show a scaling upto factor 10 of the results, with a saturation around 36 threads (i. e., factor ofnumber of cores; not shown). The stream conversion to JSON object aggregatedfor all messages is slightly faster than for ONC. However, in both order messagescases the TIP-based implementation reaches a slightly higher transaction persecond rate (tps), which lets the processing end 7 s and 4 s earlier respectively,due to the natural processing of ONC iterators in the Datalog engine. Althoughthe measured 99% confidence intervals do not overlap, the execution times aresimilar. The rather theoretical case of increasing the number of selection / built-in operations on the order messages (e. g., date before / after, string contains)showed a stronger impact for the Camel-Java case than the Camel-Datalog case

Page 11: arXiv:1504.05707v1 [cs.DB] 22 Apr 2015 · 2018-10-03 · Convergent Invoicing generate legally binding payment documents. Alternatively, new application and data analytics stacks

Towards More Data-Aware Application Integration (extended version) 11

Table 1. Throughput measurements for format conversion, message routing and tran-formation patterns based on 4kB messages generated from 1.5 million standard TPC-Horders records.

Format Content-based Routing Message Transformation

Conv.(msec)

Time(sec)

Mean(tps)

Time(sec)

Mean(tps)(Join)

Time(sec)

Mean(tps)

Camel-Datalog

7,239.60+/-152.69

108 13,761.47+/-340.08

126 11,904.76+/-261.20

103 14,423.08+/-228.74

Camel-Java

6,648.50+/-143.55

115 12,931.03+/-304,90

117 12,633.26+/-176.89

107 13,888.89+/-247.40

Datalog-Bulk(size=10)

7,239.60+/-152.69

12 122,467.58+/-2,532.42

13 116,780.00+/-1714,92

11 133,053.20+/-1,645.39

(not shown). In general, the Camel-Java implementation concludes with a routingdecision as soon as a logical conjunction is found, while the conjunctive Datalogimplementation currently evaluates all conditions before returning. In the contextof integration operations this is not necessary, thus could be improved by adaptingthe Datalog evaluation for that, which we could experimentally show (not shown;out of scope for this paper). The measured throughput of the content-basedrouter with join processing on “multi-format” the 1.5 million TPC-H customer/ nation messages again shows similar results. Only this time, the too simpleNestedLoopJoin implementation in the used Datalog engine causes a loss of 9seconds compared to the “hand-coded” JSON join implementation.

4.3 Outlook: “Multi-record” Table Message Processing

The discussed measurements assume that a message has a “single-record” payload,which results in 1.5 million messages with one record / message identifier each.So far, the JSON to ONC conversion creates ONC collections with only one tableiterator (to conform with EIP semantics). However, the nature of our approachallows us to send ONC collections with several entries (each representing a uniquemessage payload with message identifier). Knowing that this would change thesemantics of several patterns (e. g., the content-based router), we conducted thesame test as before with “multi-record” table messages of bulk size 10, whichreduces the required runtime to 12 s for the router and 11 s for the translator,which can still be used with its original definition (cf. Table 1). Increasing the bulksize to 100 or even 1, 000 reduces the required time to 1 s, which means that all1.5 million messages can be processed with one step in one single thread. Hereby,increasing the bulk size means reducing the number of message collections, whileincreasing the rows in the single collection. The impressive numbers are due to theefficient table-centric Datalog evaluation on fewer, multi-row message collections.The higher throughput comes with the cost of a higher latency. The noticed joinperformance issue can be seen in the Datalog-bulk case as well, which required13 steps / seconds to process the 1.5 million customer / nation messages.

Page 12: arXiv:1504.05707v1 [cs.DB] 22 Apr 2015 · 2018-10-03 · Convergent Invoicing generate legally binding payment documents. Alternatively, new application and data analytics stacks

12 Daniel Ritter

5 Related Work

The application of table-centric operators to current integration systems has notbeen considered before, up to our knowledge, and was only recently introducedby our position paper [12], which discusses the expressiveness of table-centric /logic programming for integration processing on the content level.

The work on Java systems like Telegraph Dataflow [15], and Jaguar [18]) canbe considered related work in the area of programming languages on applicationsystems for faster, data-aware processing. These approaches are mainly targetingto make Java more capable of data-processing, while mainly dealing with thread-ing, garbage collection and memory management. None of them considers thecombination of the host language with table-centric processing.

Declarative XML Processing Related work can be found in the area of declarativeXML message processing (e. g., [4]). Using an XQuery data store for definingpersistent message queues (i. e., conflicting with O3 ), the work targets a comple-mentary subset of our approach (i. e., persistent message queuing).

Data Integration The data integration domain uses integration systems forquerying remote data that is treated as local or “virtual” relations. Starting withSQL-based approaches (e. g., using Garlic [8]), the data integration researchreached relational logic programming, summarized by [6]. In contrast to suchremote queries, we define a table-centric, integration programming approach forapplication integration, while keeping the current semantics (for now).

Data-Intensive and Scientific Workflow Management Based on the data patternsin workflow systems described by Russel et al. [14], modeling and data access ap-proaches have been studied (e. g., by Reimann et al. [10]) in simulation workflows.The basic data management patterns in simulation workflows are ETL operations(e. g., format conversions, filters), a subset of the EIP and can be representedamong others by our approach. The (map/reduce-style) data iteration patterncan be represented by combined EIPs like scatter/gather or splitter/gather.

Similar to our approach, data and control flow have been considered inscientific workflow management systems [17], which run the integration systemoptimally synchronized with the database. However, the work exclusively focuseson the optimization of workflow execution, not integration systems, and does notconsider the usage of table-centric programming on the application server level.

6 Concluding Remarks

This paper motivates a look into a growing “processing discrepancy” (e. g.,message throughput) between current integration and complementary systems(e. g., MQ, RDBMS) based on known scenarios with new requirements and fastgrowing new domains (O1–O3 ). Towards a message throughput improvement,we extended the current integration processing on a content level by table-centric

Page 13: arXiv:1504.05707v1 [cs.DB] 22 Apr 2015 · 2018-10-03 · Convergent Invoicing generate legally binding payment documents. Alternatively, new application and data analytics stacks

Towards More Data-Aware Application Integration (extended version) 13

integration processing (TIP). To remain compliant to the current EIP definitionsthe TIP-operators work on “single-record” messages, which lets us compare withcurrent approaches using a brief experimental throughput evaluation. Althoughthe results slightly improve the standard processing, not to mention the declarativevs. “hand-coded” definition of integration content, the actual potential of ourapproach lies in “multi-record” table message processing. However, that requiresan adaption of some pattern EIP definitions, which is out of scope for this paper.

Open Research Challenges For a more comprehensive, experimental evaluation, anEIP micro-benchmark will be developed on an extension of the TPC-H and TPC-C benchmarks. EIP definitions do not discuss streaming patterns / operators,which could be evaluated (complementarily) based on Datalog streaming theory(e. g., [19,20]). Eventually, the existing EIP definitions have to be adapted tothat and probably new patterns will be established. Notably, the used Datalogengine has to be improved (e. g., join evaluation) and enhanced for integrationprocessing (e. g., for early-match / stop during routing).

Acknowledgements

We especially thank Dr. Fredrik Nordvall Forsberg and Prof. Dr. Erhard Rahmfor proof-reading and valuable discussions on the paper.

References

1. Abiteboul, S., Antoine, E., Miklau, G., Stoyanovich, J., Testard, J.: Rule-basedapplication development using webdamlog. In: ACM SIGMOD (2013)

2. AdroitLogic: Esb performance. http://esbperformance.org/ (2013)3. Anstey, J., Zbarcea, H.: Camel in Action. Manning (2011)4. Bohm, A., Kanne, C., Moerkotte, G.: Demaq: A foundation for declarative XML

message processing. In: CIDR 2007. pp. 33–43 (2007)5. Chappell, D.: Enterprise Service Bus. O’Reilly Media, Inc. (2004)6. Genesereth, M.R.: Data Integration: The Relational Logic Approach. Morgan &

Claypool Publishers (2010)7. Green, T.J., Aref, M., Karvounarakis, G.: Logicblox, platform and language: A

tutorial. In: Datalog. pp. 1–8 (2012)8. Haas, L.M., Kossmann, D., Wimmers, E.L., Yang, J.: Optimizing queries across

diverse data sources. In: VLDB. pp. 276–285 (1997)9. Hohpe, G., Woolf, B.: Enterprise Integration Patterns: Designing, Building, and

Deploying Messaging Solutions. Addison-Wesley Longman (2003)10. Reimann, P., Schwarz, H.: Datenmanagementpatterns in simulationsworkflows. In:

BTW. pp. 279–293 (2013)11. Ritter, D.: What about database-centric enterprise application integration? In:

ZEUS. pp. 73–76 (2014)12. Ritter, D., Bross, J.: Datalogblocks: Relational logic integration patterns. In: DEXA

2014. pp. 318–325 (2014)13. Ritter, D., Westmann, T.: Business network reconstruction using datalog. In:

Datalog in Academia and Industry, Datalog 2.0. pp. 148–152 (2012)

Page 14: arXiv:1504.05707v1 [cs.DB] 22 Apr 2015 · 2018-10-03 · Convergent Invoicing generate legally binding payment documents. Alternatively, new application and data analytics stacks

14 Daniel Ritter

14. Russell, N., ter Hofstede, A., Edmond, D., van der Aalst, W.: Workflow datapatterns: Identification, representation and tool support. In: ER (2005)

15. Shah, M.A., Madden, S., Franklin, M.J., Hellerstein, J.M.: Java support for data-intensive systems: Experiences building the telegraph dataflow system. SIGMODRecord 30(4), 103–114 (2001)

16. Ullman, J.: Principles of Database and Knowledge-Base Systems, Vol. I. ComputerScience Press (1988)

17. Vrhovnik, M., Schwarz, H., Suhre, O., Mitschang, B., Markl, V., Maier, A., Kraft,T.: An approach to optimize data processing in business processes. In: VLDB (2007)

18. Welsh, M., Culler, D.E.: Jaguar: enabling efficient communication and I/O in Java.Concurrency – Practice and Experience 12(7), 519–538 (2000)

19. Zaniolo, C.: A logic-based language for data streams. In: SEBD 2012 (2012)20. Zaniolo, C.: Logical foundations of continuous query languages for data streams.

In: Datalog in Academia and Industry, Datalog 2.0 (2012)

Page 15: arXiv:1504.05707v1 [cs.DB] 22 Apr 2015 · 2018-10-03 · Convergent Invoicing generate legally binding payment documents. Alternatively, new application and data analytics stacks

Towards More Data-Aware Application Integration (extended version) 15

A Supplementary Material

In this section supplementary material is listed and briefly explained. The mate-rial provides further details on the described table-operator content / Datalogrules used for the experimental evaluation and can be seen as examples for theTIP definitions. In addition, the Apache Camel routes and the results of theexperimental evaluation are shared for comparison and traceability.

The supplementary material is provided inline for better readability and canbe externalized to a separate document, if required.

A.1 Details on the Test Setup

During the execution of our micro-benchmark in Java, we used the JMH codetool from OpenJDK 7 to overcome the “profile-guided optimization”.

Data and Message Generation The message generation is a “two-step” approachconsisting of a data generation and a message generation step. The data generationstep uses the standard TPC-H generator on scale level one, from which we tookthe generated order records (i. e., CSV format), converted them to the JSONformat, added additional message identifier and type fields, and stored themin one big JSON array. Each JSON object represents the payload of a singlemessage (i. e., “single-record” message). During the test preparation, Camelmessage payloads are constructed as streams of these JSON objects and sentinto the pipeline. In each individual test, the processors in the pipeline have toconvert the JSON stream using the (Stax-like) Jackson Stream API8 to either aJava or ONC represenation (cf. canonical data model; required an extension ofJackson for ONC), before applying the specfic pattern to test. That means, theconversions will be part of the measurements, however, extracted to a separateperformance figure. We have decided to take the TPC-H data generator due to itsapplicability to our routing and transformation tests and since it is well-knownin our domain.

Message Routing and Transformation The selection / built-in operations aremainly used during message routing (e. g., content-based router, message filter).For the evaluation, we have defined a routing condition (cf. List. 1.1), which routesthe message to its receiver, if and only if, two conditions are fulfilled for routingcondition tip rc: conds = {string equality, integer less than} on fields body.x ={objecttype, ototalprice}. The help rc function is implemented in Java to passthe expected value to the runtime system on system-level. The correspondingpure-Java, content-level implementation uses JSON path statements for O(1)element access and conducts the type-specific condition evaluation “hand-coded”.

7 OpenJDK JMH, visited 02/2015: http://openjdk.java.net/projects/

code-tools/jmh/.8 Jackson Stream API, visited 02/2015: http://wiki.fasterxml.com/

JacksonStreamingApi.

Page 16: arXiv:1504.05707v1 [cs.DB] 22 Apr 2015 · 2018-10-03 · Convergent Invoicing generate legally binding payment documents. Alternatively, new application and data analytics stacks

16 Daniel Ritter

The routing condition is defined to route 904, 567 of the 1.5 million messages.The projection operation is mostly used in message tranformation patterns (e. g.,message translator, content modifier), from which we selected a message translator.The translator projects the fields of the incoming message to a target format (cf.List. 1.2) using our mttip implementation or a “hand-coded” projection on theJava Object representation. One requirement from the mentioned scenarios (O1 )are multi-format, multi-record messages. This type of messages has one “primary”message body and (several) dependent sub-entries usually in a different format.The sub-entries act as “temporary” processing context (e. g., added by a contentenricher), which can be pulled off, before forwarding the message to its receiver.For the evaluation, we use the join together with selection / built-in operations fora content-based router on a TPC-H customer message as JSON object embeddedin a JSON array with all entries of the nation table. The message is routed, ifand only if, the customers actual balance (ACCTBAL) is bigger than 3, 000 andthe customer is from the European region determined through join via nationkey (cf. Suppl. Material in Sect. A.2).

A.2 Datalog Rules

The Datalog rules, used for the table-centric integration processing on the TPC-Horder, customer and nation messages during the experimental analysis are listedsubsequently.

Listing 1.1 denotes the rule for the content-based routing pattern on a TPC-Horder message, which is used to route orders to the first recipient, if and onlyif, the order is urgent (OPRIORITY==’1-URGENT’) and costly (OTOTALPRICE>100, 000.00). In the message translation evaluation, order messages are translatedto a target format with only the primary and foreign key fields and the shipmentpriority according to the rule in List. 1.2. The extended routing condition inList. 1.3 is used to route TPC-H customer record messages with the completelist of TPC-H nation records, if and only if, the actual balance is above average(ACCTBAL> 3, 000) and the customer is from Europe region (NREGIONKEY==3).

A.3 Integration Scenarios: Apache Camel Routes

In this section, the integration scenarios, used for the experimental analysis, aredescribed in more detail.

Listing 1.4 and List. 1.5 denote the Camel routes / message channels for thecontent-based routing cases. The direct-endpoint is used to receive a streamof messages in JSON format, which are then serialized into a Java Object orONC-iterator form, the canonical data model (CDM). On the CDM subsequentoperations can be defined and executed. The Camel choice defines a conditionalrouting based on a Camel expression evaluating the message’s body / data. Theexpression is one of the extension points in Camel to which our TIP operationscan be applied (e. g., rctip). In the standard Camel-Java case a “hand-coded”Java expression, new CbrTpchExp(), is evaluated. If the expressions evaluate tofalse, the messages are processed in the otherwise channel. The same routes

Page 17: arXiv:1504.05707v1 [cs.DB] 22 Apr 2015 · 2018-10-03 · Convergent Invoicing generate legally binding payment documents. Alternatively, new application and data analytics stacks

Towards More Data-Aware Application Integration (extended version) 17

are used for the normal and the “multi-format” router cases, only the expressionsare exchanged.

Another Camel interface that allows the definition of own content-level ex-tensions is the processor. To evaluate the message translation capabilities, weapplied our TIP operators to the Camel processor to access and change the ONCmessages. In the Camel-Java case a “hand-coded” processor is implemented asnew MtTpchProc(). Both processors receive a message and translate its content,before handing it back to the system-level processing.

Listing 1.4. Camel-Datalog route forrouting according to rctip

1 from ( ” d i r e c t : datalog−cbr )2 . unmarshal ( )3 . cho i c e ( ) . when ( )4 . exp r e s s i on ( b o o l r c ( r c t i p (5<conds > ) ) ) . p roce s s (VOID)6 . o the rw i se ( ) . p roce s s (VOID) ;

Listing 1.5. Camel-Java route for rout-ing

1 from ( ” d i r e c t : java−cbr ” )2 . unmarshal ( )3 . cho i c e ( ) . when ( )4 . exp r e s s i on (new CbrTpchExp ( ) )5 . p roce s s (VOID)6 . o therw i se ( ) . p roce s s (VOID) ;

Listing 1.6. Camel-Datalog messagetranslation executing a mttip program

1 from ( ” d i r e c t : datalog−mt” )2 . unmarshal ( )3 . p roce s s ( mt t ip(<program>))4 . p roce s s (VOID) ;

Listing 1.7. Camel-Java messagetranslation

1 from ( ” d i r e c t : java−mt” )2 . unmarshal ( )3 . p roce s s (new MtTpchProc ( ) )4 . p roce s s (VOID) ;

A.4 Experimental Results

In this section the plotted results of the experimental analysis are shown. Since alltests are based on 1.5 million TPC-H order or customer / nation (“multi-format”)messages, the required processing time is represented by the x-axis in steps ofone second each. Hence, the earlier the samples stop, the faster the averagethroughput of the route, which is depicted as y-axis. All measurements wereconducted with a single Camel route thread. The first three plots show the twocontent router and one message translator cases. The fourth one plots the resultsof the three cases with “multi-record” table message processing of bulk sizeof ten. Hence each the depicted discrete throughput measures actually do notrepresent messages per second, but collections of ten messages each per second(i. e., multiplied by factor x10).

The results of the first three experiments show that the TIP approach iscompetitive to “state-of-the-art” message processing in integration systems. Therouting conditions with selection / built operations are better and the nested-loopjoin implementation is slightly worse than the “hand-coded” and optimized Javapendent. Hence, an enhanced join implementation will significantly improve theresults of the Camel-Datalog evaluation in Fig. 3(b).

Page 18: arXiv:1504.05707v1 [cs.DB] 22 Apr 2015 · 2018-10-03 · Convergent Invoicing generate legally binding payment documents. Alternatively, new application and data analytics stacks

18 Daniel Ritter

(a) CBR: Normal (b) CBR: Multi-Msg.

(c) Msg. Translator (d) Table-Processing; x10 msgs/s

Fig. 3. Plotted results of the experimental analysis.