ii 70 dm techspec spec en

8
SAP InfiniteInsight® 7.0 Data Manipulation Technical Specifications Specifications Document Version: 1.0 – 2014-11 CUSTOMER

Upload: lcyho77

Post on 25-Dec-2015

28 views

Category:

Documents


1 download

DESCRIPTION

II 70 Dm Techspec Spec En

TRANSCRIPT

Page 1: II 70 Dm Techspec Spec En

SAP InfiniteInsight® 7.0 Data Manipulation Technical Specifications

Specifications Document Version: 1.0 – 2014-11

CUSTOMER

Page 2: II 70 Dm Techspec Spec En

CUSTOMER SAP InfiniteInsight® 7.0 ii © 2014 SAP SE or an SAP affiliate company. All rights reserved- Data Manipulation

Table of Contents 1 Data Manipulation .......................................................................................................................................... 3 1.1 How to Use Data Manipulation ............................................................................................................................... 3 1.2 Architectural Elements of Data Manipulation ....................................................................................................... 3 1.3 Who Can Use Data Manipulation ............................................................................................................................ 4 1.4 When to Use Data Manipulation ............................................................................................................................. 4 1.5 Why use Data Manipulation .................................................................................................................................... 4

2 Technical Specifications ...............................................................................................................................6

Page 3: II 70 Dm Techspec Spec En

Data Manipulation How to Use Data Manipulation

SAP InfiniteInsight® 7.0 CUSTOMER Data Manipulation3 © 2014 SAP SE or an SAP affiliate company. All rights reserved- 3

1 Data Manipulation

SAP InfiniteInsight® offers a module to edit, save, and retrieve data manipulations as described in the document Data Manipulation: Use Case Scenarios. When data stores (directories or ODBC sources) are associated with a repository containing data manipulations, these connectors appear as regular files or tables and can be used directly (like other data) to train or apply models.

One of the useful features of Data Manipulation is the ability to declare arguments. Arguments are symbols with associated values that can be changed before executing the data manipulations. They can be used anywhere within Data Manipulation.

InfiniteInsight® does not provide a special engine to execute these data manipulations, since they can all be performed by standard SQL engines embedded with all major relational databases. Instead, the Data Manipulation module can be seen as an object oriented layer that is used to generate data manipulation statements in SQL, which are processed, in turn, by the data base server.

IN THIS CHAPTER

How to Use Data Manipulation ............................................................................................................................... 3 Architectural Elements of Data Manipulation .......................................................................................................... 3 Who Can Use Data Manipulation ........................................................................................................................... 4 When to Use Data Manipulation ............................................................................................................................. 4 Why use Data Manipulation .................................................................................................................................... 4

1.1 How to Use Data Manipulation

Data manipulations can be created either from shell scripts or from the InfiniteInsight®. Once defined, they appear within the regular parameter trees. Through the graphical interface, users can create data manipulations from scratch, save them, and later restore them for editing.

The InfiniteInsight® has also been adapted to prompt for the values of the arguments declared in a data manipulation when a model that uses it is being trained or applied.

1.2 Architectural Elements of Data Manipulation

The Java graphical interface stores the “data manipulation” in each store. In other words, each store is its own repository that holds the data manipulations associated with it.

Page 4: II 70 Dm Techspec Spec En

CUSTOMER SAP InfiniteInsight® 7.0 4 © 2014 SAP SE or an SAP affiliate company. All rights reserved- Data Manipulation

1.3 Who Can Use Data Manipulation

No prior knowledge of SQL is required to use Data Manipulation - only knowledge about how to work with tables and columns accessed through ODBC sources. Furthermore, users must have “read” access on these ODBC sources.

To use InfiniteInsight® modeling assistant, users need write access on the tables KXADMIN and CONNECTORSTABLE, which are used to store representations of data manipulations.

1.4 When to Use Data Manipulation

In a data mining project, after the definition of the business problem to solve, one of the first tasks is to decompose the business process into tasks that can be associated with data mining functions (such as classification, regression, clustering, segmentation, association rules, attribute importance, or time series forecasting). Then the business objects that are referenced by the data mining tasks have to be mapped to available data. This operation can be cumbersome when the data is spread among several tables in a database, or when the user wants to apply functions to the stored data so as to unveil more information for the modeling techniques to use. This is the time to use Data Manipulation. This phase is thus very early in data mining projects.

1.5 Why use Data Manipulation

Data preparation is a very important phase in data mining projects. As said earlier, We can distinguish between data manipulation that is business oriented, and data encoding which is technically oriented. Technically, data manipulation can be done in very different environments;

ETL (Extract Transform Load) Environments

These environments target heavy industrial processes where the amount of data to be manipulated can be fairly large. They usually require specific training on a graphical language used to define mappings between sources and targets. They also involve a wide spectrum of traces and logs to cope with incidents. ETL environments are usually used by DBAs (Database Administrators).

Stream-Based Data Mining Environments

Users with a specific training can build streams, which are graphical representations of data flows between “nodes.” Each node represents an atomic data manipulation (such as the filtering of lines, the creation of a new column, or the pivoting of a table of transactions). The operations that can be performed on data encompass both business operations and technical operations. Good examples of such environments are: SPSS™ -Clementine or SAS™-Enterprise Miner. These environments are usually used by data mining specialists or analysts.

SQL Environments

These environments are usually graphical front-ends on top of SQL. Users with a specific training in SQL can build select statements to extract and transform information from a database.

Page 5: II 70 Dm Techspec Spec En

Data Manipulation Why use Data Manipulation

SAP InfiniteInsight® 7.0 CUSTOMER Data Manipulation5 © 2014 SAP SE or an SAP affiliate company. All rights reserved- 5

We have always recognized the need for data manipulation and created the InfiniteInsight® Modeler - Data Encoding component specifically to automate technically-driven data encoding. (InfiniteInsight® Explorer - Event Logging and InfiniteInsight® Explorer - Sequence Coding later followed, specifically to support aggregation.) We also searched for technological partners that could supply embeddable data manipulation functionality, but found none that could meet its technical and pricing specifications. Consequently, we created Data Manipulation as a layer on top of SQL, which does not require prior knowledge of the SQL language. This approach also permits full integration of data manipulation with the export of InfiniteInsight® models to SQL (InfiniteInsight® Scorer module). It is also the most cost-effective way to complement IT spending on database servers with full data mining functionality.

Concerning the graphical interface, we chose to avoid the popular stream-based paradigm. In software packages that use this approach, each transformation appears as an atomic node in a graphical stream. We have seen that, in operational usage, the size of these streams increases beyond readability. Our user interface is thus closer to a “Query Wizard” that displays data manipulation results after each edit. The user interface is also specifically designed to accept as many columns as possible in order to complement InfiniteInsight® unique ability to make use of large numbers of variables without undue loss of robustness.

Page 6: II 70 Dm Techspec Spec En

CUSTOMER SAP InfiniteInsight® 7.0 6 © 2014 SAP SE or an SAP affiliate company. All rights reserved- Technical Specifications

2 Technical Specifications

This section focuses on the operational constraints associated with SAP InfiniteInsight® implementation.

As noted above, instead of a transformation engine, we chose to implement an object oriented layer for data manipulation that generates SQL expressions. InfiniteInsight® relies on the optimization of (database) SQL interpreters for efficient execution. This constraint means that Data Manipulation only works on top of a database. This database is accessed through an ODBC connection with associated access rights.

We chose not to build a proprietary, autonomous data manipulation environment for a couple reasons:

Database vendors have devoted great efforts to optimizing their SQL interpreters, which already support the needed functionality.

Even when the data is in flat files or a “foreign” DBMS, most databases already allow viewing such data as if they were native tables from a single source. This is called “proxy table” in Sybase, “database link” in Oracle, “federated database” provided through DB2 Connect in IBM-DB2. There also exists an open group that is specifying standards called Distributed Relational Database Architecture (DRDA). Of course, using such ”functionality to access non-native data can impact overall performance, but we believe that relying on the engineering efforts of database vendors while simplifying the user experience with a data manipulation front-end is the prudent way to go.

The current version of Data Manipulation makes the distinction between the “decoration” (left outer join) and the aggregations that can be performed by specific InfiniteInsight® components, namely InfiniteInsight® Explorer - Event Logging and InfiniteInsight® Explorer - Sequence Coding.

The current version implements around 50 functions (log, exp, year and so on…) This number will grow in future releases.

In the current version, data manipulation reloading depends on ODBC source name. Data Manipulations are saved in a repository containing data manipulations associated with many ODBC sources. The only way that a data manipulation can be associated with an ODBC source is through the name of the ODBC source, so if you want to reuse/reload a data manipulation, you must have the ODBC source accessible from the InfiniteInsight® workstation/server with the same name (this is a feature 'by design').

The service of data manipulation transfer from one source to another will be available in the near future.

In the first user interface of the Data Manipulation, InfiniteInsight® modeling assistant is not using a 'repository' to store data manipulation but save them in each store (in other words each ODBC source contains its own Data Manipulations), but the 'recollection' of these manipulations still need the name matching between the ODBC source name with the one saved in the Data Manipulation representations.

Page 7: II 70 Dm Techspec Spec En
Page 8: II 70 Dm Techspec Spec En

www.sap.com/contactsap

© 2014 SAP SE or an SAP affiliate company. All rights reserved.

No part of this publication may be reproduced or transmitted in any form or for any purpose without the express permission of SAP SE or an SAP affiliate company. The information contained herein may be changed without prior notice.

Some software products marketed by SAP SE and its distributors contain proprietary software components of other software vendors. National product specifications may vary.

These materials are provided by SAP SE or an SAP affiliate company for informational purposes only, without representation or warranty of any kind, and SAP or its affiliated companies shall not be liable for errors or omissions with respect to the materials. The only warranties for SAP or SAP affiliate company products and services are those that are set forth in the express warranty statements accompanying such products and services, if any. Nothing herein should be construed as constituting an additional warranty.

SAP and other SAP products and services mentioned herein as well as their respective logos are trademarks or registered trademarks of SAP SE (or an SAP affiliate company) in Germany and other countries. All other product and service names mentioned are the trademarks of their respective companies.

Please see

for additional trademark information and notices.