data ingest guide

535
Data Ingest Guide Version 5.3 Copyright Platfora 2016 Last Updated: 1:48 p.m. August 12, 2016

Upload: lythuan

Post on 13-Feb-2017

263 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Data Ingest Guide

Data Ingest GuideVersion 5.3

Copyright Platfora 2016

Last Updated: 1:48 p.m. August 12, 2016

Page 2: Data Ingest Guide

ContentsDocument Conventions............................................................................................. 9Contact Platfora Support.........................................................................................10Copyright Notices.................................................................................................... 10

Chapter 1: About the Platfora Data Pipeline........................................................... 12FAQs - Platfora Data Pipeline................................................................................ 12About the Data Workflow........................................................................................16

Chapter 2: Manage Data Sources.............................................................................19Supported Data Sources.........................................................................................19Add a Data Source................................................................................................. 21

Connect to a Hive Data Source........................................................................ 22Connect to an HDFS Data Source....................................................................27Connect to a Google Storage Data Source...................................................... 29Connect to an S3 Data Source......................................................................... 29Connect to a MapR Data Source......................................................................30Connect to Other Data Sources........................................................................ 31About the Uploads Data Source........................................................................32

Configure Data Source Security............................................................................. 34Delete a Data Source............................................................................................. 36Edit a Data Source................................................................................................. 37

Chapter 3: Define Datasets to Describe Data..........................................................38FAQs - Dataset Basics........................................................................................... 38Understand the Dataset Workspace.......................................................................43

Understand the Dataset Inspector Panel.......................................................... 45Filter Sample Data in a Dataset........................................................................47

Understand the Dataset Creation Process............................................................. 48Understand Dataset Permissions............................................................................55Select Source Data................................................................................................. 56

Supported Source File Formats........................................................................ 57Select a Hive or Impala Source Table.............................................................. 59Select DFS Source Files................................................................................... 60Edit the Dataset Source Location......................................................................62

Parse the Data........................................................................................................62View Raw Source Data Rows........................................................................... 62Parse Delimited Data.........................................................................................64Parse Hive and Impala Tables..........................................................................70Parse JSON Files.............................................................................................. 73Parse XML Files................................................................................................ 79

Page 3: Data Ingest Guide

Data Ingest Guide - Contents

Page 3

Parse Avro Files................................................................................................ 87Parse Parquet Files........................................................................................... 88Parse Web Access Logs................................................................................... 90Parse Other File Types..................................................................................... 91

Prepare Base Dataset Fields..................................................................................93Change the Dataset Sample Rows................................................................... 93Update the Dataset Sample Rows.................................................................... 94Update the Dataset Source Schema.................................................................95Confirm Data Types...........................................................................................97Change Field Names.......................................................................................100Sort Dataset Sample Rows............................................................................. 101Add Field Descriptions.....................................................................................102Hide Columns from Data Catalog View...........................................................103Default Values and NULL Processing............................................................. 104Bulk Upload Field Header Information............................................................ 106

Transform Data with Computed Fields................................................................. 107FAQs—Dataset Computed Fields................................................................... 108Add a Dataset Computed Field....................................................................... 111Duplicate a Dataset Computed Field...............................................................113Add Binned Fields........................................................................................... 114

Add Measures for Quantitative Analysis...............................................................118FAQs - Dataset Measures...............................................................................118The Default 'Total Records' Measure.............................................................. 120Add Quick Measures....................................................................................... 121Add Computed Measures................................................................................123

Prepare Date/Time Data for Analysis................................................................... 125FAQs—Date and Timestamp Processing........................................................125Cast DATETIME Data Types.......................................................................... 127About Date and Time References...................................................................128About the Default 'Date' and 'Time' Datasets..................................................129

Prepare Drill Paths for Analysis............................................................................130FAQs - Drill Paths........................................................................................... 130Add a Drill Path............................................................................................... 133

Define the Dataset Primary Key........................................................................... 134Model Relationships Between Datasets............................................................... 135

Understand Data Modeling in Platfora............................................................ 135Add a Reference..............................................................................................141Add an Event Reference................................................................................. 142Add an Elastic Dataset....................................................................................143Delete or Hide a Reference............................................................................ 147Update a Reference........................................................................................ 148

Prepare Location Data for Analysis...................................................................... 148FAQs—Location Data and Geographic Analysis.............................................148Understand Geo Location Fields.....................................................................151Add a Location Field to a Dataset...................................................................153

Page 4: Data Ingest Guide

Data Ingest Guide - Contents

Page 4

Understand Geo References........................................................................... 154Prepare Geo Datasets to Reference...............................................................154Add a Geo Reference..................................................................................... 160

Pre-Process Data with Transformation Datasets..................................................162Combine Data with Union Datasets................................................................ 162Work with SQL Datasets................................................................................. 167

Chapter 4: Use the Data Catalog to Find What's Available..................................172FAQs - Data Catalog Basics................................................................................ 172Find Available Datasets........................................................................................ 175Find Available Lenses...........................................................................................176Find Available Segments...................................................................................... 177Organize Datasets, Lenses, Segments, and Vizboards with Labels.....................177

FAQs—Labels.................................................................................................. 178Create a Label................................................................................................. 179Apply Labels to an Object............................................................................... 180Remove Labels From an Object..................................................................... 181Delete or Rename a Label.............................................................................. 183Search by Label Name....................................................................................183

Chapter 5: Define Lenses to Load Data.................................................................185FAQs—Lens Basics.............................................................................................. 185Lens Best Practices.............................................................................................. 187About the Lens Builder Panel...............................................................................188Understand the Lens Build Process..................................................................... 189

Understand Lens MapReduce Jobs................................................................ 190Understand Source Data Input to a Lens Build...............................................192Understand How Datasets are Joined.............................................................196

Create a Lens....................................................................................................... 198Name a Lens................................................................................................... 199Choose the Lens Type.................................................................................... 200Choose Lens Fields.........................................................................................204Define Lens Filters...........................................................................................217Allow Ad-Hoc Segments..................................................................................222

Estimate Lens Size............................................................................................... 223About Dataset Profiles.....................................................................................223About Lens Size Estimates............................................................................. 225

Manage Lenses.....................................................................................................227Edit a Lens Definition...................................................................................... 227Update Lens Data............................................................................................228Delete or Unbuild a Lens................................................................................ 229Check the Status of a Lens Build................................................................... 231Manage Lens Notifications.............................................................................. 231Schedule Lens Builds...................................................................................... 235

Page 5: Data Ingest Guide

Data Ingest Guide - Contents

Page 5

Manage Segments—FAQs................................................................................... 238

Chapter 6: Export Lens Data...................................................................................243FAQs—Lens Data Export Basics..........................................................................243Understand Exported Lens Data.......................................................................... 248Configure Export Settings for CSV....................................................................... 250Configure Export Settings for Tableau..................................................................252Export Lens Data to a Remote System................................................................254Export Lens Data to your Desktop....................................................................... 256Export a Segment Data to a Remote System...................................................... 259Download Segment Data to your Desktop........................................................... 259Export a Partial Lens as CSV...............................................................................261Query a Lens Using the REST API......................................................................261

Chapter 7: Platfora Expressions.............................................................................264Expression Building Blocks...................................................................................264

Functions in an Expression............................................................................. 264Operators in an Expression.............................................................................266Fields in an Expression................................................................................... 268Literal Values in an Expression.......................................................................270

PARTITION Expressions and Event Series Processing (ESP).............................271How Event Series Processing Works..............................................................271Best Practices for Event Series Processing (ESP)......................................... 274

ROLLUP Measures and Window Expressions..................................................... 277Understand ROLLUP Measures...................................................................... 277Understand ROLLUP Window Expressions.................................................... 280

Computed Field Examples....................................................................................282Troubleshoot Computed Field Errors....................................................................283Write a Lens Query...............................................................................................285FAQs - Expression Basics.................................................................................... 286Expression Language Reference..........................................................................287

Expression Quick Dictionary............................................................................288Comparison Operators.....................................................................................302Logical Operators.............................................................................................303Arithmetic Operators........................................................................................ 304Conditional and NULL Processing...................................................................304Event Series Processing..................................................................................306String Functions............................................................................................... 314URL Functions................................................................................................. 345IP Address Functions...................................................................................... 350Date and Time Functions................................................................................ 352Math Functions................................................................................................ 358Data Type Conversion Functions.................................................................... 362Aggregate Functions........................................................................................367

Page 6: Data Ingest Guide

Data Ingest Guide - Contents

Page 6

ROLLUP and Window Functions.....................................................................371User Defined Functions (UDFs)...................................................................... 385Regular Expression Reference........................................................................390

Appendix A: Expression Language Reference..................................................... 400Expression Quick Dictionary................................................................................. 400Comparison Operators.......................................................................................... 415Logical Operators.................................................................................................. 416Arithmetic Operators............................................................................................. 417Conditional and NULL Processing........................................................................418

CASE................................................................................................................418COALESCE......................................................................................................419IS_VALID..........................................................................................................419

Event Series Processing.......................................................................................420PARTITION...................................................................................................... 420PACK_VALUES............................................................................................... 427

String Functions.................................................................................................... 428CONCAT.......................................................................................................... 428ARRAY_CONTAINS........................................................................................ 428FILE_NAME..................................................................................................... 429FILE_PATH...................................................................................................... 430EXTRACT_COOKIE.........................................................................................431EXTRACT_VALUE...........................................................................................431INSTR...............................................................................................................432JAVA_STRING.................................................................................................433JOIN_STRINGS............................................................................................... 434JSON_ARRAY................................................................................................. 434JSON_ARRAY_CONTAINS.............................................................................436JSON_DOUBLE............................................................................................... 436JSON_FIXED................................................................................................... 437JSON_INTEGER..............................................................................................438JSON_LONG....................................................................................................439JSON_OBJECT................................................................................................440JSON_STRING................................................................................................ 442LENGTH...........................................................................................................443REGEX.............................................................................................................443REGEX_REPLACE.......................................................................................... 449SPLIT............................................................................................................... 455SUBSTRING.................................................................................................... 456TO_LOWER..................................................................................................... 457TO_UPPER...................................................................................................... 457TRIM.................................................................................................................458XPATH_STRING..............................................................................................458XPATH_STRINGS........................................................................................... 459

Page 7: Data Ingest Guide

Data Ingest Guide - Contents

Page 7

XPATH_XML.................................................................................................... 461URL Functions.......................................................................................................462

URL_AUTHORITY........................................................................................... 462URL_FRAGMENT............................................................................................ 463URL_HOST...................................................................................................... 464URL_PATH.......................................................................................................465URL_PORT...................................................................................................... 466URL_PROTOCOL............................................................................................ 467URL_QUERY................................................................................................... 467URLDECODE...................................................................................................468

IP Address Functions............................................................................................470CIDR_MATCH..................................................................................................470HEX_TO_IP......................................................................................................471

Date and Time Functions......................................................................................471DAYS_BETWEEN............................................................................................472DATE_ADD...................................................................................................... 472HOURS_BETWEEN.........................................................................................473EXTRACT.........................................................................................................474MILLISECONDS_BETWEEN...........................................................................475MINUTES_BETWEEN..................................................................................... 475NOW.................................................................................................................476SECONDS_BETWEEN....................................................................................477TRUNC.............................................................................................................477YEAR_DIFF......................................................................................................478

Math Functions......................................................................................................479DIV................................................................................................................... 479EXP.................................................................................................................. 480FLOOR............................................................................................................. 480HASH............................................................................................................... 481LN.....................................................................................................................481MOD................................................................................................................. 482POW.................................................................................................................482ROUND............................................................................................................ 483

Data Type Conversion Functions..........................................................................484EPOCH_MS_TO_DATE...................................................................................484TO_CURRENCY.............................................................................................. 484TO_DATE.........................................................................................................485TO_DOUBLE....................................................................................................487TO_FIXED........................................................................................................487TO_INT.............................................................................................................488TO_LONG........................................................................................................ 489TO_STRING.....................................................................................................489

Aggregate Functions............................................................................................. 490AVG..................................................................................................................490COUNT.............................................................................................................491

Page 8: Data Ingest Guide

Data Ingest Guide - Contents

Page 8

COUNT_VALID................................................................................................ 492DISTINCT.........................................................................................................492MAX..................................................................................................................493MIN...................................................................................................................493SUM................................................................................................................. 494STDDEV...........................................................................................................494VARIANCE....................................................................................................... 495

ROLLUP and Window Functions.......................................................................... 495ROLLUP........................................................................................................... 496DENSE_RANK................................................................................................. 501NTILE............................................................................................................... 504RANK............................................................................................................... 507ROW_NUMBER............................................................................................... 509

User Defined Functions (UDFs)............................................................................511Writing a Platfora UDF Java Program.............................................................511Adding a UDF to the Platfora Expression Builder........................................... 515

Regular Expression Reference............................................................................. 516Regex Literal and Special Characters.............................................................517Regex Character Classes................................................................................518Regex Line and Word Boundaries.................................................................. 522Regex Quantifiers............................................................................................ 522Regex Capturing Groups.................................................................................524

Appendix B: Lens Query Language Reference.....................................................527SELECT Statement............................................................................................... 527

DEFINE Clause................................................................................................529WHERE Clause............................................................................................... 530GROUP BY Clause......................................................................................... 531HAVING Clause............................................................................................... 532Example of Lens Queries................................................................................ 532

Page 9: Data Ingest Guide

Page 9

PrefaceThis guide provides information and instructions for ingesting and loading data into a Platfora®

cluster. This guide is intended for data administrators who are responsible for making Hadoop dataaccessible to business users and data analysts. Knowledge of Hadoop, data processing, and data storageis recommended.

Document ConventionsThis documentation uses certain text conventions for language syntax and code examples.

Convention Usage Example

$ Command-line prompt -proceeds a command to beentered in a command-lineterminal session.

$ ls

$ sudo Command-line promptfor a command thatrequires root permissions(commands will be prefixedwith sudo).

$ sudo yum install open-jdk-1.7

UPPERCASE Function names andkeywords are shown in alluppercase for readability,but keywords are case-insensitive (can be writtenin upper or lower case).

SUM(page_views)

italics Italics indicate a user-supplied argument orvariable.

SUM(field_name)

[ ] (squarebrackets)

Square brackets denoteoptional syntax items.

CONCAT(string_expression[,...])

...(elipsis)

An elipsis denotes a syntaxitem that can be repeatedany number of times.

CONCAT(string_expression[,...])

Page 10: Data Ingest Guide

Data Ingest Guide - Introduction

Page 10

Contact Platfora Support

For technical support, you can send an email to:

[email protected]

Or visit the Platfora support site for the most up-to-date product news, knowledge base articles, andproduct tips.

http://support.platfora.com

To access the support portal, you must have a valid support agreement with Platfora. Please contactyour Platfora sales representative for details about obtaining a valid support agreement or with questionsabout your account.

Copyright Notices

Copyright © 2012-16 Platfora Corporation. All rights reserved.

Platfora believes the information in this publication is accurate as of its publication date. Theinformation is subject to change without notice.

THE INFORMATION IN THIS PUBLICATION IS PROVIDED “AS IS.” PLATFORACORPORATION MAKES NO REPRESENTATIONS OR WARRANTIES OF ANY KIND WITHRESPECT TO THE INFORMATION IN THIS PUBLICATION, AND SPECIFICALLY DISCLAIMSIMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS FOR A PARTICULARPURPOSE.

Use, copying, and distribution of any Platfora software described in this publication requires an

applicable software license. Platfora®, You Should Know™, Interest Driven Pipeline™, Fractal Cache™,

and Adaptive Job Synthesis™ are trademarks of the Platfora Corporation. Apache Hadoop™ and Apache

Hive™ are trademarks of the Apache Software Foundation. All other trademarks used herein are theproperty of their respective owners.

Embedded Software Copyrights and License Agreements

Platfora contains the following open source and third-party proprietary software subject to theirrespective copyrights and license agreements:

• Apache Hive PDK

• dom4j

• freemarker

• GeoNames

• Google Maps API

• Apache Jandex

Page 12: Data Ingest Guide

Page 12

Chapter

1About the Platfora Data PipelineGot questions about how Platfora enables self-service access to raw data in Hadoop? Want to know whathappens to the data on the way to those stunning, interactive visualizations? This section explains how dataflows from Hadoop to Platfora, and what happens to the data at each step in the workflow.

Topics:• FAQs - Platfora Data Pipeline

• About the Data Workflow

FAQs - Platfora Data PipelineThis section answers the most frequently asked questions (FAQs) about Platfora's Interest Driven

Pipeline™ and the data workflow.

What does 'Interest Driven Pipeline' mean?

The traditional data pipeline is mainly an 'operations driven pipeline' -- IT pushes the data to theconsumers, rather than the consumers pulling the data that interests them. In a traditional data pipelinedata is pre-processed, modeled into a relational schema, and loaded into a data warehouse. Then it isoptimized (or pre-aggregated) to make it possible for BI and reporting tools to access it. All of this workto move and prepare the data happens regardless of the immediate needs of the business users.

The idea behind an 'interest driven pipeline' is to not move or pre-process the data until somebody wantsit. Platfora's approach is to catalog all of data that's available, then allow business users to discoverand request data of interest to them. Once a request is made, then Platfora pulls the data from Hadoop,cleanses and processes it, and optimizes it for analysis. Having the entire data pipeline managed in asingle application allows for more agile data projects.

How does Platfora access data in Hadoop?

When you first install Platfora, you provide connection information to your Hadoop cluster services.Then you define a Data Source in the Platfora application to point to a particular location in the Hadoopfile system. Once a data source is registered in Platfora, the data files in that location are then visible toPlatfora users.

Page 13: Data Ingest Guide

Data Ingest Guide - About the Platfora Data Pipeline

Page 13

Can I control who sees what data?

Yes. Platfora provides role-based security so you can control who can see data coming from a particulardata source. You can also control access at a more granular per-dataset level if necessary. You caneither control data access within the Platfora application, or configure Platfora to inherit the file systempermissions from HDFS.

Does all the source data have to be in Hadoop?

Yes (for the most part). Platfora primarily works with data stored in a single distributed file system—typically HDFS for on-premise Hadoop deployments or Amazon S3 or Google Storage for clouddeployments.

However, it is also possible to develop custom data connectors to access smaller datasets outside ofHadoop. For example, you may have customer data in a relational database that you want to use inconjunction with log data stored in HDFS. Data connectors can be used to pull relatively small amountsof external data over to Hadoop on demand.

How does Platfora know the structure of the data and how to process it?

You have to tell Platfora how your data is structured by defining a Dataset. A dataset points to a set offiles in a data source and describes the structure of the data, as well as any processing logic needed toprepare the data for consumption. A dataset is just a metadata description of the data -- it contains all ofthe data about the data—plus a small sampling of raw rows to facilitate data discovery.

How does Platfora handle messy, complicated data?

Platfora's dataset workspace has a number of tools to help you cleanse and transform data into astructured format. There are a number of built-in data parsers for common file formats, such asDelimited Text, CSV, JSON, XML or Avro. For unstructured or semi-structured data, Platfora has anextensive library of built-in functions that you can use to define data processing tasks.

When Platfora processes the data during a lens build, it also logs any problem rows that it could notprocess according to the logic defined in the dataset. These 'dirty rows' are shown as lens build warnings.Platfora administrators can investigate these warnings to determine the extent of the problem.

How does Platfora deal with multi-structured data?

Defining a dataset in Platfora overlays the structure on the data as a light-weight metadata layer. Theactual data remains in Hadoop in its raw form until it is requested by a Platfora user. This allows youto have datasets with very different characteristics exist together in Platfora, described in the unifyinglanguage of the Platfora dataset.

If two datasets have fields that can be used to join them together, then the logic of that join can also bedescribed in the dataset as a Reference. Modeling references between datasets within Platfora allowsyou to quickly combine multi-structured data without having to move or pre-process the data up front.

Page 14: Data Ingest Guide

Data Ingest Guide - About the Platfora Data Pipeline

Page 14

How do I find available data?

Every dataset that is defined in Platfora is added to the Data Catalog. You can search or browse the datacatalog to find datasets of interest. The Platfora data catalog is the one place where you capture all ofyour organizational knowledge about your data. It is where non-technical users can discover and requestthe data they need.

How do I request data?

Once you find the dataset you want in the data catalog, you create a Lens in Platfora to request datafrom that dataset. A lens is a selection of fields from the focal point of a single dataset. A dataset pointsto data in Hadoop.

How does data get from Hadoop into Platfora?

Users bring data into Platfora by kicking off a Lens Build. A lens build runs a series of processing jobsin Hadoop to pull, process, and optimize the requested data. The output of these jobs is the Lens. Oncethe lens build jobs have completed successfully in the Hadoop cluster, the prepared lens data is thencopied over to the Platfora nodes. At this point the data is in Platfora and available for analysis.

Where does the prepared data (the lenses) reside?

Lens data is distributed across the Platfora worker nodes. This allows Platfora to use the resources ofmultiple servers to process lens queries in parallel, and scale up as your data grows. Lenses are stored ondisk on the Platfora nodes, and are also loaded into memory whenever Platfora users interact with them.Having the lenses in memory makes the queries run faster.

A copy of each lens is also stored in the primary Hadoop file system as a backup.

How do I explore the data in a lens?

Once a lens is built, the data is in Platfora and ready to explore in a Vizboard. The main way to interactwith the data in a lens is to create a Visualization (or Viz for short). A viz is just a lens query that isrepresented visually as a chart, graph, or table.

Is building a viz the only way to look at the data in a lens?

No, but we think it is the best way! Platfora also has a REST API that you can use to programmaticlyquery a lens, or you can export lens data in CSV format for use in other applications or data workflows.

How is Platfora different from Hadoop tools like Hive?

First of all, users do not need to have any special technical knowledge to build a lens. Platfora enablesall levels of users to request data from Hadoop -- no programming or SQL skills required.

Secondly, with a query tool like Hive, each query is its own MapReduce job in Hadoop. You haveto wait for each query to run in order to see the results. If you want to change the query, you have torewrite it and run it again (and wait). It is not very responsive or interactive.

Page 15: Data Ingest Guide

Data Ingest Guide - About the Platfora Data Pipeline

Page 15

A lens is more like an on-demand data mart rather than a single query result. It contains optimized datathat is loaded into memory so the query experience is fast and interactive. The data contained in a lenscan support many combinations of queries, and the results are rendered visually so that insights areeasier to find.

What if a lens doesn't have the data I need?

If a lens doesn't quite meet your data requirements, there are a couple of things you can do:

• You can edit an existing lens definition to add additional fields or expand the scope of the datarequested.

• You can add computed fields directly in the vizboard to further manipulate the data you already have.

• You can go back to the data catalog and create an entirely new lens. You can even upload new datafrom your desktop and combine it with datasets already in Platfora.

How can I know if the data is correct?

One of the advantages to having the entire data pipeline in one application is complete visibility at eachstage of the workflow. Platfora allows you to see the data lineage of every field in a lens, all the wayback to the source file that the data originated from.

How do I share my insights with others?

Platfora's vizboards were purpose built for sharing and collaboration. You can invite others to join youin a vizboard, and use comment threads to collaborate. You can prepare view-only dashboards andemail them to your colleagues. You can also export data and images from the vizboard for use in otherbusiness applications, like R, Excel, or PowerPoint.

Page 16: Data Ingest Guide

Data Ingest Guide - About the Platfora Data Pipeline

Page 16

About the Data Workflow

What are the steps involved in going from raw data in Hadoop to visualizations in Platfora? What skillsdo you need to perform each step? This section explains each stage of the data workflow from dataingest, to analysis, to collaboration.

Step 1: Define Data Sources to Connect to Raw Data in Hadoop

The first step in the data pipeline is to make the raw data accessible to Platfora. This is done by defininga Data Source. A data source uses a data connector to point to some location in the Hadoop file systemor other external data server. Platfora has out-of-the-box data connectors for:

• HDFS

• MapR FS

• Google Storage

• Amazon S3

• Hive Metastore

Platfora also provides APIs for defining your own custom data connectors.

Who does this step? System Administrators (someone who knows where the raw data resides and howto provide access to it). System administrators also define the security permissions for the data sources.Platfora users can only interact with data that they are authorized to see.

Step 2: Create Datasets to Describe the Structure of the Data

After you have connected to a data source, the next step is to describe and model the data by creatingDatasets in Platfora. A dataset is a pointer to a collection of raw data files along with a metadatadescription of how those files are structured. Platfora provides a number of built-in file parsers forcommon file formats, such as:

Page 17: Data Ingest Guide

Data Ingest Guide - About the Platfora Data Pipeline

Page 17

• Delimited Text

• Comma-Separated Values (CSV)

• JSON

• XML

• Avro

• Web Access Logs

• Hive Table Definitions

In addition to describing the structure of the data, the datasets also contain information on how toprocess the data, plus how to join different datasets together. If you are familiar with ETL workflows(extract, transform, and load), the dataset encompasses the extract and transform logic.

Who does this step? Data Administrators (someone who understands the data and how to make the dataready for consumption).

Step 3: Build a Lens to Pull Data from Hadoop into Platfora

All datasets that have been defined in Platfora are available in Platfora's Data Catalog. The data catalogis where Platfora users can see what data is available, and make requests for the data they need. The wayyou request data is by choosing a dataset, then building a Lens from that dataset. A lens can be thoughtof as an on-demand data mart, a summary table, or a materialized view.

A Lens Build automates a number of Hadoop processing tasks -- it submits a series of MapReduce jobsto Hadoop, collects the results, and brings the results back into Platfora. The data populated to a lens ispre-aggregated, compressed, and columnar. From the perspective of an ETL workflow, the lens build isthe load part of the process.

Who does this step? Data Analysts or Data Administrators (someone who understands the businessneed for the data or has an analysis use case they want to achieve). Lenses provide self-serviceaccess to the data in Hadoop -- users do not need any specialized technical skills to build a lens. Dataadministrators may want to set up a schedule of production lenses that are built on a regular basis.However, data analysts can also build their own lenses as needed.

Step 4: Create Vizboards to Analyze and Visualize the Data

Once a lens is built, the data is available in Platfora for analysis. Platfora users create Vizboards tomanage their data analysis projects. The vizboard can be thought of as a project workspace where youcan explore the data in a lens by creating visualizations.

A Visualization (or Viz for short) is the result of a lens query, but the data is represented in a visualway. Visualizations can take various forms such as charts, graphs, maps, or cross-tabs. As users buildvizzes using the data in a lens, the data is loaded into memory so the experience is fast and interactive.

Within a vizboard, analysts can build dashboards (or pages) of visualizations that reveal particularbusiness insights or tell a data story. For example, a vizboard may show two or three charts that supporta future business direction or confirm the results of a past business campaign or decision.

Page 18: Data Ingest Guide

Data Ingest Guide - About the Platfora Data Pipeline

Page 18

Who does this step? Data Analysts (anyone who has access to the data and has a question or hunch theywant to investigate).

Step 5: Share Your Insights with Others

The Platfora Vizboard is a place where you can collaborate with your fellow analysts or shareprepared insights with business users. You can invite other Platfora users to view and comment on yourvizboards, or you can export images from a vizboard to send to others via email or PDF. You can alsoexport query results (the viz data) for use in other applications, such as Excel or R.

Who does this step? Data Analysts (anyone who has an insight they want to share).

Page 19: Data Ingest Guide

Page 19

Chapter

2Manage Data SourcesThe first step in making Hadoop data available in Platfora is identifying what source data you want to exposeto your business users, and making sure the data is in a format that Platfora can work with. Although the sourcedata may be coming into Hadoop from a variety of source systems, and in a variety of different file formats,Platfora needs to be able to parse it into rows and columns in order to create a dataset in Platfora. Platforasupports a number of data sources and source file formats.

Topics:• Supported Data Sources

• Add a Data Source

• Configure Data Source Security

• Delete a Data Source

• Edit a Data Source

Only System Administrators can manage data sources.

Supported Data SourcesHadoop supports many different distributed file systems, of which HDFS is the primary implementation.Platfora provides data adapters for a subset of the file systems that Hadoop supports. Hadoop also has

Page 20: Data Ingest Guide

Data Ingest Guide - Manage Data Sources

Page 20

various database and data warehouse implementations, some of which can be used as data sources forPlatfora. This section describes the data sources supported by Platfora.

Source Description

Hive Platfora can use a Hive metastore server as a data source, and map aHive table definition to a Platfora dataset definition. Platfora uses the Hivetable definition to obtain metadata about the source data, such as whichfiles to process, the parsing logic for rows and columns, and the fieldnames and data types contained in the source data.It is important to note that Platfora does not execute queries throughHive; it only uses Hive tables to obtain the metadata needed for definingdatasets. Platfora generates and runs its own MapReduce jobs directly inHadoop.

You can only create datasets based on Hive tables, not views.

HDFS Hadoop Distributed File System (HDFS) is the primary storage system forHadoop. Platfora can be configured to connect to the HDFS NameNodeserver and use the HDFS file system as its primary data source.

GoogleStorage

Google Storage is a distributed file system in the Google Cloud Platformwhere you pay a monthly fee for storage space and data transferbandwidth. It can be used as a data source for users who run theirHadoop clusters on Google Compute Engine machines or who utilize theGoogle Dataproc service.Hadoop supports two S3 file systems as an alternative to HDFS: S3Native File System (s3n) and S3 Block File System (s3). Platfora supportsthe S3 Native File System (s3n) only.

Amazon S3 Amazon Simple Storage Service (Amazon S3) is a distributed file systemhosted by Amazon where you pay a monthly fee for storage space anddata transfer bandwidth. It can be used as a data source for users whorun their Hadoop clusters on Amazon EC2 or who utilize the Amazon EMRservice.Hadoop supports two S3 file systems as an alternative to HDFS: S3Native File System (s3n) and S3 Block File System (s3). Platfora supportsthe S3 Native File System (s3n) only.

MapR FS MapR FS is the proprietary Hadoop distributed file system of MapR.Platfora can be configured to connect to a MapR Container LocationDatabase (CLDB) server and use the MapR file system as its primary datasource.

Page 21: Data Ingest Guide

Data Ingest Guide - Manage Data Sources

Page 21

Source Description

Uploaded Files Platfora allows you to upload files from your local file system into Platfora.These files are added to a special Uploads data source, which resides inthe distributed file system (DFS) that the Platfora server is configured touse when it first starts up.

Custom DataConnectorPlugins

Platfora provides Java APIs that allow developers to create custom dataconnector plugins. For example, you can create a plugin that connectsto a relational database such as MySQL or PostgreSQL. Datasets createdfrom a custom data source should be relatively small (less than 100,000rows). External data is pulled over to Hadoop at lens build time via thePlatfora master (which is not a parallel operation).

Add a Data SourceA data source is a connection to a mount point or directory on an external data server, such as a filesystem or database server. Platfora currently provides data source adapters for Hive, HDFS, GoogleStorage, Amazon S3, and MapR FS.

Page 22: Data Ingest Guide

Data Ingest Guide - Manage Data Sources

Page 22

1. Go to the Data Catalog > Datasets page.

2. Click Add Dataset to open the dataset workspace.

3. Click Data Source Based Dataset.

4. Click Add New Source.

5. Enter the connection information for the data source server. The required connection informationdepends on the Source Type you choose.

6. Click Save.

7. Click Cancel to exit the dataset workspace.

Connect to a Hive Data Source

When Platfora uses Hive as a data source, it connects to the Hive metastore to query information aboutthe source data. There are multiple ways to configure the Hive metastore service in your Hadoopenvironment. If you are using the Hive Thrift Metastore (known as the remote metastore clientconfiguration), you can add a Hive data source directly in the Platfora application. If you connectdirectly to the Hive metastore relational database management system (RDBMS) (known as a localmetastore client configuration), this requires additional configuration on the Platfora master server. Youcannot define this type of Hive data source in the Platfora application.

See the Hive wiki for more information about the different Hive metastore client configurations.

Connect to a Hive Thrift Metastore

By default, the Platfora application allows you to connect the the Hive Thrift Metastore service. To usethe Thrift server as a data source for Platfora, you must start the Hive Thrift Metastore server in yourHadoop environment and know the URI to connect to this server.

In a remote Hive metastore setup, Hive clients (such as Platfora) make a connection to the Hive ThriftMetastore server which then queries the metastore database (typically a MySQL database) for the Hivemetadata. The client and metastore server communicate using the Thrift protocol.

You can add a Hive Thrift Metastore data source in the Platfora application. You will need to supply theURI to the Hive Thrift metastore service in the format of:

thrift://hive_host:thrift_port

Page 23: Data Ingest Guide

Data Ingest Guide - Manage Data Sources

Page 23

Where hive_host is the DNS host name or IP address of the Hive server, and thrift_port isthe port that the Hive Thrift metastore service is listening on. For Cloudera, Hortonworks, and MapRinstallations the default Thrift port is 9083.

Optionally, you can specify the Hive database name when you add a Hive data source. When a databasename is specified, Platfora only shows that database and the tables it contains in the Hive data source.When no database is specified, all Hive databases are available in this data source. You might want tocreate a Hive data source for one database to control which databases the Platfora users can see whenthey create datasets from a Hive data source.

If the connection to Hive is successful, you will see a list of available Hive databases in that data source.Click on a database name to show the Hive tables within that database. The default database in Hive isnamed default. If you have not created your own databases in Hive, this is where all of your tables willreside.

Page 24: Data Ingest Guide

Data Ingest Guide - Manage Data Sources

Page 24

If you are using Hive views, they will also be listed. However, Hive views are disabled to use as thebasis of a Platfora dataset. You can only create a dataset from Hive tables.

If you have trouble connecting to Hive, make sure that the Hive Thrift metastore server process isrunning, and that the Platfora server machine has access over the network to the designated Hive Thriftserver port. Also, make sure that the system user that the Platfora server runs as has read permissions tothe underlying data files in HDFS.

The Hive Thrift metastore is an optional service and is not usually started by default when you installHive, so it is possible that the service is not started. To check if Platfora can connect to the Hive ThriftMetastore, run the following command from the Platfora master server:

$ hive --hiveconf hive.metastore.uris="thrift://your_hive_server:9083" --hiveconf hive.metastore.local=false

Make sure the Hive server host name or IP address and Thrift port is correct for your Hadoopinstallation. For Cloudera, Hortonworks, and MapR installations the default Thrift port is 9083.

If the Platfora server can connect, you should see the Hive console command prompt and be able toquery the Hive metastore. For example:

hive> SHOW DATABASES; hive> exit;

If you cannot connect, it is possible that your Hive Thrift Metastore service is not running. Dependingon the Hadoop distribution you are using and the version of Hive server you are running, there aredifferent ways to start the Hive Thrift metastore. For example, run the following command on the serverwhere Hive is installed:

$ sudo hive --service metastore

Page 25: Data Ingest Guide

Data Ingest Guide - Manage Data Sources

Page 25

or

$ sudo hive --service hiveserver2

Check on your Hive server to make sure it is started, and view you Hive server logs for any issues withstarting the metastore.

Connecting to a Hive RDBMS Metastore

If you are not using the Hive Thrift Metastore server in your Hadoop environment, you can configurePlatfora to connect directly to a Hive metastore relational database management system (RDBMS), suchas MySQL. This requires additional configuration on the Platfora master server that must be done beforeyou can create the data source in the Platfora application.

To have Platfora connect directly to a RDBMS metastore requires additional configuration on thePlatfora master server. The Platfora master server needs a hive-site.xml file with the correctRDBMS connection information. You also need to install the appropriate JDBC driver on the Platforamaster server, and make sure that Platfora can find the Java libraries and class files for the JDBC driver.

Here is an example hive-site.xml to connect to a MySQL metastore. A hive-site.xml containingthese properties must reside in the local Hadoop configuration directory of the Platfora master server.

<?xml version="1.0"?><?xml-stylesheet type="text/xsl" href="configuration.xsl"?><configuration>

<property> <name>javax.jdo.option.ConnectionURL</name> <value>jdbc:mysql://hive_hostname:port/metastore</value></property> <property> <name>javax.jdo.option.ConnectionDriverName</name> <value>com.mysql.jdbc.Driver</value></property> <property> <name>javax.jdo.option.ConnectionUserName</name> <value>database_username</value></property> <property> <name>javax.jdo.option.ConnectionPassword</name> <value>database_password</value></property> <property> <name>hive.metastore.client.socket.timeout</name> <value>120</value></property>

<property>

Page 26: Data Ingest Guide

Data Ingest Guide - Manage Data Sources

Page 26

<name>hive.metastore.batch.retrieve.max</name> <value>100</value></property>

</configuration>

The Platfora server would also need the MySQL JDBC driver installed in order to use this configuration.You can place the JDBC driver .jar files in $PLATFORA_DATA_DIR/extlib to install them(requires a Platfora restart).

You can add a Hive RDBMS Metastore data source in the Platfora application after you have donethe appropriate configuration on the Platfora master server. When you leave the Thrift MetastoreURI blank, the Platfora application will look for the metastore connection information in the hive-site.xml file on the Platfora master server.

Optionally, you can specify the Hive database name when you add a Hive data source. When a databasename is specified, Platfora only shows that database and the tables it contains in the Hive data source.When no database is specified, all Hive databases are available in this data source. You might want tocreate a Hive data source for one database to control which databases the Platfora users can see whenthey create datasets from a Hive data source.

If the connection to Hive is successful, you will see a list of available Hive databases in that data source.Click on a database name to show the Hive tables within that database. The default database in Hive isnamed default. If you have not created your own databases in Hive, this is where all of your tables willreside.

Page 27: Data Ingest Guide

Data Ingest Guide - Manage Data Sources

Page 27

If you are using Hive views, they will also be listed. However, Hive views are disabled to use as thebasis of a Platfora dataset. You can only create a dataset from Hive tables.

If you have trouble connecting to the Hive RDBMS metastore, make sure that:

• The Hive RDBMS metastore server process is running (i.e. the MySQL database server is running).

• The Platfora server machine has access over the network to the designated database server host andport.

• The system user that the Platfora server runs as has database permissions granted on the appropriatedatabase objects in the RDBMS. For example, if using a MySQL metastore you could run acommand such as the following in MySQL:

GRANT ALL ON *.* TO 'platfora'@'%';

• The system user that the Platfora server runs as has read permissions to the underlying data files inHDFS.

Connect to an HDFS Data Source

Creating an HDFS data source involves specifying the connection information to the HDFS NameNodeserver. Once you have successfully connected, you will be able to browse the files and directories inHDFS, and choose the files that you want to add to Platfora as datasets.

Page 28: Data Ingest Guide

Data Ingest Guide - Manage Data Sources

Page 28

When you add a new data source that connects to an HDFS NameNode server, you will need to supplythe following connection information:

ConnectionInformation

Select or Enter

Source Type HDFS

Name A name for the data source location. This can be any name youchoose, such as HDFS User Data or HDFS Root Directory.

Host The external DNS hostname or IP address of the HDFS NameNodeserver.

Port The port that the HDFS NameNode server listens on for connections.For Cloudera installations, the default port is 8020. For Apacheinstallations, the default port is 9000.

Root Path The HDFS directory that Platfora should access. For example, toaccess the entire HDFS file system, use / (root directory). To accessa particular directory only, enter the qualified path (for example, /user/data or /data/weblogs).

If the connection to HDFS is successful, you will see a list of the files and directories that reside in thespecified location of the HDFS file system when defining a dataset from the data source.

If you have trouble connecting to HDFS, make sure that the HDFS NameNode server process is running,and that the Platfora server machine has access over the network to the designated NameNode port.Also, make sure that the system user that the Platfora server runs as has read permissions to the HDFSdirectory location you specified.

Page 29: Data Ingest Guide

Data Ingest Guide - Manage Data Sources

Page 29

Connect to a Google Storage Data Source

Google Storage is a distributed file system hosted on Google Cloud Platform (GCP). Google Storage isan attractive choice for users who run Hadoop clusters on Google Compute Engine machines or utilizethe Google Dataproc service.

If you are not using Google Compute Engine or Dataproc as your primary Hadoop implementationfor Platfora, you can still use Google Storage as a data source, but keep in mind that the source datawill be copied to the Platfora primary Hadoop implementation during the lens build process. If you aretransferring a lot of data between Google Storage and another network outside of GCP, it could be slow.

When you add a new data source that connects to a Google Storage data source, you will need to supplythe following connection information:

ConnectionInformation

Select or Enter

Source Type Google Storage

Name A name for the data source location. This can be any name youchoose, such as Google Storage Sample Data or Marketing Bucket onGoogle Storage.

Bucket Name A bucket is a named container for objects stored in GoogleStorage.

Path The directory in the specified bucket that Platfora should access.For example, to access the entire bucket, use / (root directory).To access a particular directory only, enter the qualified path (forexample, /user/data or /data/weblogs).

If the connection to Google Storage is successful, you will see a list of the files and directories thatreside in the specified location of the Google Storage file system when defining a dataset from the datasource.

If you have trouble connecting to Google Storage, make sure that the Platfora server machine has accessover the network to Google Cloud Platform, and that your Google Storage connection information isspecified in the core-site.xml configuration file of the Platfora master server. If you are usingGoogle Dataproc as the Hadoop implementation for Platfora, your Platfora administrator should haveconfigured the connection information during installation.

Connect to an S3 Data Source

Amazon Simple Storage Service (Amazon S3) is a distributed file system hosted on Amazon WebServices (AWS). Data transfer is free between S3 and Amazon cloud servers, making S3 an attractivechoice for users who run their Hadoop clusters on EC2 or utilize the Amazon EMR service.

If you are not using Amazon EC2 or EMR as your primary Hadoop implementation for Platfora, you canstill use S3 as a data source, but keep in mind that the source data will be copied to the Platfora primaryHadoop implementation during the lens build process. If you are transferring a lot of data between S3and another network outside of Amazon, it could be slow.

Page 30: Data Ingest Guide

Data Ingest Guide - Manage Data Sources

Page 30

Hadoop supports two S3 file systems as an alternative to HDFS: S3 Native File System (s3n) and S3Block File System (s3). Platfora supports the S3 Native File System (s3n) only.

When you add a new data source that connects to an S3 data source, you will need to supply thefollowing connection information:

ConnectionInformation

Select or Enter

Source Type Amazon S3

Name A name for the data source location. This can be any name youchoose, such as S3 Sample Data or Marketing Bucket on S3.

Bucket Name A bucket is a named container for objects stored in Amazon S3. Ifyou go to your AWS Management Console S3 Home Page, you cansee the list of buckets you have created for your account.

Path The directory in the specified bucket that Platfora should access.For example, to access the entire bucket, use / (root directory).To access a particular directory only, enter the qualified path (forexample, /user/data or /data/weblogs).

If the connection to Amazon S3 is successful, you will see a list of the files and directories that reside inthe specified location of the S3 file system when defining a dataset from the data source.

If you have trouble connecting to Amazon S3, make sure that the Platfora server machine has accessover the network to Amazon Web Services, and that your S3 connection information and AWS securitycredentials are specified in the core-site.xml configuration file of the Platfora master server. If youare using Amazon EMR as the default Hadoop implementation for Platfora, your Platfora administratorshould have configured the S3 connection information and AWS security credentials during installation.

Connect to a MapR Data Source

Creating a MapR data source involves specifying the connection information to the MapR ContainerLocation Database (CLDB) server. Once you have successfully connected, you will be able to browsethe files and directories in the MapR file system, and choose the files that you want to add to Platfora asdatasets.

When you add a new data source that connects to a MapR cluster, you will need to supply the followingconnection information:

ConnectionInformation

Select or Enter

Source Type MapR

Name A name for the data source location. This can be any name youchoose, such as MapR File System or MapRFS Marketing Directory.

Page 31: Data Ingest Guide

Data Ingest Guide - Manage Data Sources

Page 31

ConnectionInformation

Select or Enter

Host The external DNS hostname or IP address of the MapR ContainerLocation Database (CLDB) server.

Port The port that the MapR CLDB server listens on for client connections. Thedefault port is 7222.

Root Path The MapR file system (MRFS) directory that Platfora shouldaccess. For example, to access the entire file system, use / (rootdirectory). To access a particular directory only, enter the qualifiedpath (for example, /user/data or /data/weblogs).

If the connection to MapR is successful, you will see a list of the files and directories that reside in thespecified location of the MapR file system when defining a dataset from the data source.

If you have trouble connecting to MapR, make sure that the CLDB server process is running, and thatthe Platfora server machine has access over the network to the designated CLDB port. Also, makesure that the system user that the Platfora server runs as has read permissions to the MapRFS directorylocation you specified.

Connect to Other Data Sources

The Other data source type allows you to specify a connection URL to an external data source server.You can use this to create a data source when you already know the protocol and URL to connect to asupported data source type.

When you add a data source using Other, you will need to supply the following connectioninformation:

ConnectionInformation

Select or Enter

Source Type Other

Name A name for the data source location. This can be any name youchoose, such as My File System or Marketing Data.

URL A connection URL for the data source using one of the supporteddata source protocols (hdfs, maprfs, thrift, gs, or s3n), or you canalso use the file protocol to access a directory or file on the localPlatfora master server file system. For example:file://localhost:8001/file_path_on_platfora_master

Page 32: Data Ingest Guide

Data Ingest Guide - Manage Data Sources

Page 32

If the connection to the data source is successful, you will see a list of the files and directories that residein the specified location of the file system when defining a dataset from the data source.

If you have trouble connecting, make sure that the Platfora server machine has access over the networkto the designated server. Also, make sure that the system user that the Platfora server runs as has readpermissions to the directory location specified.

About the Uploads Data Source

When you first start the Platfora server, it connects to the configured distributed file system (DFS) forHadoop, and creates a default data source named Uploads. This data source cannot be deleted. Youcan upload single files residing on your local file system, and they will be copied to the Uploads datasource in Hadoop.

For large files, it may take a few minutes for the file to be uploaded. The largest file that you can uploadthrough the Platfora web application is 50 MB. If you have large files, consider adding them directly inthe Hadoop file system rather than uploading them through the browser.

Other than adding new files, you cannot manage the files in the Uploads data source through thePlatfora application. You cannot remove files from the Uploads data source once they have beenuploaded, or create sub-directories to organize uploaded files. If you want to remove a file, you mustdelete it in the DFS source system. Re-uploading a file with the same file name will overwrite thepreviously uploaded copy of the file.

Page 33: Data Ingest Guide

Data Ingest Guide - Manage Data Sources

Page 33

Upload a Local File

You can upload a file through the Platfora application and it will be copied to the default Uploads datasource in Hadoop. Once a file is uploaded, you can then select it as the basis for a dataset.

1. Go to the Data Catalog page.

2. Click Add Dataset to open the dataset workspace.

3. Click Data Source Based Dataset.

4. Click Upload File.

5. Browse your local file system and select the file you want to upload.

6. Click Upload.

After the file is uploaded, you can either Cancel to exit the dataset workspace, or Continue todefine a dataset from the uploaded file.

About Security on Uploaded Files

By default, data access permissions on the Uploads data source is granted to the Everyone group.Object permissions allow the Everyone group to define datasets from the data source (and therebyupload files to this data source).

Page 34: Data Ingest Guide

Data Ingest Guide - Manage Data Sources

Page 34

Keep in mind that only users with a system role of System Administrator or DataAdministrator are allowed to create datasets, so only these roles can upload files.

Configure Data Source SecurityOnly system administrators can create data sources in Platfora. Access to the files in a data sourcelocation is controlled by granting data access permissions to the data source. The ability to manage ordefine datasets from a data source is controlled by its object permissions.

1. Go to the Data Catalog page.

2. Click Add Dataset to open the dataset workspace.

3. Click Data Source Based Dataset.

4. Select the data source in the Source List.

5. Click Source Options.

6. Click Permissions.

7. The Data Access section lists the users and groups allowed to see the data coming from this datasource location. If a user does not have data access, they will not be able to see any data values in

Page 35: Data Ingest Guide

Data Ingest Guide - Manage Data Sources

Page 35

Platfora that originate from this data source. Data access permissions apply to any Platfora objectcreated from this source (dataset, lens, or viz).

Data access defaults to the Everyone group (click the X to remove it).

Click Add Data Access to grant data access to other users and groups.

8. The Collaborators section lists the users and groups allowed to access the data source object.

Click Add Collaborators to grant object access to users or groups.

The following data source object permissions can be granted:

• Define Datasets on Data Source. The ability to define datasets from files and directoriesin the data source.

• Manage Permissions on Data Source. Includes the ability to define datasets plus theability to grant data access and object access permissions to other Platfora users.

Page 36: Data Ingest Guide

Data Ingest Guide - Manage Data Sources

Page 36

Delete a Data SourceDeleting a data source from Platfora removes the data source connection as well as any Platfora datasetdefinitions you have created from that data source. It does not remove source files or directories from thesource file system, only the Platfora definitions. The default Uploads data source cannot be deleted.

1. Go to the Data Catalog page.

2. Click Add Dataset to open the dataset workspace.

3. Click Data Source Based Dataset.

4. Select the data source you want to delete from the Source List.

5. Click Source Options.

6. Click Delete.

7. Click Confirm to delete the data source and all of its dataset defintions.

8. Click Cancel to exit the dataset workspace.

Page 37: Data Ingest Guide

Data Ingest Guide - Manage Data Sources

Page 37

Edit a Data SourceYou typically do not need to edit a data source once you have successfully established a connection.If the connection information changes, however, you can edit an existing data source to update itsconnection information, such as the server name or port of the data source. You cannot, however, changethe name of a data source after it has been created.

1. Go to the Data Catalog page.

2. Click Add Dataset to open the dataset workspace.

3. Click Data Source Based Dataset.

4. Select the data source you want to edit from the Source List.

5. Click Source Options.

6. Click Edit.

7. Change the connection information for the data source. You cannot change the name of a data sourceafter it has been saved.

8. Click Save.

9. Click Cancel to exit the dataset workspace.

Page 38: Data Ingest Guide

Page 38

Chapter

3Define Datasets to Describe DataData in Hadoop is added to Platfora by defining a Dataset. A dataset describes the characteristics of the sourcedata, such as its file locations, the structure of individual rows or records, the fields and data types, and theprocessing logic to cleanse, transform, and aggregate the data when it is loaded into Platfora. The collection ofmodeled datasets make up the Data Catalog (the data items available to Platfora users). This section explainshow to create and manage datasets in Platfora. Datasets point to source data in Hadoop.

Topics:• FAQs - Dataset Basics

• Understand the Dataset Workspace

• Understand the Dataset Creation Process

• Understand Dataset Permissions

• Select Source Data

• Parse the Data

• Prepare Base Dataset Fields

• Transform Data with Computed Fields

• Add Measures for Quantitative Analysis

• Prepare Date/Time Data for Analysis

• Prepare Drill Paths for Analysis

• Define the Dataset Primary Key

• Model Relationships Between Datasets

• Prepare Location Data for Analysis

• Pre-Process Data with Transformation Datasets

FAQs - Dataset BasicsThis section answers the most frequently asked questions (FAQs) about creating and managing Platforadatasets.

Page 39: Data Ingest Guide

Data Ingest Guide - Define Datasets to Describe Data

Page 39

What is a dataset?

A dataset points to a set of files in a data source and describes the structure of the data, as well as anyprocessing logic needed to prepare the data for consumption. A dataset is just a metadata description ofthe data -- it contains all of the data about the data -- plus a small sampling of raw rows to facilitate datadiscovery.

What are the prerequisites for creating a dataset?

You need access to the source data. Before you add a new dataset to Platfora, the source data files onwhich the dataset is based must be in the Hadoop file system and accessible to Platfora via a DataSource. You can also upload files from your desktop to the default Uploads data source.

Who can create a dataset?

Only System Administrators or Data Administrators can create and edit datasets in Platfora.You must also have data access permissions to the source data in order to define a dataset from data filesin Hadoop. The person who creates the dataset becomes the dataset owner. The dataset owner can grantpermissions to other Platfora users.

How do I create a dataset?

Go to the Data Catalog and click Add Dataset.

The dataset workspace guides you through a series of steps to define the structure and processing rulesfor the data. See Understand the Dataset Creation Process.

How do I edit an existing dataset?

If you have permission to edit a dataset, you can edit it by opening the dataset workspace. On the DataCatalog page in either Card or List view, click the dataset name, or choose Edit from the datasetaction menu. You can make changes on the dataset workspace and save the changes.

Page 40: Data Ingest Guide

Data Ingest Guide - Define Datasets to Describe Data

Page 40

If the dataset action menu shows View instead of Edit, it means you don't have the appropriatepermissions. Ask the dataset owner to grant you edit permission.

How do I rename a dataset?

You cannot rename a dataset after it has been saved for the first time.

You can, however, make a duplicate copy of a dataset and save it as a new name. Then you can thendelete the old dataset and keep the renamed one. Note that any references to the renamed dataset will bebroken in other datasets, so you will have to manually update those.

Can I make a copy of a dataset?

Yes, you can make a copy of an existing dataset. Edit the dataset you want to copy, and choose SaveAs from the dataset workspace Save menu.

Platfora makes a copy of the current version of the dataset using the new name. Any dataset changes thatwere made since saving the previous dataset are applied to the new dataset only.

You might want to copy an existing dataset to:

• Experiment with changes to the dataset computed fields without affecting the original dataset.

• Create another dataset that accesses different source files for users that only have access to sourcefiles in a different path.

• Change the name of the dataset (then delete the original dataset).

Since duplicating a dataset changes its name, references to the previous dataset will not be automaticallyupdated to point to the duplicated dataset. You must manually edit the other datasets and update theirreferences to point to the new dataset name instead.

Page 41: Data Ingest Guide

Data Ingest Guide - Define Datasets to Describe Data

Page 41

How do I delete a dataset?

Find the dataset in the Data Catalog in either Card or List view and choose Delete from thedataset action menu.

If the delete option is not available, it means you don't have the appropriate permissions. Only a datasetowner can delete a dataset.

Deleting a dataset does not remove files or directories in the source file system. It does not removelenses built from the dataset. Any lenses that have been built from the dataset will remain in Platfora,however future lens builds that use a deleted dataset will fail. Also, any references to the deleted datasetwill be broken in other datasets.

What kinds of data can be used to define a dataset?

You can define a dataset from data files that reside in Hadoop. Platfora supports a number of file formatsout-of-the-box. See Supported Source File Formats.

How do I join datasets together?

The logic of a join is described within the dataset definition as a Reference. A reference joins twodatasets together using fields they share in common. A reference creates a link in one dataset to theprimary key of another dataset. The actual joining of the datasets happens during lens build time, notwhen the reference is created. See Model Relationships Between Datasets.

Page 42: Data Ingest Guide

Data Ingest Guide - Define Datasets to Describe Data

Page 42

What are the different kinds of columns or fields that a dataset can have?

A field is an atomic unit of data that has a name, a value, and a data type. A column is a set of datavalues of a particular data type, with one value for each row in the dataset. Columns provide thestructure for composing a dataset row. The terms column and field are often used interchangeably.

Within a dataset, there are three basic classes of fields or columns:

• Base fields are the raw fields parsed directly from the source data.

• Computed fields are fields that you add to the dataset to perform some kind of extraction, cleansing,or transformation on the base data fields.

• Measure fields are a special type of computed field that specifies how the data should be aggregatedwhen it is analyzed. For example, suppose you had a Dollars Sold field in your dataset. At analysistime, you may want to know the Total Dollars Sold per day (a SUM aggregation). Measures serve asthe quantitative data in an analysis, and every dataset, lens, and viz must have at least one measure.

Every dataset column or field also has a data type, which describes the kind of values allowed in thatcolumn. See About Platfora Data Types.

You can change the data types of base fields. Computed field data types are set by the output type oftheir computed expression.

How do I transform or manipulate the data?

To transform or manipulate the data, add computed fields to the dataset. Platfora's expression languagehas an extensive library of built-in functions and operators that you can use to define computed fields.Think of a computed field as a single step in an ETL (extract, transform, load) workflow. Sometimesseveral steps, or computed fields, are needed to achieve the result you want. You can hide the computedfields that do interim data processing steps.

How do I request data from a dataset?

You request data from a dataset by choosing one dataset in the Data Catalog, and creating a lens fromthat dataset. When you create a lens, you can choose any fields you want from the focus dataset, plus

Page 43: Data Ingest Guide

Data Ingest Guide - Define Datasets to Describe Data

Page 43

dimension fields from any dataset that it references. When you build the lens, Platfora will go fetch thedata from Hadoop and prepare it for analysis.

See Define Lenses to Load Data.

Understand the Dataset Workspace

When you add edit or view an existing dataset, you are brought to the dataset workspace. This is whereyou describe the structure and characteristics of your source data in the form of a Platfora dataset.The dataset workspace is divided into three pages where you can view how the source data has beendescribed in the form of a Platfora dataset. Additionally, if you have Edit permission on the dataset, youcan also edit the existing dataset configuration.

Properties Page

1. Shows dataset properties, such as the source type, size, an owner.

2. Select a field and view some basic properties of the field.

3. Filter the list of fields to make it easier to find a particular field.

4. View different catalog objects that use this dataset, such as segments and lenses based on the dataset.

5. Save any change made to the dataset.

Page 44: Data Ingest Guide

Data Ingest Guide - Define Datasets to Describe Data

Page 44

Data Page

1. The dataset is horizontally divided into fields (or columns). Fields are listed in the order that theyoccur in the source data (for original base fields), then in the order that they were added to the dataset(for computed fields).

2. The field headers can be expanded or collapse to more easily view the data type of each field, and tomore quickly edit some commonly configured field properties, such as their names.

3. When you select a column, the Inspector panel shows the field detail information includingdetailed statistics based on the sample data currently displayed. Platfora uses the power of Spark tocalculate statistics on each field in real time. This is where you can edit things like the field name,description, data type, or quick measure aggregations. You can hide (and show) the Inspectorpanel if you need more room to work in the dataset workspace.

4. By default, Platfora shows up to 100 rows to help with data discovery. These are records takendirectly from the source data files, and shown with the parsing or expression logic applied. Youcan choose to increase the number of sample rows. Changing the number of sample rows displayedaffects the statistics shown in the Inspector panel. The number of sample rows shown might beless than the value chosen in Sample at Most drop down menu. This might happen if the Sparkserver reaches a timeout value before fetching all sample rows from the source files, especially ifthere are a large number source files with small amounts of data in them. Also, Platfora alwaysshows a maximum of 20 sample rows for derived datasets, elastic datasets, and datasets using dataconnector plugin data sources.

Some computed fields do not show sample values because the values are computed as lens buildtime, such as computed fields that operate on fields not in the current dataset (via references).

5. You can edit and configure dataset properties that were originally set during the dataset creationwizard.

Page 45: Data Ingest Guide

Data Ingest Guide - Define Datasets to Describe Data

Page 45

6. You can search for fields to locate them on the page quickly. This is useful if the dataset hashundreds of fields.

7. You can search for and highlight field values in the table.

Relationships Page

1. Choose the type of relationship to view or edit. Relationships between datasets must be defined afterthe datasets are created initially.

2. Select a reference to view its details.

Understand the Dataset Inspector Panel

The Platfora Inspector panel allows you to edit a dataset field and view detailed statistics on it in real-time. The Inspector panel is visible by default, but you can hide it to increase the available space for

Page 46: Data Ingest Guide

Data Ingest Guide - Define Datasets to Describe Data

Page 46

displaying dataset fields. View the Inspector panel on the Data page of the dataset workspace, or theManage step of the dataset creation wizard.

Use the Properties tab in the Inspector panel to edit field properties, such as the name and datatype, and to view statistics on the field. Platfora uses an Apache Spark server to calculate statistics on afield as you select it in the dataset workspace. All statistics shown are based on the sample data currentlyshown in the dataset workspace. Therefore, to get more precise numbers, you can try increasing thenumber of sample rows. Platfora shows up to 100 rows of data by default.

The Inspector panel shows no statistics when the field is a computed fieldthat contains a field in a referenced dataset, or the FILE_PATH or FILE_NAMEfunctions.

The Properties tab in the Inspector panel shows the following statistics:

• (Numeric and datetime data only) The distribution of the data as a histogram.

• (Numeric and datetime data only) The distribution of the data as a blox and whisker plot that showsthe minimum value, maximum value, and interquartile range of data.

• The most common values including their percentages.

• Percentage of NULL values.

• Number of distinct values.

Page 47: Data Ingest Guide

Data Ingest Guide - Define Datasets to Describe Data

Page 47

Use the Options tab to perform optional tasks on the field, such as creating a binned field (numericdata types only) or hiding the field from the lens builder.

Filter Sample Data in a Dataset

You can filter the sample data shown in a dataset to look for specific field values. Platfora highlightsthe field values that meet the filter criteria. Filtering the sample data only filters the rows shown in thedataset workspace or dataset creation wizard. It does not filter data in any lens.

You might want to filter sample data to quickly find rows with specific data so you can parse or cleanseother fields in those rows in a new computed field.

You can filter sample data using a literal string or a regular expression. Note that searching for a literalstring is case insensitive.

Page 48: Data Ingest Guide

Data Ingest Guide - Define Datasets to Describe Data

Page 48

You might want to use a regular expression to search for more complicated patterns. For example, youcan use a negative lookahead regular expression which only matches if the text is not found.

1. Go to the Manage step of the dataset creation wizard.

Or, for an existing dataset, open the dataset workspace, and click the Data page.

2. Click Filter Table.

3. Choose whether to filter using a literal string or regular expression.

4. Enter the filter expression.

Platfora highlights the field values and the matched text. It also lists how many rows matched.

Understand the Dataset Creation ProcessThere are several steps involved in creating a Platfora dataset. This section helps data administratorsunderstand all of the tasks to consider when defining a dataset in Platfora. The goal of the dataset is tomake the data consumable for data analysts and business users.

The dataset creation wizard is divided into multiple areas to guide you through the dataset definitionprocess. You can go back and forth between the areas as you work on the dataset by clicking theNext and Previous buttons. The dataset creation wizard is the same as the Data page in the datasetworkspace.

Choose Dataset Type

When you first add a dataset, you must choose the type of dataset to create. The type of dataset youcreate determines the steps in the dataset creation wizard:

Page 49: Data Ingest Guide

Data Ingest Guide - Define Datasets to Describe Data

Page 49

• Data source based dataset. This is the most common and basic dataset type you can create inPlatfora. You go through four main steps: Select, Parse, Manage, and Finish.

• Union dataset. A Union dataset is a transformation dataset that combines data in related fields fromexisting Platfora datasets. You go through four main steps: Select, Union, Manage, and Finish.For more information about Union datasets, see Combine Data with Union Datasets.

• SQL dataset. A SQL dataset is a dataset whose underlying data is produced from the results of aHive query language (HiveQL) statement performed on existing Platfora datasets. You go throughthree main steps: SQL, Manage, and Finish. For more information about SQL datasets, see Workwith SQL Datasets.

Step 1—Select Data (Data Source Based and Union Datasets Only)

In the first step in the dataset creation wizard for data source based and Union datasets, you Select thesource data for the dataset.

Page 50: Data Ingest Guide

Data Ingest Guide - Define Datasets to Describe Data

Page 50

For data source based datasets, you point Platfora to a specific location in a data source and select a tableor specific files. You can only browse the data sources that have been added to Platfora by a systemadministrator.

For Union datasets, you Select one or more existing datasets that have data you want to combine. Thedatasets you choose here become input datasets to the Union dataset.

Page 51: Data Ingest Guide

Data Ingest Guide - Define Datasets to Describe Data

Page 51

Step 1—Define SQL Statement (SQL Datasets Only)

The first step in the dataset creation wizard for creating a SQL dataset is to define the SQL statement tomanipulate existing Platfora datasets. Do this by entering a Hive query language (HiveQL) that selectsfields from specific datasets, and optionally transforms the data from the fields in some way. Platforaprocesses the expression language using an Apache Spark server.

Best practice is to enclose Platfora field names and dataset names in the HiveQLstatement in the grave accent character ( ` ), also known as the backtickcharacter. For example, SELECT `Field1`, `Field2`, `Field3` FROM`Dataset1`. Enclosing names in the ` character is required for field and datasetnames containing numerals only (such as 12345) or special characters (such asspaces and periods).

Step 2—Parse Data (Data Source Based Datasets Only)

The second step in the dataset creation wizard for data source based datasets, Parse, is where youspecify the parsing logic used to extract rows and columns from the source data. Platfora comes withseveral built-in parsers for the most common file formats. After you have done the initial parsing, youusually don't need to revisit this step unless the underlying structure of the data changes.

Page 52: Data Ingest Guide

Data Ingest Guide - Define Datasets to Describe Data

Page 52

The Parsing Options tab shows the data with the parsing logic applied. The View Raw tab showsthe original raw data records.

Step 2—Edit Union (Union Datasets Only)

The second step in the dataset creation wizard for Union datasets, Union, is where you specify how tocombine fields from the input datasets. Platfora makes a best effort based one field names and data typesto match fields from the input datasets. You can can change how these fields are mapped together.

To make sure that the data is combined as you desire, you may need to:

1. Define how to match the fields in the input datasets with each other to create a new field in the Uniondataset. This field mapping determines how Platfora combines data from the input datasets.

2. Create an additional field in the Union dataset and then define which fields to use from each inputdataset.

3. Revert all currently mapped fields and start over with Platfora's default field mappings based onfuzzy matching.

4. Remove all currently mapped fields.

5. View the fields in the input datasets, and optionally create new fields in each input dataset that youcan use in a field mapping.

Page 53: Data Ingest Guide

Data Ingest Guide - Define Datasets to Describe Data

Page 53

6. For each field mapping, view how the values from each input dataset compare to each other.

Step 3—Manage Fields (All Dataset Types)

The third step in the dataset creation wizard, Manage, is where you prepare the actual fields that userscan see and request from the Platfora data catalog. This is where the majority of the dataset definitionwork is performed.

To make sure that the data is in consumable format for analysis, you may need to:

Page 54: Data Ingest Guide

Data Ingest Guide - Define Datasets to Describe Data

Page 54

1. Give fields meaningful names

2. Verify the base field data types

3. Add field descriptions to help users understand the data

4. Specify how NULL values are handled

5. Add computed fields to further transform and process the data

6. Identify the dataset measures

7. Define the dataset primary key

8. Define drill path hierarchies

9. Hide fields you don't want users to see

10.Prepare datetime data for analysis

Step 4—Finish & Save (All Dataset Types)

The final step in the dataset creation wizard, Finish, is where you choose the dataset name, add adescription of the dataset, configure labels (optional), and choose who has object permission on thedataset. Dataset names cannot be changed after the dataset has been created.

Dataset names cannot contain the grave accent character ( ` ), also known as thebacktick character.

(Optional) Post Dataset Creation–Define Relationships (All Dataset Types)

After creating a dataset using the dataset creation wizard, you can open the dataset and edit it further.

On the Data page, you can manage the dataset fields and make edits to anything defined during theinitial dataset creation on the Manage step.

On the Relationships page, you can:

Page 55: Data Ingest Guide

Data Ingest Guide - Define Datasets to Describe Data

Page 55

1. Create joins with other datasets once all dependent datasets have been added to Platfora

2. Prepare geo-location data for analysis.

3. Define event references for event series analysis and segment analysis.

When adding the dependent datasets, you must make sure that a) they have a primary key, and b) thedata types of the primary key and foreign key fields are the same in both datasets.

Understand Dataset PermissionsOnly system and data administrators can create datasets in Platfora. The ability to edit or create alens from a dataset is controlled by the dataset's object permissions. In addition to the dataset objectpermissions, users must also have access to the source data itself in order to see and work with the datain Platfora.

Platfora controls access to a dataset at two levels:

• Source Data Access Permission - Source data access permission determines who is authorizedto view the raw source data. By default, data access permission is controlled at the data sourcelevel only, and is inherited by the datasets coming from that data source. Your Platfora systemadministrator may also configure Platfora to authorize data access using the permissions set in HDFS.In these two cases, data access permission is disabled at the dataset level. If Platfora is configuredfor more granular per-dataset access control, then data access can be set independently of the datasource, but this is not the default behavior.

• Dataset Object Permissions - Dataset object permissions control who can edit, delete, or create alens from a dataset within the Platfora application.

Users must have dataset permissions at both levels in order to work with a dataset.

Page 56: Data Ingest Guide

Data Ingest Guide - Define Datasets to Describe Data

Page 56

To manage permissions for a dataset, find the dataset in the data catalog and select Permissions.Click Add Collaborators to choose new users or groups to add.

By default, the user who created the dataset is the owner, and the Everyone group is granted DefineLens from Dataset access.

The following dataset object permissions can be granted:

• Define Lens from Dataset. The ability to define a lens from the visible fields of a dataset.The fields of referenced datasets are not included in this permission by default. A user must haveappropriate permissions on each individual dataset in order to choose dataset fields for a lens. Bydefault, all datasets have this permission granted to Everyone.

• Edit. Define lens plus the ability to edit the dataset definition. Editing the dataset definition means auser can see the raw data, including hidden fields.

• Own. Edit plus the ability to delete a dataset or manage its permissions.

Select Source DataAfter you have created a data source, the first step in creating a dataset is selecting some Hadoop sourcedata to expose in Platfora. This is accomplished by choosing files from the source file system.

For Hive data sources, a single Hive or Cloudera Impala table definition maps to a single Platforadataset.

For file system data sources, such as HDFS or S3, a dataset can map to either a single file or to multiplefiles residing in the same parent directory location.

For the default Uploads data source, a dataset usually maps to a single uploaded file, although you canselect multiple uploaded files if they use a similar file naming convention.

Page 57: Data Ingest Guide

Data Ingest Guide - Define Datasets to Describe Data

Page 57

Supported Source File Formats

To ingest source data, Platfora uses its parsing facilities to parse the data into records (rows) and fields(columns). Platfora supports the following source file formats and file compression formats.

Format Description

Hive Tables When creating a dataset from a Hive table, there is no need todefine parsing controls in Platfora. Platfora uses the Hive tabledefinition to obtain metadata about the source data, such as whichfiles to process, the parsing logic for rows and columns, and the fieldnames and data types contained in the source data. Since Platforarelies on Hive to do the file parsing, you must make sure that Hive isable to correctly handle the source file format of the underlying tabledata files.Platfora is able to parse Hive tables that refer to data in the followingfile formats:

• Delimited Text file format

• SequenceFile format

• Record Columnar File (RCFile format)

• Optimized Row Columnar (ORC file format)

• Custom Input Format (provided that the SERDE used to define therow format is also installed in Platfora)

Impala Tables When creating a dataset from a Cloudera Impala table, there is noneed to define parsing controls in Platfora. Platfora uses the Impalatable definition to obtain metadata about the source data, such aswhich files to process, the parsing logic for rows and columns, andthe field names and data types contained in the source data. SincePlatfora relies on Impala to do the file parsing, you must make surethat Impala is able to correctly handle the source file format of theunderlying table data files.Platfora is able to parse Impala tables that refer to data in thefollowing file formats:

• Parquet file format

• Text file format

• Avro file format

• Record Columnar File (RCFile format)

• SequenceFile format

Page 58: Data Ingest Guide

Data Ingest Guide - Define Datasets to Describe Data

Page 58

Format Description

Delimited Text A delimited file is a plain text file format for describing tabular data.It refers to any file that is plain text (typically ASCII or Unicodecharacters), has one record per line, has records divided into fields,and has the same sequence of fields for every record. Records (orrows) are separated by line breaks, and fields (or columns) within aline are separated by a special character called the delimiter (usuallya comma or tab character). If the delimiter also appears in the fieldvalues, it must be escaped. The Platfora delimited parser supportssingle character escapes (such as a backslash), as well as enclosingfield values in double quotes (as is common with CSV files).

CSV Comma-separated value (CSV) files are a type of delimited textfiles. The Platfora delimited file parser also supports typical CSVformatting conventions, such as enclosing field values in doublequotes, using double quotes to escape literal quotes, and the use ofheader rows.

JSON JavaScript Object Notation (JSON) is a data-interchange formatbased on a subset of the JavaScript Programming Language. JSONis a text format comprised of two basic data structures: objects andarrays. The Platfora JSON parser supports the selection of a top-levelJSON object to signify a record or row, and selection of name:valuepairs within an object to signify columns or fields (including nestedobjects and arrays).

XML Extensible Markup Language (XML) is a markup language thatdefines a set of rules for encoding documents in a format that isboth human-readable and machine-readable. XML is a text formatcomprised of two basic data structures: elements and attributes.The Platfora XML parser supports the selection of a top-level XMLelement to signify a record or row, and selection of attribute:valueor element:value pairs within a parent element to signify columns orfields (including nested elements).

Avro Apache Avro is a remote procedure call (RPC) and data serializationframework. Its primary use is to provide a data serialization formatfor persistent data stored in Hadoop. It uses JSON for definingschema, and JSON or binary format for data encoding. When Avrodata is stored in a persistent file (called a container file), its schemais stored along with it. This allows any program to be able to readthe serialized data in the file.

Parquet Apache Parquet is a columnar storage format available to any projectin the Hadoop ecosystem. The file must have a parquet-formatproject module that defines how to read the Parquet files.

Hadoop SequenceFiles

Sequence files are file format generated by Hadoop MapReducetasks, and are a common format for storing data in Hadoop. It is aflat file format containing binary records. Platfora can import recordscontained within a sequence file as long as the format of the recordsis delimited text, CSV, JSON, XML, or Avro.

Page 59: Data Ingest Guide

Data Ingest Guide - Define Datasets to Describe Data

Page 59

Format Description

Web Access Logs A web access log contains records about incoming requests madeto a web server. Platfora has a built-in parser that automaticallyrecognizes web access logs that adhere to the NCSA common orcombined log formats used by many popular web servers (such asApache HTTP Server).

Other File Types For semi-structured file formats, you can still define parsing logicusing regular expressions or Platfora's built-in expression language.Platfora provides a Regex or Line parser to allow you to define yourown parsing logic to extract data columns from the records in yoursource files (as long as your source files have one record per line).

Custom DataSources

For source data coming in from a custom data connector, the logic ofthe data connector dictates the format of the data. For example, ifusing a JDBC data connector to access data in a relational database,the data is returned in delimited format.

For Platfora to read a compressed source file, both your Platfora and your Hadoop configuration mustsupport the compression format. By default, Platfora supports the following formats:

Format Notes

Deflate (zlib)

Gzip

Bzip

Platfora and Hadoop support these formats out-of-the-box.

Snappy Platfora includes support for Snappy in its distribution. Hadoopdoes not. Your administrator must configure Hadoop to supportSnappy. Refer to your Hadoop distribution documentation forinformation on configuring Snappy.

LZO (Hadoop-LZO)

LZ4

Due to licensing restrictions, Platfora does not bundlesupport for these with the product. Your administrator mustconfigure these compression formats both in Platfora andHadoop.Although neither compression format is explicitly qualifiedwith each new release, Platfora will fix issues and releasepatches if a problem is discovered.

Select a Hive or Impala Source Table

For Hive data sources, Platfora points to a Hive metastore server. From that point, you can browse theavailable databases and tables, and select a single Hive table or Cloudera Impala table on which to baseyour dataset.

Page 60: Data Ingest Guide

Data Ingest Guide - Define Datasets to Describe Data

Page 60

Since Hive and Impala tables are already in tabular format, the parsing step is skipped for Hive datasources.

Note that Platfora does not execute any queries through the Hive server or Impala server. It only uses thetable definition in the Hive metastore to obtain the metadata needed to define a dataset.

1. On the Select step of the dataset creation wizard, select a Hive data source from the Source List.

2. Select the database that contains the table you want to use.

Note that the default Hive database is named default.

3. Select a single Hive or Impala table. Only tables can be used to define datasets, not views.

Platfora will use the table definition to determine the source files, columns, data types, partitioning,and so on.

4. Click Next.

Platfora skips the Parse step for Hive and Impala datasets, and goes directly to the Manage step.

Select DFS Source Files

For distributed file system data sources, such as HDFS and S3, a data source points to a particulardirectory in the file system. From that point, you can browse and select the source files to include in

Page 61: Data Ingest Guide

Data Ingest Guide - Define Datasets to Describe Data

Page 61

your dataset. You can enter a wildcard pattern to select multiple files including files from multipledirectory locations, however all of the files selected must be of the same file format.

1. On the Select step of the dataset creation wizard, select an HDFS, Google Storage, or S3 datasource from the Source List.

2. Browse the file system to choose a directory or file you want to use as the basis for your dataset.

3. To select multiple files within the selected directory, use a wildcard pattern in the SourceLocation path, where ? represents a single character and * represents any number of characters.

For example, suppose you wanted to base a dataset on log files that are partitioned into monthlydirectories. To select all log files for 2014, you could use a wildcard path such as:

hdfs://myhdfs.mycompany.com/data/*2014/*.log

4. In the selected source files list, confirm that the files you want are selected. If a large number ofsource files are selected, Platfora will only display the first 200 file names.

5. Click Next.

Page 62: Data Ingest Guide

Data Ingest Guide - Define Datasets to Describe Data

Page 62

Edit the Dataset Source Location

You can edit an existing dataset to point to different source location as long as it is in the same datasource. You cannot switch data sources for a dataset after it has been saved. For example, you cannotchange a dataset that points to the Uploads data source to then use another HDFS data source instead.

1. Open the dataset workspace, and click the Data page.

2. Click Edit Data Source.

3. Edit the Source Location to point to the new directory path or file name within the same datasource.

4. Click Update.

5. Click Save.

Parse the DataThe Parse Data step of the dataset workspace is where your specify the parsing options for a dataset.This section describes how to use Platfora's built-in file parsers to describe your source data in tabularformat (rows and columns). The built-in parsers assume that each record has a similar data structure.

View Raw Source Data Rows

On the Parse step of the dataset creation wizard, Platfora shows a sample of raw lines or records from asource data file. This allows you to compare the data in its original format (View Raw tab) to the datawith the parsing logic applied (Parsing Options tab). Viewing the raw data is helpful in determiningthe parsing logic, and when writing computed field expressions that do transformations on base fields.

For delimited data, Platfora shows the first 100 lines taken starting with the first source file.

Page 63: Data Ingest Guide

Data Ingest Guide - Define Datasets to Describe Data

Page 63

For structured file formats, such as JSON and XML, Platfora shows the first 100 top-level objectsstarting with the first source file.

The View Raw tab shows the rows from the source data file(s). The Parsing Options tab shows thedata values after the parsing logic has been applied.

1. Go to the Parse step of the dataset creation wizard.

2. Select the View Raw tab.

You can also view the raw data for existing datasets.

1. Open the dataset workspace, and click the Data page.

2. Click Edit Parsing Options.

3. Click the View Raw tab.

Page 64: Data Ingest Guide

Data Ingest Guide - Define Datasets to Describe Data

Page 64

4. Click Cancel when you're done viewing the raw data.

Parse Delimited Data

To use Platfora's delimited file parser, your data must be in plain text file format, have one record perline, and have the same sequence of fields for every record separated by a common delimiter (such as acomma or tab).

Delimited records (or rows) are separated by line breaks, and fields (or columns) within a line areseparated by a special character called the delimiter (usually a comma or tab character). If the delimiteralso appears in the field values, it must be escaped. The Platfora delimited parser supports single

Page 65: Data Ingest Guide

Data Ingest Guide - Define Datasets to Describe Data

Page 65

character escapes (such as a backslash), as well as enclosing field values in double quotes (as is commonwith CSV files).

You define the parsing controls on the Parse step of the dataset creation wizard.

The Parsing Controls for the Delimited parser are as follows:

Parser Control Description

File Type Choose the Delimited parser for delimited text and CSV files. TheWrangled view shows the data with the parsing logic applied.

Upload FieldNames

Allows you to upload a comma or tab delimited text file containing thefield information you want to set. When a dataset has a lot of fields tomanage, it may be easier to update several field names, descriptions,data types, and visibility settings all at once rather than editing eachfield one-by-one.For more information, see Bulk Upload Field Header Information.

Row Delimiter Specifies a single character used to separate rows (or records) in yoursource data files.In most delimited files, rows are separated by a new line, such as theline feed character, carriage return character, or carriage return plusline feed. Line feed is the standard new line representation on UNIX-like operating systems. Other operating systems (such as Windows)may use carriage return individually, or carriage return plus line feed.Selecting Any New Line will recognize any of these representations ofa new line as the row delimiter.

Page 66: Data Ingest Guide

Data Ingest Guide - Define Datasets to Describe Data

Page 66

Parser Control Description

Ignore TopRows

Specifies the number of lines at the beginning of the file to ignorewhen reading the source file during data ingest and lens builds. Enterthe number of lines to ignore and click Update. To use this with theRaw Files Contains Header option, ensure that the line containingthe column names is visible and is the first remaining line.

ColumnDelimiter

Specifies the single character used to separate the columns (or fields)of a row in your source data files. Comma and tab are the mostcommonly used column delimiters.

Raw FileContainsHeader

A header is a special row containing column names at the beginning ofa data source file. If your source data files have a header row as thefirst line in the file, select this check-box. This will treat the first line ineach source file as a header row instead of as a row of data.

EscapeCharacter

Specifies the single character used to escape the Quote Character oranother instance of the Escape Character when a Quote Characteris specified. Platfora only reads an escape character as data if it isescaped with another escape character.If your data values contain quote characters as data, those charactersmust be escaped and the entire field value must be enclosed withthe Quote Character, otherwise the parser will assume the quotecharacter denotes a new column.For comma-separated values (CSV) files, it is common practice toescape column delimiters by enclosing the entire field value withindouble quotes. If your source data uses this convention, then youshould specify a Quote Character.

Page 67: Data Ingest Guide

Data Ingest Guide - Define Datasets to Describe Data

Page 67

Parser Control Description

QuoteCharacter

The quote character is used to enclose individual data values in CSV-formatted files. The quote character is typically the double quotecharacter (").If a field value contains a column delimiter as data, then the fieldvalue must be enclosed in the Quote Character, otherwise the parserwill assume the column delimiter denotes a new column.If a field value contains the quote character as data, then the fieldvalue must be enclosed in the Quote Character and it must beescaped: either by the Escape Character or another quote character.If a field value contains a row delimiter (such as a new line character)as data, then the field value must be enclosed in the QuoteCharacter and Field values contain new lines must be checked.For example, suppose you have a row with these three data values:weekly specialwine, beer, and soda"2 for 1" or 9.99 eachIf the column delimiter is a comma, the quote character is a doublequote, and the escape character is a backslash, then a correctlyformatted row in the source data would look like this:"weekly special","wine, beer, and soda","\"2 for 1\" or 9.99 each"

Field valuescontain newlines

Check this option if your source data might contain new line charactersas part of a field value. This option is available when a QuoteCharacter is specified.When enabled, Platfora reads the new line characters inside quotecharacters as part of the field value instead of as a row delimiter.Platfora interprets any row delimiter character outside of quotecharacters as a new record.Enabling this option may impact lens build performance if Platforareads very large source files.Note that you may get unexpected results if you enable this option andthe source file has malformed data (for example when a field valuehas either an opening or closing quote character but not both). Try toensure your source data is well formed when using this option.

Page 68: Data Ingest Guide

Data Ingest Guide - Define Datasets to Describe Data

Page 68

You can also access the parsing controls for an existing dataset, by clicking Edit Parsing Controlson the Data page of the dataset workspace.

Page 69: Data Ingest Guide

Data Ingest Guide - Define Datasets to Describe Data

Page 69

Specify a Single-Character Custom Delimiter

If your delimited data uses a special delimiter character that is not available in the default choices, youcan define a custom delimiter as either a single-character string or decimal-encoded ASCII value.

1. Go to the Parsing Options panel. Make sure you are using the Delimited parser.

2. Choose Add Custom from the Column Delimiter or Row Delimiter menu.

3. Choose the encoding to use: String or ASCII Decimal Code.

4. Enter the delimiter value.

For String, you can enter any single character that you can type on your keyboard.

For ASCII Decimal Code, enter the decimal-encoded representation of the ASCII character. Forexample, 29 is the ASCII code for group separator, 30 is for record separator, 65 is for the letter A.

5. Click OK to add the custom delimiter to the selected parser delimiter menu.

Page 70: Data Ingest Guide

Data Ingest Guide - Define Datasets to Describe Data

Page 70

Specify a Multi-Character Column Delimiter

In some cases, your source data may use a multi-character column delimiter. The delimited parser doesnot support multi-character delimiters, but you can work around it by using the Regex parser instead.

1. Go to the Parsing Options panel.

2. Choose the Regex parser.

3. Enter a Regular Expression that matches the structure of your data lines.

For example, if your multi-character column delimiter was a two colons (::), and your data had 6fields, then you could use a regular expression such as:

(.*)::(.*)::(.*)::(.*)::(.*)::(.*)

4. Click Continue.

Parse Hive and Impala Tables

When creating a dataset from a Hive or Cloudera Impala table, there is no need to define parsingcontrols in Platfora. Platfora uses the table definition in the Hive metastore to obtain metadata about thesource data, such as which files to process, the parsing logic for rows and columns, and the field namesand data types contained in the source data. You can only create a dataset based on Hive and Impalatables, not views.

Because Platfora accesses the source data in HDFS when building lenses, you must make sure thesystem user that the Platfora server runs as has read permissions to the location in HDFS where thesource data is located. For example, if a Hive table is created using the LOCATION clause LOCATION'/data/users/*' then the user the Platfora servers runs as must have read permission on the /data/users directory in HDFS.

See the Apache Hive Wiki for more information about using Hive. See the Cloudera Impaladocumentation for more information about using Impala.

Page 71: Data Ingest Guide

Data Ingest Guide - Define Datasets to Describe Data

Page 71

Hive and Impala to Platfora Data Type Mapping

When you create a dataset based on a Hive or Cloudera Impala table, Platfora maps the data types of theHive or Impala columns to one of the Platfora internal data types.

Platfora has a number of built-in data types that can be used to classify the fields in a dataset. Hive andImpala also have a set of primitive and complex data types they support for table columns.

Platfora does not currently support the Hive or Impala BINARY primitive data type.

The DECIMAL data type is mapped to DOUBLE by default. This may result in a lost of precision dueto roundoff errors. You can choose to map DECIMAL columns to FIXED. This retains precision fornumbers that have four or fewer digits after the decimal point, and loses precision for more precisenumbers.

The Hive and Impala complex data types (MAP, ARRAY, STRUCT and UNIONTYPE) are imported intoPlatfora as a single JSON-formatted STRING. You can then use the Platfora expression language todefine new computed columns in the dataset that extract a particular key:value pair from the importedJSON structure.

If Platfora encounters a data type from Hive or Impala that isn't in the tablebelow, such as VARCHAR, Platfora maps it to STRING.

Hive or Impala Data Type Platfora Data Type

TINYINT INTEGER

SMALLINT INTEGER

INT INTEGER

BIGINT LONG

DECIMAL DOUBLE

FLOAT DOUBLE

DOUBLE DOUBLE

STRING STRING

MAP STRING (JSON-formatted)

ARRAY STRING (JSON-formatted)

STRUCT STRING (JSON-formatted)

UNIONTYPE STRING (JSON-formatted)

Page 72: Data Ingest Guide

Data Ingest Guide - Define Datasets to Describe Data

Page 72

Hive or Impala Data Type Platfora Data Type

TIMESTAMP (must be in Hive timestamp formatof yyyy-MM-dd HH:mm:ss[:SSS]

DATETIME

Enable Hive and Impala SerDes in Platfora

If you are using Hive as a data source, Platfora must be able to parse the underlying source data filesthat a Hive or Cloudera Impala table definition refers to. For Hive to be able to support custom fileformats, you implement Serialization/Deserialization (SerDe) libraries in Hive that describe how to read(or parse) the data. Any custom SerDe libraries that you implement in Hive must also be installed inPlatfora.

Platfora is only compatible with SerDes of type FileInputFormat.

In order for Platfora to be able to read and process data files referenced by a Hive or Impala table, anycustom SerDe library (.jar file) that you are using in your Hive or Impala table definitions must also beinstalled in Platfora.

To install a custom SerDe in Platfora, copy the SerDe .jar file to the following location on the Platforamaster server (create the extlib directory in the Platfora data directory if it doesn't exist):

$PLATFORA_DATA_DIR/extlib

Restart Platfora after installing all of your SerDe jars:

platfora-services restart

How Platfora Uses Hive and Impala Partitions and Buckets

Hive and Cloudera Impala source tables can be partitioned, bucketed, neither, or both. In Platfora,datasets defined from Hive or Impala table sources take advantage of the partitioning defined in Hive orImpala. However, Platfora does not exploit the clustering or sorting of bucketed tables at this time.

Defining a partitioning field on a Hive or Impala table organizes the data into separate files in the sourcefile system. The goal of partitioning is to improve query performance by keeping records togetherin the way that they are accessed. When a Hive or Impala query uses a WHERE clause to filter dataon a partitioning field, the filter effectively describes which data files are relevant. If a Platfora lensincludes a filter on any of the partitioning columns defined in Hive or Impala, Platfora will only read thepartitions that match the filter.

A bucketed table is created using the CLUSTER BY field [SORT BY field] INTO nBUCKETS clause of the Hive or Impala table definition. Bucketing defines a hash partitioning of databased on values in the table. A bucketed table may also be sorted within each bucket. When the table isbucketed, each partition must be reorganized during the load phase to enforce the clustering and sorting.Platfora does not exploit the clustering or sorting of bucketed tables at this time.

Page 73: Data Ingest Guide

Data Ingest Guide - Define Datasets to Describe Data

Page 73

Parse JSON Files

This section explains how to use the Platfora JSON parser to create datasets based on JSON files. JSONis a plain-text file format comprised of two basic data structures: objects and arrays. The Platfora JSONparser allows you to choose a top-level object to signify a record or row, and name:value pairs within anobject to signify columns or fields (including nested objects and arrays).

What is JSON?

JavaScript Object Notation (JSON) is a data-interchange format based on a subset of the JavaScriptProgramming Language. JSON is a plain-text file format comprised of two basic data structures: objectsand arrays.

A name is just a string identifier, also sometimes called a key. A value can be a string, a number, true,false, null, an object, or an array. An array is an ordered, comma-separated collection of valuesenclosed in brackets []. Objects and arrays can be nested in a tree-like structure within a JSON recordor document.

For example, here is a user record in JSON format:

{ "userid" : "joro99" "firstname" : "Joelle", "lastname" : "Rose", "email" : "[email protected]", "phone" : [ { "type" : "home", "number": "415 123-4567" }, { "type" : "mobile", "number": "650 456-7890" }, { "type" : "work", "number": null } ] }

And the same user record in XML format:

<user> <userid>joro99</userid> <firstname>Joelle</firstname> <lastname>Rose</lastname> <email>[email protected]</email> <phone> <number type="home">415 123-4567</number> <number type="mobile">650 456-7890</number> <number type="work"></number> </phone> </user>

Supported JSON File Formats

This section describes how the Platfora JSON parser expects JSON files to be formatted, and how tospecify what makes a record or row in a JSON file. There are two general JSON file formats supportedby Platfora: JSON Object per line and JSON Object.

Page 74: Data Ingest Guide

Data Ingest Guide - Define Datasets to Describe Data

Page 74

The JSON Object per line format supports files containing top-level JSON objects, with one objectper line. For example, here is a JSON file where each top-level object represents a user record with oneuser object per line.

{"name": "John Smith","email": "[email protected]", "phone":[{"type":"mobile","number":"123-456-7890"}]}{"name": "Sally Jones", "email: "[email protected]", "phone":[{"type":"home","number":"456-789-1007"}]}{"name": "Jeff Hamm","email": "[email protected]", "phone":[{"type":"mobile","number":"789-123-3456"}]}

The JSON Object format supports files containing a top-level array of JSON objects:

[ {"name": "John Smith","email": "[email protected]"}, {"name": "Sally Jones", "email: "[email protected]"}, {"name": "Jeff Hamm","email": "[email protected]"}]

or one large JSON object with the records to import contained within a sub-array:

{ "registration-date": "Sept 24, 2014", "users": [ {"name": "John Smith","email": "[email protected]"}, {"name": "Sally Jones", "email: "[email protected]"}, {"name": "Jeff Hamm","email": "[email protected]"} ]}

In some cases, the structure of your JSON file might be more complicated. You must always specifyone level from the JSON object tree to use as the basis for rows. You can, however, still extract columnsfrom a top-level object as well.

As an example, suppose you had the following JSON file containing movie review records. You wanta row to be created for each reviews record, but still want to retain the value of movie_title and year foreach row:

[{"movie_title":"Friday the 13th", "year":1980, "reviews":[{"user":"Pam","stars":3,"comment":"a bit predictable"}, {"user":"Alice","stars":4,"comment":"classic slasher flick"}]}, {"movie_title":"The Exorcist", "year":1984, "reviews":[{"user":"Jo","stars":5,"comment":"best horror movie ever"}, {"user":"Bryn","stars":4,"comment":"I will never eat pea soup again"}, {"user":"Sam","stars":4,"comment":"loved it"}]},{"movie_title":"Rosemary's Baby", "year":1969, "reviews":[{"user":"Fred","stars":4,"comment":"Mia Farrow is great"}, {"user":"Lou","stars":5,"comment":"the one that started it all"}]}]

Page 75: Data Ingest Guide

Data Ingest Guide - Define Datasets to Describe Data

Page 75

Using the JSON Object parser, you would choose the reviews array as the record filter. You couldthen add the movie_title and year columns by their path as follows:

$. movie_title$. year

The $. notation starts the path from the base of the object tree hierarchy.

Use the JSON Parser

The Platfora JSON parser takes a sample of the source data to determine the format of your JSON files,and then shows the object hierarchy so you can choose the rows and columns to include in the dataset.

1. When you select data that is in valid JSON format, Platfora recognizes the file format and chooses aJSON parser.

2. The basis of a record or row depends on the format of your JSON files. You can either choose to useeach line in the file as a record (JSON Object per line), or choose a sub-array in the file to use asthe basis of a record (JSON Object).

3. If your object hierarchy is nested, you can add a Filter to a specific object in the hierarchy. Thisallows you to use objects nested within a sub-array as the basis for rows.

4. You can Expand the Parsing Options tab to make it wider. This allows you to more easily viewand focus on the JSON hierarchy.

5. Use the Record object tree to select the columns to include in the dataset. You can browse up to 20JSON records when choosing columns.

6. You can add additional columns based on objects above what was used as the row Filter. Use theData Field Path to add a column by its path in the top-level object hierarchy. The $. notation isused to specify a path from the root of the file.

Page 76: Data Ingest Guide

Data Ingest Guide - Define Datasets to Describe Data

Page 76

7. Sometimes you can't delete columns that are added by mistake. For example, the parser mayincorrectly guess the row filter, or you might make a mistake adding columns using Data FieldPath. If this happens, you can always hide these columns on the Manage step.

For the JSON Object per line format, each line in the file represents a row.

For the JSON Object format, the top-level object is used by default to signify rows. If the objectsyou want to use as rows are contained within a sub-array, you can specify a Filter to the array namecontaining the objects you want to use.

For example, in this JSON structure, the Filter value would be users (use the objects in the users arrayas the basis for rows):

{ "registration_date": "September 24, 2014", "users": [ {"name": "John Smith","email": "[email protected]"}, {"name": "Sally Jones", "email": "[email protected]"}, {"name": "Jeff Hamm","email": "[email protected]"} ]}

Or in the example below, you could use the filter users.address to select the contents of the addressarray as the basis for rows.

{ "registration_date": "September 24, 2014", "users": [ { "name": "John Smith", "email": "[email protected]", "address": { "street": "111 Main St.",

Page 77: Data Ingest Guide

Data Ingest Guide - Define Datasets to Describe Data

Page 77

"city": "Madison", "state": "IL", "zip": "35460" } }, { "name": "Sally Jones", "email": "[email protected]", "address": { "street": "32 Elm St.", "city": "Dallas", "state": "TX", "zip": "23456" } }, { "name": "Jeff Hamm", "email": "[email protected]", "address": [ { "street": "101 2nd St.", "city": "San Mateo", "state": "CA", "zip": "94403" }, { "street": "505 5th St.", "city": "San Mateo", "state": "CA", "zip": "94403" } ] } ]}

Once the parser knows the root object to use as rows, the JSON object tree is displayed in the ParsingControls panel. You can add fields contained within nested objects and arrays by selecting the fieldname in the JSON tree. The field is then added as a dataset column. You can browse through a sample of20 records to check for fields to add.

If you unselect a field containing a nested object or array (remove it from the dataset), and later decideto select it again (add it back to the dataset), make sure that Format as JSON string is selected. This

Page 78: Data Ingest Guide

Data Ingest Guide - Define Datasets to Describe Data

Page 78

will format the contents of the field as a JSON string rather than as a regular string. This is important ifyou plan to do additional processing on the values using the JSON string functions.

In some cases, you may want to extract columns from an object one or more levels above the recordfilter in the JSON structure.

For example, in this JSON structure above, the Filter value would be users (use the objects in the usersarray as the basis for rows), but you may also want to include the registration_date object as a column.To capture upstream objects as columns, you can add the field by its path in the object tree.

• The $. notation starts the path from the base of the object tree hierarchy.

• To access fields within a nested object, specify a dot-separated path of field names (for exampletop_level_field_name.nested_field_name).

• To extract a value from an array, specify the dot-separated path of field names and the arrayposition starting at 0 for the first value in an array, 1 for the second value, and so on (for examplefield_name.0).

• If the name identifier contains dot or period characters within the name itself, escape the name byenclosing it in brackets (for example, [field.name.with.dot].[another.dot.field.name]

Page 79: Data Ingest Guide

Data Ingest Guide - Define Datasets to Describe Data

Page 79

• If the field name is null (empty), use brackets with nothing in between as the identifier, for example[].

Parse XML Files

This section explains how to use the Platfora XML parser to create datasets based on XML files.

XML is a plain-text file format that encodes data using different components, such as elements andattributes, in a document hierarchy. The Platfora XML parser allows you to choose a top-level elementto signify the starting point of a record or row, and attributes or elements to signify columns or fields.

What is XML?

Extensible Markup Language (XML) is a markup language for encoding documents. XML is a textualfile format that can contain different components, including elements and attributes, in a documenthierarchy.

A valid XML document starts with a declaration that states the XML version and document encoding.For example:

<?xml version= "1.0" encoding= "UTF-8"?>

An element is a logical component in the document. Elements always begin with opening tag andend with a matching closing tag. Element content can contain text, markup, attributes, or other nestedelements, called child elements. For example, here is a parent users element that contains individualchild elements for each user:

<users> <user name="John Smith" email="[email protected]"/> <user name="Sally Jones" email="[email protected]"/>

Page 80: Data Ingest Guide

Data Ingest Guide - Define Datasets to Describe Data

Page 80

<user name="Jeff Hamm" email="[email protected]"/></users>

Elements can be empty. For example, this image element has no content between its open and closingtags:

<image href="mypicture.jpg"/>

Elements can also have attributes in their opening tag. Attributes are name=value pairs that containuseful data about the element. For example, here is how you might list attributes of an element calledaddress:

<address street="45 Pine St." city="Atlanta" state="GA" zip="53291"/>

Elements can also have both attributes and content. For example, this address element has the actualaddress components as attributes, and the address type as its content:

<address street="45 Pine St." city="Atlanta" state="GA" zip="53291">home address</address>

For details on the XML standard, go to http://www.w3.org/XML/.

Supported XML File Formats

This section describes how the Platfora XML parser expects XML files to be formatted, and how tospecify what makes a record or row in an XML file. There are two general XML file formats supportedby Platfora: XML Element per line and XML Document.

The XML Element per line format supports files containing one XML record per line, each recordhaving the same top-level element and structure. For example, here is an XML file where each top-levelelement represents a user record with one record per line.

<user name="John Smith" email="[email protected]"><phone type="mobile" number="123-456-7890"/></user><user name="Sally Jones" email="[email protected]"><phone type="home" number="456-789-1007"/></user><user name="Jeff Hamm" email="[email protected]"><phone type="mobile" number="789-123-3456"/></user>

The XML Document format supports valid XML document files (one document per file).

In the following example, the top-level XML element contains nested XML element records:

<?xml version= "1.0" encoding= "UTF-8"?>

<registration date="Aug 21, 2012"> <users> <user name="John Smith" email="[email protected]"/> <user name="Sally Jones" email="[email protected]"/> <user name="Jeff Hamm" email="[email protected]"/> </users></registration>

Page 81: Data Ingest Guide

Data Ingest Guide - Define Datasets to Describe Data

Page 81

In the following example, the top-level XML element contains a sub-tree of nested XML elementrecords:

<?xml version= "1.0" encoding= "UTF-8"?>

<registration date="Sept 24, 2014"> <region name="us-east"> <user name="Georgia" age="42" gender="F"> <address street="111 Main St." city="Madison" state="IL" zip="35460"/> <statusupdate type="registered"/> </user> <user name="Bobby" age="30" gender="M"> <address street="45 Pine St." city="Atlanta" state="GA" zip="53291"/> <statusupdate type="unsubscribed"/> </user> </region></registration>

Use the XML Parser

The Platfora XML parser takes a sample of the source data to determine the format of your XML files,and then shows the element and attribute hierarchy so you can choose the rows and columns to includein the dataset.

Page 82: Data Ingest Guide

Data Ingest Guide - Define Datasets to Describe Data

Page 82

1. When you select data that is in valid XML format, Platfora recognizes the file format and chooses anXML parser.

2. The basis of a record or row depends on the format of your XML files. You can either choose to useeach line in the file as a record (XML Element per line), or choose a child element in the XMLdocument to use as the basis of a record (XML Document).

3. (XML Document formats only) You can add a Filter that determines which rows to include in thedataset. For more details, see Parsing Rows from XML Documents.

4. You can Expand the Parsing Options tab to make it wider. This allows you to more easily viewand focus on the XML hierarchy.

5. Use the Record element tree to select the elements and attributes to include in the dataset ascolumns. You can browse up to 20 sample records when choosing columns. For more details, seeExtracting Columns from XML Using the Element Tree.

6. The Data Field Path field allows you to add a column represented by its path in the elementhierarchy. You can use any XPath 1.0 expression that is relative to the result of the row Filter. Formore details, see Extracting Columns from XML Using an XPath Expression.

7. Sometimes you can't delete columns that are added by mistake. For example, the parser mayincorrectly guess the row filter, or you might make a mistake adding columns using Data FieldPath. If this happens, you can always hide these columns on the Manage Fields step.

For the XML Element per line format, each line in the file represents a row.

For the XML Document format, by default, the top-level element below the root of the XMLdocument is used as the basis of rows. If you want to use different elements as the basis for rows, youcan enter a Filter to specify the element name you want to use as the basis of rows.

The Platfora XML parser supports an XPath-like notation for specifying which XML element to useas rows. As an example of how to use the Platfora XML parser filter notation, suppose you had thefollowing XML document containing movie review records:

<?xml version= "1.0" encoding= "UTF-8"?>

<records> <movie title="Friday the 13th" year="1980"> <reviews> <review user="Pam" stars="3">a bit predictable</review> <review user="Alice" stars="4">classic slasher flick</review> </reviews> </movie> <movie title="The Exorcist" year="1984"> <reviews> <review user="Jo" stars="5">best horror movie ever</review> <review user="Bryn" stars="4">I will never eat pea soup again</review> <review user="Sam" stars="4">loved it</review> </reviews> </movie></records>

Page 83: Data Ingest Guide

Data Ingest Guide - Define Datasets to Describe Data

Page 83

The document hierarchy is assumed to start one level below the root element. The root element wouldbe the records element in this example. From this point in the document, you can use the followingXPath-like notation to specify row filters:

Row FilterNotation

Description Example

// Specifies all elements with the givenname located within the previouselement no matter where they existwithin the previous element. When usedat the beginning of the row filter, thisspecifies all elements in the documentwith the given name.

Use any review element as thebasis for rows://review

/ Specifies an element with the givenname one level in the documenthierarchy within the element listedbefore it. When used as the firstcharacter in the row filter, it specifiesone level below the root element of thedocument.

Use the review element as thebasis for rows:/movie/reviews/review

Page 84: Data Ingest Guide

Data Ingest Guide - Define Datasets to Describe Data

Page 84

Row FilterNotation

Description Example

$ Specifies an element in the row filter asan extraction point. An extraction pointis an element in a XML row filter thatallows you to define a variable that canbe used to define a column definitionexpression relative to that element inthe filter. The last element in a rowfilter is always considered an extractionpoint, so it is unnecessary to use the $notation for the last element.You can specify zero or more extractionpoints in a row filter.Extraction points give you moreflexibility when extracting columns.Use an extraction point element at thebeginning of a column definition tosignify an expression relative to theextraction point element.You might want to use an extractionpoint to extract a column or attributefrom a parent element one or morelevels above the last element definedin the row filter. For example, for therow filter /a/$b/c/d you could write thefollowing column definition:$bxpath_expressionUse caution when adding an extractionpoint to the row filter. Platfora buffers allXML source data in an extraction pointelement during data ingest and when itbuilds a lens in order to extract columndata. Depending on the source data,this may impact performance duringdata ingest and may increase lens buildtimes.

Use the review element as thebasis for rows while allowing theability to extract reviews data forthat row as column data:/movie/$reviews/review

Note for the XML structure above, the following row filter expressions are equivalent:

• movie

• /movie

• $/records/movie

Page 85: Data Ingest Guide

Data Ingest Guide - Define Datasets to Describe Data

Page 85

• $//movie

For example, in this XML structure, the Filter value would be $users (use the collection of childelements contained in the in the users element as the basis for rows):

<?xml version= "1.0" encoding= "UTF-8"?>

<registration date="Sept 24, 2014"> <users> <user name="John Smith" email="[email protected]"/> <user name="Sally Jones" email="[email protected]"/> <user name="Jeff Hamm" email="[email protected]"/> </users></registration>

Once the parser knows the element to use as rows, the XML element tree is displayed in the Recordpanel. You can add fields based on XML attributes or nested XML elements by selecting the element orattribute name in the XML element tree. The field is then added as a dataset column. You can browsethrough a sample of 20 records in a single file to check for fields to add.

If you unselect a field containing nested XML elements (remove it from the dataset), and later decideto select it again (add it back to the dataset), make sure that Format as XML string is selected. Thiswill format the contents of the field as XML rather than a regular string. This is important if you plan todo additional processing on the values using the XPATH string functions. For more details, see Parsingof Nested Elements and Content.

Another way to add columns is to enter an XPath expression in Data Field Path that represents a pathin the element hierarchy. You might want to do this to extract columns from a parent element one ormore levels above the row filter in the XML document hierarchy.

Page 86: Data Ingest Guide

Data Ingest Guide - Define Datasets to Describe Data

Page 86

Note the following rules and guidelines when using an XPath expression to extract columns:

• The Platfora XML parser only supports XPath 1.0.

• The expression must be relative to the last element or any extraction point element in the row Filter.

• Platfora recommends starting the expression with a variable using the $element/ syntax. Theelement must be the last element or an extraction point element in the row Filter.

• XML namespaces are not supported. The XML parser strips all XML namespaces from the XMLfile.

• Variables are only allowed at the beginning of the expression.

For example, assume you have the following row filter: /movie/$reviews/review

You could create a column definition expression for any element or attribute in the documenthierarchy that comes after the review element. Additionally, because the row filter includes anextraction point for $reviews, you could also create a column definition relative to that node:$reviewsxpath_expression.

For more information about XPath, see http://www.w3.org/TR/xpath/.

If the element you are parsing contains nested XML elements and content, and you want to preservethe XML structure and hierarchy, select Format as XML string. This will allow you to do furtherprocessing on this data with the XPATH_STRING, XPATH_STRINGS and XPATH_XML functions.

If the column contains nested elements and Format as XML string is not enabled, Platfora returnsNULL.

Repeated elements are wrapped inside a <list> ... </list> parent element to maintain validXML structure.

Page 87: Data Ingest Guide

Data Ingest Guide - Define Datasets to Describe Data

Page 87

Parse Avro Files

The Platfora Avro parser supports Avro container files where the top-level object is an Avro recorddata type. The file must have a JSON-formatted schema declared at the beginning of the file, and theserialized data must be in the Avro binary-encoded format.

1. On the Parse step of the dataset creation wizard, select Avro as the File Type in the ParsingOptions panel.

2. The Avro parser uses the JSON schema of the source file to extract the name:value pairs from eachrecord object in the Avro file.

What is Avro?

Apache Avro is a remote procedure call (RPC) and data serialization framework. Its primary use is toprovide a data serialization format for persistent data stored in Hadoop.

Avro uses JSON for defining schema, and JSON or binary format for data encoding. When Avro datais stored in a persistent file (called a container file), its schema is stored along with it. This allows anyprogram to be able to read the serialized data in the file. For more information about the Avro schemaand encoding formats, see the Apache Avro Specification documentation.

Page 88: Data Ingest Guide

Data Ingest Guide - Define Datasets to Describe Data

Page 88

Avro to Platfora Data Type Mapping

Avro has a set of primitive and complex data types it supports. These are mapped to Platfora's internaldata types.

Complex data types are imported into Platfora as a single JSON-formatted STRING. You can then usethe JSON String Functions in the Platfora expression language to define new computed columns in thedataset that extract a particular name:value pair from the imported JSON structure.

Avro Data Type Platfora Data Type

BOOLEAN INTEGER

INT INTEGER

LONG LONG

FLOAT DOUBLE

DOUBLE DOUBLE

STRING STRING

BYTES STRING (Hex-encoded)

RECORD STRING (JSON-formatted)

ENUM STRING (JSON-formatted)

ARRAY STRING (JSON-formatted)

MAP STRING (JSON-formatted)

UNION STRING (JSON-formatted)

FIXED FIXED

Parse Parquet Files

The Platfora Parquet parser supports reading Parquet files by accessing them from a file system. Thefile must have a parquet-format project module that defines how to read the Parquet files. ApacheParquet is a columnar storage format available to any project in the Hadoop ecosystem. You might wantto define a dataset from Parquet files in a file sytem, such as HDFS, if you don't want Platfora to accessa Hive metastore.

When creating a dataset from Parquet files, there is no need to define parsing controls in Platfora.Platfora uses the parquet-format project to obtain metadata about the source data, such as whichfiles to process, the parsing logic for rows and columns, and the field names and data types contained inthe source data.

Platfora requires that all Parquet files used in a dataset have the same schema.

Page 89: Data Ingest Guide

Data Ingest Guide - Define Datasets to Describe Data

Page 89

For more information about the Parquet schema and encoding formats, see the Apache Parquetdocumentation.

Parquet to Platfora Data Type Mapping

Platfora maps the Parquet logical data types to Platfora's internal data types. The Parquet logical datatypes are comprised of a primitive data type and an annotation (also known as the original type), whichis stored in Parquet as ConvertedType.

Parquet DECIMAL data type is mapped to DOUBLE by default. This may result in a lost of precisiondue to roundoff errors. You can choose to map DECIMAL columns to FIXED. This retains precisionfor numbers that have four or fewer digits after the decimal point, and loses precision for more precisenumbers.

Complex (nested) data types, such as MAP and LIST, are imported into Platfora as a single JSON-formatted STRING. You can then use the JSON String Functions in the Platfora expression language todefine new computed columns in the dataset that extract the values you want from the imported JSONstructure.

Logical Type

Parquet Primitive Data Type Original Data Type

Platfora Data Type

BOOLEAN none INTEGER

FLOAT none DOUBLE

DOUBLE none DOUBLE

DECIMAL DOUBLE

DATE DATETIME

TIME_MILLIS STRING

UINT_8 INTEGER

UINT_16 INTEGER

UINT_32 INTEGER

INT32

none INTEGER

DECIMAL STRING

TIMESTAMP_MILLIS DATETIME

INT_64 LONG

INT64

none LONG

INT96 none DATETIME

Page 90: Data Ingest Guide

Data Ingest Guide - Define Datasets to Describe Data

Page 90

Logical Type

Parquet Primitive Data Type Original Data Type

Platfora Data Type

UTF8 STRING

ENUM STRING

BINARY

DECIMAL STRING

DECIMAL DOUBLE

INTERVAL STRING

FIXED_LEN_BYTE_ARRAY

none STRING

Parse Web Access Logs

A web access log contains records about incoming requests made to a web server. Platfora has a built-in Web Access Log parser that automatically recognizes web access logs that adhere to the NCSAcommon or combined log formats.

1. On the Parse step of the dataset creation wizard, select Web Access Log as the File Type inthe Parsing Options panel.

2. The Web Access Log parser extracts fields according to the supported NCSA log formats.

Supported Web Access Log Formats

Platfora supports web access logs that comply with the NCSA common or combined log formats. This isthe log format used by many popular web servers (such as Apache HTTP Server).

Page 91: Data Ingest Guide

Data Ingest Guide - Define Datasets to Describe Data

Page 91

An example log line for the common format looks something like this:

123.1.1.456 - - [16/Aug/2015:15:01:52 -0700] "GET /home/index.html HTTP/1.1" 200 1043

The NCSA common log format contains the following fields for each HTTP access record:

• Host - The IP address or hostname of the HTTP client that made the request.

• Logname - Identifies the client making the HTTP request. If no value is present, a dash (-) issubstituted.

• User - The user name used by the client for authentication. If no value is present, a dash (-) issubstituted.

• Time - The timestamp of the request in the format of dd/MMM/yyyy:hh:mm:ss +-hhmm.

• Request - The HTTP request. The request field contains three pieces of information: the requestedresource (/home/index.html), the HTTP method (GET) and the HTTP protocol version(HTTP/1.1).

• Status - The HTTP status code indicating the success or failure of the request.

• Response Size - The number of bytes of data transferred as part of the HTTP request, not includingthe HTTP header.

The NCSA combined log format contains the same fields as the common log format with the addition ofthe following optional fields:

• Referrer - The URL that linked the requestor to your site. For example, http://www.platfora.com.

• User-Agent - The web browser and platform used by the requestor. For example, Mozilla/4.05[en] (WinNT; I).

• Cookie - Cookies are pieces of information that the HTTP server can send back to a client alongthe with the requested resource. A client browser may store this information and send it back to theHTTP server upon making additional resource requests. The HTTP server can establish multiplecookies per HTTP request. Cookie values take the form KEY=VALUE. Multiple cookie key/valuepairs are delineated by semicolons (;). For example, USERID=jsmith;IMPID=01234.

For web access logs that do not conform to the default expected ordering of fields and data types,Platfora will make a best guess at parsing the rows and columns found in the web log files, and usegeneric column headers (for example column1, column2, etc.). You can then rename the columns tomatch your web log format.

Parse Other File Types

For other file types that cannot be parsed using the built-in parsing controls, Platfora provides twogeneric parsers: Regex and Line. As long as your source data has one record per line, you can use oneof these generic parsers to extract columns from semi-structured source data.

Page 92: Data Ingest Guide

Data Ingest Guide - Define Datasets to Describe Data

Page 92

Parse Raw Lines with a Regular Expression

The Regex parser allows you to search lines in the source data and extract columns using a regularexpression. It evaluates each line in the source data against a regular expression to determine if there is amatch, and returns each capturing group of the regular expression as a column. Regular expressions are away to describe a set of rows based on characteristics they share in common.

1. On the Parse step of the dataset creation wizard, select Regex as the File Type in the ParsingOptions panel.

2. Enter a regular expression that matches the entire line with parenthesis around each column matchingpattern you want to return.

3. Confirm the regular expression is correct by comparing the raw data to the wrangled data.

Platfora uses capturing groups to determine what parts of the regular expression to return as columns.The Regex line parser applies the user-supplied regular expression against each line in the source file,and returns each capturing group in the regular expression as a column value.

For example, suppose you had user records in a file, and the lines were formatted like this (no commondelimiter is used between fields):

Name: John Smith Address: 123 Main St. Age: 25 Comment: ActiveName: Sally R. Jones Address: 2 E. El Camino Real Age: 32Name: Rod Rogers Address: 55 Elm Street Age: 47 Comment: Suspended

You could use the following regular expression to extract the Full Name, Last Name Only, Address, Age,and Comment column values:

Name: (.*\s(\p{Alpha}+)) Address:\s+(.*) Age:\s+([0-9]+)(?:\s+Comment:\s+(.*))?

Page 93: Data Ingest Guide

Data Ingest Guide - Define Datasets to Describe Data

Page 93

Parse Raw Lines with Platfora Expressions

The Line parser simply returns each line in the source file as one column value, essentially not parsingthe source data at all. This allows you to bypass the parsing step and instead define a series of computedfields to extract the desired column values out of each line.

1. On the Parse step of the dataset creation wizard, select the Line in the Parsing Options panel.

This creates a single column where each row contains an entire record.

2. Go to the Manage step.

3. Define computed fields that extract columns from the raw line.

Prepare Base Dataset Fields

When you first add a dataset, it only has its Base fields. These are the fields parsed directly from the rawsource data. This section describes the tasks involved in making sure the base data is correct and readyfor Platfora's analyst users. In some cases, the data values contained in the base fields may be ready forconsumption. Most likely, however, the raw data values will need some additional processing. It is bestpractice to confirm and edit all of the base fields in a dataset before you begin defining computed fieldexpressions to do any additional processing on the dataset.

Change the Dataset Sample Rows

Platfora displays a sample of dataset rows to facilitate the data ingest process. By default, the sampleconsists of the first 10,000 records starting with the first file in the source location. You can change thenumber of sample rows to read from the source data to display in the dataset workspace.

When you change the number of rows of sample data displayed, Platfora reads the data from the rawfiles again and updates the statistics and field information for each field in the Inspector panel.

Page 94: Data Ingest Guide

Data Ingest Guide - Define Datasets to Describe Data

Page 94

You may want to increase the number of sample rows to get more precise statistics on each field in theInspector panel. You might want to decrease the number of sample rows to improve performance,especially if the dataset has a lot of fields.

1. Go to the Manage step of the dataset creation wizard.

Or, for an existing dataset, open the dataset workspace, and click the Data page.

2. Click the menu that lists the number of rows to display, and choose a different number of rows toview as the sample data.

Platfora reads the raw data from the source files again, recalculates the statistics for each field, anddisplays the rows in the dataset workspace.

Update the Dataset Sample Rows

Platfora displays a sample of dataset rows when creating or editing a dataset. You can force Platfora toread the source files again to determine the sample rows to display.

You might want to refresh the dataset sample rows when the dataset uses a wildcard pattern to determineits source files, and files have been added, deleted, or updated at the source. When you refresh thedataset sample rows, Platfora purges any sample row stored in its internal cache and then reads thesource files to obtain new sample rows to display.

Refreshing the sample rows only affects the data displayed in the dataset. It does not change the columnsthat define the dataset (schema). If you need to update the dataset schema, use the Refresh Schemabutton when editing the data source used in the dataset.

You can refresh sample rows for all dataset types except for elastic datasets, derived datasets, anddatasets created from a Platfora plugin. To refresh the sample rows for a transformation dataset,you must first refresh the sample rows in the input datasets and then refresh the sample rows in thetransformation dataset.

Page 95: Data Ingest Guide

Data Ingest Guide - Define Datasets to Describe Data

Page 95

To refresh the dataset sample rows, you must have Edit or Own object permissions on the dataset.

1. To refresh the sample rows, open a dataset and go to the Data page.

2. Select Refresh Sample from the menu that lists the number of rows to display.

Update the Dataset Source Schema

Over time a dataset's source schema may evolve and change. You may need to periodically re-parse thesource data to pick up schema changes, such as when new columns are added in the source data.

Updating the dataset source schema in this way only applies to Hive andDelimited source data types.

Update Schema for Hive Datasets

Datasets based on Hive tables have the Parse step disabled in Platfora when creating a dataset and inthe dataset workspace. This is because the Hive table definition is used to determine the dataset columnsand their respective column order, column names, and data type information.

Page 96: Data Ingest Guide

Data Ingest Guide - Define Datasets to Describe Data

Page 96

If the source data schema changes for a Hive-based data source, you would first update the tabledefinition in Hive. Then in Platfora you can refresh the dataset schema to get the latest dataset columnsfrom the Hive table definition.

1. Update the table in the Hive source system.

2. Edit the dataset in Platfora, and go to the Data page.

3. Click Edit Source Data.

4. Click Refresh Hive.

Platfora re-reads the table definition from Hive and displays the updated column order, names, anddata types.

5. Save your changes.

Update Schema for Delimited Datasets

For datasets based on delimited text or comma-separated value (csv) files, the only schema change thatis supported is appending new columns to the end of a row. If new columns are added in the sourcedata files, you can refresh the schema to pick up the new columns. Changing the column order (addingnew columns in the middle of the row) is not supported for delimited datasets. For delimited datasetsthat have a header row, the base column names in the Platfora dataset definition must match the headercolumn names in the source data file in order to use this feature.

Page 97: Data Ingest Guide

Data Ingest Guide - Define Datasets to Describe Data

Page 97

Older source data rows that do not have the new appended columns will just have NULL (empty) valuesfor those columns.

1. Edit the dataset in Platfora, and go to the Data page.

2. Click Edit Source Data.

3. Click Refresh Schema.

Platfora re-reads the schema from all files and displays the new base columns (as long as the newcolumns are appended at the end of the rows).

4. Save your changes.

Confirm Data Types

The dataset parser will guess the data type of a field based on the sampled source data, but you may needto change the data type depending on the additional processing you plan to do.

The expression language functions require input values to be of a certain data type. It is best practice toconfirm and change the data types of your base fields before defining computed fields. Changing themlater may introduce errors to your computed field expressions.

Page 98: Data Ingest Guide

Data Ingest Guide - Define Datasets to Describe Data

Page 98

Note that you can only change the data type of a Base field. Computed field data types are determinedby the return type of the computed expression.

1. Go to the Manage step of the dataset creation wizard.

2.Click the icon to expand the field header.

3. Select a column and verify the data type that Platfora assigned to the base field.

4. Change the data type in the field header or in the Inspector panel.

Note that you cannot accurately convert the data type of a field to DATETIME from the drop-downdata type menus. See Cast DATETIME Data Types.

About Platfora Data Types

Each dataset field, whether a base or a computed field, has a data type attribute. The data type defineswhat kind of values the field can hold. Platfora has a number of built-in data types you can assign todataset fields.

The dataset parser attempts to guess a field's data type by sampling the data. A base field's data typerestricts the expressions you can apply to that field. For example, you can only calculate a sum withnumeric fields. For computed fields, the expression's result determines the field's data type.

You may want to change a base field's data type to accommodate the computed field processing youplan to do. For example, many value manipulation functions require input values to be strings.

Page 99: Data Ingest Guide

Data Ingest Guide - Define Datasets to Describe Data

Page 99

Platfora supports the following data types:

Table 1: Platfora Data Types

Type Description Range of Values

STRING variable length non-unicodestring data

maximum string length of 2,147,483,647

DATETIME date combined with a timeof day with fractionalseconds based on a 24-hourclock

date range: January 1, 1753, through December 31,9999, time range: 00:00:00 through 23:59:59.997

FIXED Fixed decimal valueswith accuracy to a ten-thousandth of a numericunit

-922,337,203,685,477.5808 through 2^63 - 1(+922,337,203,685,477.5807), with accuracy to aten-thousandth of a numeric unit.

INTEGER 32-bit integer (wholenumber)

-2,147,483,648 to 2,147,483,647

LONG 64-bit long integer (wholenumber)

-9,223,372,036,854,775,808 to9,223,372,036,854,775,807

DOUBLE double-precision 64-bitfloating point number

4.94065645841246544e-324d to1.79769313486231570e+308d (positive ornegative)

Page 100: Data Ingest Guide

Data Ingest Guide - Define Datasets to Describe Data

Page 100

Change Field Names

The field name defined in the dataset is what users see when they browse the Platfora data catalog. Insome cases, the field names imported from the source data may be fine. In other cases, you may want tochange the field names to something more understandable for your users.

It is important to decide on base field names before you begin defining computedfields and references (joins), as changing a field name later on will breakcomputed field expressions and references that rely on that field name.

1. Go to the Manage step of the dataset creation wizard.

2.Click the icon to expand the field header.

3. Select the field you want to rename.

4. Enter the new name in the field header or the Inspector panel.

Page 101: Data Ingest Guide

Data Ingest Guide - Define Datasets to Describe Data

Page 101

If a name change breaks other computed field expressions or reference links in the dataset, theerror panel will show all of the affected computed fields and references. You can either change thedependent field back to the original name, or edit the affected fields to use the new name.

Field names cannot start with the left parenthesis character: (.

Sort Dataset Sample Rows

You can sort the sample rows displayed in a dataset to more easily find some values in a dataset field.

Page 102: Data Ingest Guide

Data Ingest Guide - Define Datasets to Describe Data

Page 102

To sort the data display in the dataset sample rows, hover over a column header, click the dropdownmenu, and choose a sort option.

Note that sorting the data in this way only affects the sample data displayed in the dataset. It does notsort the data in the lens or vizboard.

Add Field Descriptions

Field descriptions are displayed in the data catalog view of a dataset or lens, and can help users decide ifa field is relevant for their needs. Data administrators should add helpful field descriptions that explainthe meaning and data value characteristics of a field.

1. Go to the Manage step of the dataset creation wizard.

2. Select the column you want to add a description for.

3. In the Inspector panel, click inside the Description text box and enter a description.

Page 103: Data Ingest Guide

Data Ingest Guide - Define Datasets to Describe Data

Page 103

Hide Columns from Data Catalog View

Hiding a column or field in a dataset definition removes it from the data catalog view of the dataset.Users cannot see hidden columns when browsing datasets in the data catalog, or select them when theybuild a lens.

1. Go to the Manage step of the dataset creation wizard.

2.Click the icon to expand the field header.

3. Select the column you want to hide.

4. In the expanded column header, check Hide Field.

OR

5. In the Inspector panel, click the Options tab. Then check Hide Field in the Inspector panel.

Why Hide Dataset Columns?

A data administrator can control what fields of a dataset are visible to Platfora users. Hidden fields arenot visible in the data catalog view of the dataset and connot be selected for a lens.

You might choose to hide a field for the following reasons:

• Protect Sensitive Data. In some cases, you may want to hide fields to protect sensitive information.In Platfora, you can hide detail fields, but still allow access to summary information. For example,in a dataset containing employee salary information, you may want to hide sensitive identifyinginformation such as names, job titles, and individual salaries, but still allow analysts to view averagesalary by department or job level. In database applications, this is often referred to as column-levelsecurity or column access control.

Page 104: Data Ingest Guide

Data Ingest Guide - Define Datasets to Describe Data

Page 104

• Hide Unpopulated or Sparse Data Columns. You may have columns in your raw data that didnot have any data collected, or the data collected is too sparse to be valid for analysis. For example,a web application may have a placeholder column for comments, but it was never implemented onthe website so the comments column is empty. Hiding the column prevents analysts from choosing afield with mostly null values when they go to build a lens.

• Control Lens Size. High cardinality dimension fields can significantly increase the size of a lens.Hiding such fields prevents analysts from creating large lenses unintentionally. For example, youmay have a User ID field with millions of unique values. If you do not want analysts to be able tocreate a lens at that level of granularity, you can hide User ID, but still keep other dimension fieldsabout users available, such as age or gender.

• Use Computed Values Instead of Base Values. You may add a computed field to transform thevalues of the raw data. You want your users to choose the transformed values, not the raw values. Forexample, you may have a return reason code column where the reason codes are numbers (1,2,3, andso on). You want to transform the numbers to the actual reason information (Did not Fit, ChangedMind, Poor Quality, and so on) so the data is more usable during analysis.

• Hide Computed Fields that do Interim Processing. As you work on your dataset to cleanse andtransform the data, you may need to add interim computed fields to achieve a final result. These arefields that are necessary to do a processing step, but are not intended for final consumption. Theseworking fields can be hidden so they do not clutter the data catalog view of the dataset.

Default Values and NULL Processing

If a field or column value in a dataset is empty, it is considered a NULL value. During lens processing,Platfora replaces all NULL values with a default value instead. Platfora lenses and vizboards have noconcept of NULL values. NULLs are always substituted with the default field values specified in thedataset definition.

How Platfora Processes NULL Values

A value can be NULL for the following reasons:

• The raw data is missing values for a particular field.

• A computed field expression returns an empty or invalid result.

• A record in the focus (or fact) dataset does not have a corresponding record in a referenced (ordimension) dataset. During lens processing, any rows that do not join will use the default valuesin place of the unjoined dimension fields. For lenses that include fields from referenced datasets,Platfora performs an outer join between the focus dataset and any referenced datasets included in thelens. This means that rows in the fact dataset are compared to related rows in the referenced datasets.Any row that does not have a corresponding row in the referenced dataset is considered an unjoinedforeign key. The dimension columns for unjoined foreign keys are treated as NULL and replacedwith the default values.

A Platfora aggregate lens is analogous to a summary or roll-up table in a data warehouse. During lensprocessing, the measure values are pre-aggregated and grouped by each dimension field value includedin the lens. Platfora calculates all computed fields, including measure fields, using NULL values. For

Page 105: Data Ingest Guide

Data Ingest Guide - Define Datasets to Describe Data

Page 105

example, Average (AVG) calculations exclude NULL values from the row count. After all computationsare complete, Platfora substitutes the default value for any NULL value in a dimension or measure field.

Default Values by Data Type

If you do not specify your own default values in the dataset, the following default values are used inplace of any NULL value. The default value depends on the data type of the field or column.

Data Type Default Value

LONG, INTEGER, DOUBLE, FIXED 0

STRING NULL (as a string)

DATETIME January 1, 1970 12:00:00:000 GMT

LOCATION (latitude,longitude coordinateposition)

0,0

Change the Default Value for a Column

You can specify different default values on a per-column basis. These values will replace any NULLvalues in that column during lens build processing. Analysts will see the default values instead of NULL(empty) values when they are working with the data in a vizboard.

To change the default value for a column:

1. Go to the Manage step of the dataset creation wizard.

2. Select the column you want to edit.

Page 106: Data Ingest Guide

Data Ingest Guide - Define Datasets to Describe Data

Page 106

3. In the Inspector panel, click the Default Value text box and type a description.

Bulk Upload Field Header Information

When a dataset has a lot of fields to manage, it may be easier to update several field names, data types,descriptions, and visibility settings all at once rather than editing each field one-by-one in the Platforaapplication. To do this, you can upload a comma or tab delimited text file containing the field headerinformation you want to set.

1. Create an update file on your local machine containing the field information you want to update.

Page 107: Data Ingest Guide

Data Ingest Guide - Define Datasets to Describe Data

Page 107

This file must meet the following specifications:

• It must be a comma-delimited or tab-delimited text file.

• It can contain up to four lines (separated by a new line). Any additional lines in the file will beignored.

• Field names are specified on the first line of the file.

• Field data types are specified on the second line of the file ( DOUBLE, FIXED, INTEGER, LONG,STRING, or DATETIME).

• Field descriptions are specified on the third line of the file.

• Field visibility settings are specified on the fourth line of the file (Hidden or Not Hidden).

• On a line, values must be be specified in the column order of the dataset.

2. On the Parse step of the dataset creation wizard, click Upload Field Names on the ParsingOptions tab. Find and open your update file.

3. After uploading the file, advance to the Manage step to confirm the results.

Example Update Files

Here is an update file that updates the field names, descriptions, data types, and visibility settings forthe first four columns of a dataset.

UserID,Name,Email,AddressINTEGER,STRING,STRING,STRINGThe unique user ID,The user's name,The email linked to this user's account, The user's mailing addressHidden,Not Hidden,Not Hidden,Not Hidden

The double-quote character can be used to quote field names or descriptions. This is useful if a fieldname or description contains the delimiter character (comma or tab). For example:

UserID,Name,Email

,"The user's name, both first and last.",,The user's mailing addressHIDDEN

Notice how lines can be left blank, and values can be skipped over by leaving them out or byspecifying nothing between the two delimiters. Missing values will not be updated in the Platforadataset definition (the dataset will use the previously set values).

Transform Data with Computed FieldsThe way you transform data in Platfora is by adding computed fields to your dataset definition. A datasetcomputed field contains an expression that describes a single data processing step.

Page 108: Data Ingest Guide

Data Ingest Guide - Define Datasets to Describe Data

Page 108

Sometimes several steps are needed to achieve the result that you want. The result of a dataset computedfield can be used in the expressions of other dataset computed fields, allowing you to define a chain ofprocessing steps.

FAQs—Dataset Computed Fields

This section answers the most frequently asked questions (FAQs) about creating and editing datasetcomputed fields.

What kinds of things can I do with dataset computed fields?

Computed fields are useful for deriving meaningful values from base fields (such as calculatingsomeone's age based on their birthday), doing data cleansing and pre-processing (such as groupingsimilar values together or substituting one value for another), or for computing new data values based ona number of input variables (such as calculating a profit margin value based on revenue and costs).

Platfora has an extensive library of built-in functions that you can use to define data processing tasks.These functions are organized by the type of data they operate on, or the kind of processing they do. Seethe Expression Quick Dictionary for a list of what's available.

Can I do further processing on the results of a computed field?

Yes. A computed field is treated just like any other field in the dataset. You can refer to it in othercomputed field expressions or aggregate the results to create a measure.

To analyst users, computed fields are just like any other dataset field. Users can include them in a lensand analyze their results in a vizboard.

One exception is a computed field that uses an aggregate function in its expression (measure). Youcannot combine row functions and aggregate functions in the same expression. A row function cannottake a measure field as input. Per-row processing on aggregated data in not allowed.

Page 109: Data Ingest Guide

Data Ingest Guide - Define Datasets to Describe Data

Page 109

How do I edit a computed field expression in a dataset?

Go to the Manage step of the dataset creation wizard or the Data page of the dataset workspace, andfind the computed column you want to edit. With the column selected, click the expression at the bottomof the workspace. This will open the expression builder.

Page 110: Data Ingest Guide

Data Ingest Guide - Define Datasets to Describe Data

Page 110

How do I remove a computed field from a dataset?

Go to the Manage step of the dataset creation wizard or the Data page of the dataset workspace, findthe computed column you want to edit, and click the X in the field header. Note that this might causeerrors if other computed fields refer to the deleted field.

If you need the computed field for an interim processing step, but want to remove it from the selection offields that the users see, you can hide it. Hiding a field keeps it in the dataset definition and allows it tobe referred to by other computed field expressions. However, users cannot see hidden fields in the datacatalog, or select them in a lens or vizboard. See Hide Columns from Data Catalog View.

Where can I find examples of useful computed field expressions?

Platfora's expression reference documentation has lots of examples of useful expressions. SeeExpression Language Reference.

Why isn't my computed field showing any sample values?

Certain types of computed field expressions can only be computed during lens build processing. Becauseof the complicated processing involved, the dataset workspace can't show sample results for:

• Measures (computed fields containing aggregate functions)

• Event Series Processing (computed fields containing PARTITION expressions)

• Computed field expressions that reference fields in other datasets

Why can't I change the data type of a computed field?

A computed field's data type is set by the output type of its expression. For example, a CONCAT functionalways outputs a STRING. If you want the output data type to be something else, you can nest theexpression inside the appropriate data type conversion function. For example:

Page 111: Data Ingest Guide

Data Ingest Guide - Define Datasets to Describe Data

Page 111

TO_INT(CONCAT(field1, field2))

Can analyst users add computed fields if they want?

Analyst users can't add computed fields to a dataset. You must be a data administrator and have theappropriate dataset permissions to edit a dataset.

In a vizboard, analyst users can create computed fields to manipulate the data they already have in theirlens. With some exceptions, analyst users can add a vizboard computed field that can do almost anythingthat a dataset computed field can do.

However, event series processing (ESP) computed fields cannot be used to create vizboard computedfields.

Add a Dataset Computed Field

You can add a new dataset computed field on the Manage step of the dataset workspace or on theData page of the dataset workspace of an existing dataset. A computed field has a name, a description,and an expression. The computed field expression describes a processing task you want to perform onother fields in the dataset.

Computed fields contain expressions that can take other fields as input. These fields can be base fieldsor they can be other computed fields. When you add a computed field, it appears as a new column in thedataset definition.

Page 112: Data Ingest Guide

Data Ingest Guide - Define Datasets to Describe Data

Page 112

1. Go to the Manage step of the dataset creation wizard, or the Data page of the dataset workspace ofan existing dataset.

2. Click Add Computed Field.

This creates a new, empty computed field titled New Field, and opens the expression builder at thebottom of the page containing the controls for defining its expression.

3. Enter a name for your field and a description.

The description is optional but very useful for others who will use the field later.

4. Double-click a function in the Functions list to add it to the expression area .

You can use the search feature to find a function in the list more quickly.

The expression area updates with the function's syntax.

5. Double-click a field in the Fields list to add it into the expression area.

You can use the search feature to find a field in the list more quickly.

Platfora inserts the field name in the expression area where the cursor is currently located.

6. Continue adding functions and fields into your expression until it is complete.

You can use a keyboard shortcut to undo changes you made to the expression.Use Ctrl+Z or Command+Z, depending on your operating system.

7. Make sure your expression is correct.

The system checks your syntax as you build the expression. The red area above the field's expressiondisplays any error messages. Although you can save expressions that contain errors and save adataset with one of these computed fields, the next time you open the dataset you will see no sampledata in any field until all expressions evaluate successfully.

8. Click Done to save the new computed field in the dataset.

This saves the field and closes the expression builder.

Clicking another field on the page or the Update link also saves the changes tothe computed field. If you click Update instead of Done, the field is saved andthe field statistics are updated, but the expression builder remains open.

9. Check the computed column values in the Inspector Panel to make sure the expression logic isworking as expected.

The dataset workspace can't show sample results for computed fields that operate on fields of areferenced dataset.

You can optionally choose to show or hide the Functions and Fieldsinformation in the expression builder by clicking Expression Reference. Andif you make any changes to the expression, click Update to save the changesand show new values on the page and in the Inspector Panel.

10.If you're editing an existing dataset, click Save to save the new computed field to the dataset.

Expressions are an advanced topic. For information on working with the Platfora expression syntax, seeExpressions Guide .

Page 113: Data Ingest Guide

Data Ingest Guide - Define Datasets to Describe Data

Page 113

Duplicate a Dataset Computed Field

Some computed fields you need to create are similar to Platfora's existing computed fields, but witha minor variation. You can use the Duplicate Computed Field button to more easily create anew computed field based on an existing computed field. This button appears when you select anexisting computed field. You can add a new dataset computed field on the Manage step of the datasetworkspace or on the Data page of the dataset workspace of an existing dataset. A computed field has aname, a description, and an expression. The computed field expression describes some processing taskyou want to perform on other fields in the dataset.

When you duplicate a computed field, Platfora creates a new field with the same description andexpression and adds it to the end of the field list. The field name is the same as the original with anumber appended at the end to make it unique. You can edit the new field to modify it as necessary.

1. Go to the Manage step of the dataset creation wizard, or the Data page of the dataset workspace ofan existing dataset.

2. Select an existing computed field.

3. Click Duplicate Computed Field.

This creates a new field with the same description and expression as the selected computed field, andadds it to the end of the field list. The field name is the same as the original with a number appendedat the end to make it unique. You can edit the new field to modify it as necessary.

4. Edit the expression and field name as necessary.

Page 114: Data Ingest Guide

Data Ingest Guide - Define Datasets to Describe Data

Page 114

Add Binned Fields

A binned field is a special kind of computed field that groups ranges of values together to create newcategories. The dataset workspace has tools for quickly creating bins on numeric type fields. Binnedfields are a way to reduce the number of values in a high-cardinality column, or to group data in away that makes it easier to analyze. For example, you might want to bin the values in an age field intocategories such as under 18, 19 to 29, 30-39, and 40 and over.

Bin Numeric Values

You can bin numeric values by adding a binned quick field. As you define the bin intervals, Platforadisplays a histogram of the data distribution for the source rows currently shown in the datasetworkspace.

1. Open the dataset workspace for an existing dataset, and click the Data page.

Or, go to the Manage step of the dataset creation wizard.

2. Select a column that is a numeric data type.

3. In the Inspector panel, click the Options tab.

4. Click Create Bins.

5. Choose a Bin Type and enter your bin intervals.

• Even Intervals will group numeric values into even numbered bins. The bin value that isreturned for a row is determined by rounding down the original value to the starting value of its

Page 115: Data Ingest Guide

Data Ingest Guide - Define Datasets to Describe Data

Page 115

containing bin. For example, if the interval is 10, then a value of 5 would return 0, and a value of11 would return 10.

• Custom Intervals groups values into user-defined ranges and assigns a text label to eachrange. For example, suppose you had an Age field that was in years and you wanted to bin thevalues into different age intervals. The values you enter creates a range between the starting valueto the ending value. Note that Platfora creates additional bins for all values that are not covered bya bin you define. For example, it creates a bin for all values less than the lowest bin, and it createsa bin for all values greater than the highest bin. So if the lowest bin you define has a starting valueof 16, the starting range would be less than 16 years. If the highest bin you define has an endingvalue of 55, the ending range would be over 55 years.

With custom intervals, the values you enter should correspond to the data type. For example,an integer would have values such as 60, 120, etc. A double would have values such as 60.00,120.00, etc.

Page 116: Data Ingest Guide

Data Ingest Guide - Define Datasets to Describe Data

Page 116

The value returned for a row in a field with custom interval bins is always text, so the binned fieldhas the STRING data type.

6. Look at bins Platfora will create based on the defined bins and the histogram showing the distributionof values in the sample data. Adjust the bins based on the results if necessary.

7. Click Create Binned Field.

Platfora creates a new computed field and adds it to the dataset.

8. Click the newly created field in the dataset workspace, and verify that the returned values arecalculated as expected.

9. (Optional) In the Inspector panel, edit the name and description of the new binned field.

Page 117: Data Ingest Guide

Data Ingest Guide - Define Datasets to Describe Data

Page 117

10.Save the dataset.

Bin Text Values

If you wanted to bin text or STRING values, you can define a computed field that groups values togetherusing a CASE expression.

For example, here is a CASE expression to bucket values of a name field together by their first letter:

CASE WHEN SUBSTRING(name,0,1)=="A" THEN "A" WHEN SUBSTRING(name,0,1)=="B" THEN "B" WHEN SUBSTRING(name,0,1)=="C" THEN "C" WHEN SUBSTRING(name,0,1)=="D" THEN "D" WHEN SUBSTRING(name,0,1)=="E" THEN "E" WHEN SUBSTRING(name,0,1)=="F" THEN "F" WHEN SUBSTRING(name,0,1)=="G" THEN "G" WHEN SUBSTRING(name,0,1)=="H" THEN "H" WHEN SUBSTRING(name,0,1)=="I" THEN "I" WHEN SUBSTRING(name,0,1)=="J" THEN "J" WHEN SUBSTRING(name,0,1)=="K" THEN "K" WHEN SUBSTRING(name,0,1)=="L" THEN "L" WHEN SUBSTRING(name,0,1)=="M" THEN "M" WHEN SUBSTRING(name,0,1)=="N" THEN "N" WHEN SUBSTRING(name,0,1)=="O" THEN "O" WHEN SUBSTRING(name,0,1)=="P" THEN "P" WHEN SUBSTRING(name,0,1)=="Q" THEN "Q" WHEN SUBSTRING(name,0,1)=="R" THEN "R" WHEN SUBSTRING(name,0,1)=="S" THEN "S" WHEN SUBSTRING(name,0,1)=="T" THEN "T" WHEN SUBSTRING(name,0,1)=="U" THEN "U" WHEN SUBSTRING(name,0,1)=="V" THEN "V" WHEN SUBSTRING(name,0,1)=="W" THEN "W"

Page 118: Data Ingest Guide

Data Ingest Guide - Define Datasets to Describe Data

Page 118

WHEN SUBSTRING(name,0,1)=="X" THEN "X" WHEN SUBSTRING(name,0,1)=="Y" THEN "Y" WHEN SUBSTRING(name,0,1)=="Z" THEN "Z" ELSE "unknown" END

Expressions are an advanced topic. For information on working with Platfora expressions and theircomponent parts, see Expressions Guide.

Add Measures for Quantitative AnalysisA measure is a special type of computed field that returns an aggregated value for a group of records.Measures provide the basis for quantitative analysis when you build a lens or visualization in Platfora.Every dataset, lens, or visualization must have at least one measure. There are a couple of ways to addmeasures to a dataset.

FAQs - Dataset Measures

This section describes the basic concept of measures, and why they are needed in a Platfora dataset.Measures are necessary if you plan to build aggregate lenses from a dataset, and use the data forquantitative analysis.

What is a measure?

Measures provide the basis for quantitative analysis in a visualization or lens query. A measure is anumeric value representing an aggregation of values from multiple rows. For example, measures containdata such as total dollar amounts, average number of users, count distinct of users, and so on.

Measure values always result from a computed field that uses an aggregate function in its expression.Examples of aggregate functions include COUNT, DISTINCT, AVG, SUM, MIN, MAX, VARIANCE,and so on.

Why do I need to add measures to a dataset?

In some data analysis tools, measures (or metrics as they are sometimes called) can be aggregated at thetime of analysis because the amount of data to aggregate is relatively small. In Platfora, however, thedata in a lens is pre-aggregated to optimize performance of big data queries. Therefore, you must decidehow to aggregate the metrics of your dataset up front. You do this by defining measures either in thedataset or at lens build time. When you go to analyze the data in a vizboard, you can only do quantitativeanalysis on the measures you have available in the lens.

Page 119: Data Ingest Guide

Data Ingest Guide - Define Datasets to Describe Data

Page 119

How do I add measures to a dataset?

You add measures from the Manage step of the dataset creation wizard, or from the Data page ofthe dataset workspace. From there, click the Configure Aggregations button. Platfora opens theConfigure Aggregations page.

There are a couple of ways to add measures to a dataset:

Page 120: Data Ingest Guide

Data Ingest Guide - Define Datasets to Describe Data

Page 120

1. Choose quick measure aggregations on certain columns of the dataset. This way, if a user choosesthe field in the lens builder, they will automatically get the measure aggregations you have selected.Users can always override quick measure selections if they want.

2. Define an aggregate expression that gets added to the dataset as a (computed) field. Measurescomputed in this way allow data administrators more control over how the data is aggregated, andwhat level of detail is available to users. For example, you may want to prevent users from seeingthe original values of salary field, but allow users to see averages or percentiles of salary data. Also,more complex aggregate calculations, such as standard deviation or ranking, can only be done withcomputed field expressions.

3. Use the default measure. Every dataset has one default measure called Total Records, which is asimple count of dataset records. You can't delete Total Records from the dataset.

Can analyst users add their own measures if they want?

Analyst users can always choose quick measure aggregations when they go to build a lens, but they can'tadd computed measures to a dataset. You must be a data administrator and have the appropriate datasetpermissions to add computed fields to a dataset.

In a vizboard, users can manipulate the measure data they already have in their lens. They can useROLLUP and window functions to compute measure results over different time frames or categories.

Most aggregate calculations must be computed during lens build processing. However, a few aggregateexpressions are allowed without having to rebuild the lens. DISTINCT, MIN, and MAX can be used todefine new measures in the vizboard without having to rebuild the lens.

The Default 'Total Records' Measure

Platfora automatically adds a default measure to every dataset you create. This measure is calledTotal Records, and it counts the number of records (or rows) in the dataset. You can change the name,description, or visibility of this default measure, but you cannot delete it. When you build a lens from adataset, this measure is always selected by default.

Page 121: Data Ingest Guide

Data Ingest Guide - Define Datasets to Describe Data

Page 121

Add Quick Measures

If you have a field in your dataset that you want to use for quantitative analysis, you can select that fieldand quickly add measures to the dataset. A quick measure sets the default aggregation(s) to use when auser builds a lens.

Quick measures are an easy way to add measures to a dataset without having to define new computedfields or write complicated expressions. Quick measures are added to a field in a dataset, and they set thedefault measures to create if a user chooses that field for their lens. Users can always decide to overridethe default measure selections when they define a lens.

1. Go to the Manage step of the dataset creation wizard.

Or, for an existing dataset, open the dataset workspace, and click the Data page.

2. Click Configure Aggregations.

This opens the Configure Aggregations area where you can define both computed fieldaggregate expressions and quick measures.

3. On the Quick Measures tab, choose which fields should be aggregated by default.

Page 122: Data Ingest Guide

Data Ingest Guide - Define Datasets to Describe Data

Page 122

DISTINCT (count of distinct values) is available for all field types. MIN (lowest value) andMAX (highest value) are available for numeric-type or datetime-type fields. SUM (total) and AVG(average) are available for numeric-type fields only.

You can select multiple fields and apply the same aggregate function to themsimultaneously. Also, for datasets with a lot of fields, you can filter the list offields and sort the field list by clicking any column in the Quick Measurestable. For examle, you can filter the list of fields and then select all fields in thelist and quickly apply the same quick measure to all selected fields.

4. Click OK when you're done defining all measure fields.

Platfora returns to the Manage step of the dataset creation wizard, or the Data page of the datasetworkspace.

5. Select one of the fields for which you defined a quick measure.

6. Click the Options tab of the Inspector panel and look at the Field Role section.

Page 123: Data Ingest Guide

Data Ingest Guide - Define Datasets to Describe Data

Page 123

Platfora lists the possible ways to use this field when it is selected for a lens, either as a measure only,or as a measure and a dimension field.

In most cases, fields that are intended to be used as measures (aggregateddata only) should not have Measure & dimension selected, as this can causethe lens to be larger than intended.

Add Computed Measures

In addition to quick measures, you can create more sophisticated measures using computed fieldexpressions. A computed field expression containing an aggregate function is considered a measure.

Page 124: Data Ingest Guide

Data Ingest Guide - Define Datasets to Describe Data

Page 124

1. Go to the Manage step of the dataset creation wizard.

Or, for an existing dataset, open the dataset workspace, and click the Data page.

Review the sample data values before writing your measure expression.

2. Click Configure Aggregations.

This opens the Configure Aggregations area where you can define both computed fieldaggregate expressions and quick measures.

3. Go to the Aggregate Expressions tab, and click Add Aggregate Expression.

4. Enter a name for your field and a description.

The description is optional but very useful for others who will use the field later.

5. Double-click an aggregate function from the list to add it to the Expression area.

The Expression panel updates with the function's template. Also, the Fields list refreshes withthose fields you can use with the function. For example,MIN and MAX functions can only aggregatenumeric or datetime data types.

6. Double-click a field to add it into the Expression area.

7. Continue adding functions and fields into your expression until it is complete.

Aggregate functions can only take fields or literal values as input.

8. Make sure your expression is correct.

The system checks your syntax as you build the expression. The area below the Expression areadisplays any error messages. You can save expressions that contain errors, but will not be able tosave the dataset until all expressions evaluate successfully.

Page 125: Data Ingest Guide

Data Ingest Guide - Define Datasets to Describe Data

Page 125

9. Click Add to add the new computed measure field to the dataset.

Your new field appears in the dataset. At this point, the field has no sample values. This is expectedfor measure fields. As an aggregate field, it depends on a defined group of input rows to calculate avalue.

10.(Optional) Hide the field you used as input to the aggregate function.

Hiding the input field is useful when only the aggregated data is used for future analysis.

Expressions are an advanced topic. For information on working with Platfora's expression syntax, seeExpressions Guide.

Prepare Date/Time Data for AnalysisWorking with time series data is an important part of data analysis. To prepare time-based data foranalysis, you must tell Platfora which fields of your dataset contain DATETIME type data, and how yourtimestamp fields are formatted. This allows users to analyze data chronologically and see trends in thedata over time.

FAQs—Date and Timestamp Processing

This section answers the common questions about how Platfora handles date and time data in a dataset.Date and time data should be assigned to the DATETIME data type for Platfora to recognize it as a dateor timestamp.

In what format does Platfora store timestamp data?

Internally, Platfora stores all DATETIME type data in UTC format (coordinated universal time). Ifyour timestamp data does not have a time zone component to it, Platfora uses the local timezone of thePlatfora server.

When time-based data is in DATETIME format it can be ordered chronologically. You can also usethe DATETIME processing functions to calculate time intervals between two DATETIME fields. Forexample, you can calculate the time difference between an order date and a ship date field.

How does Platfora parse timestamp data?

There are a handful of timestamp formats that Platfora can recognize automatically. On the ParseData step of the dataset workspace, pay attention to the data type assigned to your timestamp columns.If the data type is DATETIME, then Platfora was able to parse the timestamp correctly.

If the data type is STRING, then Platfora was not able to parse the timestamp correctly. You will have tocreate a computed field to tell Platfora how your date/time data is formatted. See Cast DATETIME DataTypes.

Page 126: Data Ingest Guide

Data Ingest Guide - Define Datasets to Describe Data

Page 126

Why are all my dates/times 1970-01-01T00:00:00.000Z (January 1, 1970 at12:00 AM)?

This is the default value for the DATETIME data type in Platfora. If you see this value in your date ortimestamp columns, it could mean:

• Platfora does not recognize the format of your timestamp string, and was not able to parse itcorrectly.

• The timestamp string represented an invalid date. For example, an invalid date could be February 31(of any year with any time), or the day and hour when Daylight Saving Time (DST) begins, such asMarch 13, 2016 at 02:00 AM.

• Your data values are NULL (empty). Check the raw source data to confirm.

• Your data does not have a time component (or a date component) to it. Platfora only has one datatype for dates and times: DATETIME. It does not have just DATE or just TIME. If one of thesecomponents is missing in your timestamp data, the defaults will be substituted for the missinginformation. For example, if you had a date value that looks like this: 04/30/2014, then Platfora willconvert it to this: 2014-04-30T00:00:00.000Z (the time is set to midnight).

What are the Date and Time datasets for?

Slicing and dicing data by date and time is a very common reporting requirement. Platfora's built-inDate and Time datasets allow users to explore time-based data at different granularities in a vizboard.For example, you can explore date-based data by day, week, or month or time-based data by hour,minute, or second.

Why does my dataset have all these date and time references when I didn't addthem?

Every DATETIME type field in a dataset automatically generates two references: one to the built-in Datedataset and one to the built-in Time dataset. These datasets have a built-in hierarchy that allows users toexplore dates at different granularities.

How do I remove the automatic references to Date and Time?

You cannot remove the automatic references to Date and Time, but you can rename them or hide them.

Page 127: Data Ingest Guide

Data Ingest Guide - Define Datasets to Describe Data

Page 127

Cast DATETIME Data Types

If Platfora can recognize the format of a date field, it will automatically cast it to the DATETIME datatype. However, some date formats are not automatically recognized by Platfora and need to be convertedto DATETIME using a computed field expression.

1. Go to the Manage step of the dataset creation wizard.

Or, for an existing dataset, open the dataset workspace, and click the Data page.

2. Find your base date field.

If the data type is STRING and not DATETIME, that means that Platfora could not automaticallyparse the date format.

3. Choose Add Computed Field.

4. Enter a name for the new computed field.

5. Write an Expression using the TO_DATE function. This function converts values to DATETIMEusing the date format you specify.

6. Click Add to add the computed field to the dataset.

Page 128: Data Ingest Guide

Data Ingest Guide - Define Datasets to Describe Data

Page 128

7. Verify that the new DATETIME field is formatting the date values correctly. Also check that theautomatic references to the Date and Time datasets are created.

About Date and Time References

Every DATETIME type field in a dataset automatically generates two references: one to the built-in Datedataset and one to the built-in Time dataset. These datasets have a built-in hierarchy that allows users toexplore dates at different granularities. You cannot remove these auto-generated references, but you canrename or hide them.

1. Go to the Relationships page of the dataset workspace of an existing dataset.

For each DATETIME type field in the dataset, you will see two references: one to Date and one toTime.

2. Click a reference to select it.

3. In the Inspector panel, you can edit the reference name or description. This is the reference nameas it will appear to users in the data catalog.

Page 129: Data Ingest Guide

Data Ingest Guide - Define Datasets to Describe Data

Page 129

4. If you don't want a reference to appear in the data catalog at all, you can hide it.

5. If you make changes, you must click Save for the changes to take effect.

About the Default 'Date' and 'Time' Datasets

Slicing and dicing data by date and time is a very common reporting requirement. Platfora allows you toanalyze date and time-based data at different granularities by automatically linking DATETIME fields toPlatfora's built-in Date and Time dimension datasets.

The source data for these datasets is added to the Hadoop file system when Platfora first starts up (in/platfora/system by default). If you have a different fiscal calendar for your business, you caneither replace the built-in datasets or add additional ones and link your datasets to those instead. Youcannot delete the default Date and Time references, however you can hide them if you do not need them.

The Date dataset has Gregorian calendar dates ranging from January 1, 1800 to December 31, 2300.Each date is broken down into the following columns:

Date Dimension

Column

Data Type Description

Date DATETIME A single date in the format yyyy-MM-dd, for example2014-10-31.This is the key of the Date dataset.

Day_of_Month INTEGER The day of the month from 1-31

Day_of_Year INTEGER The day of the year from 1-366

Month INTEGER Calendar month, for example January 2014

Month_Name STRING The month name (January, February, etc.)

Month_Number INTEGER The month number where January=1 and December=12

Quarter STRING The quarter number with year (Q1 2014) where quartersstart on January 1, April 1, July 1, or October 1.

Quarter_Name STRING The quarter number without year (Q1) where quartersstart on January 1, April 1, July 1, or October 1

Week INTEGER The week number within the year where week 1 starts onthe first Monday of the calendar year

Weekday STRING The weekday name (Monday, Tuesday, etc.)

Weekday_Number INTEGER The day of the week where Sunday is 1 and Saturday is 7

Work_Day STRING One of two values; Weekend (Saturday, Sunday) orWeekday (Monday - Friday)

Page 130: Data Ingest Guide

Data Ingest Guide - Define Datasets to Describe Data

Page 130

Date Dimension

Column

Data Type Description

Year INTEGER Calendar year, for example 2014

The Time dataset has each time of day divided into different levels of granularity, from the most general(AM/PM) to the most detailed (Time in Seconds).

Prepare Drill Paths for AnalysisAdding a drill path to a dataset allows vizboard users to drill down to more granular levels of detail in aviz. A drill path is defined in a dataset by specifying a hierarchy of dimension fields.

For example, a Product drill path might have categories for Division, Type, and Model. Drill path levelsdepend on the granularity of the dimension fields available in the dataset.

FAQs - Drill Paths

This topic answers some frequently asked questions about defining and using drill paths in Platfora.

What is a drill path?

A drill path is a hierarchy of dimension fields, where each level in the hierarchy is a sub-division of thelevel above. For example, the default drill path on Date starts with a top-level category of Year, sub-divided by Quarter, then Month, then Date. Drill paths allow vizboard users to interact with data in aviz. Users can double-click on a mark in a viz (or a cell in a cross-tab) to navigate from summarized todetailed levels of categorization.

Who can define a drill path?

You must be a Data Administrator system role or above, have Edit permissions on the dataset, andhave data access permissions to the datasets included in the drill path hierarchy in order to define a drillpath.

Any user can navigate a drill path in a viz or cross-tab (provided they have sufficient data accesspermissions).

Where are drill paths defined?

Drill paths are defined in a dataset. You can define drill paths when adding a new dataset or whenediting an existing one. Choose Add > Drill Path in the dataset workspace.

Can a field be included in more than one drill path?

Yes. A dataset can have multiple drill paths, and the same fields can be used in more than one drill path.However, there is currently no way for a user to choose which drill path they want in a vizboard if a fieldhas multiple paths. The effective drill path will always be the path that comes first alphabetically (bydrill path name).

Page 131: Data Ingest Guide

Data Ingest Guide - Define Datasets to Describe Data

Page 131

For example, the Date dataset has two pre-defined drill paths: YQMD (Year > Quarter > Month > Date)and YWD (Year > Week > Date). If a user adds the Year field to a viz, they should be able to choosebetween Quarter or Week as the next drill level. However, since there is no way to choose betweenmultiple drill paths, the current behavior is to pick the first drill path (YQMD in this case). The ability tochoose between multiple drill paths will be added in a future release.

Can I define a drill path on a single column of a dataset?

No. A drill path is a hierarchy of more than one dimension field.

If you want to drill on different granularities of data contained in a single column, you can createcomputed fields to bin or bucket the values at different granularities. See Add Binned Fields.

For example, suppose you had an age field, and wanted to be able to drill from age in 10-yearincrements, to age in 5-year increments, to actual age. To accomplish this, you'd first need to define twoadditional computed fields: age-by-10 (10-year buckets) and age-by-5 (5-year buckets). Then you couldcreate a drill path hierarchy of age-by-10 to age-by-5 to age.

Can a drill path include fields from more than one dataset?

Yes. A drill path can include fields from the focus dataset, as well as from any datasets that it references.For example, you can define one drill path that includes fields from both the Date and Time datasets viatheir associated references.

Are there any default drill paths defined?

Yes. The built-in datasets for Date and Time have default drill paths defined. Any DATETIME type fieldsthat reference these datasets will automatically include these default drill paths.

Platfora recommends leaving the default Date and Time drill paths as is. You can always override thedefault Date and Time drill paths by defining your own drill paths in the datasets that you create.

Why do the Date and Time datasets have multiple drill paths defined?

The built-in datasets for Date and Time are automatically referenced by any dataset that contains aDATETIME type field. These datasets include some built-in drill paths to facilitate navigation betweendifferent granularities of dates and times. You may notice that the Date dataset has two pre-defined drillpaths, and the Time dataset has four.

The multiple drill paths accommodate different ways of dividing date and time. In each drill pathhierarchy, each level is evenly divisible by the next level down. This ensures consistent drill behavior forwhatever field is used in a viz.

What things should I consider when defining a drill path?

A couple things to consider when defining drill paths:

• Consistent Drill Levels. Levels in the hierarchy should ideally be evenly divisible subsets of eachother. For example, in the Time dataset, the drill increments go from AM/PM to Hour by 6 to

Page 132: Data Ingest Guide

Data Ingest Guide - Define Datasets to Describe Data

Page 132

Hour by 3 to Hour. Each level in the hierarchy is evenly divisible by levels below it. This ensuresconsistent drill-down navigation in a viz.

• Alphabetical Drill Path Names. When a field participates in multiple drill-paths, the effective drillpath is the one that comes first alphabetically. Plan your drill path names accordingly.

• The Lens Decides the Drill Behavior. Ultimately, the fields that are included in the lens will dictatethe drill path levels available in a vizboard. If a level in the drill path hierarchy is not included inthe lens, it is simply skipped by the drill-down navigation. Consider defining one large drill pathhierarchy with all possible levels, and then use the lens field selections to control the levels ofgranularity applicable to your analysis.

• Aggregate Lenses Only. Viz users can only navigate through a drill path in a viz that uses anaggregate lens. Drill paths are not applicable to event series lenses.

How do I include a drill path in my lens or vizboard?

To include a drill path in a lens or vizboard, simply choose the fields that you are interested in analyzing.As long as there is more than one field from a given drill path in the lens, then drill-down capabilities areautomatically included. The lens builder does not currently indicate if a field is a member of a drill pathor not.

You do not have to include every level of the drill path hierarchy in a lens -- the vizboard drill-downbehavior can skip levels that are not present. For example, if you have defined a drill path that goes fromyear to month to day, but you only have year and day in your lens, the effective drill path for that lensthen becomes year to day (month is skipped).

Page 133: Data Ingest Guide

Data Ingest Guide - Define Datasets to Describe Data

Page 133

Add a Drill Path

Drill paths are defined in a dataset. You can define drill paths when adding a new dataset or whenediting an existing one.

1. Open the dataset workspace for an existing dataset, and click the Data page.

Or, go to the Manage step of the dataset creation wizard.

2. Click Configure Drill Paths.

3. Click Add a Drill Path.

4. Enter a name for the drill path.

Keep in mind that drill path precedence is determined alphabetically by name whenever a field is partof multiple drill paths.

5. Add the fields that you want to include in the drill path. You can include fields from a referenceddataset as well.

6. Use the up and down arrows to set the drill hierarchy order. The most general categorization shouldbe on top, and the most detailed categorization should be on the bottom.

7. Click Add.

8. Click OK.

Page 134: Data Ingest Guide

Data Ingest Guide - Define Datasets to Describe Data

Page 134

Define the Dataset Primary KeyA primary key is single field (or combination of fields) that uniquely identifies a row in a dataset, similarto a primary key in a relational database. Dataset primary keys are required for defining relationshipsbetween datasets.

You need to define a primary key in a dataset if:

• You plan to join to it from another dataset (it is the target of a reference).

• You want to use it as the focus of an event series lens.

• You want to define segments on this dataset.

1. Open the dataset workspace, and click the Data page.

Or, in the dataset creation wizard, click Next until you reach the Manage step.

2. Click Edit Primary Key.

3. Select one or more fields in the column header to include in the primary key.

A dataset may have a compound key (a combination of fields that are theunique identifier). Select each field that comprises the key.

4. Click OK to apply the changes to the primary key.

5. Click Save to save the changes to the dataset.

Page 135: Data Ingest Guide

Data Ingest Guide - Define Datasets to Describe Data

Page 135

Model Relationships Between Datasets

This section explains the relationships between datasets, and how to model dataset references, eventsand elastic datasets in Platfora to support the type of analysis you want to do on the data.

Understand Data Modeling in Platfora

This section explains the different kind of relationships you model between datasets to supportquantitative analysis, event series analysis, and/or behavioral segment analysis.

The Fact-Centric Data Model

A fact-centric data model is centered around a particular real-world event that has happened, such asweb page views or sales transactions. Datasets are modeled so that a central fact dataset is the focus ofan analysis, and dimension datasets are referenced to provide more information about the fact. In datawarehousing and business intelligence (BI) applications, this type of data model is often referred to as astar schema.

For example, you may have web server logs that serve as the source of your central fact data about pagesviewed on your web site. Additional dimension datasets can then be related (or joined) to the central factto provide more in-depth analysis opportunities.

In Platfora, you would model dataset relationships in this way to support the building of aggregate lensesfor quantitative data analysis. Fact-centric data modeling involves the following high-level steps inPlatfora:

1. Define a key in the dimension dataset. A key is one or more dataset columns that uniqely identify arecord.

Page 136: Data Ingest Guide

Data Ingest Guide - Define Datasets to Describe Data

Page 136

2. Create a reference in your fact dataset that points to the key of the dimension dataset.

How References Work in Platfora

Creating a reference allows the datasets to be joined when building aggregate lenses and executingaggregate lens queries, similar to a foreign key to primary key relationship between tables in a relationaldatabase.

Once you have added your datasets, you can model the relationships between them by adding referencesin your dataset definitions. A reference is a special kind of field in Platfora that points to the key ofanother dataset. A reference is created in a fact dataset and points to the key of a dimension dataset.

For example, you may have web server logs that serve as the source of your central fact data about pagesviewed on your web site. Additional dimension datasets can then be related (or joined) to the central factto provide more in-depth analysis opportunities.

Page 137: Data Ingest Guide

Data Ingest Guide - Define Datasets to Describe Data

Page 137

Upstream datasets point to other datasets. Downstream datasets are the datasets being pointed to.For example, the Page Views dataset is upstream of the Visitors dataset, and the Visitors dataset isdownstream of Page Views.

Once a reference is created, the fields of all downstream datasets are available through the dataset wherethe reference was created. Data administrators can define computed expressions using downstreamdimension fields, and analyst users can choose downstream dimension fields when they build a lens.Measure fields, however, are not available through a reference.

The Entity-Centric Data Model

An entity-centric data model 'pivots' a fact-centric data model to focus an analysis around a particulardimension (or entity). Modeling the data in this way allows you to do event series analysis, behavioralanalysis, or segment analysis in Platfora.

For example, suppose you had a common dimension that spanned multiple facts. In a relational database,this is sometimes referred to as a conforming dimension. In this example, our conforming dimension iscustomer.

Modeling the fact datasets around a central customer dataset allows you to analyze different aspects of acustomer's behavior. For example, instead of asking "how many customers visited my web site?" (fact-centric), you could ask questions like "which customers visit my site more than once a day?" or "whichcustomers are most likely to respond to a direct marketing campaign?" (entity-centric).

In Platfora, you would model dataset relationships in this way to support the building of event serieslenses and/or segments for behavioral data analysis. Entity-centric data modeling involves the followinghigh-level steps in Platfora:

Page 138: Data Ingest Guide

Data Ingest Guide - Define Datasets to Describe Data

Page 138

1. Identify or create a dimension dataset to serve as the common entity you want to analyze. If yourexisting data is only comprised of fact datasets, you can create an elastic dataset (a virtual dimensionused to model entity-centric relationships).

2. Define a key for the dimension dataset. A key is one or more dataset columns that uniqely identify arecord.

3. Create references in your fact datasets that point to the key of the common entity dimension dataset.

4. Model events in your common entity dimension dataset.

How Events Work in Platfora

An event is similar to a reference, but the direction of the join is reversed. An event joins the primarykey field(s) of a dimension dataset to the corresponding foreign key field(s) in a fact dataset, plusdesignates a timestamp field for ordering the event records.

Adding event references to a dataset allows you to define an event series lens from that dataset. An eventseries lens can contain records from multiple fact datasets, as long as the event references have beenmodeled in the dimension dataset.

For example, suppose you had a common dimension dataset (customer) that was referenced by multiplefact datasets (clicks, emails, calls). By creating different events within the customer dataset, you can

Page 139: Data Ingest Guide

Data Ingest Guide - Define Datasets to Describe Data

Page 139

build an event series lens from customer that allows you to analyze different aspects of a customer'sbehavior.

By looking at different customer events together in a single lens, you can discover additional insightsabout your customers. For example, you could analyze customers who were the target of an email ordirect marketing campaign who then visited your website or made a call to your call center.

How Elastic Datasets Work in Platfora

Elastic datasets are a special kind of dataset used for entity-centric data modeling in Platfora. They areused to consolidate unique key values from other datasets into one place for the purpose of definingsegments, event series lenses, references, or computed fields. They are elastic because the data theycontain is dynamically generated at lens build time.

Elastic datasets can be created when you have a flat data model with the majority of your data in a singledataset. Platfora requires you to have separate dimension datasets in order to create segments and eventseries lenses. Elastic datasets allow you to create 'virtual dimensions' so you can do the entity-centricdata modeling required to use these features of Platfora. Elastic datasets can be used to work with datathat is not backed by a single data source, but instead is embedded in other various datasets.

For example, suppose you wanted to do an analysis of the IP addresses that accessed your network.You had various server logs that contained IP addresses, but did not have a separate IP Address datasetmodeled out in Platfora. In order to consolidate the unique IP addresses that occurred in your other

Page 140: Data Ingest Guide

Data Ingest Guide - Define Datasets to Describe Data

Page 140

server log datasets, you could create an elastic dataset called IP Address. You could then modelreferences and events that pointed to this elastic, virtual dataset of IP addresses.

There are two ways to create an elastic dataset:

1. From one or more columns in an existing dataset. This will generate the elastic dataset, a staticexample data file, and the corresponding reference at the same time.

2. From a file containing static example data. You can also use a dummy file of key examples to definean elastic dataset. The file is then used for example purposes only.

After the elastic dataset has been created, you then need to model references (if you want to createsegments) and events (if you want to create event series lenses).

Elastic datasets are virtual - all of their data values are consolidated from the other datasets thatreference it. They are not file-based like other datasets. The actual key values that comprise the elasticdataset are computed at lens build time. The example data shown in the Platfora data catalog is file-based, but it is only used for example purposes.

Elastic datasets inherit the data access permissions of the datasets that reference them. So for example,if a user has access to the Web Logs and Network Logs datasets, they will have access to the IP addressvalues consolidated from those datasets via the IP Address elastic dataset.

One thing to keep in mind is the sample file used to show the values seen in the dataset workspace andthe Platfora data catalog. The values in this sample data file are viewable to all Platfora users by default.If you are concerned about this, don't use real data values to create this sample data file.

Since elastic datasets contain no actual data of their own, they cannot be used as the focus of anaggregate lens. They can be included by reference in an aggregate lens, or be used as the focus whenbuilding an event series lens.

Also, since they are used to consolidate key values from other datasets, every base field in the datasetmust be included in the elastic dataset key. Additional base fields that are not part of the key are notallowed in an elastic dataset (additional computed fields are OK though).

Page 141: Data Ingest Guide

Data Ingest Guide - Define Datasets to Describe Data

Page 141

Add a Reference

A reference creates a link from a field in the current dataset (the focus dataset) to the primary key fieldof another dataset (the target dataset). The target dataset must have a primary key defined. Also, thefields used to join two datasets must be of the same data type.

1. Open the dataset workspace for an existing dataset, and click the Relationships page.

2. Click Add a Reference.

The Select a Dataset to Reference dialog appears.

3. Select the dataset to link to. Only datasets that have keys defined will appear in the list.

If you do not see the dataset you want to reference in the target list, make surethat it has a key defined and that the data type of the key field(s) is the sameas the foreign key field(s) in the focus dataset. For example, if the key of thetarget dataset is an INTEGER data type, but the focus dataset only has STRINGfields, you will not see the dataset in the target list because the data types arenot compatible.

4. Click Select.

The Define a Reference dialog appears.

5. Enter a Name for the reference.

6. (optional) Enter a Description for the reference.

7. Choose the Foreign Key field(s) in the current dataset to link to the Primary Key field(s) of thetarget dataset.

The foreign key field must be of the same data type as the target dataset primary key field.

Page 142: Data Ingest Guide

Data Ingest Guide - Define Datasets to Describe Data

Page 142

If the target dataset has a compound key, you must choose a corresponding foreign key for each fieldin the key.

8. (Optional) Click Preview for one of the primary key-foreign key pairs to see sample data in eachfield.

9. Click OK.

The new reference is added to the dataset on the References tab on the Relationships page.

10.Save the changes to the dataset.

You may also want to hide the foreign key field(s) in the current dataset so that users only see thereference fields in the data catalog.

To refer to the referenced dataset from here on out, use the reference name (not the original datasetname).

Add an Event Reference

An event is a special reverse-reference that is created in a dimension dataset. Before you can modelevent references, you must define regular references first. Event references allow you to define an eventseries lens from a dataset.

In order to create an event in a dataset, the current dataset and the event dataset you are linking to mustmeet the following requirements:

• The current dataset must have a key defined.

• The current dataset must be the target of a reference. See Add a Reference.

• The event dataset that you are linking to must have a timestamp field in it (a DATETIME type field).

If the dataset does not meet these requirements, you will not see the Add an Event button.

Page 143: Data Ingest Guide

Data Ingest Guide - Define Datasets to Describe Data

Page 143

1. Edit the dimension dataset in which you want to model event references.

2. Select the Relationships page in the dataset workspace.

3. Select the Events tab.

4. Click Add an Event.

5. Provide the event information.

Event Name This is a logical name for the event. This is the name users will seein the data catalog or a lens.

Event Dataset This is the fact dataset that you are linking to. You will only seedatasets that have references to the current dataset.

Event Dataset Reference This is the name of the reference in the event dataset. If the eventdataset has multiple references to the current dataset, then choosethe appropriate one for your event.

Ordering Field This is a timestamp field in the event dataset. When an event serieslens is built, this is the field used to order event records. OnlyDATETIMEtype fields in the event dataset are shown.

6. Click OK.

7. The event is added to the Events tab on the Relationships page. Click on an event to view or editthe event details.

Add an Elastic Dataset

Elastic datasets are a special kind of dataset used for entity-centric data modeling in Platfora. There aretwo ways to create an elastic dataset - from a column in an existing dataset or from a file containingsample values.

Elastic datasets are used to consolidate unique key values from other datasets into one place for thepurpose of defining segments or event series lenses. They are elastic because the data they contain isdynamically generated at lens build time.

The data values used when adding the elastic dataset are for example purposes only. They are visible tousers as example values when they view or edit the elastic dataset in the data catalog. The actual datavalues of an elastic dataset come from the datasets that reference it.

Page 144: Data Ingest Guide

Data Ingest Guide - Define Datasets to Describe Data

Page 144

Create an Elastic Dataset from an Existing Dataset

As a convenience, you can create an elastic dataset while working in another dataset. This creates theelastic dataset, a static example data file, and the corresponding fact-to-dimension reference at the sametime.

1. Open the dataset workspace for an existing dataset that contains the key values you want toconsolidate.

2. Click the Data page.

3. Click Derive Elastic Dataset.

4. Choose the column(s) in the current dataset that you want to base the elastic dataset on. If the keyvalues are comprised from multiple columns (a compound key), select each column.

5. Enter a name for the new elastic dataset that will be created.

6. Enter a description for the new elastic dataset that will be created.

7. Enter a name for the new reference that will be created in the current dataset.

8. Click Create.

9. You are notified that the new elastic dataset is about to be created using sample values from thecolumn(s) you selected in the current dataset. Click Confirm.

The sample values are written to Platfora's system directory in the Hadoop file system. Forexample, in HDFS at:

Page 145: Data Ingest Guide

Data Ingest Guide - Define Datasets to Describe Data

Page 145

/platfora/system/current+dataset+name/sample.csv

This file is only used as sample data when viewing or editing the elastic dataset.

This sample file is not removed from HDFS if you delete the elastic dataset inPlatfora. You'll have to remove this file in HDFS directly.

10.Go to the Relationship page in the dataset workspace, and notice that the reference to the elasticdataset is created in the current dataset.

11.Save the current dataset.

Create an Elastic Dataset Using a Sample File

If you are creating an elastic dataset based on sensitive values, such as social security numbers oremail addresses, you may want to use a sample file of fake data to create the elastic dataset. This

Page 146: Data Ingest Guide

Data Ingest Guide - Define Datasets to Describe Data

Page 146

way unauthorized users will not be able to see any real data via the sample values. This is especiallyimportant for Platfora systems using HDFS delegated authorization.

1. Upload a file to Platfora as a basis for the elastic dataset. This file should contain a newline separatedlist of sample values.

2. Go to the Manage step of the dataset creation wizard.

3. Click Edit Primary Key.

4. Select each field in the dataset to be included in the primary key.

5. Select Enable Elastic Dataset to change the dataset's type from file-based to elastic.

6. Click Confirm to confirm the dataset type change.

7. Click OK to return to the Manage step of the dataset creation wizard.

8. Click Next to progress to the Finish step of the dataset creation wizard. Change the DatasetName for the elastic dataset. For example, you may want to use a special naming convention forelastic datasets to help you find them in the Platfora data catalog.

9. Click Finish to save the dataset.

After you create the elastic dataset, you have to add references in your fact datasets to point to it. This ishow the elastic dataset gets populated with real data values at lens build time. It consolidates the foreignkey values from the datasets that reference it.

Page 147: Data Ingest Guide

Data Ingest Guide - Define Datasets to Describe Data

Page 147

Delete or Hide a Reference

Deleting a reference removes the link between two datasets. If you want to keep the reference link, butdo not want the reference to appear in the data catalog, you can always hide it instead. The automaticreferences to Date and Time cannot be deleted, but they can be hidden.

Before deleting a reference, make sure that you do not have computed fields,lenses, or vizboards that are using referenced fields. A missing reference cancause errors the next time someone updates a lens or vizboard that is using fieldsdownstream of the reference.

1. Edit an existing dataset, and go to the Relationships page in the dataset workspace.

2. Find the reference on the References tab, event references on the Events tab, or geo reference onthe Geo Locations tab.

3. Delete or hide the reference.

• To delete the reference, click the delete icon.

• To hide the reference select Hide in Lens Builder. This will keep the reference in the datasetdefinition, but hide it in the the lens builder and the data catalog view of the dataset.

Page 148: Data Ingest Guide

Data Ingest Guide - Define Datasets to Describe Data

Page 148

Update a Reference

You can edit and existing reference, event, or geo reference to change its name or description.

Before changing the name of a reference, make sure that you do not havecomputed fields, lenses, or visualizations that are using it. Changing the name cancause errors the next time someone updates a lens or vizboard that is using theold reference name.

1. Edit an existing dataset, and go to the Relationships page in the dataset workspace.

2. Find and click the reference on the References tab, event references on the Events tab, or georeference on the Geo Locations tab.

3. In the Inspector panel, update the name or description.

4. Click Save to save the dataset.

Prepare Location Data for AnalysisAdding geographic location information to a dataset allows vizboard users to use maps and geo-spatial analytics to discover new insights in the data. To prepare location data for analysis, you musttell Platfora which fields of your dataset contain geographic coordinates (latitude and longitude), andoptionally a place name to associate with those coordinates (such as the name of a business).

FAQs—Location Data and Geographic Analysis

This section answers the common questions about how Platfora handles location data in a dataset.Location information can be added to a dataset by geo-encoding certain fields of the dataset, or bycreating a geo location reference to another dataset that contains geo-encoded location data.

Page 149: Data Ingest Guide

Data Ingest Guide - Define Datasets to Describe Data

Page 149

What is location data?

Location data represents a geographic point on a map of the Earth's surface. It is comprised of latitude /longitude coordinates, plus an optional label that associates a place name with the set of coordinates.

What is geographic analysis?

Geographic analysis is a type of data analysis that involves understanding the role that location plays inthe occurrence of other factors. By looking at the geo-spatial distribution of data on a map, analysts cansee how location impacts different variables.

In a vizboard, analysts can use the geo map viz type to do geographic analysis.

How does Platfora do geographic analysis?

Platfora enables geographic analysis by allowing data administrators to encode their datasets withlocation information. This geo-encoded data then appears as special location fields in the dataset, lens,and vizboard.

These special location fields can then be used to create map visualizations in a Platfora vizboard.Platfora uses Google Maps to render map visualizations.

What are the prerequisites to doing geographic analysis in Platfora?

In order to do geographic analysis in Platfora, you must have:

• Access to the Google Maps web service from your Platfora master server. Your Platfora SystemAdministrator must configure this for you.

• Datasets with latitude / longitude coordinates in them.

Page 150: Data Ingest Guide

Data Ingest Guide - Define Datasets to Describe Data

Page 150

Platfora provides some curated datasets for US states, counties, cities and zip codes. You can importthese datasets and use them to create geo references if needed (assuming your datasets have a columnthat can be used to link to these datasets).

What are the high-level steps to prepare data for geographic analysis?

1. Geo-encode the location data in your datasets by creating geo location fields or geo locationreferences.

2. Make sure to include location fields in your lens when you build it.

3. In the vizboard, choose the map viz type.

What is a location field?

A location field is a new type of field you create in a dataset. It has a field name, latitude / longitudecoordinates, plus an optional label that associates a place name with a set of coordinates. To create a geolocation field in a dataset, you must tell Platfora which columns of the dataset contain this information.

What is a geo reference?

A geo location reference is a reference to another dataset that contains location fields. Geo referencesshould be used when the dataset you are referencing is primarily used for location purposes.

How are geo references different from regular references?

References and geo references are basically the same—they both create a link to another dataset. A georeference, however, can only point to datasets that have geo location fields in it. The purpose of a georeference is to link to datasets that primarily contain location information.

Geo references and regular references are also displayed differently in the data catalog view of thedataset, the lens, and the vizboard. Notice that either type of reference can contain location fields. Georeferences just use a different icon. This visual cue helps users find location data more easily.

Page 151: Data Ingest Guide

Data Ingest Guide - Define Datasets to Describe Data

Page 151

When should I create geo references versus regular references?

The purpose of a geo reference is to link to datasets that primarily contain just location fields. This helpsusers identify location data more easily when they want to do geographic analysis. Geo location typefields are featured more prominently via a geo reference.

If you have a dataset that has lots of other fields besides just location fields, you may want to use aregular reference instead. Users will still be able to use the location fields in the referenced dataset tocreate map vizzes if they want. The location data is just featured less prominently in the data catalog andlens.

Can I change a regular reference into a geo reference (or vice versa)?

No. If you want to change the type of a reference, you will have to delete the regular reference andrecreate it as a geo reference (or the other way around). You should use the same reference name so thatlenses and vizboards that are using the old reference name do not break.

You cannot have two references with the same name, even though they are different types of references.You will either have to delete or rename the old reference before you create the new one.

Understand Geo Location Fields

A geo location field is a new type of field you create in a dataset. It has a field name, latitude / longitudecoordinates, plus a label that associates a place name value with a set of coordinates. To create a geolocation field in a dataset, you must tell Platfora which columns of the dataset contain this locationinformation.

You can think of a geo location field as a complex data type comprised of multiple dataset columns. Inorder to create a geo location field, your dataset must have:

• A latitude column with numeric data type values

• A longitude column with numeric data type values

• A place name column containing STRING data type values. Place name values must be unique foreach latitude, longitude coordinate pair.

The values of this column will be used to:

• label tooltips for points on a map viz

• when creating a filter on a location field

• label marks and axes when using a location field in non-map visualizations

Page 152: Data Ingest Guide

Data Ingest Guide - Define Datasets to Describe Data

Page 152

The reason for creating geo location fields is so analyst users can plot location data in a map

visualization. Location fields are shown with a special pin icon in the data catalog, lens, andvizboard. This location icon lets users know that the field can be used on a map.

Page 153: Data Ingest Guide

Data Ingest Guide - Define Datasets to Describe Data

Page 153

Add a Location Field to a Dataset

The process of adding a geo location field to a dataset involves mapping information from other datasetcolumns. To create a location field, a dataset needs a latitude column, a longitude column, and labelcolumn containing place names.

1. Open the dataset workspace for an existing dataset, and click the Data page.

2. Make sure that your dataset has the necessary columns.

Latitude and longitude columns are required to create a geo location field. Each coordinate must be inits own column, and the columns must be a numeric data type.

A location name column is optional, but highly recommended. If you do use location names, thevalues must be unique for each latitude, longitude coordinate pair. For example, a column containingjust city names may not be unique (there may be a city named Paris in multiple states and countries).You may need to create a unique place name column by combining the values of multiple fields in acomputed field expression.

3. Click the Relationships page.

4. Click Add a Location Field.

5. Give the location field a Name.

Page 154: Data Ingest Guide

Data Ingest Guide - Define Datasets to Describe Data

Page 154

Consider using a standard naming convention for all location type fields. For example, always useLocation or Geo in the field name. This will make it easier for users to find location fields usingsearch.

6. (Optional) Enter a Description for the location field.

7. Under Geo Location Type, choose either Latitude, Longitude (if you only have thecoordinates) or Latitude, Longitude with Name (if you also have a column to use as placename labels).

8. Choose the fields in the current dataset that map to Latitude, Longitude, and Location Name.

If you don't see the expected dataset columns as choices, make sure the dataset columns are thecorrect data type—DOUBLE, FIXED, LONG or INTEGER for Latitude and Longitude, STRINGfor Location Name.

9. Click OK.

10.Make sure the geo location field was added to the dataset as expected. Location fields and georeferences are both added in the Geo Locations tab of the dataset on the Relationships page.

11.Save the changes to the dataset.

Understand Geo References

If your datasets do not have geographic coordinates in them, you can reference special geo datasets thatdo have coordinate information in them. For example, if your dataset has US zip code information in it,you can reference a special geo dataset that contains latitude/longitude coordinates for each US zip code.

A geo reference is similar to a regular reference in Platfora. The difference is that geo references areused specifically for the purpose of linking to datasets containing geo location fields.

A regular reference links to datasets that have other dimension information besides just locationinformation. Although a regular referenced dataset may have location information in it as well, locationinformation is not the primary reason the dataset exists.

Prepare Geo Datasets to Reference

A geo dataset is a dataset that contains mainly location information. The main purpose of a geo datasetis to be the target of a geo location reference from other datasets in the data catalog. Linking anotherdataset to a geo dataset allows users to do geographic analysis in a vizboard.

Page 155: Data Ingest Guide

Data Ingest Guide - Define Datasets to Describe Data

Page 155

Platfora comes with some built-in geo datasets that you can install and use for US States, US Counties,US Cities, and US Zipcodes.

Optionally, you may have your own data that you want to use to create your own geo datasets. Forexample, you may have location information about sites that are relevant to your business, such as storelocations or office locations.

Load the US Geo Datasets

Platfora installs with some curated geo-location datasets for US States, US Counties, US Cities, and USZipcodes. You can load these datasets into the Platfora data catalog, and then use them to create georeferences from datasets that do not have location information in them.

The geo location datasets contain United States location data only. If you have International locationdata, or custom locations you want to create (such as custom business locations) you can look at thesedatasets as examples for creating your own geo-location datasets.

In order to reference these geo datasets, your own datasets must have a column that can be used to jointo the key of appropriate geo dataset. For example, to join to the US States dataset, your dataset musthave a column that has two-digit state codes (CA, NY, TX, etc.).

1. Log in to the Platfora master server in a terminal session as the platfora system user.

2. Run the geo dataset install script:

$ $PLATFORA_HOME/client/examples/geo/US/install_geo_us.sh --user username --password mypassword

You should see output such as:

Importing dataset: "US States"Importing dataset: "US Counties"Importing dataset: "US Cities"Importing dataset: "US Zipcodes"Importing permissions for SourceTable: 'US Cities'Importing permissions for SourceTable: 'US Cities'Importing permissions for SourceTable: 'US Cities'Importing permissions for SourceTable: 'US Counties'Importing permissions for SourceTable: 'US Counties'Importing permissions for SourceTable: 'US Counties'Importing permissions for SourceTable: 'US Zipcodes'Importing permissions for SourceTable: 'US Zipcodes'Importing permissions for SourceTable: 'US Zipcodes'Importing permissions for SourceTable: 'US States'Importing permissions for SourceTable: 'US States'Importing permissions for SourceTable: 'US States'

3. Go to the Platfora web application in your browser.

4. Go to the Datasets tab in the Data Catalog. Look for the US States, US Counties, US Cities,and US Zipcodes datasets.

Page 156: Data Ingest Guide

Data Ingest Guide - Define Datasets to Describe Data

Page 156

The default permissions on these datasets allow Everyone to view them and build lenses, but onlythe default System Administrator (admin) account to edit or delete them. You may want to grant editpermissions to other system or data administrator users.

Create Your Own Geo Datasets

You may have location information about sites that are relevant to your business, such as store or officelocations. If you have location data that you want to reference from other datasets, you can create aspecial geo dataset. Geo datasets are datasets that are intended to be the target of a geo reference.

Creating a geo dataset is basically the same as creating any other dataset. However, you prepare thefields and references within a geo dataset so that only (or mostly) location fields are visible in the datacatalog. Hide all other fields that are not location fields.

Prepare the dataset so that only (or mostly) location fields appear as top-level columns in the dataset. Forexample, in the Airports dataset, there are 3 possible locations for an airport (from most granular to mostgeneral): Airport Location, Airport City Location, and Airport State Location.

If the dataset references other datasets, hide the internal references so users don't see a complicated treeof references in the data catalog. The goal is to flatten and simplify the reference structure for users. For

Page 157: Data Ingest Guide

Data Ingest Guide - Define Datasets to Describe Data

Page 157

example, in the Airports dataset, there is an internal reference to US Cities. That reference is hidden sousers don't see it in the data catalog.

Use interim computed fields to 'pull up' the required latitude and longitude columns from the referenceddataset into the current dataset. For example, in the Airports dataset, the latitude and longitude columnsfor Airport Location are already in the current dataset. The latitude and longitude columns for AirportCity Location, however, are in the referenced US Cities dataset.

Page 158: Data Ingest Guide

Data Ingest Guide - Define Datasets to Describe Data

Page 158

Then create geo location fields in the current dataset. The computed fields add the required columnsneeded to create a location field in the current dataset. The goal is to create all possible geo locationfields in the current dataset, so users don't have to navigate through multiple references to find them.

Consider using a common naming convention for location fields, such as always having Location in thename. This will help users easily find location fields using search.

Page 159: Data Ingest Guide

Data Ingest Guide - Define Datasets to Describe Data

Page 159

After all of the location fields have been added to your geo dataset, consider adding a drill path fromthe most general location field (for example, Airport State Location) to the most specific (for example,Airport Location). This will allow users to drill-down on points in a map visualization.

Don't forget to designate a key for your geo dataset. A dataset must have a key to be the target of a georeference from another dataset.

Page 160: Data Ingest Guide

Data Ingest Guide - Define Datasets to Describe Data

Page 160

This approach takes a bit of work, but the end result makes it clear to users what fields they can use inmap visualizations. Here is what a geo reference from the Flights dataset to a specially prepared AirportLocations geo dataset might look like.

Add a Geo Reference

A geo reference is basically the same as a regular reference. It creates a link from a field in the currentdataset (the focus dataset) to the primary key field of another dataset (the target dataset). You shoulduse a geo reference when the dataset you are linking to is mostly used for location purposes. The target

Page 161: Data Ingest Guide

Data Ingest Guide - Define Datasets to Describe Data

Page 161

dataset of a geo reference must have a primary key defined and also contain at least one geo locationtype field. Also, the fields used to join two datasets must be of the same data type.

1. Open the dataset workspace for an existing dataset, and click the Data page.

2. Make sure that your dataset has the necessary foreign key columns to join to the target geo dataset.

For example, to join to the US Cities dataset provided by Platfora, your dataset must have a statecolumn containing two digit, capitalized state values (CA, TX, NY, and so on), and a city columnwith city names that have initial capital letters, proper spacing, and no abbreviated names (forexample, San Francisco, Los Angeles, Mountain View—not san francisco, LA, or Mt. View).

3. Click the Relationships page.

4. Click Add a Geo Reference.

5. In the Select a Geo Reference Dataset dialog, choose the dataset you want to link to.

Only datasets that have keys defined and geo location fields in them will appear in the list.

6. Click Select.

The Define a Geo Refence dialog appears.

7. Give the geo reference a Name.

Page 162: Data Ingest Guide

Data Ingest Guide - Define Datasets to Describe Data

Page 162

Consider using a standard naming convention for all geo location references. For example, alwaysuse Location or Geo in the name. This will make it easier for users to find geo references andlocation fields using search.

8. (optional) Enter a Description for the geo reference.

9. Choose the Foreign Key field(s) in the current dataset to link to the primary key field(s) of thetarget dataset.

The foreign key field must be of the same data type as the focus dataset primary key field.

If the focus dataset has a compound key, you must choose a corresponding foreign key for each fieldin the primary key.

10.Click OK.

11.Make sure the geo location reference was added to the dataset as expected. Location fields and georeferences are both added in the Geo Locations tab of the dataset on the Relationships page.

This is how geo references appear in the data catalog view of the dataset, the lens, and the vizboard. Thelocation fields under a geo reference are listed before other dimension fields.

Pre-Process Data with Transformation Datasets

A transformation dataset is a category of dataset that allows data administrators to model and pre-process data before transforming fields with computed fields. The results of this pre-processing of dataforms the basis of the base fields in the dataset that you can manipulate with computed fields, computedmeasures, and drill paths, for example.

Transformation datasets always use existing datasets as their source. Transformation datasets can changethe number of records and fields in the dataset, whereas data source based datasets can only change thenumber of fields in a dataset.

Platfora processes transformation datasets using an Apache Spark server. It uses Spark when working inthe dataset workspace and when building a lens that uses a transformation dataset.

Combine Data with Union Datasets

Data administrators can create a Union dataset that combines data in related fields from existing Platforadatasets.

Page 163: Data Ingest Guide

Data Ingest Guide - Define Datasets to Describe Data

Page 163

FAQs—Union Datasets

This topic answers some frequently asked questions about Union datasets.

A Union dataset is a transformation dataset that combines data in related fields from existing Platforadatasets.

You might want to create a Union dataset to combine related data that exists in different datasets into asingle dataset for analysis. For example, you might want to combine datasets that have similar, but notidentical schemas, or datasets with the same schema but with source files in different locations.

Go to the Data Catalog page and click Add Dataset. Choose Union Dataset as the type ofdataset to create, and then choose two or more existing datasets that will provide input data to the Uniondataset.

You can use most data source based datasets, any transformation dataset (for example, SQL or Uniondatasets), and datasets created using a Platfora Data Connector framework (for example, HBase orBigQuery datasets) as inputs to a Union dataset. Platfora does not currently support using elastic,dynamic derived, or static derived datasets, or any dataset that uses a Platfora plugin as a data source.

When you first create a Union dataset, Platfora suggests how to match fields from each input datasetinto a field in the Union dataset. Platfora uses fuzzy matching logic based on field names and data types.Matched fields from the input datasets correspond to a field in the Union dataset. This set of associatedfields is called a field mapping.

Page 164: Data Ingest Guide

Data Ingest Guide - Define Datasets to Describe Data

Page 164

You might need to edit the default field mappings that Platfora creates. For example, some fieldmappings might not include a field from each input dataset if Platfora couldn't detect an obvious match.

You can define new field mappings, and delete or edit existing ones. In each field mapping, you entera name for the field as it appears in the Union dataset. Each Union dataset field specified in a fieldmapping becomes a base field on the Manage Fields step of the dataset creation wizard and a basefield on the Data page of the dataset workspace.

The following types of fields from input datasets are not available in a Union dataset:

• Hidden fields

• Fields from referenced datasets

• Computed fields that are based on fields from referenced datasets

• Computed fields that contain the PARTITION, FILE_PATH, or FILE_NAME functionS

If you want the Union dataset to include fields from a referenced dataset, you must include theappropriate field from the input datasets and then define a reference in the Union dataset.

The datasets you want to combine in a Union dataset might not have the same schema. For example,one dataset might have first name and last name information in a single field, and the other has thatinformation in two fields. When this is the case, you can create computed fields in each input dataset sothey can be combined in a Union dataset.

You can create computed fields for input datasets in the following locations:

• In the original dataset. Any computed field you create in a dataset is available to both that datasetand any Union dataset that uses it as an input dataset.

• On the Input tab of the Union dataset. Any computed field you create for an input dataset ofa Union dataset is only available to that Union dataset. It does not get pushed back to the original

Page 165: Data Ingest Guide

Data Ingest Guide - Define Datasets to Describe Data

Page 165

dataset. You might want to create a computed field on the Input tab of a Union dataset if youneed a field to use a different data type, but you don't want to change the data type of the originaldataset. For example, you could create a computed field to change a zip code field from INTEGER toSTRING to match it with a STRING zip code field in another in put dataset.

No. If you don't specify a field in an input dataset for a field mapping, Platfora will use a NULL value inthat field from that input dataset.

In the dataset workspace, go to the Data page and click Edit Union Properties.

Although there is no limit to the number of input datasets you can include in a Union dataset, youmight want to limit each Union dataset to four input datasets. Using more than four datasets might bechallenging to configure in the Platfora web application.

Data lineage only goes back as far as a Union dataset because it is based on multiple datasets. It does notreport on lineage information for the datasets and data sources used by the Union dataset.

Page 166: Data Ingest Guide

Data Ingest Guide - Define Datasets to Describe Data

Page 166

Define Field Mappings in a Union Dataset

A field mapping is a set of related fields in a Union dataset that specifies a field name in the Uniondataset that includes data from one or more fields in an input dataset. Field mappings determine howdata is combined in a Union dataset.

1. Go to the Union step of the dataset creation wizard.

Or, open a Union dataset, go to the Data page, and click Edit Union Properties.

When creating a Union dataset, Platfora creates some default field mappings based on field namesand data types. You can add edit the field mappings.

2. Click Add Field to create a new field mapping.

3. Enter field name as you want it to appear in the Union dataset.

4. Select which fields in each input dataset to combine in this Union dataset field. You can chooseNone for an input dataset to select no field. Platfora will use NULL values when it builds the lens.

5. View how the field values will combine in the Union dataset.

6. (Optional) Click the Input tab for an input dataset to view the dataset and optionally create acomputed field that can be used in this Union dataset.

Page 167: Data Ingest Guide

Data Ingest Guide - Define Datasets to Describe Data

Page 167

The contents of the Input tab is very similar to the Data page of the dataset. For example, you canview details of each field in the Inspector panel, search the fields, and create a computed field.

7. (Optional) Delete an existing field mapping by clicking the X next to a field mapping, or all fieldmappings by clicking Clear All Fields.

8. (Optional) Click Remap Fields to lose all currently configured field mappings and restore the defaultfield mappings that Platfora creates based on field names and data types.

Work with SQL Datasets

Data administrators can create a SQL dataset that manipulates existing Platfora datasets.

Understand SQL Datasets

A SQL dataset is a transformation dataset whose underlying data is produced from the results of aHive query language (HiveQL) statement performed on existing Platfora datasets. When adding a newdataset, choose SQL Dataset and then enter a HiveQL statement.

Once a SQL dataset is defined from the HiveQL statement, you can use it as you would any other datasetin Platfora—you can edit it, add computed fields, and join it by reference to other datasets in the Platforadata catalog.

SQL datasets take in one or more existing datasets as their source. In that regard, SQL datasets aresimilar to relational database views, while datasets backed by files, such as on HDFS, are similar torelational database tables.

You might want to create a SQL dataset to pre-process data before defining other dataset properties,such as creating computed fields. For example, you can use SQL datasets to perform a union of rowsfrom multiple datasets that have the same fields, or explode multiple values stored together in a singlefield, such as in a delimited list or a JSON array.

Page 168: Data Ingest Guide

Data Ingest Guide - Define Datasets to Describe Data

Page 168

For example, you could write the following HiveQL statement to perform a union of two datasets:

SELECT `F1`, `F2`, `F3` FROM `Dataset1` UNION ALL SELECT `F1`, `F2`, `F3` FROM `Dataset2`

SQL datasets allow you to manipulate data in a way that changes both the number of rows as well ascolumns. This is in contrast to computed fields which only allow you to manipulate data that changesthe number of columns. You could create new rows by exploding data, or decrease rows by filtering outrows.

Consider the following rules and guidelines when creating and using SQL datasets:

• Best practice is to enclose Platfora field names and dataset names in the HiveQL statement inthe grave accent character ( ` ), also known as the backtick character. For example, SELECT`Field1`, `Field2`, `Field3` FROM `Dataset1`. Enclosing names in the ` characteris required for field and dataset names containing numerals only (such as 12345) or special characters(such as spaces and periods).

• When a lens contains a SQL dataset, Platfora runs the HiveQL expression in Spark every time thelens is built.

• The HiveQL statement can't use segment fields, elastic datasets, dynamically derived datasets, ordatasets that use a data connector plugin as a data source.

• Data lineage only goes back as far as a SQL dataset when the SQL dataset queries multiple datasets,such as when performing a union. It does not report on lineage information for the datasets and datasources used by these SQL datasets.

• SQL datasets listed on the Data Catalog page show no size. This is because SQL datasets arebacked by query statements, not raw source files.

• The HiveQL statement must return fields with data types supported by Platfora. For example, theHiveQL statement can't return a single field as an array.

• HiveQL statements can't bring in the following types of fields from the original dataset:

• Hidden fields

• Fields from referenced datasets

• Computed fields that contain fields from referenced datasets

• Computed fields that contain the PARTITION function

• The HiveQL IN and EXISTS clauses don't work currently due to a known limitation with ApacheSpark. To work around this issue, use a WITH clause.

• Field names used in a LATERAL VIEW explode() clause must use lower case for the aliasedname in the AS clause.

• The number of rows of sample data shown in the dataset workspace of the SQL dataset is determinedby the number of rows selected from the original datasets for sample data and the HiveQL processingperformed on those rows. Therefore, if the HiveQL statement filters out all rows queried from theoriginal datasets for sample data, no sample data will appear in the dataset workspace for the SQLdataset. To work around this, try increasing the number of sample rows. In some cases, you mightneed to edit the HiveQL statement.

Page 169: Data Ingest Guide

Data Ingest Guide - Define Datasets to Describe Data

Page 169

SQL Dataset Functions

Platfora has created some functions that you can use in the HiveQL statement of a SQL dataset.

These SQL dataset functions can be used to return a Hive array of STRING or STRUCT values from aninput XML string. You might want to use these functions with the HiveQL EXPLODE function to createnew rows of data in a SQL dataset.

Platfora has created the following SQL dataset functions:

• PLAT_XPATH_ARRAY(xml_string, XPATH_expression)—This function returns a Hive ARRAYof STRING values. The string values it returns can be complex XML elements instead of just text orattribute values.

• PLAT_XPATH_ARRAY_INDEXED(xml_string, XPATH_expression)—This function returns anARRAY of STRUCT values. The STRUCT content has two values, value (as a string) and index (as aninteger). The string value returned in the STRUCT value is the same as the string value that would bereturned when using the PLAT_XPATH_ARRAY function.

Suppose you have a Platfora dataset called xml_dataset that contains a field called xml containing thefollowing XML data:

<?xml version="1.0" encoding="UTF-8"?><data> <customer name="Alice"> <phone type="main">555-555-5555</phone> <phone>555-555-7777</phone> <order id="123"> <payment-method>Visa</payment-method> <lineitem> <product>bananas</product> <quantity>4</quantity> <unit-price>0.99</unit-price> </lineitem> <lineitem> <product>steak</product> <quantity>1.4</quantity> <unit-price>5.99</unit-price> </lineitem> </order> <order id="124"> <payment-method>cash</payment-method> <payment-method>Visa</payment-method> <lineitem> <product>bananas</product> <quantity>2</quantity> <unit-price>0.99</unit-price> </lineitem> <lineitem> <product>milk</product> <quantity>1</quantity> <unit-price>3.99</unit-price> </lineitem> <lineitem> <product>steak</product> <quantity>2</quantity> <unit-price>5.99</unit-price> </lineitem>

Page 170: Data Ingest Guide

Data Ingest Guide - Define Datasets to Describe Data

Page 170

</order> </customer> <customer name="Bob"> <order id="125"> <payment-method>Visa</payment-method> <lineitem> <product>milk</product> <quantity>2</quantity> <unit-price>3.99</unit-price> </lineitem> </order> <order id="126"> </order> </customer></data>

You could create a SQL dataset and use the following HiveQL statement:

SELECT cust, ord, line, size(PLAT_XPATH_ARRAY(cust,'//phone')) as phone_count, concat_ws(',', PLAT_XPATH_ARRAY(line,'/lineitem/product/text()')) productFROM `xml_dataset` as x LATERAL VIEW EXPLODE( PLAT_XPATH_ARRAY(xml, '/data/customer')) c as cust LATERAL VIEW OUTER EXPLODE( PLAT_XPATH_ARRAY(cust, '/customer/order')) o as ord LATERAL VIEW OUTER EXPLODE( PLAT_XPATH_ARRAY(ord, '//lineitem')) l as line

Create a SQL Dataset

SQL datasets use the Hive query language (HiveQL) to manipulate existing Platfora datasets.

1. Go to the Data Catalog page and click Add Dataset.

The dataset creation wizard starts.

Page 171: Data Ingest Guide

Data Ingest Guide - Define Datasets to Describe Data

Page 171

2. Click SQL Dataset.

3. On the SQL step, enter a valid HiveQL statement that queries existing Platfora datasets.

Best practice is to enclose Platfora field names and dataset names in the HiveQL statement inthe grave accent character ( ` ), also known as the backtick character. For example, SELECT`Field1`, `Field2`, `Field3` FROM `Dataset1`. Enclosing names in the ` characteris required for field and dataset names containing numerals only (such as 12345) or special characters(such as spaces and periods).

4. Click Next.

Platfora uses Apache Spark to query the data behind the datasets in the HiveQL statement andprocesses the records and fields (rows and columns) as necessary. The dataset creation wizardprogresses to the Manage step. Platfora displays the new fields and example data.

5. Finish defining the dataset like other datasets.

Page 172: Data Ingest Guide

Page 172

Chapter

4Use the Data Catalog to Find What's AvailableThe data catalog is a collection of data items available and visible to Platfora users. Data administrators buildthe data catalog by defining and modeling datasets in Platfora that point to source data in Hadoop. When usersrequest data from a dataset, that request is materialized in Platfora as a lens. The data catalog shows all of thedatasets (data available for request) and lenses (data that is ready for analysis) that have been created by Platforausers.

Topics:• FAQs - Data Catalog Basics

• Find Available Datasets

• Find Available Lenses

• Find Available Segments

• Organize Datasets, Lenses, Segments, and Vizboards with Labels

FAQs - Data Catalog BasicsThe data catalog is where users can find datasets, lenses, and segments that have been created inPlatfora. This topic answers the basic questions about the data catalog.

How can I see the relationships between datasets?

There isn't one place in the data catalog where you can see how all of the datasets are related to eachother. You can however, open a particular dataset to see how it relates to other datasets in the datacatalog.

Page 173: Data Ingest Guide

Data Ingest Guide - Use the Data Catalog to Find What's Available

Page 173

The Relationships page of the dataset workspace shows all datasets that the current dataset is relatedto. Click the different tabs to view the different types of relationships.

The Relationships page shows the following tabs:

• References— Go here to view or define downstream datasets. When a reference is listed here, thecurrent dataset is considered to be a fact dataset.

• Geo Locations—Go here to view or define any geo reference or location field defined in thisdataset. Datasets containing a geo reference or location field can be used in a geo map visualization.

• Events—Go here to view or define an event reference. If a dimension dataset has an event orsegment associated with it, then it is also considered an entity dataset. Entity datasets serve as aconforming dimension to join multiple fact datasets together. Entity datasets can be used as the focusof an event series lens.

• Upstream Datasets—Go here to view upstream datasets. If a dataset is the target of an incomingreference, it is considered a dimension dataset. Dimension datasets show upstream relationships onthis tab.

What does it mean when a dataset or lens has a lock on it?

If you are browsing the data catalog and see datasets or lenses that are grayed-out and locked, this meansthat you do not have sufficient data access permissions to see the data in that dataset or lens. Contactyour Platfora system administrator to ask if you can have access to the data.

Page 174: Data Ingest Guide

Data Ingest Guide - Use the Data Catalog to Find What's Available

Page 174

You can choose to hide these objects from displaying in the Platfora application by editing your userprofile and selecting Hide objects I don't have permission on.

What does it mean when a dataset has (static) or (dynamic) after its name?

This means that the dataset is a derived dataset. A derived dataset is defined from a viz (or lens

query) in Platfora, whereas a regular dataset is defined from a data source outside of Platfora.

A static derived dataset takes a snapshot of the viz data at a point in time - the data does not change ifthe parent lens is updated. A dynamic derived dataset does not save the actual viz data, but instead savesthe lens query used to produce the data - the data is dynamically updated whenever the parent lens isupdated.

Why doesn't 'My Datasets' or 'My Lenses' have anything listed?

Even though you may work with certain datasets and lenses on a regular basis, they won't show in theMy Datasets or My Lenses panels unless you were the original user who created them.

Page 175: Data Ingest Guide

Data Ingest Guide - Use the Data Catalog to Find What's Available

Page 175

Find Available DatasetsDatasets represent a collection of source data in Hadoop that has been modeled for use in Platfora. Youcan browse or search the data catalog to find datasets that are available to you and that meet your datarequirements. Once you find a dataset of interest, you can request that data by building a lens (or checkif a lens already exists that has the data you need).

Search within Datasets

Using the Quick Find search, you can find datasets by name. Quick find also searches the field nameswithin the datasets.

1. Go to the Datasets tab in the Data Catalog.

2. Search by dataset name or by a field name within the dataset using the search.

Dataset List View

List view allows you to sort the available datasets by different criteria to find the dataset you want.

1. Go to the Datasets tab in the Data Catalog.

Page 176: Data Ingest Guide

Data Ingest Guide - Use the Data Catalog to Find What's Available

Page 176

2. Select List view.

3. Click a column header to sort by that column.

4. Once you find the dataset you want, use the dataset action menu to access it.

5. While in list view, you can select multiple datasets to delete at once.

Find Available LensesLenses contain data that is already loaded into Platfora and immediately available for analysis. Lensesare always built from the focus of a single dataset in the data catalog. Before you build a new lens, youshould check if there is already a lens that has the data you need. You can browse the available lenses inthe Platfora data catalog.

1. Go to the Lenses tab in the Data Catalog.

2. Choose the List view to easily sort and search for lenses.

3. Search by lens name or by a field name within the lens using the search.

4. Click a column header to sort by that column, such as finding lenses by their focus dataset.

5. Once you find a lens you want, use the lens action menu to access it.

Page 177: Data Ingest Guide

Data Ingest Guide - Use the Data Catalog to Find What's Available

Page 177

Find Available SegmentsYou can browse or search the data catalog to find segments that have been created in Platfora. Once youfind a segment of interest, you can view the segment details.

From the segment action menu you can Edit the segment definition, Delete the segment, edit itsPermissions, define a Schedule for the segment lens to update, and more. Segments cannot becreated from the data catalog, they can only be edited or deleted from the data catalog.

Clicking a segment opens the Edit Segment dialog and shows the following information:

Segment Name The name given to the segment when it was defined in the vizboard.

Segment Info The number of rows of this dataset that met the segment conditions.

Segment of The focus dataset (and its associated reference to the current dataset) thatwas used to define the segment .

Occuring in Dataset This is always the same as the currently selected dataset name.

Origin Lens The lens that was queried to define this segment.

Last Built The last time the segment was updated.

Segment Conditions The criteria that a record must meet to be counted in the segment.

IN and NOT IN ValueLabels

The value labels given to records that are in the segment, and those thatare not.

Organize Datasets, Lenses, Segments, and Vizboards with LabelsIf you have datasets, lenses, segments, and vizboards that you use all of the time, you can tag them witha label so you can easily find them.

Page 178: Data Ingest Guide

Data Ingest Guide - Use the Data Catalog to Find What's Available

Page 178

FAQs—Labels

This topic answers some frequently asked questions about labels.

What is a label?

A label is a tag that you can assign to Platfora datasets, lenses, segments, and vizboards to classifyobjects together.

Labels allow you to organize and categorize Platfora objects for easier search and collaboration.For example, you can label all datasets, lenses, segments, and vizboards associated with a particulardepartment or project.

Who can create a label?

Anyone in Platfora can create a label and apply it to any object to which they have view data access.

Is there any security associated with labels?

Labels are just an organizational tool. They do not have any security or privacy settings associated withthem. Labels can be created, viewed, applied, or deleted by any Platfora user, even labels created byother users. There is no ownership associated with labels.

When I view objects with a particular label and then move to a different page inthe Platfora web application, will I see objects that have that label assigned only?

Yes, Platfora keeps track of the label you're currently viewing in the URL (as part of the queryparameter portion of the URL). This is sometimes referred to as "being sticky." For example, if you areviewing the "marketing" label on the Data Catalog page and then click the Vizboards page, onlyvizboards with the marketing label are shown on the Vizboards page.

What kinds of objects can I apply labels to?

You can apply labels to datasets, lenses, segments, and vizboards.

How many labels can I apply to an object?

You can apply as many labels as you like to a dataset, lens, segment, or vizboard.

How many levels of labels can I create?

Platfora allows you to create up to 10 levels of labels.

Can I use special characters in label names?

You can use all characters except the period character ( . ). Platfora uses periods to distinguish betweenlabel levels in the query parameter string of the web application URL.

When I export and import objects in Platfora, how does Platfora handle labels?

When you export objects using the platfora-export utility, Platfora includes the name of labels thatare assigned to those objects by default. However, you can choose to not export label name information.

Page 179: Data Ingest Guide

Data Ingest Guide - Use the Data Catalog to Find What's Available

Page 179

When you import objects using the platfora-import utility, Platfora uses the existing labels if theyexist in the new data catalog, and creates new labels if they don't exist.

Also note the following considerations when importing object that have labels assigned:

• If a label name in the JSON import file has a period, the import fails. You must manually change thelabel name in the JSON file by removing the period or changing it to an underscore.

• If an object has a label assigned that contains more than 10 levels, Platfora uses the first nine levelsand then uses the final label name as level 10.

Can I use labels to make pages load in my web browser more quickly?

Yes! When you view a specific label in the Data Catalog or Vizboards page, you can bookmark thepage in your web browser. You might want to do this if your data catalog has a lot of objects and takesmore time to load than desired. By viewing only the objects in a specific label, the page loads quicker.

Create a Label

Before you create new labels, first decide how you want to categorize and organize your data objectsin Platfora. For example, do you want to tag objects by user names? by project? by department? by usecase? a combination of these? Labels can be created for each category you want to search by, and withina category, you can create up to 10 levels of nested sub-label categories.

By default, there is one parent label category called All which cannot be renamed or deleted. Any labelyou add will be a sublabel of All.

1. You can manage labels from the Data Catalog or the Vizboards area of Platfora.

2. Select Manage Labels from the Labels menu.

Page 180: Data Ingest Guide

Data Ingest Guide - Use the Data Catalog to Find What's Available

Page 180

3. Select Create Sublabel from the desired parent label in the hierarchy. The default parent labelcategory is All.

4. Enter a name for the label.

5. Click Create.

6. Click Close.

Apply Labels to an Object

You can apply as many labels as you like to a dataset, lens, segment, or vizboard. You can apply labelsto one object at a time, or apply them in bulk to many objects.

Apply Labels to a Single Object

1. You can apply labels from the Data Catalog or the Vizboards area of Platfora.

2. Select Labels from the dataset, lens, segment, or vizboard action menu.

3. Click the plus sign to apply a label. Click the minus sign to remove a label that has beenpreviously applied.

4. Click OK.

Page 181: Data Ingest Guide

Data Ingest Guide - Use the Data Catalog to Find What's Available

Page 181

Apply Labels to Many Objects at a Time

1. You can apply labels from the Data Catalog or the Vizboards area of Platfora.

2. (Optional) Filter the objects displayed.

3. Select the objects to which you want to apply labels.

4. Click Assign Labels.

5. Click the plus sign to apply a label. Click the minus sign to remove a label that has beenpreviously applied.

6. Click OK.

Remove Labels From an Object

You can remove labels applied to a dataset, lens, segment, or vizboard. You can remove labels from oneobject at a time, or remove them in bulk from many objects.

Page 182: Data Ingest Guide

Data Ingest Guide - Use the Data Catalog to Find What's Available

Page 182

Remove Labels From a Single Object

1. You can apply labels from the Data Catalog or the Vizboards area of Platfora.

2. Select Labels from the dataset, lens, segment, or vizboard action menu.

3. Click the minus sign to remove a label that has been previously applied.

4. Click OK.

Remove Labels From Many Objects at a Time

1. You can apply labels from the Data Catalog or the Vizboards area of Platfora.

2. (Optional) Filter the objects displayed.

3. Select the objects from which you want to remove labels.

4. Click Clear Labels.

5. Click Clear to remove the labels from these objects.

Page 183: Data Ingest Guide

Data Ingest Guide - Use the Data Catalog to Find What's Available

Page 183

Delete or Rename a Label

Deleting a label also deletes all of its children labels. When you delete a label, the label is removed fromall objects to which it was applied. The objects themselves are not affected. When you rename a label,the label will be updated to the new name wherever it is applied. You do not need to re-apply it to theobjects after renaming.

If you are getting errors using a label after is has been renamed, try reloadingthe browser page. Sometimes old label names are cached by the browser and cancause unexpected results.

Search by Label Name

Once you have applied labels to your objects, you can use the label breadcrumbs and search to findobjects by their assigned labels. You can search by label in the Data Catalog or Vizboards areas ofthe Platfora application.

Page 184: Data Ingest Guide

Data Ingest Guide - Use the Data Catalog to Find What's Available

Page 184

1. Click any level in the breadcrumb hierarchy to filter by that label category.

2. Select an existing label to filter on.

Platfora only displays objects that have this filter applied. The My Datasets and other My Objectspanes only show objects that have that label applied. So in some cases, they might not list any objectif you didn't edit objects with the label recently.

Page 185: Data Ingest Guide

Page 185

Chapter

5Define Lenses to Load DataTo request data from Hadoop and load it into Platfora, you must define and build a lens. A lens can be though ofas a dynamic, on-demand data mart purpose-built for a specific analysis project.

Topics:• FAQs—Lens Basics

• Lens Best Practices

• About the Lens Builder Panel

• Understand the Lens Build Process

• Create a Lens

• Estimate Lens Size

• Manage Lenses

• Manage Segments—FAQs

FAQs—Lens BasicsA lens is a user chosen slice of data from a dataset that is used as the source of your analysis invizboards, or can be shared out with other systems by writing to a common location or accessedprogrammatically via the Rest API. This topic answers some frequently asked questions about lenses.

What is a lens?

A lens is a type of data storage that is specific to Platfora. Platfora uses Hadoop as its data source andprocessing engine to build and store its lenses. Once a lens is built, this prepared data is copied toPlatfora, where it is then available for analysis. A lens can be though of as a dynamic, on-demand datamart purpose-built for a specific analysis project.

Who can create a lens?

Lenses can be created by any Platfora user with the Analyst system role (or above), provided that useralso has the appropriate security permissions to the underlying source data and the dataset.

Page 186: Data Ingest Guide

Data Ingest Guide - Define Lenses to Load Data

Page 186

How do I create a lens?

You create a lens by first choosing a Dataset in the Platfora data catalog, then choose Create Lensfrom the dataset detail page or the dataset action menu.

If the Create Lens option is grayed-out, you don't have the appropriate securitypermissions on the dataset. Ask your system administrator or the dataset ownerto grant you access.

Now that I have a lens, what can I do with it?

Lenses provide access to a subset of data and aggregates dataset that you or another user has created.With a lens chosen, you can create visualizations (vizzes) to explore the data further and discover trends,or share your findings with others.

How big can a lens be?

It depends on how much disk space and memory you have available in Platfora, and if your systemadministrator has set a limit on how much data you can request at once. As a general rule, a lens shouldnot be bigger than the amount of memory you have available in your entire Platfora cluster.

For most Platfora users, your system administrator sets a lens quota which limits how big of a lensyou can build. The default lens quota depends on your system role: 1 GB for Analysts, 1 TB for DataAdministrators, and Unlimited for System Administrators. You can see your lens quota when you go tobuild a lens.

The lens quota for your role applies to the size of the built lens plus the temporaryfiles created when building the lens.

Likely, your organization uses Hadoop because you are collecting and storing a lot of data. It probablydoesn't make sense to request all of that data all at once. You can limit the amount of data you request byusing lens filters and only choosing the fields you need for your analysis.

How long does it take to build a lens?

It really depends - a lens build can take a few minutes or several hours.

There are a lot of factors that determine how long a lens build will take, and a lot of those factors dependon your Hadoop cluster, not necessarily on Platfora. Since the lens build jobs happen in Hadoop, thebiggest factor is the resources that are available in your Hadoop cluster to run Platfora's MapReducejobs. If the Hadoop cluster is busy with other workload, or if there is not enough memory on the Hadooptask nodes, then Platfora's lens builds will take longer.

The time it takes to build a lens also depends on the size of the input data, the number and cardinality ofthe dimension fields you choose, and the complexity of the processing logic you have defined in yourdataset definitions.

Page 187: Data Ingest Guide

Data Ingest Guide - Define Lenses to Load Data

Page 187

What are the different kinds of lenses?

Platfora has two types of lenses you can build: an Aggregate Lens or an Event Series Lens. The type oflens you build determines what kinds of visualizations you can create and what kinds of analyses youcan perform when using the lens in a vizboard.

An aggregate lens can be built from any dataset. It contains aggregated measure data grouped by thevarious dimension fields you select from the dataset. Choose this lens type if you want to do ad hoc dataanalysis.

An event series lens can only be built from dimension datasets that have an Event reference definedin them. It contains non-aggregated events (fact dataset records), partitioned by the primary key of theselected dimension dataset, and sorted by the time the event occurred. Choose this lens type if you wantto do time series analysis, such as funnel paths.

How does Platfora handle rows that can't be loaded?

When Platfora processes the data during a lens build, it logs any problem rows that it could not processaccording to the logic defined in the dataset. These 'dirty rows' are shown as lens build warnings.Platfora administrators can investigate these warnings to determine the extent of the problem.

Lens Best PracticesWhen you define a lens, you want the selection of fields to be broad enough to support all of thebusiness questions you want to answer. A lens can be used by many visualizations and many users at thesame time. On the other hand, you want to constrain the overall size of the lens so that it will fit into theavailable memory and so queries against the lens are fast.

Check for existing lenses before you build a new one.

Once you find a dataset that contains the data you want, first check for any existing lenses that have beenbuilt from that dataset. There may already be a lens that you can use for your analysis. Also, if there isan existing lens that contains some but not all of the fields you want, you can always modify the lensdefinition to add additional fields. This is more efficient than building a whole new lens from scratch.

Define lens filters to reduce the amount of data you request.

You can add a lens filter on any dimension field of a dataset. Lens filters constrain the number of recordspulled into the lens from the data source. For example, if you store 10 years worth of data in Hadoop,but only need to access the past year's worth of data, you can set a date-based filter to limit the lens toget only the data you need.

Keep in mind that you can also create filters within visualizations. Lens filtersshould be used to limit the number of records (and overall lens size). You don'twant to have a lens be too narrow in scope as to limit its analysis opportunities.

Page 188: Data Ingest Guide

Data Ingest Guide - Define Lenses to Load Data

Page 188

Don't include high cardinality fields that are not essential to your analysis.

The size of an aggregate lens depends mostly on the cardinality (number of unique values) of thedimension fields selected. The more granular the dimension data, the bigger the aggregate lens will be.For example, aggregating time-based data to the second granularity will make the lens significantlybigger than if you chose to analyze the data to the hour granularity.

For fields that you intend to use as measures only (you only need the aggregated values), make sure todeselect Original Value. When Original Value is selected, the field is also included in your lens asa dimension field.

Don't include DISTINCT measures unless they are essential to your analysis.

Measures that calculate DISTINCT counts must also include the original field values that they arecounting. If you add a DISTINCT measure on a high-cardinality field, this can make your aggregate lenslarger than expected. Only include DISTINCT measures in your lens when you are sure you need themfor your analysis.

For any dimension field you have in your lens, you can also calculate a DISTINCTcount in the vizboard using a vizboard computed field. DISTINCT is the onemeasure aggregation that doesn't have to be calculated at lens build time.

About the Lens Builder PanelWhen you create a new lens or edit an existing one, it opens the lens builder panel. The lens builder iswhere you choose and confirm the dataset fields that you want in your lens. The lens builder panel looks

Page 189: Data Ingest Guide

Data Ingest Guide - Define Lenses to Load Data

Page 189

slightly different depending on the type of lens you are building (aggregate or event series lens). Youcan click any field to see its definition and description.

1. Lens Name

2. Focus Dataset Name

3. Focus Dataset Size

4. Lens Type (Aggregate or Event Series)

5. Lens Size and Quota Information

6. Field Selection Controls

7. Field Information and Descriptions

8. Quick Measure Controls

9. Lens Filter Controls

10.Lens Management Controls

11.Segment Controls

12.Collapse and Expand Lens Information Panel

13.Lens Actions (save lens definition and/or initiate a build job to get data from Hadoop)

Understand the Lens Build ProcessThe act of building a lens in Platfora generates a series of MapReduce jobs in Hadoop to select,process, aggregate, and prepare the data for use by Platfora's visual query engine, the vizboard. Thissection explains how source data is selected for processing, what happens to the data during lens buildprocessing, and what resulting data to expect in the lens. By understanding the lens build process,

Page 190: Data Ingest Guide

Data Ingest Guide - Define Lenses to Load Data

Page 190

administrators can make decisions to improve lens build performance and ensure the resulting datameets the expectations of business users.

Understand Lens MapReduce Jobs

When you build or update a lens in Platfora, it generates a series of MapReduce jobs in the HadoopMapReduce cluster. The number of jobs and time to complete each job depends on the number ofdatasets involved, the number and size of the fields selected, and if that lens definition has been builtbefore (incremental vs non-incremental lens builds).

This topic explains all of the MapReduce jobs or steps that you might possibly see in a lens build, andwhat is hapening in each step. These steps are listed in the order that they occur in the overall lens buildprocess.

These MapReduce jobs appear on the Platfora System page as distinct steps of a lens build. Dependingon the lens build, you might see all of these steps or just a few of them. Depending on the number ofdatasets involved in the lens build, you may see some steps more than once:

Order Job / Step What's Happening in this Step?

1 Inspect Source Data This step scans the data source to determine thenumber and size of the files to be processed. If a lenswas built previously using the same dataset and fieldselections, then the inspection checks for any new orchanged files since the last build.If you have defined lens filters in an input partitioningfield, these filters are applied at this time before anyother processing occurs.

2 Waiting for lens build slotto become available

To prevent Platfora from overwhelming the Hadoopcluster with too many concurrent lens build jobs,Platfora limits the number of concurrent jobs it runs.Any lens build submitted after that limit is reachedwaits for existing lens builds to finish before starting.The limit is 3 by default.This limit is controlled by theplatfora.builder.lens.build.concurrency property.

3 Event series processing forcomputed_field_name

This step only occurs in lenses that include eventseries processing computed fields (computed fieldsdefined using a PARTITION statement). This job doesthe value partitioning and multi-row pattern matchprocessing of event series computed fields.

Page 191: Data Ingest Guide

Data Ingest Guide - Define Lenses to Load Data

Page 191

Order Job / Step What's Happening in this Step?

4 Build Data Dictionaries This step scans the source files and determines thedistinct values for each dimension (grouping column)in the lens. For example, a gender field might have twodistinct values (Male, Female) and a state field mighthave 50 distinct values (CA, NY, WA, TX, etc.).For high-cardinality fields, you may see an additionalBuild Partitioned Data Dictionaries preceding this step.This splits up the distinct values so that the dictionarycan be distributed across multiple nodes.This job is run for each dataset included in the lens.

5 Encoding Attribute This step encodes the dimension values (or attributes)using the data dictionaries. When data dictionariesare small, this step does not require its own job (it isperformed as part of dictionary building). When a datadictionary is large, encoding attributes is a separateMapReduce job.

6 Encoding Reference This step joins datasets that are connected byreferences. When data dictionaries are small, this stepdoes not require its own job (it is performed as part ofdictionary building). When a data dictionary is large,joining datasets is a separate MapReduce job.

7 Aggregate Datasets For aggregate lenses, this step calculates theaggregated measure values for each dimensionvalue and each unique combination of dimensionvalues. For example, if the lens included a measurefor SUM(sales), and the dimension fields gender andstate, then the sum of sales would be calculated foreach gender, each state, and each state/gendercombination.For event series lenses, this step partitions theindividual event records by the focus dataset key andorders the event records in each partition by time.This job is run for each dataset included in the lens.

8 Load Datasets This step creates a columnar representation of thedata and writes out the lens data structures to disk inthe Hadoop file system.This job is run for each dataset included in the lens.

Page 192: Data Ingest Guide

Data Ingest Guide - Define Lenses to Load Data

Page 192

Order Job / Step What's Happening in this Step?

9 Index Datasets For lenses that include fields from multiple datasets,this step creates indexes on key fields to allow joinsbetween the datasets when they are queried.This job is run for each referenced dataset included inthe lens.

10 Transfer to Final Location This is a step specific to cloud lens builds (eitherAmazon EMR or Google Dataproc). It copies lensoutput files from the intermediate directory in theHadoop job flow to the final destination in S3 orGoogle Storage.

11 Preload Built Data Files toLocal Disk

This step copies the lens data structures from theHadoop file system to the data directory locations onthe Platfora servers. Pre-fetching the lens data fromHadoop reduces the initial query time when a lens isfirst accessed in a vizboard.

Understand Source Data Input to a Lens Build

This section describes how Platfora determines what source data files to process for a given lens build.

Source data input refers to the raw data files in Hadoop that are considered for a particular lens build. APlatfora dataset points to a location in a data source (a directory in HDFS, an S3 bucket, a Hive table,etc.). By choosing a focus dataset for your lens, you set the scope of source data to be considered for thatlens. As to what source data files actually get processed by a lens build depends on other characteristicsof the lens, such as if the lens has been built before or if there are any lens filters that exclude sourcedata files.

Understand Incremental vs Full Lens Builds

Whenever possible, Platfora tries to conserve processing resources on the Hadoop cluster by onlyprocessing the source data files it needs for the lens. If a source data file has already been processed oncefor a particular lens definition, Platfora can reuse that work from past lens builds and not process thatfile again. However, if the underlying data has changed in some way, Platfora must re-process all of thesource data files in order to ensure data accuracy.

This section describes how a lens build determines if it needs to process all of the source data (full lensbuild) or just the new source data that was added since the last time the lens was built (incremental lensbuild). Incremental lens builds are more desirable because they are faster and use fewer resources.

When you first create a lens, Platfora builds the lens data using a full build. During the build, Platforastores a record of the build inputs. Then, as it manages that lens, Platfora can determine if any build

Page 193: Data Ingest Guide

Data Ingest Guide - Define Lenses to Load Data

Page 193

inputs changed. Platfora rebuilds a lens whenever a user manually fires a build by pressing a lens' Buildbutton or a scheduled build is fired by the system.

Whenever a build is fired, Platfora first compares the last build inputs to the new build inputs. If nothingchanged between the two builds, Platfora reuses the results of the last build. If there are changes andthose changes fall within certain conditions, Platfora does an incremental lens build. If it cannot do anincremental build, Platfora does a full rebuild.

Platfora defaults to incremental builds because they are faster than full rebuilds. You can optimize lensbuild performance in your environment by understanding the conditions that determine if a lens build isfull or incremental.

An incremental build appends new data to an existing lens without changing any previously built data.So, Platfora can only incrementally build changes that add but that do not modify or delete old buildinputs. For this reason, Platfora can only incrementally build lenses that rely only on HDFS or HIVEdata sources.

HDFS directory or HIVE partitions permit incremental builds because they support wildcardconfigurations. Wildcard configurations typically acquire new data through pattern matching incomingdata. They do not modify or delete existing data. An incremental lens build retrieves the newly addeddata, processes it, and appends it to the old data in Platfora. The old data is not changed.

Even though a data source is HIVE or HDFS, it does not guarantee that a lens will always buildincrementally. Under certain conditions, Platfora always builds the full lens. When any of the followinghappens between the last build and a new build, Platfora does a full lens build:

• The lens has a LAST X DAYS filter and the last build occurred outside the filter's parameters.

• The lens is modified. For example, a user changes the description or adds a field.

• The dataset is modified. For example, a user adds a field to the dataset.

• A referenced dimension dataset changes in any way.

• A data source is modified. For example, a file is modified or a file is deleted from an HDFSdirectory.

Additionally, Platfora builds the full lens under the following conditions:

• The lens includes event series processing fields. Due to the nature of patten matching logic, lenseswith ESP fields require full lens builds that scan all of a dataset's input data.

• The HDFS Delegated Authorization feature is enabled and HDFS is not configured to use ACLs(access control lists) for file permissions.

A full lens build can be resource intensive and it can take a long time. Which is why Platfora alwaystries to do an incremental build if it can. You can increase the chances Platfora does an incremental buildby relaxing the build behavior for dimension data.

Understand Input Partitioning Fields

An input partitioning field is a field in a dataset that contains information about how to locate the sourcefiles in the remote file system. Defining a filter on these special fields eliminates files from lens build

Page 194: Data Ingest Guide

Data Ingest Guide - Define Lenses to Load Data

Page 194

processing as the very first step of the lens build process, as compared to other lens filters which areevaluated later in the process. Defining a lens filter on an input partitioning field is a way to reduce theamount of source data that is scanned by a lens build.

For Hive data sources, partitioning fields are defined on the data source by the Hive administrator.Hive partitioning fields appear in the PARTITIONED BY clause of the Hive table definition. Hiveadministrators use partitioning fields to organize the Hive data into separate files in the Hadoop filesystem. The goal of Hive table partitioning is to improve query performance by keeping records togetherfor faster access.

For HDFS, Google Storage, or S3 data sources, Platfora administrators can define a partitioning fieldwhen they create a dataset. A partitioning field for HDFS or S3 is any computed field that uses aFILE_NAME() or FILE_PATH() function. File or directory path partitioning is useful when the sourcedata that comprises a dataset comes from multiple files, and there is useful information in the directoryor file names themselves. For example, useful path information includes dates or server names.

Not all datasets will have partitioning fields. If there are partitioning fields available, they are listed

on the left side of the lens builder under the Input Data section. Additionally, a special icon isdisplayed next to each partitioning field.

Platfora applies filters on input partitioning fields as the first step of a lens build. Then, Platforacomputes any event series processing computed fields. Any other lens field filters are then applied laterin the build process.

Event series processing computed fields are those that are defined using a PARTITION statement. Theinteraction of input partitioning fields and event series processing is important to understand if you areusing event series processing computed fields.

Page 195: Data Ingest Guide

Data Ingest Guide - Define Lenses to Load Data

Page 195

You can apply a filter on a partitioning field in a SQL dataset. You can filter on a partitioning field ineither of the following locations:

• Lens definition. Define a lens on a SQL dataset, and apply a lens filter to a partitioning field. ForSQL datasets, the Platfora web application does not indicate which fields are input partitioning fields,so you need to know which fields these are as shown in the base dataset the SQL dataset is based on.Note that Platfora reads all source files and then filters the source data when a lens filter expressionuses the LIKE syntax.

• HiveQL statement in the SQL dataset definition. Use a WHERE clause in the HiveQL statementused in the SQL dataset to filter on a partitioning field. For example, you could use the followingWHERE clause: WHERE myfilenamefield = Q42015.gz

Platfora recommends filtering on a partitioning field using a lens filter instead of using a WHERE clausein the HiveQL statement whenever possible.

To ensure that Platfora only reads the appropriate source files, verify that the partitioning field is broughtinto the SQL dataset unchanged from the base dataset. For example, in the HiveQL statement, don'tchange the field name of the partitioning field.

Understand Progressive Lens Builds

Very large lens builds can take a lot of time and resources, and as a result, money. Platfora uses theprogressive lens build feature to help your organiztaion save time and money rebuilding lenses after afailure.

When a lens builds progressively, Platfora divides the data in the focus dataset into increments and thenbuilds the lens one increment at a time. After a failure, the next lens build reuses any increments thatwere successfully built.

By default, a lens is built progressively when the input data size is 500GB or greater. Systemadministrators can change this value using the platfora.builder.progressive.thresholdconfiguration property. By default, a progressive lens build uses up to five increments. This can beconfigured using the platfora.builder.progressive.max.increments configurationproperty.

Page 196: Data Ingest Guide

Data Ingest Guide - Define Lenses to Load Data

Page 196

When a lens builds incrementally, Platfora groups the lens build steps on the System > Activitiespage by increment.

Understand How Datasets are Joined

This topic explains how datasets are joined together during the lens build process, and what to expectin the resulting lens data. Joins only occur for datasets that have references to other datasets, and fieldsfrom the referenced datasets are also included in the lens definition.

About Focus Datasets and Referenced Datasets

When you build a lens, you must choose one dataset as the starting point. This is called the focus datasetfor the lens. The focus dataset may have references to other datasets allowing you to choose dimensionfields from both the focus dataset and the referenced datasets as well. If a lens includes fields frommultiple datasets, then all of the selected fields are combined into one consolidated row in the lensoutput. This consolidation of fields is done by joining together the rows of the various datasets on thefields that they share in common.

Page 197: Data Ingest Guide

Data Ingest Guide - Define Lenses to Load Data

Page 197

The Default Join Behavior: (Left) Outer Joins

Consider a lens that includes fields from both the focus dataset and a referenced dataset. When Platforabuilds this lens, it does an OUTER JOIN between the focus dataset and any referenced datasets. TheOUTER JOIN operation compares rows in the focus dataset to related rows in the referenced datasets.

If a row in the focus dataset cannot join to a row in the referenced dataset, then Platfora still includesthese unjoined focus rows in the lens results. However, the values for the referenced fields that did notjoin are treated as NULL values. These NULL values are then replaced with default values and joinedto the consolidated focus row. Platfora notifies you with an "unjoined foreign keys" warning wheneverthere is a focus row that did not join.

During lens processing, all rows that have duplicate primary key values aredropped in the referenced dataset before joining. For example, if there are tworows that have the same primary key value, both rows will be discarded beforejoining to the fact dataset. Any referenced dimension fields for those discardedrows will use the default values instead. Platfora notifies you with an "duplicateprimary keys" warning whenever there are two or more rows in the referenceddataset with the same primary key value that did not join.

How Lens Filters can Change Join Behavior to (Right) Inner Joins

If a lens filter is on a field from the focus dataset, then the default join behavior is still an OUTER JOIN.The focus dataset rows are used as the basis of the join.

Page 198: Data Ingest Guide

Data Ingest Guide - Define Lenses to Load Data

Page 198

However, if the lens filter is on a field from a referenced dataset, the lens build process uses an INNERJOIN instead. The referenced dataset rows are used at the basis for comparison. This means that focusrows can potentially be excluded from the lens entirely.

Before doing the join, the lens build first filters the rows of the referenced dataset and discards anyrows that don't match the filter criteria. Then, the build joins the filtered referenced dataset to the focusdataset.

When it uses an INNER JOIN, Platfora entirely excludes all unjoined rows from the lens results.Because the lens build performs the filter first and it excludes unjoined rows, an INNER JOIN can returnfewer focus rows than you may expect.

Create a LensA lens is always defined from the focal point of a single dataset in the data catalog. Once you havelocated a dataset that has the data you need, first check and see if there are any existing lenses that youcan use. If not, click Create Lens on the dataset details page to define and build a new lens from thatdataset.

Page 199: Data Ingest Guide

Data Ingest Guide - Define Lenses to Load Data

Page 199

To create a new lens, you must be at least an Analyst role or above. You must have data accesspermissions on the source data and at least Define Lens from Dataset object permissions on thefocus dataset, as well as any datasets that are included in the lens by reference.

1. Go to the Platfora Data Catalog.

2. Find the dataset you want, and open it.

3. In the dataset workspace, select the Lenses tab on the Properties page.

4. Click Add a Lens.

5. In the lens builder panel, define your lens and choose the data fields you want to analyze.

a) Name your lens. Choose the name carefully - lenses cannot be renamed.

b) Choose the lens type. An aggregate lens is the default lens type, but you can choose to build anevent series lens if your datasets are modeled in a certain way.

c) Choose lens fields. The types of fields you choose depend on the type of lens you are building.

d) (Optional) Define lens filters. Filters limit the scope of data being requested.

e) (Optional) Allow ad-hoc segments. Choose whether or not to allow vizboard users to createsegments based on members in a particular referenced dataset.

6. Save and build the lens.

Name a Lens

The first step of defining a lens is to give it a meaningful name. The lens name should help usersunderstand what kind of data they can find in the lens, so they can decide if it will meet their analysis

Page 200: Data Ingest Guide

Data Ingest Guide - Define Lenses to Load Data

Page 200

needs. Choose the lens name carefully - you cannot rename a lens after it has been saved or built for thefirst time.

You won't be able to save or build a lens until you give it a name. The lens name must be unique - thename can't be the same as any existing lens, dataset, or segment in Platfora. You can't change the lensname after you save the lens.

It is also a good idea to give the lens a description to help users understand what data is in the lens. Youcan always edit the description later.

Choose the Lens Type

There are two types of lenses you can create in Platfora: an Aggregate Lens or an Event Series Lens.The type of lens you can choose depends on the underlying characteristics of the dataset you pick asthe focus of your lens. The type of lens you build also determines what kinds of visualizations you cancreate and what kinds of analyses you can perform when using the lens in a vizboard.

1. Aggregate lenses can be built from any dataset. Event series lenses can only be built from datasetsthat meet certain data modeling requirements. If your dataset does not meet the requirements for anevent series lens, you will not see it as a choice.

Page 201: Data Ingest Guide

Data Ingest Guide - Define Lenses to Load Data

Page 201

2. When building an aggregate lens, you can choose any measure or dimension field from the currentdataset. You can also choose additional dimension fields from any datasets that are referenced fromthe current dataset.

3. To build an event series lens, the dataset must have one or more event references created in it. Eventsare a special kind of reverse-reference that includes timestamp information. Events do not apply toaggregate lenses, only to event series lenses.

When building an event series lens, you can choose dimension fields from the focus dataset or anyrelated event dataset. Measure fields are not always applicable to event series analysis, since the datain an event series lens is not aggregated.

About Aggregate Lenses

An aggregate lens can be built from any dataset. There are no special data modeling requirements tobuild an aggregate lens. Aggregate lenses contain aggregated measure data grouped by the variousdimension fields you select from the dataset. Choose this lens type when you want to do ad hoc dataanalysis.

An aggregate lens contains a selection of measure and dimension fields chosen from the focal pointof a single fact dataset. A completed or built lens can be thought of as a table that contains aggregatedmeasure data values grouped by the selected dimension values.

For example, suppose you had the following simple dataset containing 6 rows:

id date customer product quantity unit price total amount

1 Jan 1 2013 smith tea 2 1.00 2.00

2 Jan 1 2013 hutchinson coffee 1 1.00 1.00

3 Jan 2 2013 smith coffee 1 1.00 1.00

4 Jan 2 2013 smith coffee 3 1.00 3.00

5 Jan 2 2013 smith tea 1 1.00 1.00

6 Jan 3 2013 hutchinson tea 1 1.00 1.00

In Platfora, a measure is always aggregated data. So in the example above, the field total amount wouldonly be considered a measure if an aggregate function, such as SUM, were applied to that field.

A dimension is always used to group the aggregated measure data. Suppose we chose the product fieldas a dimension in our lens. There would be two groups in this case: coffee and tea.

Page 202: Data Ingest Guide

Data Ingest Guide - Define Lenses to Load Data

Page 202

If our lens only contained that one measure (sum of total amount) and that one dimesion (product), thenthe data in the lens would look something like this:

dimension = product measure = total amount (Sum)

tea 4.00

coffee 5.00

Suppose we added one more measure (sum of quantity) and one more dimesion (customer) to our lens.The measure values are then calculated for each combination of dimension values. In this case, the datain the lens would look something like this:

dimensions = product, customer,

product+customer

measure = total amount (Sum) measure = quantity (Sum)

tea 4.00 4

coffee 5.00 5

smith 7.00 7

hutchinson 2.00 2

smith, tea 3.00 3

smith, coffee 4.00 4

hutchinson, tea 1.00 1

hutchinson, coffee 1.00 1

About Event Series Lenses

An event series lens can only be built from dimension datasets that have at least one event referencedefined in them. It contains non-aggregated fact records, partitioned by the key of the focus dataset,sorted by the time an event occurred. Choose this lens type if you want to do time series analysis, suchas funnel paths.

To build an event series lens, the dataset you choose as the focus of your lens must meet the followingdata model requirements:

• The dataset must have a primary key.

• The dataset must have at least one event reference modeled in it. Events are a special reverse-references that associate a dimension dataset to a fact dataset, and designate a timestamp field forordering of the fact records.

Page 203: Data Ingest Guide

Data Ingest Guide - Define Lenses to Load Data

Page 203

An event series lens contains a selection of dimension fields chosen from the focal point of a singledimension dataset, and from any event datasets associated to that dimension dataset.

Measure fields are not always applicable to event series lenses, since the data is not aggregated at lensbuild time. If you do decide to add measure fields, you can only choose measures from the event datasets(not from the focus dataset). These measures will be added to the lens, but will not always be visiblein the vizboard depending on the type of analysis you choose. For example, measures are hidden in thevizboard if you choose to do funnel analysis.

A completed or built lens can be thought of as a table that contains individual event records partitionedby the primary key of the dimension dataset, and ordered by a timestamp field.

An event series lens can contain records from multiple event datasets, as long as the event referenceshave been modeled in the dimension dataset.

For example, suppose you had a dimension dataset that contained these 2 user records. This dataset has aprimary key (a user_id field that is unique for each user record in the dataset):

user_id name

A smith

B hutchinson

This user dataset contains a purchase event reference that points to a dataset containing these 6 purchaseevent records:

transaction date user_id product quantity unit price total amount

1 Jan 1 2014 A tea 2 1.00 2.00

2 Jan 1 2014 B coffee 1 1.00 1.00

3 Jan 2 2014 A coffee 1 1.00 1.00

4 Jan 3 2014 A coffee 3 1.00 3.00

5 Jan 4 2014 A tea 1 1.00 1.00

6 Jan 3 2014 B tea 1 1.00 1.00

Page 204: Data Ingest Guide

Data Ingest Guide - Define Lenses to Load Data

Page 204

In an event series lens, individual event records are partitioned by the primary key of the dimensiondataset and sorted by time. If our event series lens contained one measure (sum of total amount) and onedimesion (product), then the data in the lens would look something like this:

user_id date product total amount

A Jan 1 2014 tea 2.00

A Jan 2 2014 coffee 1.00

A Jan 3 2014 coffee 3.00

A Jan 4 2014 tea 1.00

B Jan 1 2014 coffee 1.00

B Jan 3 2014 tea 1.00

Notice that there are a couple of differences between event series lens data and aggregate lens data:

• The key field (user_id) and timestamp field (date) of the event are automatically included in the lens.

• Measure data is not pre-aggregated. Instead, individual event records are partitioned by the key fieldand ordered by time.

Having the lens data structured in this way allows analysts to create special event series viz types inthe vizboard. Event series lenses allow you to analyze sequences of events, including finding patternsbetween multiple types of events (purchases and returns, for example).

Choose Lens Fields

Choosing fields for a lens depends on the lens type you pick (Aggregate Lens or Event SeriesLens), and the type of analysis you plan to do. Aggregate lenses need both measure fields (aggregatedvariable data) and dimension fields (categorical data). Event series lenses only need dimension fields -measures are optional and not always applicable to event series analysis.

About Lens Field Types

Fields are categorized into two basic roles: measures and dimensions. Measure fields are the quantitativedata. Dimension fields are the categorical data. A field also has an associated data type, which describesthe types of values the field contains (STRING, DATETIME, INTEGER, LONG, FIXED, or DOUBLE).

Page 205: Data Ingest Guide

Data Ingest Guide - Define Lenses to Load Data

Page 205

Fields are grouped by the dataset they originate from. As you choose fields for your lens, you will noticethat each field has an icon to denote what kind of field it is, and where it originates from.

Icon Field Role Description

Measure(Numeric)

Measure fields are quantitative data that have an aggregationapplied, such as SUM or AVG. Measures always produceaggregated values in Platfora. Measure values are always anumeric data type (INTEGER, LONG, FIXED, or DOUBLE) and arealways the result of an aggregation.Every aggregate lens must have at least one measure. The defaultmeasure is Total Records (a count of the records in the dataset).Measures are not applicable to event series lenses and funnelanalysis visualizations.

DatetimeMeasure

Datetime measure fields are a special variety of measurefields. They are datetime data that have either the MIN or MAXaggregate functions applied to them. Datetime measure valuesare always the DATETIME data type.Datetime measures are not applicable to event series lenses andfunnel analysis visualizations.

CategoricalDimension

Dimension fields are used to filter dataset records, group measuredata (in an aggregate lens), or define set conditions (in an eventseries lens). Categorical dimension fields contain STRING typedata.

NumericDimension

Dimension fields are used to filter dataset records, group measuredata (in an aggregate lens), or define set conditions (in an eventseries lens). Numeric dimension fields contain INTEGER, LONG,FIXED, or DOUBLE type data.You can apply an aggregate function to a numeric dimension fieldto turn it into a measure.

DateDimension

Dimension fields are used to filter dataset records and groupmeasure data. Date dimension fields contain DATETIME type data.Every datetime field also auto-creates a reference to Platfora'sbuilt-in Date and Time datasets. These date and time referencesallow you to analyze the time-based data at different granularities(week, day, hour, and so on).

LocationField

Location fields are a special kind of dimension field used only in geo mapvisualizations. They are comprised of a set of geo coordinates (latitude,longitude) and optionally a label name.

Page 206: Data Ingest Guide

Data Ingest Guide - Define Lenses to Load Data

Page 206

Icon Field Role Description

CurrentDatasetFields

Fields that are within the currently selected dataset are grouped together atthe top of the lens field list.

References A reference groups fields together that come from another dataset.A reference joins two datasets together on a common key. Youcan select dimension fields from any dataset, however you canonly choose measure fields from the current dataset (if building anaggregate lens).

GeoReferences

Ageo referenceis similar to a regular reference. The difference is that geo references areused specifically for the purpose of linking to datasets containing locationfields.

Events Aneventis like a reference, except that the direction of the join is reversed. Aneventgroups fields together that come from another dataset containing factrecords that are associated with a point in time. Event fields are onlyapplicable to event series lenses.

Segments Platfora groups together all segment fields that are based on members ofa referenced dimension dataset. You can select segment fields originallydefined in any lens as long as the segment is based on members in thereferenced dimension dataset.

SegmentField

Asegmentis a special type of dimension field that groups together members of apopulation that meet some defined common criteria. A segment is basedon members of a dimension dataset (such as customers) that have somebehavior in common (such as purchasing a particular product). Any segmentdefined on a particular dimension dataset is available as a segment field inlens that references that dataset.Segments are created in a viz based on the lens used in the viz.After creating a segment, Platfora creates a special lens build topopulate the segment members. However, after segments aredefined, you can optionally choose to include a segment fieldin any lens that references that dimension dataset. For moreinformation, see Allow Ad-Hoc Segments.

Page 207: Data Ingest Guide

Data Ingest Guide - Define Lenses to Load Data

Page 207

Choose Fields for Aggregate Lenses

Every aggregate lens must have at least one measure field and one dimension field to be a valid lens.Choose only the fields you need to do your analysis. You can always come back and modify the lenslater if you decide you need other fields. You can choose fields from the currently selected dataset, aswell as from any datasets it references.

1. Click Add+ or Add- to add or remove all of the fields grouped under a dataset or reference. Notethat this does not apply to nested references and events. You must select fields from each referenceddataset independently.

2. Click the plus icon to add a field to your lens. The plus sign means the field is not in the lens.

3. Click the minus icon to remove the field from your lens. The minus sign means the field is in thelens.

4. Open the quick measure selector to confirm the measure aggregations you have chosen on a field.Original Value (the default), means the field will be included in the lens as a dimension.

5. Expand references to find additional dimension fields.

6. Use the Fields added to lens tab to confirm the field selections you have made.

Page 208: Data Ingest Guide

Data Ingest Guide - Define Lenses to Load Data

Page 208

Choose Measure Fields (Aggregate Lens)

Every aggregate lens needs at least one measure. In Platfora, measure fields are always the result ofan aggregate calculation. If you have metric fields in your dataset that you want to use as the basis forquantitative analysis, you must decide how to aggregate those metrics before you build a lens.

1. Some measures are pre-defined in the dataset. Pre-defined measures are always at the top of thedataset field list.

2. Other non-measure fields can be converted into a measure by choosing additional aggregation typesin the lens definition.

Page 209: Data Ingest Guide

Data Ingest Guide - Define Lenses to Load Data

Page 209

Define Quick Measures

A quick measure is an aggregation applied to a dimension field to turn it into a measure. You can addquick measures to your lens based on any dimension field in the current focus dataset (for an aggregatelens) or event dataset (for an event series lens).

1. First check if the dataset has pre-defined measures that meet your needs.

Pre-defined measure fields are always listed at the top of the dataset. These measures are aggregatedcomputed fields that have already been defined in the dataset. Clicking on a predefined measure willshow the aggregate expression used to define the measure.

2. Find the field that you want to use as a measure and add it to your lens definition.

3.Click the gear icon to open the quick measure selector.

4. Choose the measure aggregations you want to apply to that field:

• Sum (total) is available for numeric type dimension fields only.

• Avg (average) is available for numeric type dimension fields only.

• Distinct (a count of the distinct values in a column) is available for all field types.

• Max (highest value) is available for numeric type dimension fields only.

• Min (lowest value) is available for numeric type dimension fields only.

Each selection will create a new measure in the lens when it is built. Quick measure fields appear inthe built lens with a name such as field_name(Avg).

Page 210: Data Ingest Guide

Data Ingest Guide - Define Lenses to Load Data

Page 210

5. Original Value also keeps the field in the lens as a dimension (grouping column) as well asaggregating its values for use as a measure. For fields that have lots of unique values, it is probablybest to deselect this option if building an aggregate lens.

Choose the Default Measure

The default lens measure is automatically added to the lens panel for any visualizations created fromthis lens. This way, if you add one dimension field to a drop zone for your viz, a visualization isautomatically displayed because the default measure is automatically placed in the other drop zone. Thisallows a default chart to be shown in the vizboard immediately after the data analyst chooses a lens fortheir viz.

By default, the default measure is the Total Records field. If a lens does not have a default measuredefined, Platfora uses Total Records as the default measure. If Total Records doesn't exist,Platfora uses a different configured measure field, or a quick field if no measure field is defined.

1. Select the measure that you want to designate as the default lens measure. Only pre-defined measurescan be used. You cannot designate quick measures as the default lens measure.

2. Make sure the measure field is added to the lens definition.

3. Click Default to This Measure.

Page 211: Data Ingest Guide

Data Ingest Guide - Define Lenses to Load Data

Page 211

Choose Dimension Fields (Aggregate Lens)

Every aggregate lens needs at least one dimension field. Dimension fields are used to group and filtermeasure data in an aggregate lens. You can add dimension fields from the currently selected focusdataset or any of its referenced datasets.

1. Dimension fields are denoted by different icons. See About Lens Field Types for more information.

2. Click Add+ or Add- to add or remove all of the dimension fields grouped under a particular datasetor reference. Add+ and Add- does not apply to nested references. You must select fields from eachreferenced dataset independently.

3. Expand references to see the dimension fields available in referenced datasets.

4. Click the plus icon to add a dimension field to your lens.

Page 212: Data Ingest Guide

Data Ingest Guide - Define Lenses to Load Data

Page 212

5. Click a dimension field to see the details about it. The following information is available about afield, depending on it's field type and whether or not the dataset has been profiled. Data heuristicinformation is only applicable to aggregate lenses.

Field Detail Description

Field Type EitherBase,ComputedorMeasure. Base field values come directly from the source data.Computed fields and measure values have been transformed orprocessed in some way.

Field Name The name of the field as defined in the dataset.

Expression If it is a computed field, the expression used to derive the fieldvalues.

Description The description of the field that has been added to the Platforadataset definition.

Example Data Shows a sampling of the field values from 20 dataset rows. Thisis not available for certain types of computed fields, such asmeasures, event series computed fields, or computed fields thatreference other datasets.

Data Type The data type: STRING, DATETIME, INTEGER, LONG,FIXED, or DOUBLE.

Default Value The default value that will be substituted for NULL dimensionvalues when the lens is built. If n/a, then Platfora will use thedefaults of January 1, 1970 for datetimes, NULL for strings, and0 for numeric data types.

Estimated Distinct Values If the dataset has been profiled, this is an estimate of how manyunique values the field has. This information is only applicableto aggregate lenses.

Data Distribution If the dataset has been profiled, this shows the top 20 values forthat field and an estimation of how the top values are distributedacross all the rows of the dataset. This information is onlyapplicable to aggregate lenses.

Page 213: Data Ingest Guide

Data Ingest Guide - Define Lenses to Load Data

Page 213

Field Detail Description

Path If it is a field from a referenced dataset, shows the dataset name,reference name, and field name.

Choose Segment Fields (Aggregate Lens)

Segments are members of a referenced dataset that have some behavior in common. After created in avisualization, the segment field is available to include in any lens that references that dataset. You mightwant to include a segment field in a lens if it's commonly used in visualizations or you want to increaseviz query performance.

1. Expand references to see the segments available in referenced datasets.

2. Expand Segments to see the segments available for a particular referenced dataset.

3.Segment fields are denoted by a cut-out cube icon .

4. Click the plus icon to add a segment field to your lens.

5. Click Add+ or Add- to add or remove all of the segment fields grouped under a particularreferenced dataset. Add+ and Add- does not apply to nested references. You must select fields fromeach referenced dataset independently.

Page 214: Data Ingest Guide

Data Ingest Guide - Define Lenses to Load Data

Page 214

6. Click a segment field to see the details about it. The following information is available about asegment field.

Field Detail Description

Field Type AlwaysSegment.

Field Name The name of the segment field as defined in the segment.

Built On The date of the special segment lens build that populated thecurrent members of the segment.

Segment Of The referenced dimension dataset of which the segment valuesare a member of. This dataset matches the referenced datasetunder which the segment field is located.

Occurring in Dataset The fact dataset that includes the behaviors the segmentmembers have in common. This dataset may be the focus datasetin the current lens, or it may be from a different dataset thatreferences this dimension dataset.

Origin Lens The lens used in the vizboard in which the segment was createdoriginally.

Segment Conditions The conditions defined in the segment that determine thesegment members.

"IN" Value Label The value labels for records that are members of the segment.

"NOT IN" Value Label The value labels for records that arenotmembers of the segment.

Selected Members The number of segment members out of the total number ofrecords in the referenced dataset.

Choose Fields for Event Series Lenses

For an event series lens, field selections are mostly dimension and timestamp fields. You can choosedimension fields from the currently selected dataset, and any fields from the event datasets it references.

Page 215: Data Ingest Guide

Data Ingest Guide - Define Lenses to Load Data

Page 215

Measure fields (aggregated variables) are not applicable to event series analysis, since data is notaggregated in an event series lens.

1. Click Add+ or Add- to add or remove all of the fields grouped under a dataset, reference, or eventreference. Note that this does not apply to nested references and events. You must select fields fromeach referenced dataset independently.

2. Click the plus icon to add a field to your lens. The plus sign means the field is not in the lens.

3. Click the minus icon to remove the field from your lens. The minus sign means the field is in thelens.

4. Open the quick measure selector to confirm the selections are appropriate for an event series lens.

In an event series lens, aggregated measures (such as SUM or AVG) are not applicable. For example, ifyou want to do funnel analysis on some metric of the dataset, make sure that Original Value (thedefault) is selected. This means the field will be included in the lens as a dimension.

5. Expand event references (or regular references) to find additional dimension fields.

6. Use the Fields added to lens tab to confirm the field selections you have made.

Timestamp Fields and Event Series Lenses

Timestamp fields have a special purpose in an event series lens. They are used to order all fact recordsincluded in the lens, including fact records coming from multiple datasets. Event series lenses have aglobal Timestamp field that applies to all event records included in the lens. There are also global

Page 216: Data Ingest Guide

Data Ingest Guide - Define Lenses to Load Data

Page 216

Timestamp Date and Timestamp Time references, which can be used to filter records ondifferent granularities of date and time.

Dataset records are not aggregated in an event series lens. Records are partitioned (or grouped) by thekey of the focus dataset and ordered by a datetime field in the event dataset(s).

For example, suppose you built an event series lens based on a customer dataset that had eventreferences to a purchases dataset and a returns dataset. The lens would partition the event records bycustomer and order both types of events (purchases and returns) by the timestamp of the event record.

Event series lenses have a global Timestamp field, and global Timestamp Date and TimestampTime references that apply to all event records included in the lens. This is especially relevant if thelens includes links to multiple event datasets.

Because event series lenses order records by a designated event time (represented by the globalTimestamp), other references to date and time may or may not be relevant to your event seriesanalysis.

For example, suppose you were building an event series lens based on customers that contained bothpurchase and return events. The global Timestamp represents the purchase timestamp or the returntimestamp of the corresponding event record. As an attribute of a customer, suppose you also had thedate the customer first registered on your web site. This customer registration date may be useful foryour analysis if you wanted to group or filter customers by how long they have been a customer. Forexample, if you wanted to know 'how does new customers' purchase behavior differ from customers whoregistered over a year ago?'

Page 217: Data Ingest Guide

Data Ingest Guide - Define Lenses to Load Data

Page 217

Measure Fields and Event Series Lenses

In Platfora, measure fields are always the result of an aggregate calculation. Since event series lensesdo not contain aggregated data, measure fields are not always applicable to event series analysis.Measure fields may be included in an event series lens, however they may not show up in the vizboard(depending on the type of analysis you choose).

1. For event series lenses, you can only choose measures from a referenced event dataset only, not fromthe currently selected dataset.

2. Pre-defined measures are listed at the beginning of an event dataset.

If you add a measure to an event series lens, the aggregation will not be calculated at lens build time.Measure fields that are added to the lens will not show up in the Funnel viz type in Vizboards.

Even though measure fields are not needed for event series analysis, the lens builder still requiresevery lens to have at least one measure. Open an event dataset and choose any measure so you won'tget a No measure fields error when you go to build the lens.

3. For event series lenses, quick measures are not applicable. If you want to use a field for funnelanalysis, make sure that Original Value is selected and any aggregations are unselected. This addsthe field to the lens as a dimension.

Define Lens Filters

One way to limit the size of a lens is to define a filter to constrain the number of rows pulled in from thedata source. You can only define filters on dimension fields—one filter condition per field. Filters are

Page 218: Data Ingest Guide

Data Ingest Guide - Define Lenses to Load Data

Page 218

evaluated independently during a lens build, so the order in which they are added to the lens does notmatter.

1. Select the dimension field to filter on.

2. Click the filter icon to the right of the field name. You can only define one filter per field.

3. Define the Filter expression.

Filter expressions are always Boolean expressions, meaning they must evaluate to either true or false.

Note that the selected field name serves as the first argument of the expression, followed by acomparison operator, and then the comparison value. The comparison value must be of the same datatype as the field you are filtering on.

Some examples of lens filter expressions:

BETWEEN 2012-06-01 AND 2012-07-31

LAST 7 DAYS LIKE("Plat*")

IN("Larry","Curly","Moe")

NOT IN ("Saturday","Sunday")

Page 219: Data Ingest Guide

Data Ingest Guide - Define Lenses to Load Data

Page 219

< 50.00 >= 21

BETWEEN 2009 AND 2012

IS NOT NULL

4. Click Save.

5. Make sure the filter is added in the Filters panel.

Lens Filters on DATETIME Type Fields

This section contains special considerations you must make when filtering on datetime type values.

Filter conditions on DATETIME type fields must be in the format yyyy-MM-ddTHH:mm:ss:SSS.Z(or any shortened version of this format) without enclosing quotes or any other punctuation. If the datevalue is in string format rather than a datetime format, the value must be enclosed in quotes.

If the lens filter expression is a shortened version of the full format, then any values not includedare assigned a value of zero (0). For example, the expression BETWEEN 2013-06-01 AND2013-07-31 is equivalent to the following expression:

BETWEEN 2015-06-01T00:00:00.000Z AND 2015-07-31T00:00:00.000Z

The expression above does not include any values from July 31, 2015. To include values from July 31,2015, use BETWEEN 2015-06-01 AND 2015-08-01.

Date-based filters can do absolute or relative comparisons.

An absolute date comparison specifies a specific boundary such as:

>= 2015-01-01

The filter expression specifies a range of dates using particular dates in addition to the allowedcomparison operators.

BETWEEN 2015-06-01 AND 2015-08-01

The expression above includes all values in the month of June and July, and no value in August.

When specifying a range of dates, the earlier date should always come first.

Relative comparisons are always relative to the current date. Relative date filters use the followingformat:

LAST <integer> DAYSLAST 7 DAYSLAST 0 DAYS

When using a relative date filter, the filter includes all data from the current day up until the lens buildstart time. The current day is defined as the day in Coordinated Universal Time (UTC) when the lensbuild began. Therefore, the expression LAST 0 DAYS includes data from the current day only, and theexpression LAST 1 DAYS includes data from the current day and the previous day.

Page 220: Data Ingest Guide

Data Ingest Guide - Define Lenses to Load Data

Page 220

You can use a relative date filter together with a lens build schedule to define a rolling time window.For example, you could define a lens filter expression of LAST 7 DAYS and schedule the lens to buildnightly. This way, the lens always contains the previous week’s worth of data.

Lens Filters on Input Partition Fields

An input partitioning field is a field in a dataset that contains information about how to locate the sourcefiles in the remote file system. Defining a filter on these special fields eliminates source data files as thevery first step of the lens build process.

Not all datasets will have input partitioning fields. If there are partitioning fields available, they are listed

on the left side of the lens builder under the Input Data section. Additionally, a special icon isdisplayed next to each partitioning field. You should look for these special fields when you build a lens.Adding a filter on these fields reduces the amount of source data to be scanned and processed during alens build.

Click Add Filter next to a partitioning field to define a filter on that field and reduce the input sizeduring a lens build.

See Understand Input Partitioning Fields for more information about these special fields and how theyaffect lens build processing.

Troubleshoot Lens Filter Expressions

Invalid lens filter expressions don't always result in an error in the web application. Some invalid filterexpressions are only caught during a lens build and can cause the lens build to fail. This section hassome common lens filter expression mistakes that can cause an error or a lens build failure.

Page 221: Data Ingest Guide

Data Ingest Guide - Define Lenses to Load Data

Page 221

A comparison expression must compare values of the same data type. For example, if you create a filteron a field that is an INTEGER data type, you can't specify a comparison argument that is a STRING.

Lens filters often compare a field value to a literal value. Specifying a literal value correctly depends onits data type (string, numeric, or datetime). For example:

• Date literals must be in the format of yyyy-MM-dd without any enclosing quotation marks or otherpunctuation.

• String literals are enclosed in double quotes ("). If the string itself contains a quote, it must beescaped by doubling the double quote ("").

When specifying a range of dates, the earlier date should always come first. For example, when usingthe BETWEEN operator:

use BETWEEN 2013-07-01 AND 2013-07-15 (correct)

not BETWEEN 2013-07-15 AND 2012-07-01 (incorrect)

For relative date filter expressions, the only valid date range keyword is DAY or DAYS. For example:

use LAST 7 DAYS (correct)

not LAST 1 WEEK (incorrect)

Below are some more examples of incorrect lens filter expressions, and their corrected versions.

Filtered Field Lens Filter with Error Corrected Lens Filter What'sWrong?

Date.Year Date.Year = "2012" Date.Year =2012 Can't compare anInteger field to aString literal

OrderDate BETWEEN "2013-07-01"AND "2013-07-15"

BETWEEN 2013-07-01 AND2013-07-15

Can't compare aDatetime field toa String literal

Date.Year 2012 = 2012 No comparisonoperator

Title IN(Mrs,Ms,Miss) IN("Mrs","Ms","Miss") String literalsmust be quoted

Height = "60"" = "60""" Quotes in a literalstring literal mustbe escaped

Height LIKE("\d\'(\d)+"") LIKE("\d\'(\d)+""") Quotes ina regularexpression mustbe escaped

Page 222: Data Ingest Guide

Data Ingest Guide - Define Lenses to Load Data

Page 222

PurchaseDate LAST 1 WEEK LAST 7 DAYS Unsupportedkeyword forrelative dates

PurchaseDate BETWEEN 2013-07-15 AND2012-07-01

BETWEEN 2013-07-01 AND2013-07-15

Invalid daterange

Allow Ad-Hoc Segments

When the focus dataset in a lens references other datasets, you can choose whether or not to allowvizboard users to create and use ad-hoc segments based on the members of the referenced datasets. Youcan enable this option per reference in the lens.

A segment is a special type of dimension field that vizboard users can create in a viz to group togethermembers of a population that meet some defined common criteria.

You might want to allow users to create and use ad-hoc segments so they can use segmentation analysisto analyze members of a population and perform side-by-side comparisons.

When ad-hoc segments are enabled for a reference in a lens, vizboard users have the ability to createad-hoc segments in a viz. Additionally, they can use other segments that have been created for thatreference if they are granted permission on the segment.

Allowing ad-hoc segments may increase the lens size depending on the cardinality of the referenceddataset. By default, ad-hoc segments are not allowed for references in a lens due to lens sizeconsiderations. If the lens already includes the primary key from the referenced dataset, allowing ad-hocsegments for that reference doesn't significantly increase the lens size.

After a segment has been created, you can choose to include the segment field in the lens. Segmentfields included in a lens perform faster in a viz than the equivalent ad-hoc segment. For moreinformation on including a segment field in a lens, see Choose Segment Fields (Aggregate Lens).

Page 223: Data Ingest Guide

Data Ingest Guide - Define Lenses to Load Data

Page 223

To allow vizboard users to create and use segments based on members of a particular referenced dataset,click Ad-Hoc for that reference in the lens builder.

Estimate Lens SizeThe size of an aggregate lens is determined by how much source data you request (the number of inputrows), the number of dimension fields you select, and the cardinality (or number of unique values) ofthe dimension fields you select. Platfora can help estimate the size of a lens by profiling the data in thedataset.

About Dataset Profiles

Dataset profiling takes a sampling of rows (50,000 by default) to determine the characteristics of thedata, such as the number of distinct values per field, the distribution of values, and the size of the variousfields in the dataset.

Profiling a dataset runs a series of MapReduce jobs in Hadoop, and builds a special purpose lens calleda profile lens. This lens cannot be opened or used in a vizboard like a regular lens. The sole purpose ofthe profile lens is to scan the source data and capture the data heuristics. These heuristics are then usedto estimate lens output size in the lens builder.

Having more information about the source data can guide users to make better choices when they createa lens, which can reduce the overall lens size and build time.

Some good things to know about dataset profiles:

• When you profile a dataset, any of its referenced datasets are profiled too.

Page 224: Data Ingest Guide

Data Ingest Guide - Define Lenses to Load Data

Page 224

• You do not need to rerun dataset profile jobs every time new source data arrives in Hadoop. The datacharacteristics of a dataset typically do not change that often.

• You must have Define Lens from Dataset object permission on a dataset in order to profile it.

• The time is takes to run a profile job depends on the amount of source data in Hadoop. If there is a lotof data to scan and sample, it can take a while.

• Profile lenses use the naming convention, dataset_name profile.

• The data hueristics collected during profiling are only applicable to estimating the output size of anaggregate lens. There is no lens output size estimation for event series lenses.

Profile a Dataset

You can profile a dataset as long as you have data access and Define Lens from Datasetpermissions on the dataset. Profiling a dataset initiates a special lens build to sample the source data andcollect its data characteristics.

1. Go to the Data Catalog and choose the Datasets tab.

2. Use the List view or the Quick Find to locate the dataset you want to profile.

3. Choose Profile Dataset from the dataset action menu.

4. Click Confirm.

5. To check the status of the profile job, go to the System page and choose the Activities tab.

Page 225: Data Ingest Guide

Data Ingest Guide - Define Lenses to Load Data

Page 225

After the profile job is finished, you can review the results that are collected when you build or edit alens from that dataset. Dataset profile information can only be seen in the lens builder panel, and is onlyapplicable when building aggregate lenses.

About Lens Size Estimates

Platfora uses the information collected by the dataset profile job to estimate the input and output sizeof a lens. Dataset profile and estimation information is shown in the lens builder workspace. Lens sizeestimates dynamically change as you add or remove fields and filters in the lens definition.

Lens estimations and are calculated before the lens is built. As a result, they maybe off from the actual size of a built lens. Profile data is not available for certainkinds of computed fields that require multi-row or multi-dataset processing, suchas aggregated computed fields (measures), event series processing computedfields, or computed fields that refer to fields in other datasets. As a result,estimates may be off by a large factor when there are lots of these types ofcomputed fields in a lens. This is especially true for DISTINCT measures.

Lens input size estimations apply to both aggregate and event series lens types.However, lens output size estimates are only applicable to aggregate lenses.

Lens Input Size Estimates

Lens input size estimates reflect how much source data will be scanned in the very first stage of a lensbuild. Lens input size estimates are available on all datasets (event datasets that have not been profiledyet). Input size estimation is applicable to all lens types.

Page 226: Data Ingest Guide

Data Ingest Guide - Define Lenses to Load Data

Page 226

1. Platfora looks at the source files in the Hadoop data source location to estimate the overall datasetsize. You do not need to profile a dataset to estimate this number. If the source data files arecompressed, this represents the compressed file size.

2. The size of the input data to a lens build can be reduced if the dataset has input partitioning fields.

The partitioning fields of a dataset are denoted with a special filter icon .

Not all datasets will have partitioning fields, but if they do, the names of those fields will be shownunder the Input data size estimate.

3. After applying a lens filter, the lens builder estimates the percentage of total dataset size that will beexcluded from the lens based on the filter.

Lens Output Size Estimates

Lens output size refers to how big the final lens will be after it is built. Lens output size estimates areonly available for datasets that have been profiled. Output size estimation is only applicable to aggregatelenses (not to event series lenses).

1. Based on the fields you have added to the lens, Platfora will use the profile data to estimate theoutput size of the lens. The estimate is shown as a range.

2. The relative dimension size helps you identify which dimension fields are the largest (have the mostdistinct values), and therefore are contributing most to the overall lens size. Hover your mouse over amark in the distribution chart to see the field it represents.

Page 227: Data Ingest Guide

Data Ingest Guide - Define Lenses to Load Data

Page 227

3. Dimension fields are marked with small/medium/large size icons to denote the cost of adding thatfield to a lens. Small means the field has less than 1000 unique values. Medium means the field hasbetween 1001-9999 unique values. Large means the field has over 10,000 unique values.

4. When you select a dimension field, you can see the data characteristics for that field, such as:

• An estimate of how many unique values the field has.

• The top 20 values for that field and an estimation of how the top values are distributed across allthe rows of the dataset.

• A sampling of the field values from 20 dataset rows. This is not available for certain types ofcomputed fields, such as event series computed fields, or computed fields that reference otherdatasets.

Manage LensesAfter a lens is defined, you can edit it, build it, check its status, update/rebuild it (refresh its data), ordelete it. Lenses can be managed from the Data Catalog or System page.

Edit a Lens Definition

After working with a lens, you may realize that there are fields that you do not need, or additional fieldsthat you'd like to add to the lens. You can edit the lens definition to add or remove fields as long as youhave data access and edit permissions on the lens. Lens definition changes will not be available in avizboard until the lens is rebuilt.

You can edit a lens definition from the Data Catalog page or from a Vizboard. To edit a lens, youmust be at least an Analyst role or above. You must have Edit object permissions on the lens, Define

Page 228: Data Ingest Guide

Data Ingest Guide - Define Lenses to Load Data

Page 228

Lens from Dataset object permissions on the focused dataset and referenced datasets, and dataaccess permissions on the focus dataset and referenced datasets.

Some things to know about changing a lens defintion:

• If you remove fields from a lens definition, it may break visualizations that are using that lens andthose fields.

• Changing the lens defintion requires a full lens rebuild. Rather than just incrementally processing themost recently added source data, all of the source data must be re-processed.

Update Lens Data

Once a lens has been built, you can update it to refresh its data it at any time. You may want to update alens if new data has arrived in the source Hadoop system, or if you have changed the lens definition toinclude additional fields. Depending on what has changed since the last build, subsequent lens builds areusually a lot faster.

Page 229: Data Ingest Guide

Data Ingest Guide - Define Lenses to Load Data

Page 229

To rebuild a lens, you must be at least an Analyst role or above. You must have Edit objectpermissions on the lens and data access permissions on the dataset, as well as any datasets that areincluded in the lens by reference.

1. Go to the Data Catalog and choose the Lenses tab.

2. Use the List view or the Quick Find to locate the lens you want to update.

3. Choose Rebuild from the lens action menu.

4. Click Confirm.

5. To check the status of the lens build job, go to the System page and choose the Activities tab.

Delete or Unbuild a Lens

Deleting or unbuilding a lens is a way to free up space in Platfora for lenses that are no longer neededor used. Deleting a lens removes the lens definition as well as the lens data. Unbuilding a lens onlyremoves the lens data (but keeps the lens definition in case you want to rebuild the lens at a later time).

Page 230: Data Ingest Guide

Data Ingest Guide - Define Lenses to Load Data

Page 230

To delete or unbuild a lens, you must be at least an Analyst role or above and have Own objectpermissions on the lens.

Deleting or unbuilding a lens should be done with care, as doing so will invalidateall of the visualizations that depend on that lens.

1. Go to the Data Catalog and choose the Lenses tab.

2. Use the List view or the Quick Find to locate the lens you want to delete or unbuild.

3. Choose Unbuild or Delete from the lens action menu.

4. Click Confirm or Delete (depending on the action you chose).

When you unbuild or delete a lens, the lens data is immediately cleared from the Platfora servers diskand memory. However, the built lens data will remain on disk in Hadoop for a period of time (thedefault is 24 hours) in case you change your mind and decide to rebuild it.

Page 231: Data Ingest Guide

Data Ingest Guide - Define Lenses to Load Data

Page 231

Check the Status of a Lens Build

Depending on the size of the source data requested, it may take a while to process the requested data inHadoop to build a Platfora lens. The lens is not available for use in visualizations until the lens build iscomplete. You can check the status of a lens build on the System page.

1. Go to the System page.

2. Go to the Lens Builds tab.

3. In the Activities section, click In Progress.

4. Find the lens in the list and expand it to see the build jobs running in Hadoop.

5. Click on a build status message to see more detailed messages about the tasks of a particular job. Thisshows the messages from the Hadoop MapReduce JobTracker.

Manage Lens Notifications

You can configure a lens so Platfora sends an email message to users when it detects an anomaly in lensdata. Do this by defining a lens notification. You might want to define a lens notification so data analystsknow when to view a vizboard to analyze the data further.

When Platfora builds a lens, it queries the lens data to search for data that meets the defined criteria.When it detects that some data meets the criteria, it notifies users with the results of the query it ranagainst the lens. Analysts might then choose to view a vizboard based on the same lens so they caninvestigate the data further.

Consider the following rules and guidelines when working with lens notifications:

• Platfora must be configured to connect to an email server using SMTP.

Page 232: Data Ingest Guide

Data Ingest Guide - Define Lenses to Load Data

Page 232

• To define a rule, the lens must be built already.

• You can define multiple notification rules per lens. Each rule results in a different email message.

• The rule can only return results for one measure in the lens. However, it can filter on multiplemeasures.

• The email message only contains data on the fields chosen in the lens notification rule.

• Platfora sends email notifications after a lens is built, not after the lens notification is created oredited.

• You can disable a lens notification after it has been defined. You might want to do that to temporarilystop notifying users while retaining the logic defined in the notification rule.

Add a Lens Notification Rule

Define a lens notification rule so Platfora sends an email to notify users when the data in a lens buildmeets specified criteria. You can define multiple rules per lens. You can also define a lens notificationrule to notify someone when a lens build fails.

To add a lens notification rule, you must have the permissions to edit a lens definition.

1. Go to the Lenses tab in Data Catalog and find the lens for which you want to define a lensnotification rule.

2. Choose Notifications from the lens action menu.

3. In the Notifications dialog, click Add Notification.

Page 233: Data Ingest Guide

Data Ingest Guide - Define Lenses to Load Data

Page 233

A dialog appears where you can define a lens notification rule.

4. Enter a name for this notification rule.

5. Define the query to run against the lens after the lens is built.

You must select one measure field, and you may select zero or more dimension fields. You can group

by dimension fields, and filter the scope of rows by either dimension or measure fields. Click the icon to include more fields in the query.

6. Define the criteria in the query results that triggers Platfora to send the email notification.

Select the icon to define additional criteria.

7. Enter one or more email addresses that should receive the notification messages.

Separate multiple email addresses with commas (,).

8. Choose whether the lens notification email should be sent when the criteria defined here are met ornot met.

9. Click Save.

Page 234: Data Ingest Guide

Data Ingest Guide - Define Lenses to Load Data

Page 234

10.(Optional) Click the edit button under Failure Notifications to define a rule to send an emailwhen a lens build fails.

a) Enter one or more email addresses that should receive the notification message when the lensbuild fails.

b) Click OK.

c) Enable the Failure Notifications feature.

d) Click Close.

11.Click Close in the Notifications dialog.

The lens notification rule is created and enabled by default.

Page 235: Data Ingest Guide

Data Ingest Guide - Define Lenses to Load Data

Page 235

Disable a Lens Notification Rule

Disabling a lens notification rule allows you temporarily stop notifications while retaining the logicdefined in the notification rule.

1. Go to the Lenses tab in Data Catalog and find the lens for which you want to disable a lensnotification rule.

2. Choose Notifications from the lens action menu.

3. In the Notifications dialog, clear the check box for the notification rule you want to disable.

4. Click Close.

Schedule Lens Builds

You can configure a lens to be built at specific times on specific days. By default, lenses are builton demand, but when you define a schedule for a lens, it is built automatically at the times and daysspecified in the schedule. You might want to define a schedule so the lens is built nightly, outside ofregular working hours.

About Lens Schedules

When you create or edit a schedule, you define one or more rules. A rule is a set of times and days thatspecify when to build a lens. You might want to create multiple rules so the lens builds at different timeson different days. For example, you might want to build the lens at 1:00 a.m. on weekdays, and 8:00p.m. on weekends.

Lenses build start times are determined by the clock on the Platfora server.

Users who have permission to edit a lens can define and edit its schedule.

Lens builds are run by the user who last updated or created the lens build schedule. This is importantbecause the user's lens build size limit applies to the lens build. For example, if a user with a role type

Page 236: Data Ingest Guide

Data Ingest Guide - Define Lenses to Load Data

Page 236

that has permission to build unlimited size lenses creates a schedule, and then a user with a role type thathas permission to build 100 GB size lenses edits the schedule, the lens will only successfully build if it isless than 100 GB.

When a scheduled lens build occurs when the same lens is in progress of being built, the scheduled lensbuild is skipped and the in progress lens build continues.

You can define rules with the following day and time repeat patterns:

• Specific days of the week at specific times. For example, every Monday, Tuesday, Wednesday,Thursday, and Friday at 11:00 pm.

• Specific days of the week at repeated hourly intervals. For example, every Saturday and Sunday,every four hours starting at 12:15 am.

• Specific days of the month at specific times. For example, on the first and 15th of every month at1:00 am.

Create a Lens Schedule

You can configure a schedule for a lens so it is built at specific times on specific days. You define thelens schedule when you edit the lens. The schedule is saved whether or not you save your changes on thelens page.

To create a lens schedule, you must have the permissions to edit a lens definition.

Page 237: Data Ingest Guide

Data Ingest Guide - Define Lenses to Load Data

Page 237

1. Go to the Lenses tab in Data Catalog and find the lens for which you want to define a schedule.

2. Choose Schedule from the lens action menu.

3. Click Add Rule.

4. In the Lens Build Schedule dialog, define a rule for the schedule using the Day of Week orDay of Month rules.

5. (Optional) Click Add another rule to define an additional rule for the schedule.

Lenses are only built once if you define multiple overlapping rules for the sametime and day.

6. (Optional) Select Export this lens once the build completes if you want to export the lensdata in CSV format.

The lens definition export settings must be defined before you can configure alens build schedule to export data.

The export files will be created in the remote file system location you specify. For example, to exportto HDFS, the location URL would look something like this:

hdfs://10.80.231.123:8020/platfora/exports

7. Click OK.

8. Click OK.

View All Scheduled Builds

Users with the role of System Administrator can view all scheduled lens builds. Additionally, theycan pause (and later resume) a schedule lens build, which might be useful during maintenance windowsor a time of unusually high lens build demand.

1. Go to the System > Activities page.

2. Click View Full Schedule.

Page 238: Data Ingest Guide

Data Ingest Guide - Define Lenses to Load Data

Page 238

3. The Scheduled Activities dialog displays the upcoming scheduled lens builds and vizboard PDFemails.

a) (Optional) Click a column name to sort by that column.

b) (Optional) Click Pause for a scheduled lens build to prevent the lens from building at thescheduled time. It will remain paused until someone resumes the schedule.

4. Click OK.

Manage Segments—FAQsAfter a segment is defined in a vizboard, you can edit it, update the segment members, delete it, scheduleupdates, and show the data lineage. After creating a segment, the segment appears as a catalog objecton the Data Catalog > Segments page. This topic answers some frequently asked questions aboutmanaging segments.

How do I view a segment definition?

Any user with data access permission on the underlying datasets can view a segment definition. Clickthe segment on the Data Catalog > Segments page. You can view the segment and its conditions,but cannot edit it.

How do I edit a segment definition?

To edit a segment, you must be at least an Analyst role or above. You must have Edit objectpermissions on the segment and data access permissions on the datasets used in the segment.

Page 239: Data Ingest Guide

Data Ingest Guide - Define Lenses to Load Data

Page 239

You can edit a segment definition from the Data Catalog page or from a vizboard. When editing asegment, you can add or remove conditions and edit the segment value labels for members and non-members.

How do I update the segment members when the source data changes?

Once a segment is created, Platfora creates its own special type of lens behind the scenes to create andpopulate the members of the segment. To update the segment members from the latest source data, yourebuild the segment lens. To schedule a segment, you must have Edit object permission on the segment.

Page 240: Data Ingest Guide

Data Ingest Guide - Define Lenses to Load Data

Page 240

Choose Build from the segment's menu on the Data Catalog > Segments page.

Can I delete a segment?

Yes. Deleting a segment removes the segment definition and its data from the Platfora catalog. To deletea segment, you must have Own object permission on the segment.

Any visualization using the segment in an analysis will show an error if the segment is deleted from thedataset. To use these visualizations without error, remove the deleted segment from the drop zone thatcontains it.

You can't delete segments that are currently included in a lens. To delete a segment that is included in alens, remove it from the lens and then rebuild the lens.

Choose Delete from the segment's menu on the Data Catalog > Segments page.

Segments created from an aggregate lens can also be deleted using the Deletebutton when editing the segment from a viz or from the Data Catalog page.

Page 241: Data Ingest Guide

Data Ingest Guide - Define Lenses to Load Data

Page 241

Can I configure a segment lens build to build on a defined schedule?

Yes, you can configure segment lenses to be built at specific times on specific days like other lensbuilds. To schedule a segment, you must have Edit object permission on the segment. ChooseSchedule from the segment's menu on the Data Catalog > Segments page.

For more details on how to define a schedule for a segment lens build, see Schedule Lens Builds.

Can I show the data lineage for a segment?

To show the data lineage for a segment, you must have Edit object permission on the segment.

The data lineage report for a segment shows the following types of information:

• Segment conditions

• Lens field names

• Reference field names

• Filter expressions

• Field expressions

• Lens names

• Dataset names

• Data source names

• Data source locations

• Lens build specific source file names including their paths

• Timestamps

Page 242: Data Ingest Guide

Data Ingest Guide - Define Lenses to Load Data

Page 242

Choose Show info & lineage from the segment's menu on the Data Catalog > Segments page.

Page 243: Data Ingest Guide

Page 243

Chapter

6Export Lens DataPlatfora is open for business. Platfora allows you to export lens data for use in other tools or applications. Youcan export data to either a comma-separated values (csv) formatted file or a file used by Tableau (TDE). You canalso query a lens with a SQL-like language.

Topics:• FAQs—Lens Data Export Basics

• Understand Exported Lens Data

• Configure Export Settings for CSV

• Configure Export Settings for Tableau

• Export Lens Data to a Remote System

• Export Lens Data to your Desktop

• Export a Segment Data to a Remote System

• Download Segment Data to your Desktop

• Export a Partial Lens as CSV

• Query a Lens Using the REST API

FAQs—Lens Data Export BasicsThis topic answers some frequently asked questions about exporting lens data out of Platfora.

Why would I want to export lens data?

Platfora is open. You might want to export lens data to take advantage of Platfora's ability to enrich andoptimize raw data by partially aggregating it, and then leverage that with your existing investment ina different application, such as Tableau. This might be useful if business analysts in your organizationwant to consume the insights from Platfora in an existing business intelligence tool.

What kind of file formats can I export lens data into?

You can export data to either a comma-separated values (CSV) formatted file or a file used by Tableau(TDE). You can also query a lens with a SQL-like language.

Page 244: Data Ingest Guide

Data Ingest Guide - Export Lens Data

Page 244

Who can export lens data?

Any user who has the Analyst Limited system role or higher and has data access permissions on alldatasets included in the lens can export lens data.

To export the lens data to a location in a remote file system (such as HDFS or S3), Platfora must havewrite permissions to the export directory you specify.

Where can I export lens data to?

You can export lens data either to your local machine or to a remote file system. When exportingremotely, CSV files are exported to a distributed file system, such as HDFS, Google Storage, or S3, andTDE files are exported to a Tableau server.

How do I export lens data?

There are multiple ways to export lens data!

• From the lens or segment contextual menu on the Data Catalog page, choose any of the Exportor Download menu options. Platfora exports the data currently in the latest lens build.

For more information, see Export Lens Data to a Remote System, Export Lens Data to your Desktop,Export a Segment Data to a Remote System, and Download Segment Data to your Desktop.

• Build the lens and export the newly built data when the lens build is complete. Platfora rebuilds thelens and then exports the newly built lens data.

For more information, see Export Lens Data when Building a Lens.

• From a visualization, export the data used in the viz. Click the viz export icon , and selectDownload Data (.csv)., Download Data (.tde), or Export Data to File System(.csv). Platfora exports only part of the lens data in the current lens build, the part that's currentlydisplayed in the viz.

For more information, see Export a Partial Lens as CSV .

• Programmatically export lens data by submitting a lens query using Platfora's REST API. Platforaexports only part of the lens data in the current lens build, the parts requested in the query statement.

For more information, see Query a Lens Using the REST API.

Some of the Download and Export menu options are disabled or hidden, why isthat?

Your Platfora system administrator can choose to not allow users to export data as CSV files or TDEfiles, either locally or to a remote system.

To download lens data to your local machine, the platfora.permit.lens.to.desktopconfiguration property must be set to true. Additionally, to download lens data as a TDE file, theplatfora.tableau.enabled configuration property must be set to true.

Page 245: Data Ingest Guide

Data Ingest Guide - Export Lens Data

Page 245

To export lens data to a remote system, the platfora.permit.export.dataconfiguration property must be set to true. Additionally, to export lens data to TDE, theplatfora.tableau.enabled configuration property must be set to true and theplatfora.tableau.version configuration property must use version 9.0 or higher.

How much lens data can I export?

When you export lens data from the Data Catalog page or when a lens is built, Platfora exportsthe entire lens. However, there are limits to the allowed file sizes. If a configured limit is reached, theexport or download is aborted. You can decrease the size of an exported file by applying a lens filter orreducing the number of fields included in the lens.

When downloading lens data to your local machine, Platfora limits the number of rows allowed ina downloaded CSV or TDE file. By default, the limit is 10,000,000 rows, but your Platfora systemadministrator can change this value using the platfora.csv.download.max.rows configurationproperty. Platfora limits the number rows allowed in a downloaded file because the data that's exportedis done via a browser client.

When exporting lens data to a Tableau server, the exported file must be 1 GB or less. Note that you arelimited by the amount of disk space on the Tableau server and the amount of memory in the Platforacluster. Very large lenses can be costly to export in terms of memory usage. To prevent a large lensexport from using all of the memory on the Platfora cluster, Platfora allows only one lens export query ata time.

When exporting lens data to a CSV file on a remote server, there is no size limit.

One way to export partial lens data is to export the data currently shown in a viz.

Page 246: Data Ingest Guide

Data Ingest Guide - Export Lens Data

Page 246

Can I automate data exports following a scheduled lens build?

Yes. When you create a lens build schedule, there is an option to export the lens data to either CSV orTableau after the lens build completes. Platfora exports the lens data to the location configured in thelens definition export settings.

The lens definition export settings must be defined before you can configure a lensbuild schedule to export data.

What is the performance impact of a large lens export?

In order to export data from a lens, Platfora uses the in-memory query engine to select the export data.This is the same query engine used when other Platfora users create and edit visualizations. Lens exportdata is treated just like any other lens query. This means that large lens exports can impact performancefor other vizboard users by competing for in-memory resources.

How are the CSV files formatted?

The first row of an exported CSV file is a header row containing the lens field names. Measure fieldnames are enclosed in square brackets []. Data values are enclosed in double-quotes (") and separatedby commas. If a data value contains a double-quote character, then it is escaped using another double-quote ("").

Page 247: Data Ingest Guide

Data Ingest Guide - Export Lens Data

Page 247

The column order in the export file is dimension fields first (in alphabetical order) followed by measurefields (in alphabetical order).

What are the file names of the exported files?

Platfora uses the following naming conventions:

dataset-name_lens-name_epoch-timestamp.csv

dataset-name_lens-name_epoch-timestamp.tde

Where are the exported files located?

When you download export lens data to your local computer, Platfora creates a single file in the locationdetermined by your browser.

For files exported to a remote system, you configure the export location in the export settings in the lensdefinition. For more information, see Configure Export Settings for CSV and Configure Export Settingsfor Tableau.

When you export lens data to a CSV file on a remote file system, Platfora creates a directory in theconfigured export location using the following directory naming convention:

export-location/lens-name/timestamp

When exporting to a CSV file on a remote file system, the lens data is exported in parallel and is usuallysplit across multiple export files. The export location contains a series of csv.gz lens data files and a.success file if the export completed successfully.

When you export lens data to Tableau, Platfora creates a TDE file on the configured Tableau server, inthe configured Tableau site and project.

Can I export all UTF-8 characters?

Platfora lenses support UTF-8 character encoding. When you export lens data to a CSV file, allcharacters are exported as expected.

Unfortunately, the third party software Platfora uses to export lens data to a TDE file (either locally orto a Tableau server), doesn't support multibyte characters that are four (4) bytes or greater. Therefore,Platfora replaces multibyte characters that are four (4) bytes or greater in the exported TDE file. Platforareplaces each byte of a multibyte character with a DEL character, which is represented as decimalcharacter number 127 (hexadecimal 0x7F) in 7-bit ASCII.

For example, most emoji characters use four bytes, so if your source files contain emoji characters, theywill not preserved when exporting to TDE.

Page 248: Data Ingest Guide

Data Ingest Guide - Export Lens Data

Page 248

Understand Exported Lens Data

A lens is a type of data storage that is specific to Platfora. This section explains some of the nuancesabout how data is stored in a lens and how that affects lens data exported to a CSV or TDE (Tableau)file. This topic only discusses aggregate lenses.

Platfora will change how lens data is exported to Tableau files in a future release.

Pre-Aggregated Data

A lens takes raw data from a source and pre-aggregates it to optimize the performance of big dataqueries. Platfora data administrators decide how to aggregate the metrics from a source in the lensdefinition. They do this by defining measure fields either in the dataset or at lens build time.

The data populated to a lens is pre-aggregated, compressed, and columnar.

A lens contains aggregated measure data grouped by the various dimension fields you select from thedataset. Each row in the exported file represents the data for the unique combination of dimension fieldsincluded in the lens, as well as the calculated aggregated value in each measure field for this particularcombination of dimension fields.

For example, assume you have the following raw source data:

State Movie Rating

CA Real Genius 9

CA Real Genius 9

CA Real Genius 8

OR Real Genius 8

OR Real Genius 7

CA Thunderheart 10

OR Thunderheart 9

If you calculate a COUNT (called Total Records in Platfora), AVG, and SUM on the Rating field andinclude all of those measure fields in a lens, then the data in an exported file looks like this:

State Movie [Rating (Avg)] [Rating(Sum)]

[TotalRecords]

CA Real Genius 8.66 26 3

CA Thunderheart 10 10 1

Page 249: Data Ingest Guide

Data Ingest Guide - Export Lens Data

Page 249

State Movie [Rating (Avg)] [Rating(Sum)]

[TotalRecords]

OR Real Genius 7.5 15 2

OR Thunderheart 9 9 1

Notice that Platfora only includes one row for duplicate records and it includes a COUNT (called TotalRecords in Platfora) for that record. In this example, there are three duplicate records for ratings made inCalifornia (CA) for the movie Real Genius. The COUNT for this record is three (3).

The aggregate values for each row in the lens can be considered as partially aggregated data. That is, thevalues given for each measure field apply to that record only.

Using Exported Lens Data in Tableau

Both Platfora and Tableau use the term measure, and while similar, they are treated slightly differently.In Platfora, a measure is a field that is calculated from another field in the dataset by applying anaggregate function. In Tableau, a measure is any quantitative field, usually numeric.

When exporting lens data to Tableau, Tableau treats all numeric fields from Platfora as measures, evennumeric dimension fields. When working with a Tableau data source that originated in Platfora, youshould move the numeric dimension fields from the Measures area to the Dimensions area inTableau.

Additionally, due to the pre-aggregated nature of Platfora measure fields, when choosing Platforameasure fields to export to Tableau, Platfora recommends only including measure fields that use thefollowing aggregate functions:

• SUM. When you use this field in Tableau as a measure, you can apply Sum to this field.

• COUNT. When you use this field in Tableau as a measure, you can apply Sum to this field.

• MIN. When you use this field in Tableau as a measure, you can apply Minimum to this field.

• MAX. When you use this field in Tableau as a measure, you can apply Maximum to this field.

NULL Value Handling

If a field or column value in a dataset is empty, it is considered a NULL value. During lens processing,Platfora replaces all NULL values with a default value instead. Platfora lenses and vizboards have noconcept of NULL values. NULLs are always substituted with the default field values specified in thedataset definition.

Therefore, when you export lens data, the exported file contains the default value instead of a NULLvalue. Users who process the CSV or TDE file see only valid values and have no way of knowing whichvalues were originally NULL values in the raw source.

Page 250: Data Ingest Guide

Data Ingest Guide - Export Lens Data

Page 250

Data administrators specify the default value for each field in a lens definition. By default, Platfora usesthe following default values per data type:

Data Type Default Value

LONG, INTEGER, DOUBLE, FIXED 0

STRING NULL (as a string)

DATETIME January 1, 1970 12:00:00:000 GMT

LOCATION (latitude,longitude coordinateposition)

0,0

Time Zone Encoding

Platfora stores all DATETIME fields in Coordinated Universal Time (UTC). Any third party applicationthat processes exported lens data can convert these DATETIME fields from UTC to whatever time zonethey require.

Additionally, Platfora provides two built-in datasets, Date and Time, to allow analysts to more quicklyanalyze time-based data at more granular levels. When a dataset contains a DATETIME field, Platforaautomatically creates references to the Date and Time datasets.

When defining a lens, you can include fields from the Date and Time datasets, such as Month or Hour.The data type of all fields in the Date and Time datasets is STRING. Because these fields are STRINGand not DATETIME, any third party application that processes exported lens data will not automaticallyconvert these values to a different time zone. All values from fields in the Date and Time datasets willalways be in UTC format.

When defining a lens that will be exported to a CSV or TDE file, Platfora recommends includingDATETIME fields, but not including any fields from the Date or Time datasets.

Field Names

When you export lens data, the names of fields in the exported file depend on the field name in Platforaand the field role, either measure or dimension.

In an exported file, dimension field names are the same as defined in the dataset, and measure fieldnames are the same, but enclosed in square brackets ( [ ] ).

Configure Export Settings for CSVWhen exporting lens and segment data to a remote file system as a CSV file, you must specify how toconnect to the remote file system.

You can configure the export settings for CSV files from the following locations:

Page 251: Data Ingest Guide

Data Ingest Guide - Export Lens Data

Page 251

• Lens builder. When editing a lens in the lens builder, click Export Settings and then define theexport settings on the CSV tab.

• Data Catalog page. You can export a lens or segment to a CSV file by accessing the lens orsegment contextual menu from the Data Catalog page. When you do this, Platfora prompts you todefine the export settings. When done for a lens, any settings you make here get saved to the lens.

• Viz export menu. You can export the data comprising a viz to either your desktop or to a remote filesystem.

Make sure that Platfora has write permissions to the specified export location. Alsomake sure there is enough free space in the export location. If the export locationdoes not exist, Platfora will create it if it has the appropriate permissions.

To define the export settings for exporting lens or segment data to a CSV file, first choose the type ofexport location in the Destination Type field.

Define the following export settings for each Destination Type:

Destination Type Description

HDFS Enter the fully qualified domain name or IP address of the Hostmachine, the Port, and the Root Path on the machine where youwant to create the CSV file.

Google Storage Enter the Google Storage Bucket Name and the Path where youwant to create the CSV file.

Page 252: Data Ingest Guide

Data Ingest Guide - Export Lens Data

Page 252

Destination Type Description

Amazon S3 Enter the Amazon S3 Bucket Name and the Path where you wantto create the CSV file.If exporting to S3, make sure Platfora also has your Amazonaccess key id and secret key entered in the propertiesplatfora.S3.accesskey and platfora.S3.secretkey. Platforaneeds these to authenticate to your Amazon Web Services (AWS)account.

MapR Enter the fully qualified domain name or IP address of the MapRHost machine, the Port, and the Root Path on the machinewhere you want to create the CSV file.

Other Enter the URL of any other machine where you want to create theCSV file. Use the following format for the URL:native_filesystem_protocol://hostname:port/path-to-export-location

Configure Export Settings for TableauWhen exporting lens data to a Tableau server as a TDE file, you must specify how to connect to theTableau server and what file settings to use.

You can configure the export settings for Tableau from the following locations:

• Lens builder. When editing a lens in the lens builder, click Export Settings and then define theexport settings on the Tableau tab.

• Data Catalog page. You can export a lens to Tableau by accessing the lens contextual menu fromthe Data Catalog page. When you do this, Platfora prompts you to define the export settings. Anysettings you make here get saved to the lens.

Page 253: Data Ingest Guide

Data Ingest Guide - Export Lens Data

Page 253

Your Platfora administrator might have configured some global connection settings to use forconnecting to a Tableau server. If these have been configured, you can choose to use those globalsettings or define different custom settings when exporting a lens.

Define the following export settings for exporting a lens to Tableau:

Setting Description

Use GlobalSettings / UseCustom Settings

Choose whether to use connection settings that your Platforaadministrator configured globally for all users, or to specify yourown connection settings here. To use global settings, the Tableauconnection information must be configured in the Platfora GlobalSettings on the System page.

Host The fully qualified domain name or IP address of the machine thathosts the Tableau server that will receive the exported TDE file.

Site ID The name of the site on the Tableau server that should containthe TDE file. This value is case sensitive and must match the sitesconfigured in Tableau.

User Name The name of the Tableau user that has permission to add a TDE fileas a data source.

Password The password of the configured Tableau user.

Page 254: Data Ingest Guide

Data Ingest Guide - Export Lens Data

Page 254

Setting Description

Test Connection Use this button to verify that Platfora can connect to the Tableauserver using the configured connection settings.

Project Name The name of the Tableau project in the Tableau site that shouldcontain the TDE file.

Data Source The name of Tableau data source you want to create that will beassociated with the exported TDE file.

Export Lens Data to a Remote SystemYou can export all data in a Platfora lens to a remote system as either a CSV or TDE file. CSV filesare exported to a distributed file system, such as HDFS or S3, and TDE files are exported to a Tableauserver.

You can export lens data in the following ways:

• Export the current lens build data from the lens contextual menu on the Data Catalog page.Platfora exports the data currently in the latest lens build.

• Build the lens and export the newly built data when the lens build is complete. Platfora rebuilds thelens and then exports the newly built lens data.

Your Platfora administrator might disable the ability for users to export lens datato a remote system for either CSV files, TDE files, or both. Depending on howPlatfora is configured, the menu choices to export might be grayed out or hidden.

Page 255: Data Ingest Guide

Data Ingest Guide - Export Lens Data

Page 255

Export Lens Data from the Data Catalog Page

1. Go to the Lenses tab in the Data Catalog.

2. From the lens contextual menu, select Export Data as CSV or Export Data to Tableau.

3. Enter the Export Settings.

For details, see Configure Export Settings for CSV and Configure Export Settings for Tableau.

4. Click Export.

A notification message appears when the lens export completes.

For exported CSV files, in the remote file system, a directory is created in the specified export locationusing the directory naming convention:

export-location/lens-name/timestamp

The lens data is exported in parallel and is usually split across multiple export files. The export locationcontains a series of csv.gz lens data files and a .success file if the export completed successfully.

For exported TDE files, the TDE file is sent to the Tableau server and a data source is created in thespecified Tableau project inside the specified Tableau site.

Page 256: Data Ingest Guide

Data Ingest Guide - Export Lens Data

Page 256

Export Lens Data when Building a Lens

1. Open the lens builder for the lens to export.

2. Click Export Settings to verify the export settings are configured for this lens to export to eitherCSV or TDE.

For details, see Configure Export Settings for CSV and Configure Export Settings for Tableau.

3. Click Build/Update.

4. In the Confirm Build dialog, choose to export to CSV or export to Tableau.

5. Click Confirm.

Platfora builds the lens. When the lens build is complete, Platfora queries the newly built lens data andexports it to the configured file formats.

For exported CSV files, in the remote file system, a directory is created in the specified export locationusing the directory naming convention:

export-location/lens-name/timestamp

The lens data is exported in parallel and is usually split across multiple export files. The export locationcontains a series of csv.gz lens data files and a .success file if the export completed successfully.

For exported TDE files, the TDE file is sent to the Tableau server and a data source is created in thespecified Tableau project inside the specified Tableau site.

Export Lens Data to your DesktopExporting lens data to your desktop allows you to download a single export file to your local machine.You can download lens data from the Platfora data catalog or from a viz in a vizboard.

Page 257: Data Ingest Guide

Data Ingest Guide - Export Lens Data

Page 257

Download an Entire Lens to your Desktop

When downloading lens data to your desktop, keep in mind that the data is exported via your browserclient. Platfora limits the number of rows allowed in a downloaded CSV or TDE file. By default,the limit is 10,000,000 rows, but your Platfora system administrator can change that using theplatfora.csv.download.max.rows configuration property. If this limit is reached, the downloadis aborted.

1. Go to the Data Catalog > Lenses page.

2. From the lens contextual menu, select Download Data as CSV or Download Data as TDE.

3. A CSV or TDE file is generated and downloaded to the directory determined by your browser.Depending on the size of the lens, it can take several minutes for the download to finish.

Depending on the size of the data requested, the export can take a while tocomplete. Stay on the page until the download starts or else the export will becancelled.

The export file naming conventions are:

dataset-name_lens-name_epoch-timestamp.csvdataset-name_lens-name_epoch-timestamp.tde

Page 258: Data Ingest Guide

Data Ingest Guide - Export Lens Data

Page 258

Download a Partial Lens to your Desktop

1. Open the vizboard containing the data you want to export.

2. From the viz toolbar click the export icon .

3. Select Download Data (.csv) or Download Data (.tde).

4. A CSV or TDE file is generated and downloaded to the directory determined by your browser.

Depending on the size of the data requested, the export can take a while tocomplete. Stay on the page until the download starts or else the export will becancelled.

The export file naming conventions are:

dataset-name_lens-name_epoch-timestamp.csv

dataset-name_lens-name_epoch-timestamp.tde

Page 259: Data Ingest Guide

Data Ingest Guide - Export Lens Data

Page 259

Export a Segment Data to a Remote SystemExporting a segment writes out the data in parallel to a distributed file system, such as HDFS or S3. Youcan export an entire segment from the Platfora data catalog.

1. Go to the Data Catalog > Segments page.

2. From the segment contextual menu, select Export Data as CSV.

3. Enter the Export Settings.

For details, see Configure Export Settings for CSV and Configure Export Settings for Tableau.

4. Click Export.

5. A notification message displays when the segment export completes.

In the remote file system, a directory is created in the specified export location using the directorynaming convention:

export-location/segment-name/timestamp

The segment data is exported in parallel and is usually split across multiple export files. The exportlocation contains a series of csv.gz lens data files and a .success file if the export completedsuccessfully.

Download Segment Data to your DesktopExporting segment data to your desktop allows you to download a single export file to your localmachine.

When downloading segment data to your desktop, keep in mind that the data is exported via yourbrowser client. Platfora limits the number of rows allowed in a downloaded CSV file. By default,

Page 260: Data Ingest Guide

Data Ingest Guide - Export Lens Data

Page 260

the limit is 10,000,000 rows, but your Platfora system administrator can change that using theplatfora.csv.download.max.rows configuration property. If this limit is reached, the downloadis aborted.

It can take several minutes for a large segment export query to execute and forthe download to begin. If you navigate away from the page before the downloadstarts, it will be cancelled.

1. Go to the Data Catalog > Segments page.

2. From the segment contextual menu, select Download Data as CSV.

3. A CSV file is generated and downloaded to the directory determined by your browser.

Depending on the size of the segment, it can take several minutes for the download to finish.

The export file naming convention is:

export-location/segment-name/timestamp.csv

Page 261: Data Ingest Guide

Data Ingest Guide - Export Lens Data

Page 261

Export a Partial Lens as CSVExporting a partial lens writes out the results of a lens query to a distributed file system, such as HDFSor S3. You can export a partial lens from a single viz in a vizboard.

Make sure you have the correct URL and path information for the remote filesystem, and that Platfora has write permissions to the specified export location.Also make sure there is enough free space in the export location. If the exportlocation does not exist, Platfora will create it if it has the appropriate permissions.

1. Go to Vizboards and open the vizboard containing the data you want to export.

2. From the viz export menu, select Export Data to File System (.csv).

3. Enter the Export Settings.

For details, see Configure Export Settings for CSV and Configure Export Settings for Tableau.

4. Click Export.

5. A notification message will appear when the lens export completes.

Query a Lens Using the REST APIPlatfora provides a SQL-like query language that you can use to programmatically access data in a lens.You can submit a SELECT statement using the REST API, and the query results are returned in CSVformat.

The syntax used to define a lens query is similar to a SQL SELECT statement. Here is an overview of thesyntax used to define a lens query:

[ DEFINE new-computed-field_alias AS computed_field_expression ] SELECT measure-fields, dimension-fields FROM aggregate-lens-name

Page 262: Data Ingest Guide

Data Ingest Guide - Export Lens Data

Page 262

[ WHERE dimension-filter-expression ] GROUP BY dimension-fields [ SORT BY measure-field [ASC | DESC] [LIMIT number] ] [ HAVING measure-filter-expression ]

The LIMIT clause applies to the group formed by the GROUP BY clause, not theentire lens.

If you have been using Platfora vizboards, you have already been generating lens queries by creatingvisualizations. Here is how the query language clauses map to actions in the viz builder.

For more information about the lens query language syntax and usage, see the Lens Query LanguageReference.

1. Write a lens query SELECT statement.

For example:

SELECT [Total Records],[Lease Status],Carrier.Name FROM Planes WHERE Carrier.Name NOT IN ("NULL") GROUP BY [Lease Status],Carrier.Name HAVING [Total Records] > 100

Notice how the lens field names containing spaces are escaped by enclosing them in brackets. Alsonotice the dot notation to refer to a field from a referenced dataset.

2. Depending on the REST client you are using, you may need to URL encode the query beforesubmitting it via the REST API.

Page 263: Data Ingest Guide

Data Ingest Guide - Export Lens Data

Page 263

For example, here is the URL-encoded version of the previous lens query:

SELECT+%5BTotal+Records%5D%2C%5BLease+Status%5D%2CCarrier.Name+FROM+Planes+WHERE+Carrier.Name+NOT+IN+%28%22NULL%22%29+GROUP+BY+%5BLease+Status%5D%2CCarrier.Name+HAVING+%5BTotal+Records%5D+%3E+100

3. Submit the encoded query string via the REST API.

For example, using the cURL command-line utility:

curl -u admin:admin http://localhost:8001/api/v1/query?query=SELECT+%5BTotal+Records%5D%2C%5BLease+Status%5D%2CCarrier.Name+FROM+Planes+WHERE+Carrier.Name+NOT+IN+%28%22NULL%22%29+GROUP+BY+%5BLease+Status%5D%2CCarrier.Name+HAVING+%5BTotal+Records%5D+%3E+100 >> query_output.csv

Notice the first part of the URL specifies the Platfora server hostname and port. This example isconnecting to localhost using the default admin username and password.

Notice the latter part of the URL which specifies the Rest API endpoint: /api/v1/query

The GET method for this endpoint expects one input parameter, query, which is the encoded querystring.

The output is returned in CSV format, which you can redirect to a file if you want to save the queryresults.

Page 264: Data Ingest Guide

Page 264

Chapter

7Platfora ExpressionsPlatfora comes with a powerful, flexible built-in expression language that you can use to transform, manipulate,and query data. This section describes Platfora's expression language, and describes how to use it to definedataset computed fields, vizboard computed fields, measures, lens filters, and lens query statements.

Topics:• Expression Building Blocks

• PARTITION Expressions and Event Series Processing (ESP)

• ROLLUP Measures and Window Expressions

• Computed Field Examples

• Troubleshoot Computed Field Errors

• Write a Lens Query

• FAQs - Expression Basics

• Platfora Expression Language Dictionary

Expression Building Blocks

This section explains the building blocks of an expression, and the general rules for constructing a validexpression.

Functions in an Expression

Functions perform common data processing tasks. While not all expressions contain functions, most do.This section describes basic concepts you need to know to use functions.

Function Inputs and Outputs

Functions take one or more input values and return an output value. Input values can be a literal valueor the name of a field that contains a value. In both cases, the function expects the input value to be aparticular data type such as STRING or INTEGER. For example, the CONCAT() function combinesSTRING inputs and outputs a new STRING.

Page 265: Data Ingest Guide

Data Ingest Guide - Platfora Expressions

Page 265

This example shows how to use the CONCAT() function to concatenate the values in the month, day,and year fields separated by the literal forward slash character:

CONCAT(month,"/",day,"/",year)

A function's return value may be the same as its input type or it may be an entirely new data type. Forexample, the TO_DATE() function takes a STRING as input, but outputs a DATETIME value. If afunction expects a STRING, but is passed another data type as input, the function returns an error.

Typically, functions are classified by what data type they take or what purpose they serve. For example,CONCAT() is a string function and TO_DATE() is a data type conversion function. You'll find acomplete list of functions by type in Platfora's Expression Language Reference.

Nesting Functions

Functions can take other functions as arguments. For example, you can use the CONCAT function as anargument to the TO_DATE() function. The final result is a DATETIME value in the format 10/31/2014.

TO_DATE(CONCAT(month,"/",day,"/",year),"MM/dd/yyyy")

The nested function must return the correct data type. So, because TO_DATE() expects string input andCONCAT() returns a string, the nesting succeeds.

Only row functions allow nesting. Aggregate functions do not allow nested expressions as input.

Aggregate Functions versus Row Functions

Most functions process one value from one row at a time. These are called row functions because theyoperate on one value from a single row at a time. Aggregate functions are a special class of functions.Unlike row functions, aggregate functions process the values from multiple rows together into a singlereturn value. Some examples of row functions are:

• SUM()

• MIN()

• VARIANCE()

Aggregate functions are also special because you use them to define measures. Measures always returnnumeric values that serve as the quantitative data in an analysis. Aggregate expressions are often referedto as measure expressions in Platfora.

Limitations of Aggregation Functions

Unlike row functions, aggregate functions can only take simple expressions as input (such as field namesor literal values). Aggregate functions cannot take row functions as arguments. You also cannot use anaggregate function as input into a row function. You cannot mix aggregate functions and row functionstogether in one expression.

Finally, while you can build expressions in both the dataset or the vizboard, only the following aggregatefunctions are allowed in a vizboard computed field expressions:

• DISTINCT()

Page 266: Data Ingest Guide

Data Ingest Guide - Platfora Expressions

Page 266

• MIN()

• MAX()

• ROLLUP

Operators in an Expression

Platfora has a number of built-in operators for doing arithmetic, logical, and comparison operations.Often, you'll use operators to combine or compare values. The values can be literal values, field values,or even other expressions.

Arithmetic Operators

Arithmetic operators perform basic math operations on two values of the same data type. For example,you could calculate the gross profit margin percentage using the values of a total_revenue andtotal_cost field as follows:

((total_revenue - total_cost) / total_cost) * 100

Or you can use the plus (+) operator to combine STRING values:

"Firstname" + " " + "Lastname"

You can use the plus (+) and minus (-) operators to add or subtract DATETIME values. The followingtable lists the math operators:

Operator Description Example

+ Addition amount + 10(add 10 to the value of theamountfield)

- Subtraction amount - 10(subtract 10 from the value of theamountfield)

* Multiplication amount * 100(multiply the value of theamountfield by 100)

/ Division bytes / 1024(divide the value of thebytesfield by 1024 and return the quotient)

Page 267: Data Ingest Guide

Data Ingest Guide - Platfora Expressions

Page 267

Comparison Operators

Comparison operators are used to define Boolean (true / false) expressions. They test whether two valuesare equivalent. Comparisons return 1 for true, 0 for false. If the comparison is invalid, for examplecomparing a STRING to an INTEGER, the comparison operator returns NULL.

For example, you could use comparison operators within a CASE expression:

CASE WHEN age <= 25 THEN "0-25" WHEN age <= 50 THEN "26-50ELSE "over 50" END

This expression compares the value in the age field to a literal number value. If true, it returns theappropriate STRING value.

You cannot use comparison operators to test for equality between DATETIME values. The followingtable lists the comparison operators:

Operator Meaning Example Expression

= or == Equal to order_date = "12/22/2011"

> Greater than age > 18

!> Not greater than(equivalent to <)

age !> 8

< Less than age < 30

!< Not less than (equivalentto >=)

age !< 12

>= Greater than or equal to age >= 20

<= Less than or equal to age <= 29

<> or != or ^= Not equal to age <> 30

BETWEENmin_value ANDmax_value

Test whether a date ornumeric value is withinthe min and max values(inclusive).

year BETWEEN 2000 AND 2012

IN(list) Test whether a value iswithin a set.

product_typeIN("tablet","phone","laptop")

LIKE("pattern") Simple inclusive case-insensitive characterpattern matching. The *character matches anynumber of characters.The ? character matchesexactly one character.

last_name LIKE("?utch*")matches Kutcher, hutch but not Krutcher orcrutchcompany_name LIKE("platfora")matches Platfora or platfora

Page 268: Data Ingest Guide

Data Ingest Guide - Platfora Expressions

Page 268

Operator Meaning Example Expression

valueIS NULL

Check whether a fieldvalue or expression isnull (empty)

ship_date IS NULLevaluates to true when the ship_date fieldis empty

Logical Operators

Logical operators are used in expressions to test for a condition. Logical operators are often used inlens filters, CASE expressions, and PARTITION expressions. Filters test if a field or value meets somecondition. For example, this tests if a date falls between two other dates.BETWEEN 2013-06-01 AND 2013-07-31

Logical operators are also used to construct WHERE clauses in Platfora's query language. The followingtable lists the logical operators:

Operator Meaning Example Expression

AND Test whether twoconditions are true.

OR Test if either of twoconditions are true.

NOT Reverses the value ofother operators. • year NOT BETWEEN 2000 AND 2012

• first_name NOT LIKE("Jo?n*")excludes John, jonny but not Jon or Joann

• Date.Weekday NOTIN("Saturday","Sunday")

• purchase_date IS NOT NULLevaluates to true when the purchase_datefield is not empty

Fields in an Expression

Expressions often operate on the values of a field. This section explains how to use field names inexpressions.

Referring to Fields in the Current Dataset

When you specify a field name in an expression, if the field name does not contain spaces or specialcharacters, you can simply refer to the field by its name. For example, the following expression sums thevalues of the sales field:

SUM(sales)

Page 269: Data Ingest Guide

Data Ingest Guide - Platfora Expressions

Page 269

Enclose field names with square brackets ([]) if they contain spaces, special characters, reservedkeywords (such as function names), or start with numeric characters. For example:

SUM([Sale Amount])SUM([2013_data])SUM([count])

If a field name contains a ] (closing square bracket), you must escape the closing square bracket bydoubling it ]]. So if the field name is:

Min([crs_flight_duration])

You enclose the entire field name in square brackets and escape the closing bracket that is part of theactual field name:

[Min([crs_flight_duration]])]>

If you are using the expression builder, it provides the correct escapes for you.

Field is a synonym for dataset column. The documentation uses the word fieldbecause that is the terminology used in Platfora's user interface.

Use Dot Notation for Fields in a Referenced Dataset

Your expression might refer to a field in the focus dataset. (Focus dataset is simply the current datasetyou are working with.) You also might include a field in a referenced dataset. When including fieldsin a referenced dataset, you must qualify the field name with the proper notation. The convention isreference_name.field_name.

Don't confuse a reference name with the dataset name; they are not the same. When you create areference link in a dataset, you give that reference its own name. Use . (dot) notation to separate the twocomponents.

For example consider, the Airports dataset which goes by the Departure Airport reference name. Torefer to the City field of the Departure Airport reference to the Airports dataset, you would use thenotation:

[Departure Airport].City

Just as with field names, you must escape reference names if they contain spaces, special characters,reserved keywords (such as function names), or start with numeric characters.

Aggregated Functions and Fields in a Referenced Dataset

Aggregate functions can only operate on fields in the current focus dataset. You cannot directly calculatea measure on a field belonging to a referenced dataset. For example, the following expression is notallowed:

DISTINCT([Departure Airport].City)

Instead, use a two-step process to 'pull up' a referenced field into the current dataset. First, defineDeparture Airport City computed field whose expression is just the path to the referenced dataset field:

[Departure Airport].City

Page 270: Data Ingest Guide

Data Ingest Guide - Platfora Expressions

Page 270

Then, you can use the interim Departure Airport City computed field as an argument to the aggregateexpression. For example:

DISTINCT([Departure Airport City])

Literal Values in an Expression

Sometimes you need to use a literal value in an expression, as opposed to a field value. How you specifya literal value depends on its data type (text, numeric, or date). This section explains how to use literalsin expressions.

Literal STRING Values

To specify a literal or actual STRING value, enclose the value in double quotes ("). For example, thisexpression converts the values of a gender field to the literal values of male, female, or unknown:CASE WHEN gender="M" THEN "male" WHEN gender="F" THEN "female" ELSE "unknown" END

To escape a literal quote within a literal value itself, double the literal quote character. For example:

CASE WHEN height="60""" THEN "5 feet" WHEN height="72""" THEN "6 feet" ELSE "other" END

The REGEX() function is a special case. In the REGEX() function, string expressions are also enclosedin quotes. When a string expression contains literal quotes, double the literal quote character. Forexample:

REGEX(height, "\d\'(\d)+""")

Literal DATE and DATETIME Values

To refer to a DATETIME value in a lens filter expression, the date format must be yyyy-MM-dd withoutany enclosing quotation marks or other punctuation.

order_date BETWEEN 2012-12-01 AND 2012-12-31

To refer to a literal date value in a computed field expression, you must specify the format of the dateand time components using TO_DATE, which takes a string literal argument and a format string. Forexample:

CASE WHEN order_date=TO_DATE("2013-01-01 00:00:59 PST","yyyy-MM-dd HH:mm:ss z") THEN "free shipping" ELSE "standard shipping" DONE

Literal Numeric Values

For literal numeric values, you can just specify the number itself without any special escaping orformatting. For example:

CASE WHEN is_married=1 THEN "married" is_married=0 THEN "not_married" ELSE NULL END

Page 271: Data Ingest Guide

Data Ingest Guide - Platfora Expressions

Page 271

PARTITION Expressions and Event Series Processing (ESP)

Computed fields that contain a PARTITION expression are considered event series processing (ESP)computed fields. You can add ESP computed fields to Platfora datasets only (not vizboards).

Event series processing is also referred to as pattern matching or event correlation. Use event seriesprocessing (ESP) to partition the rows of a dataset, order the rows sequentially (typically by atimestamp), and search for matching patterns among the rows.

ESP fields evaluate multiple rows in the dataset, and output one value (or column) per row. You can usethe results of an ESP computed field in other expressions or (after lens build processing) in a viz.

How Event Series Processing Works

This section explains how event series processing works by walking you through a simple use of thePARTITION expression.

This example uses some weblog page view data. Each row represents a page view at a given point intime within a user session. Each session is unique and belongs to only one user. Users can have multiplesessions. Within any session a user can visit any page one or more times.

SessionID UserID Timestamp Page

2A 2 3/4/13 2:02 AM products.html

1A 1 12/1/13 9:00 AM home.html

1A 1 12/1/13 9:10 AM products.html

1A 1 12/1/13 9:05 AM company.html

1B 1 3/1/13 9:45 PM home.html

1A 1 3/1/13 9:40 PM checkout.html

2A 2 3/4/13 2:56 AM checkout.html

1B 1 3/1/13 9:46 PM products.html

1A 1 12/1/13 9:20 AM checkout.html

2A 2 3/4/13 2:20 AM home.html

2A 2 3/4/13 2:33 AM blogs.html

1A 1 12/1/13 9:15 AM blogs.html

Page 272: Data Ingest Guide

Data Ingest Guide - Platfora Expressions

Page 272

Consider the following partial PARTITION expression:

PARTITION BY SessionIDORDER BY Timestamp...

This partitions the rows by the SessionID. Within each partition, the function orders each row byTimestamp in ascending order (the default order).

Suppose you wanted to find sessions where users traversed the pages in order from home.html toproducts.html and then to the checkout.html page. To look for this page view pattern, youcomplete the expression like this.

PARTITION BY SessionIDORDER BY TimestampPATTERN (A,B,C) DEFINE A AS Page = "home.html", B AS Page = "product.html", C AS Page = "checkout.html"OUTPUT "TRUE"

The PATTERN clause describes the sequence and the DEFINE clauses assigns values to the PATTERNelements. This pattern says that there is a match whenever there are 3 consecutive rows that meetcriteria A then B then C. If the computed field containing this PARTITION expression was calledPath=home,product,checkout, you would get output that looks like this:

SessionID UserID Timestamp Page Path=home,product,checkout

1A 1 12/1/13 9:00 AM home.html NULL

1A 1 12/1/13 9:05 AM company.html NULL

1A 1 12/1/13 9:10 AM products.html NULL

1A 1 12/1/13 9:15 AM blogs.html NULL

1A 1 12/1/13 9:20 AM checkout.html NULL

1B 1 3/1/13 9:40 PM home.html NULL

1B 1 3/1/13 9:45 PM products.html NULL

1B 1 3/1/13 9:46 PM checkout.html TRUE

2A 2 3/4/13 2:02 AM products.html NULL

2A 2 3/4/13 2:20 AM home.html NULL

2A 2 3/4/13 2:33 AM blogs.html NULL

2A 2 3/4/13 2:56 AM checkout.html NULL

Page 273: Data Ingest Guide

Data Ingest Guide - Platfora Expressions

Page 273

The lens build processing that happens to produce these results is as follows:

1. Partition (or group) the rows of the dataset by session.

2. Order the rows in each partition by time (in ascending order by default).

3. Evaluate the rows against each DEFINE clause and bind the row to the symbol where there is amatch.

4. Check if the PATTERN clause conditions are met in the specified order and frequency.

5. If the PATTERN criteria is met, output TRUE as the result value for the last row that caused thepattern to be true. Write the output results to a new computed field: Path=home,product,checkout. Ifa row does not cause the pattern to be true, output nothing (NULL).

Understand Pattern Match Processing Order

During lens processing, the build evaluates patterns row-by-row from the partitions top row and goingdownwards. A pattern match is evaluated based on the current row, and any rows that come before (interms of their position in the partition). The pattern match only looks back from the current row – it doesnot look ahead to the next row in the partition.

Order processing is important to consider when you want to look for events that happened lateror next (chronologically speaking). With the default sort order (ascending), the build sorts rowswithin a partition from oldest to most recent. This means that you can only pattern match backwardschronologically (or look for events that happened previously in time).

For example, to answer a question such as "what page did a user visit before they visited the productpage?", the following expression would return the previous (chronologically) viewed page before theproduct page:

PARTITION BY SessionIDORDER BY Timestamp ASCPATTERN (^product_page?,A)DEFINE product_page AS "product.html", A AS TRUE

Page 274: Data Ingest Guide

Data Ingest Guide - Platfora Expressions

Page 274

OUTPUT A.Page

If you want to pattern match forwards chronologically (or look for events that happened later in time),you would specify DESC sort order in the ORDER BY clause of your PARTITION expression. Forexample, to answer a question such as "what page did a user visit after they visited the product page?",the following expression would return the next (chronologically) viewed page after the product page:

PARTITION BY SessionIDORDER BY Timestamp DESCPATTERN (^product_page?,A)DEFINE product_page AS "product.html", A AS TRUEOUTPUT A.Page

Understand Pattern Match Precedence

By default, pattern expressions are matched from left to right. The innermost parenthetical expressionsare evaluated first and then moving outward from there.

For example, the pattern:

PATTERN (((A,B)|(C,D)),E)

Would evaluate differently than:

PATTERN (A,B|C,D,E)

Understand Regex-Style Quantifiers (Greedy and Reluctant)

The PATTERN clause can use regex-style quantifiers to denote the frequency of a match.

By default, quantifiers are greedy. This means that it matches as many rows as possible. For example:

PATTERN (A*,B?)

Causes symbol A to match zero or more rows. Symbol B can match to exactly one row.

Adding an additional question mark ? to a quantifier makes it reluctant. This means that the PATTERNonly matches to a row when the row cannot match to any other subsequent match criteria in the pattern.For example:

PATTERN (A*?,B)

Causes symbol A to match zero or more rows, but only when symbol B does not produce a match. Youcan use reluctant quantifiers to break ties when there is more than one possible match to the pattern.

A quantifier applies to a single match criteria symbol only. You cannot apply quantifiers to parentheticalexpressions. For example, you cannot write ((A,B,C)*, D) to indicate that the asterisk quantifierapplies to the whole (A,B,C) expression.

Best Practices for Event Series Processing (ESP)

Event series processing (ESP) computed fields, unlike other computed fields, require advancedprocessing during lens builds. This means they require more compute resources on your Hadoop cluster.

Page 275: Data Ingest Guide

Data Ingest Guide - Platfora Expressions

Page 275

This section discusses what to consider when adding event series computed fields to your datasetdefinitions, and the best practices when using this feature.

Use Helpful Field Names and Descriptions

In the Data Catalog and Vizboards areas of the Platfora application, event series computed fieldslook just like any other dataset field. When defining event series computed fields, give them names anddescriptions that help users understand the field's purpose. This cues users on how to use a field in ananalysis.

For example, if describing an event series computed field that computes Next Page Viewed, it may behelpful for users to know that this field is best used in conjunction with the Page field. Whatever thecurrent value is for the Page field, the Next Page Viewed field has the value of Page for the next clickrecord immediately following the current page.

Increase Partition Limit for Larger Event Series Processing Jobs

The global configuration property platfora.max.pattern.events sets the maximum number ofrows in a partition to evaluate for a pattern match. The default is one million rows.

If a partition exceeds this number of rows, the result of the PARTITION function is NULL for all therows that exceed the limit. For example, if you had an event series computed field that partitioned byUserID and ordered by Timestamp, the build processes only the first million rows and ignores any rowsbeyond that so the event series computed field is NULL for those rows.

If you are noticing a lot of default values in your lens data (for example: ‘January 1, 1970’ for dates or‘NULL’ for strings), you may want to increase platfora.max.pattern.events so that all of therows are processed. Keep in mind that increasing this limit will consume more memory resources on theHadoop cluster during lens processing.

Filter Partitioning Fields to Restrict Lens Build Scope

Platfora cannot incrementally build lenses that include event series processing fields. Due to the natureof pattern matching logic, lenses with ESP fields require full lens builds that scan all of a dataset's inputdata. You can limit the scope of these lens builds and improve processing time by adding a lens filter ona dataset partitioning field.

A dataset partitioning field is different from the partition criteria of the ESP field. For Hive data sources,partitioning fields are defined on the data source by the Hive administrator. For HDFS, Google Storage,

Page 276: Data Ingest Guide

Data Ingest Guide - Platfora Expressions

Page 276

or S3 data sources, partitioning fields are defined in a Platfora dataset. If there are partitioning fields

available in a lens, the lens builder displays a special icon next to them.

Consider How Lens Filters Impact Event Series Processing Results

Lens builds always apply lens filters on dataset partitioning fields as the first step of a lens build. Thismeans a build excludes some source data before processing any computed field expressions. If your lensincludes both lens filters on partitioning fields and ESP computed fields, you should take this behaviorinto consideration as it can change the results of PARTITION expressions, and ultimately, your analysisconclusions.

For example, suppose you are analyzing web page visits by user on data from 2012 and 2013:

SessionID UserID Timestamp (partition field) Page

1A 1 12/1/12 9:00 AM home.html

1A 1 12/1/12 9:05 AM company.html

1A 1 12/1/12 9:10 AM products.html

1A 1 12/1/12 9:15 AM blogs.html

1B 1 3/1/13 9:40 PM home.html

1B 1 3/1/13 9:45 PM products.html

1B 1 3/1/13 9:46 PM checkout.html

2A 2 3/4/13 2:02 AM products.html

2A 2 3/4/13 2:20 AM home.html

Page 277: Data Ingest Guide

Data Ingest Guide - Platfora Expressions

Page 277

SessionID UserID Timestamp (partition field) Page

2A 2 3/4/13 2:33 AM blogs.html

2A 2 3/4/13 2:56 AM checkout.html

Timestamp is a partitioning field and it has a filter that excludes 2012 sessions. Then, you create acomputed field with an event series PARTITION function that returns a user's first visit date. When thelens builds, the PARTITION expression would process this filtered data:

SessionID UserID Timestamp Page

1B 1 3/1/13 9:40 PM home.html

1B 1 3/1/13 9:45 PM products.html

1B 1 3/1/13 9:46 PM checkout.html

2A 2 3/4/13 2:02 AM products.html

2A 2 3/4/13 2:20 AM home.html

2A 2 3/4/13 2:33 AM blogs.html

2A 2 3/4/13 2:56 AM checkout.html

Additionally, the results would say UserID 1 had a first visit date of 3/1/13 even though the user'sfirst visit was actually 12/1/12. This discrepancy results from the build processing the lens filter onthe partitioning field (Timestamp) before the event series processing field.

Lens filters on other, non-partitioning dataset fields are applied after event seriesprocessing.

ROLLUP Measures and Window ExpressionsThis section explains how to write ROLLUP and window expressions to calculate complex measures,such as running totals, benchmark comparisons, rank ordering, percentiles, and so on.

Understand ROLLUP Measures

ROLLUP is a modifier to a measure (or aggregate) expression that allows you to operate on a subset ofrows within the overall result set of a query. Using ROLLUP you can build a frame around one or morerows in a dataset or query result, and then compute an aggregate result in relation to that frame only.

The result of a ROLLUP expression is always a measure. However, instead of just doing a simpleaggregation, it does more complex aggregate processing over a specified set of rows (or marks in a viz).

Page 278: Data Ingest Guide

Data Ingest Guide - Platfora Expressions

Page 278

If you are familiar with SQL, a ROLLUP expression in Platfora is equivalent to the OVER clause in SQL.For example, this SQL statement:

SELECT SUM(distance) OVER (PARTITION BY departure_date)

would be equivalent to this ROLLUP expression in Platfora:

ROLLUP SUM(Distance) TO [Departure Date]

What is the difference between a measure and a ROLLUP measure?

A measure is the result of an aggregate function (such as SUM) applied to a group of input data rows. Forexample, using the Flights tutorial data that comes with your Platfora installation, suppose you wantedto calculate the total distance flown by an airline. You could create a measure called Distance(Sum) withan aggregate expression such as this:

SUM(Distance)

The group of input records passed into this aggregate calculation is then determined by the dimension(s)used in a visualization or lens query. Records that have the same dimension members are groupedtogether in a single row, which then gets represented as a mark in a viz. For example, in this viz there isone group or mark for each Carrier/Week combination in the input data.

Page 279: Data Ingest Guide

Data Ingest Guide - Platfora Expressions

Page 279

A ROLLUP clause modifies another aggregate function to define additional partitioning, ordering, andwindow frame criteria. Like a regular aggregate function, ROLLUP also computes aggregate values overgroups of input rows. However, a ROLLUP measure then partitions the overall rows returned by theviz query into subsets or buckets, and then computes the aggregate expression separately within eachindividual bucket.

A ROLLUP is useful when you want to compute an aggregation over a subset of rows (or marks)independently of the overall result of the viz query. The ROLLUP function specifies how to partition thesubset of rows and how to compute the aggregation within that subset.

For example, suppose you wanted to calculate the percentage of all miles that were flown in a givenweek. You could write a ROLLUP expression that calculates the percent of total distance within thepartition of a week (total distance for the week is 100%). The ROLLUP expression to define such acalculation would look something like this:

100 * [Distance(Sum)] / ROLLUP [Distance(Sum)] TO ([Departure Date].Week)

Then when this ROLLUP expression is used in a viz, the group of input records passed into the aggregatecalculation is determined by the dimension(s) used in the viz (such as Carrier in this case), however theaggregation is calculated independently within each week. In this case, you can see the percentage thateach carrier contributed to the total distance flown in a given week.

Page 280: Data Ingest Guide

Data Ingest Guide - Platfora Expressions

Page 280

How to calculate a ROLLUP over an 'adaptive' partition

A ROLLUP expression can have fixed or adaptive partitioning criteria. When you define the ROLLUPmeasure expression, the TO clause of the expression specifies how to partition the data. You can eitherspecify an exact field name (fixed), a reference field name (adaptive), or no field name at all (adaptive).

In the previous example, the ROLLUP expression used a fixed partition of [Departure Date].Week.If we changed the partition criteria to use just [Departure Date] (a reference), the partition criteriabecomes adaptive to any field of that reference that is used in a viz. The expression to define an adaptivedate partition might look something like this:

100 * [Distance(Sum)] / ROLLUP [Distance(Sum)] TO ([Departure Date])

Since Departure Date is a reference that points to the Date dimension, the calculation dynamicallychanges if you drill down from week to day in the viz. This expression can then be used to partitionby any granularity of Departure Date without having to rewrite the ROLLUP expression. The ROLLUPexpression adapts to any granularity of Departure Date used in a viz.

Understand ROLLUP Window Expressions

Adding an ORDER BY plus an optional RANGE or ROWS clause to a ROLLUP expression turns it into awindow expression. These clauses are used to specify an order inside of each partition, and a window

Page 281: Data Ingest Guide

Data Ingest Guide - Platfora Expressions

Page 281

frame around all, one, or several rows over which to compute the aggregate calculation. The windowframe defines how to crop, shift, or fix the row set in relation to the position of the current row.

For example, suppose you wanted to calculate a cumulative total on a day to day basis. You could dothis by adding a window frame to your ROLLUP expression that ordered the rows in each partition bydate (using the ORDER BY clause) , and then summed up the current row and all the days that camebefore it (using a ROWS UNBOUNDED PRECEDING clause). In the Flights tutorial data, an expressionthat calculated a cumulative total of flights per day would look something like this:

ROLLUP [Total Records] TO () ORDER BY ([Departure Date].Date) ROWS UNBOUNDED PRECEDING

When this ROLLUP expression is used in a viz, the Total Records measure is computed cumulativelyby day for each partition group (the Date and Cancel Status dimensions in this case), allowing us to seethe progression of cancelled flights in the month of October 2012. This allows us to see unusual growthpatterns in the data, such as the dramatic spike in cancellations at the end of the month.

The RANK, DENSE_RANK, and NTILE functions are considered exclusively window functions becausethey can only be used in a ROLLUP expression, and they always require an ordered set of rows (orwindow) over which to compute their result.

Page 282: Data Ingest Guide

Data Ingest Guide - Platfora Expressions

Page 282

Computed Field ExamplesThis section contains examples of some common data processing tasks you can accomplish usingPlatfora computed fields.

The Expression Language Reference has examples for all of the built-in functions that Platfora provides.

Finding and Replacing Values

You may have a particular values in your data that you want to find and change to something else, orreformat them in a way so they are all consistent. For example, find and replace values in a name fieldwhere name values are formatted as firstname lastname and replace them with name valuesformatted as lastname, firstname:

REGEX_REPLACE(name,"(.*) (.*)","$2, $1")

Or you may have field values that are not formatted exactly the same, and want to change them so thatlike values can be grouped and sorted together. For example, change all profession_title field values thatcontain the word "Retired" anywhere in the string to just be a value of "Retired":

REGEX_REPLACE(profession_title,".*(Retired).*","Retired")

Extracting Information from File Names and Directories

You may have a dataset where the information you need is not inside the source files, but in the Hadoopfile name or directory path, such as dates or server names.

Suppose your dataset is based on daily log files that are organized into directories by date, and the filenames are the server IP address of the server that produced the log file.

For example, the URI path to a log file produced by server 172.12.131.118 on July 4, 2012 is:

hdfs://myhdfs-server.com/data/logs/20120704/172.12.131.118.log

The following expression uses FILE_PATH() in combination with REGEX() and TO_DATE() tocreate a date field from the date directory name:

TO_DATE(REGEX(FILE_PATH(),"hdfs://myhdfs-server.com/data/logs/(\d{8})/(?:\d{1,3}\.*)+\.log"),"yyyyMMdd")

And the following expression uses FILE_NAME() and REGEX() to extract the server IP address fromthe file name:

REGEX(FILE_NAME(),"(\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3})\.log")

Extracting a Portion of Field Values

You may have field values where only part of the value contains useful information. You can pull out aportion of a field value to define a new field. For example, suppose you had an email_address field withvalues in the format of [email protected], and you wanted to extract just the provider portionof the email address:

REGEX(email,"^[a-zA-Z0-9._%+-]+@([a-zA-Z0-9._-]+)\.[a-zA-Z]{2,4}$")

Page 283: Data Ingest Guide

Data Ingest Guide - Platfora Expressions

Page 283

Renaming Field Values

Sometimes field values are not very user-friendly. For example, a Boolean field may have values of 0and 1 that you want to change to more human-readable values.

CASE WHEN cancelled=0 THEN "Not Cancelled" WHEN cancelled=1 THEN "Cancelled" ELSE NULL END

Deriving a New Field from Other Fields

You may want to combine the values of other fields to create a new field. For example, you couldcombine a month, day, and year field into a single date field. This would then allow you to referencePlatfora's built-in Date dimension dataset.

TO_DATE(CONCAT(month,"/",day,"/",year),"MM/dd/yyyy")

You can also use the values of other fields to calculate a new value. For example, you could calculate agross profit margin percentage using the values of a revenue and cost field as follows:

((revenue - cost) / cost) * 100

Cleansing and Casting Field Values

Sometimes the are data values in a column need to be transformed and cast to another data type in orderto allow for further calculations on the data. For example, you might have some numeric data that youwant to use as a measure, however, it has string values of "NA" to represent what should really be NULLvalues. You could transform the "NA" values to NULL and then cast the column to a numeric data type.

TO_INT(CASE WHEN delay_minutes="NA" then NULL ELSE delay_minutes END)

Troubleshoot Computed Field ErrorsWhen you create a computed field Platfora catches any syntax error in your expression when you try tosave the field. This section describes the most common causes of expression syntax errors.

Function Arguments Don't Match the Expected Data Type

Functions expect input arguments to be of a certain data type. When a function uses another field as itsinput argument, and that field is not of the expected data type, you might see an error such as:

Function REGEX takes 2 arguments with types STRING, STRING, but oneargument of type INTEGER was provided.

Look at the function's arguments that appear in the error message and verify they are the proper datatypes. If the argument is a field, you might need to change the data type of the base field or use a datatype conversion function to cpnvert the argument to the expected data type within the expression itself.

See also: Functions in an Expression

Page 284: Data Ingest Guide

Data Ingest Guide - Platfora Expressions

Page 284

Not Escaping Field or Dataset Names

Field and dataset names used in an expression must be enclosed in square brackets ([ ]) if they containspaces, special characters, reserved keywords, or start with numeric characters. When an expressioncontains a field or dataset name that meets one of these criteria and is not encosed in square brackets,you might see an error such as:

Platfora expected the string `)', but instead received `F'.TO_LONG(New Field)

Look at the bolded character in the expression to find the location of the error. Note the text that comesafter this position. If it is part of a field or dataset name, you need to enclose the name with squarebrackets. To correct the expression in this example, use: TO_LONG([New Field])

See also: Escaping Spaces or Special Characters in Field and Dataset Names

Not Specifying the Full Path to Fields of a Referenced Dataset

Functions can use a field that is in dataset referenced from the focus dataset. You must specify the field'sfull path by including the reference dataset's reference name. If you forget to use the full path, you mightsee an error like:

Field not found: carrier_name

When you see the Field not found error, make sure the field is qualified with the reference name.In this example, carrier_name is a field in a referenced dataset. The reference name in this example iscarriers. To correct this expression, use: carriers.carrier_name for the field name.

See also: Referring to Fields in a Referenced Dataset

Unenclosed Literal Strings

You can include a literal string value as a function argument, but it must be enclosed in double quotes("). When an expression uses a literal string that isn't enclosed in double quotes, you might see an errorsuch as:

Field not found: Platfora

When you see the Field not found error, one option is that the alleged field is meant to be a literalstring and needs to be enclosed in double quotes. To correct this expression, use: "Platfora" for thestring.

See also: Literal Values in an Expression

Unescaped Special Characters

Field and dataset names may contain a right square bracket (]), but it must be preceded by another rightsquare bracket (]]). Literal strings may contain a double quote ("), but it must be preceded by anotherdouble quote (""). Suppose you want to concatenate the strings "Hello and world." to make thestring "Hello world.". The double quotes in each string are special characters and must be escapedin the expression. If not, you might see an error like:

Platfora expected the string `)', but instead received `H'.

Page 285: Data Ingest Guide

Data Ingest Guide - Platfora Expressions

Page 285

CONCAT(""Hello", " world."")

Look at the bolded character in the expression to find the location of the error. To correct this error,escape the double quotes with another double quote:

CONCAT("""Hello", " world.""")

Invalid Syntax

Functions have specific requirements, including required arguments and keywords. When an expressionis missing a keyword, you might see an error such as:

Platfora expected a string matching the regular expression`(?i)\Qend\E', but instead received end of source.CASE WHEN cancel_code=0 THEN "Not Cancelled" WHEN cancel_code=1 THEN"Cancelled" ELSE NULL

Look at the bolded character in the expression to find the location of the error. In this example, itexpected the string END (indicated by (?i)\Qend\E), but instead it reached the end of the expression.The CASE function requires the END keyword at the end of its syntax string. To correct this error, addEND to the end of the expression:

CASE WHEN cancel_code=0 THEN "Not Cancelled" WHEN cancel_code=1 THEN"Cancelled" ELSE NULL END

See also: Expression Language Reference

Using Row and Aggregate Functions Together in the Same Expression

Aggregate functions (functions used to define measures) cannot use nested expressions as their inputarguments. Aggregate functions can only accept field names as input. You also cannot use an aggregateexpression as input to a row function expression. Aggregate functions and row functions cannot bemixed together in one expression.

Write a Lens Query

Platfora includes a programmatic query access feature you can use to query a lens. This sectiondescribes support for querying lenses using Platfora's lens query language and the REST API.

Platfora allows you to make a query against an aggregate lens in your Platfora instance. This feature isnot meant as an end-user feature. Rather it is intended to allow you to write programs that issue SQL-like queries to a Platfora lens. For example, you could write a simple command-line client for queryinga lens. Since programmatic query access is meant for use by programs rather than people, a caller makesthe queries through REST API calls.

A query consists of a SELECT statement with one or more optional clauses. The statement and itsclauses use the same expression language elements you encounter when building a computed fieldexpression and/or a lens filter expression.

Page 286: Data Ingest Guide

Data Ingest Guide - Platfora Expressions

Page 286

[ DEFINE alias-name AS expression [ DEFINE ... ] ]SELECT measure-field [ AS alias-name ] | measure-expression AS alias-name [ , {dimension-field [ AS alias-name ] | row-expression AS alias-name } [ , ...] ]FROM lens-name[ WHERE filter-expression [ AND filter-expression ] ][ GROUP BY dimension-field [ [, group-ordering ] ][ HAVING measure-filter-expression ]

For example, you make a query like the following:

SELECT [device].[manufacturer], [user].[gender], [Num Users] FROM bo_view2G_PSM WHERE video.genre %3D "Action/Comedy" AND user.gender !%3D "male" GROUP BY [device].[manufacturer], [user].[gender]

Once you know the query structure, you make an REST call use the query endpoint. You can pass thequery as a parameter to a GET or as JSON body to a POST.

https://hostname:port/api/v1/query?query="HTML-encoded SELECT statement ..."

Considerations for Using Programmatic Query Access

Here are some considerations to keep in mind when constructing lens queries:

• You can only query aggregate lenses. You cannot query event series lenses.

• Queries run against the currently built version of the lens.

• Queries that once worked can later fail because the underlying dataset or lens changed.

• You cannot do a SELECT * on a lens.

FAQs - Expression BasicsThis section covers the basic concepts and common questions about the Platfora expression language.

What is an expression?

An expression computes or produces a value by combining fields (or columns), constant values,operators, and functions. An expression outputs a value of a particular data type, such as numeric, string,datetime, or Boolean (true/false) values. Simple expressions can be a single constant value, the values ofa given column or field, or a function call. You can use operators to join two or more simple expressionsinto a complex expression.

How are expressions used in the Platfora application?

Platfora expressions allow you to select, process, transform, and manipulate data. Expressions are usedin several ways in the Platfora application:

Page 287: Data Ingest Guide

Data Ingest Guide - Platfora Expressions

Page 287

• In Datasets, they are used to define computed fields and measures that operate on the raw sourcedata.

• In Lenses, they are used to define lens filters that limit the scope of raw data requested from Hadoop.

• In Vizboards, they are used to define computed fields that further manipulate the prepared data in alens.

• In the Lens Query Language via the REST API, they are used to programmatically access andmanipulate the prepared data in a lens from external applications or plugins.

What is the expression builder?

The expression builder helps you create computed field expressions in the Platfora application. Itshows the available fields in the dataset or lens you are working with, plus the list of Platfora's built-infunctions and statements. It validates your expressions for correct syntax, input data types, and so on.You can also access the help to view correct syntax and examples for all of the built-in functions andstatements.

What is a computed field expression?

A computed field expression generates its values based on a calculation or condition, and returns a valuefor each input row. Computed field expressions that can contain values from other fields, constants,mathematical operators, comparison operators, or built-in row functions.

What is a measure expression?

A measure expression generates its values as the result of an aggregate function. It takes input valuesfrom multiple rows and returns a single aggregated value.

How are expressions used in programmatic lens queries?

Platfora's lens query language does not have a graphical user interface like the expression builder.Instead, you can use the cURL command line, Chrome's Postman extension, or write your own pluginextension to submit a SQL-like SELECT query statement through Platfora's REST API.

The lens query language makes use of expressions in its SELECT statement, DEFINE clause, WHEREclause and HAVING clause.

Programmatic lens queries are subject to some of the same expression limitations as vizboard computedfields, since they also operate on the pre-processed data in a lens.

Platfora Expression Language DictionaryAn expression computes or produces a value by combining field or column values, constant values,operators, and functions. Platfora has a built-in expression language. You use the language's functionsand operators in dataset computed fields, vizboard computed fields, lens filters, and programmatic lensqueries.

Page 288: Data Ingest Guide

Data Ingest Guide - Platfora Expressions

Page 288

Expression Quick Dictionary

An expression is a combination of columns (or fields), constant values, operators, and functions usedto evaluate, transform, or produce a value. Simple expressions can be combined to make more complexexpressions. This quick reference describes the functions and operators that can be used to writeexpressions.

Platfora's built-in statements, functions and operators are divided into the following categories:

• Conditional and NULL Processing

• Event Series Processing

• String Processing

• Date and Time Processing

• URL Processing

• IP Address Processing

• Mathematical Processing

• Data Type Conversion

• Aggregation and Measure Processing

• ROLLUP and Window Calculations

• User Defined Functions

• Comparison Operators

• Logical Operators

• Arithmetic Operators

Conditional and NULL Processing

Conditional and NULL processing allows you to transform or manipulate data values based on certaindefined conditions. Conditional processing (CASE) can be done at either the dataset or vizboard level.NULL processing (COALESCE and IS_VALID) is only applicable at the dataset level. During a lensbuild, any NULL values in the source data are converted to default values, so lenses and vizboards haveno concept of NULL values.

Function Description Example

CASE evaluates each row inthe dataset accordingto one or more inputconditions, andoutputs the specifiedresult when the inputconditions are met

CASE WHEN gender = "M" THEN "Male"WHEN gender = "F" THEN "Female" ELSE"Unknown" END

Page 289: Data Ingest Guide

Data Ingest Guide - Platfora Expressions

Page 289

Function Description Example

COALESCE returns the first validvalue (NOT NULLvalue) from a comma-separated list ofexpressions

COALESCE(hourly_wage * 40 * 52, salary)

IS_VALID returns 0 if thereturned value isNULL, and 1 if thereturned value is NOTNULL.

IS_VALID(sale_amount)

Event Series Processing

Event series processing allows you to partition rows of input data, order the rows sequentially (typicallyby a timestamp), and search for matching patterns in a set of rows. Computed fields that are definedin a dataset using a PARTITION expression are considered event series processing computed fields.Event series processing computed fields are processed differently than regular computed fields. Insteadof computing values from the input of a single row, they compute values from inputs of multiple rowsin the dataset. Event series processing computed fields can only be defined in the dataset - not in thevizboard.

Function Description Example

PACK_VALUES returns multipleoutput values packedinto a single stringof key/value pairsseparated by thePlatfora default keyand pair separators- useful when theOUTPUT clause of aPARTITION expressionreturns multiple outputvalues

PACK_VALUES("ID",custid,"Age",age)

PARTITION partitions the rowsof a dataset, ordersthe rows sequentially(typically by atimestamp), andsearches for matchingpatterns in a set ofrows

PARTITION BY SessionID ORDER BYTimestamp PATTERN (A,B,C) DEFINEA AS Page = "home.html", B ASPage = "product.html", C AS Page ="checkout.html" OUTPUT "TRUE"

Page 290: Data Ingest Guide

Data Ingest Guide - Platfora Expressions

Page 290

String Functions

String functions allow you to manipulate and transform textual data, such as combining string values orextracting a portion of a string value.

Function Description Example

ARRAY_CONTAINS performs a wholestring match againsta string containingdelimited valuesand returns a 1 or 0depending on whetheror not the stringcontains the searchvalue.

ARRAY_CONTAINS(device,",","iPad")

CONCAT concatenates(combines together)the results of multiplestring expressions

CONCAT(month,"/",day,"/",year)

FILE_NAME returns the original filename from the sourcefile system

TO_DATE(SUBSTRING(FILE_NAME(),0,8),"yyyyMMdd")

FILE_PATH returns the full URIpath from the sourcefile system

TO_DATE(REGEX(FILE_PATH(),"hdfs://myhdfs-server.com/data/logs/(\d{8})/(?:\d{1,3}\.*)+\.log"),"yyyyMMdd")

EXTRACT_COOKIE extracts the valueof the given cookieidentifier from a semi-colon delimited list ofcookie key=value pairs.

EXTRACT_COOKIE("SSID=ABC; vID=44","vID") returns 44

EXTRACT_VALUE extracts the value forthe given key froma string containingdelimited key/valuepairs.

EXTRACT_VALUE("firstname;daria|lastname;hutch","lastname",";","|") returnshutch

INSTR returns an integerindicating the positionof a character withina string that is thefirst character ofthe occurrence of asubstring.

INSTR(url,"http://",-1,1)

Page 291: Data Ingest Guide

Data Ingest Guide - Platfora Expressions

Page 291

Function Description Example

JAVA_STRING returns the unescapedversion of a Javaunicode characterescape sequence as astring value

CASE WHEN currency ==JAVA_STRING("\u00a5") THEN "yes" ELSE"no" END

JOIN_STRINGS concatenates(combines together)the results of multiplestring expressionswith the separator inbetween each non-nullvalue

JOIN_STRINGS("/",month,day,year)

JSON_ARRAY extracts a JSONARRAY as a STRINGvalue from a field in aJSON object

JSON_ARRAY(friends, "f1")

JSON_ARRAY_CONTAINSperforms a wholestring match againsta string formattedas a JSON arrayand returns a 1 or 0depending on whetheror not the stringcontains the searchvalue

JSON_ARRAY_CONTAINS(software,"platfora")

JSON_DOUBLE extracts a DOUBLEvalue from a field in aJSON object

JSON_DOUBLE(top_scores,"test_scores.2")

JSON_FIXED extracts a FIXED valuefrom a field in a JSONobject

JSON_FIXED(top_scores,"test_scores.2")

JSON_INTEGER extracts an INTEGERvalue from a field in aJSON object

JSON_INTEGER(top_scores,"test_scores.2")

JSON_LONG extracts a LONG valuefrom a field in a JSONobject

JSON_LONG(top_scores,"test_scores.2")

JSON_OBJECT extracts a JSONOBJECT as a STRINGvalue from a field in aJSON object

JSON_OBJECT(friends, "f1.0.f2.f3")

Page 292: Data Ingest Guide

Data Ingest Guide - Platfora Expressions

Page 292

Function Description Example

JSON_STRING extracts a STRINGvalue from a field in aJSON object

JSON_STRING(misc,"hobbies.0")

LENGTH returns the count ofcharacters in a stringvalue

LENGTH(name)

REGEX performs a wholestring match againsta string value with aregular expression andreturns the portion ofthe string matchingthe first capturinggroup of the regularexpression

REGEX(weblog.request_line,"GET\s/([a-zA-Z0-9._%-]+\.[html])\sHTTP/[0-9.]+")

REGEX_REPLACE evaluates a stringvalue against aregular expression todetermine if there isa match, and replacesmatched stringswith the specifiedreplacement value

REGEX_REPLACE(phone_number,"([0-9]{3})\.([[0-9]]{3})\.([[0-9]]{4})","\($1\)$2-$3")

SPLIT breaks down adelimited input stringinto sections andreturns the specifiedsection of the string

SPLIT("Restaurants>Location>SanFrancisco",">", -1) returns San Francisco

SUBSTRING returns the specifiedcharacters of a stringvalue based on thegiven start and endposition

SUBSTRING(name,0,1)

TO_LOWER converts all alphabeticcharacters in a stringto lower case

TO_LOWER("123 Main Street") returns 123main street

TO_UPPER converts all alphabeticcharacters in a stringto upper case

TO_UPPER("123 Main Street") returns 123MAIN STREET

TRIM removes leading andtrailing spaces from astring value

TRIM(area_code)

Page 293: Data Ingest Guide

Data Ingest Guide - Platfora Expressions

Page 293

Function Description Example

XPATH_STRING takes an XML-formatted string andreturns the first stringmatching the givenXPath expression

XPATH_STRING(address,"//address[@type='home']/zipcode")

XPATH_STRINGS takes an XML-formatted string andreturns a newline-separated array ofstrings matchingthe given XPathexpression

XPATH_STRINGS(address,"/list/address[1]/street")

XPATH_XML takes an XML-formatted stringand returns an XML-formatted stringmatching the givenXPath expression

XPATH_XML(address,"//address[last()]")

Date and Time Functions

Date and time functions allow you to manipulate and transform datetime values, such as calculating timedifferences between two datetime values, or extracting a portion of a datetime value.

Function Description Example

DAYS_BETWEEN calculates thewhole number ofdays (ignoringtime) between twoDATETIME values

DAYS_BETWEEN(ship_date,order_date)

DATE_ADD adds the specified timeinterval to a DATETIMEvalue

DATE_ADD(invoice_date,45,"day")

HOURS_BETWEEN calculates thewhole number ofhours (ignoringminutes, seconds, andmilliseconds) betweentwo DATETIME values

HOURS_BETWEEN(NOW(),impressions.adview_timestamp)

EXTRACT returns the specifiedportion of a DATETIMEvalue

EXTRACT("hour",order_date)

Page 294: Data Ingest Guide

Data Ingest Guide - Platfora Expressions

Page 294

Function Description Example

MILLISECONDS_BETWEENcalculates thewhole number ofmilliseconds betweentwo DATETIME values

MILLISECONDS_BETWEEN(request_timestamp,response_timestamp)

MINUTES_BETWEEN calculates the wholenumber of minutes(ignoring seconds andmilliseconds) betweentwo DATETIME values

MINUTES_BETWEEN(impression_timestamp,conversion_timestamp)

NOW returns the currentsystem date and timeas a DATETIME value

YEAR_DIFF(NOW(),users.birthdate)

SECONDS_BETWEEN calculates thewhole number ofseconds (ignoringmilliseconds) betweentwo DATETIME values

SECONDS_BETWEEN(impression_timestamp,conversion_timestamp)

TRUNC truncates a DATETIMEvalue to the specifiedformat

TRUNC(TO_DATE(order_date,"MM/dd/yyyyHH:mm:ss"),"day")

YEAR_DIFF calculates thefractional number ofyears between twoDATETIME values

YEAR_DIFF(NOW(),users.birthdate)

URL Functions

URL functions allow you to extract different portions of a URL string, and decode text that is URL-encoded.

Function Description Example

URL_AUTHORITY returns the authorityportion of a URL string

URL_AUTHORITY("http://user:[email protected]:8012/mypage.html") returnsuser:[email protected]:8012

URL_FRAGMENT returns the fragmentportion of a URL string

URL_FRAGMENT("http://platfora.com/news.php?topic=press#Platfora%20News")returns Platfora%20News

Page 295: Data Ingest Guide

Data Ingest Guide - Platfora Expressions

Page 295

Function Description Example

URL_HOST returns the host,domain, or IP addressportion of a URL string

URL_HOST("http://user:[email protected]:8012/mypage.html") returns mycompany.com

URL_PATH returns the pathportion of a URL string

URL_PATH("http://platfora.com/company/contact.html") returns /company/contact.html

URL_PORT returns the portportion of a URL string

URL_PORT("http://user:[email protected]:8012/mypage.html") returns 8012

URL_PROTOCOL returns the protocol(or URI scheme name)portion of a URL string

URL_PROTOCOL("http://www.platfora.com")returns http

URL_QUERY returns the queryportion of a URL string

URL_QUERY("http://platfora.com/news.php?topic=press&timeframe=today") returnstopic=press&timeframe=today

URLDECODE decodes a string thathas been encodedwith the application/x-www-form-urlencodedmedia type

URLDECODE("N%2FA%20or%20%22not%20applicable%22")

IP Address Functions

IP address functions allow you to manipulate and transform STRING data consisting of IP addressvalues.

Function Description Example

CIDR_MATCH compares twoSTRING argumentsrepresenting a CIDRmask and an IPaddress, and returns 1if the IP address fallswithin the specifiedsubnet mask or 0 if itdoes not

CIDR_MATCH("60.145.56.0/24","60.145.56.246")returns 1

HEX_TO_IP converts ahexadecimal-encodedSTRING to a textrepresentation of an IPaddress

HEX_TO_IP(AB20FE01) returns 171.32.254.1

Page 296: Data Ingest Guide

Data Ingest Guide - Platfora Expressions

Page 296

Math Functions

Math functions allow you to perform basic math calculations on numeric values. You can also use thearithmetic operators to perform simple math calculations, such as addition, subtraction, division andmultiplication.

Function Description Example

DIV divides two LONGvalues and returns aquotient value of typeLONG

DIV(TO_LONG(file_size),1024)

EXP raises themathematical constante to the power(exponent) of anumeric value andreturns a value of typeDOUBLE.

EXP(Value)

FLOOR returns the largestinteger that is lessthan or equal to theinput argument

FLOOR(32.6789) returns 32.0

HASH evenly partitions datavalues into the specifiednumber of buckets

HASH(username,20)

LN returns the naturallogarithm of a number

LN(2.718281828) returns 1

MOD divides two LONGvalues and returns theremainder value oftype LONG

MOD(TO_LONG(file_size),1024)

POW raises a numericvalue to the power(exponent) of anothernumeric value andreturns a value of typeDOUBLE.

100 * POW(end_value/start_value, 0.2) - 1

ROUND rounds a DOUBLEvalue to the specifiednumber of decimalplaces

ROUND(32.4678954,2) returns 32.47

Page 297: Data Ingest Guide

Data Ingest Guide - Platfora Expressions

Page 297

Data Type Conversion Functions

Data type conversion functions allow you to cast data values from one data type to another. Thesefunctions are used implicitly whenever you set the data type of a field or column in the Platfora userinterface. The supported data types are: INTEGER, LONG, DOUBLE, FIXED, DATETIME, and STRING

Function Description Example

EPOCH_MS_TO_DATEconverts LONG valuesto DATETIME values,where the inputnumber represents thenumber of millisecondssince the epoch

EPOCH_MS_TO_DATE(1360260240000)returns 2013-02-07T18:04:00:000Z

TO_FIXED converts STRING,INTEGER, LONG, orDOUBLE values tofixed-decimal values

TO_FIXED(opening_price)

TO_DATE converts STRINGvalues to DATETIMEvalues, and specifiesthe format of the dateand time elements inthe string

TO_DATE(order_date,"yyyy.MM.dd 'at'HH:mm:ss z")

TO_DOUBLE converts STRING,INTEGER, LONG, orDOUBLE values toDOUBLE (decimal)values

TO_DOUBLE(average_rating)

TO_INT converts STRING,INTEGER, LONG,or DOUBLE valuesto INTEGER (wholenumber) values

TO_INT(average_rating)

TO_LONG converts STRING,INTEGER, LONG, orDOUBLE values toLONG (whole number)values

TO_LONG(average_rating)

TO_STRING converts values ofother data types toSTRING (character)values

TO_STRING(sku_number)

Page 298: Data Ingest Guide

Data Ingest Guide - Platfora Expressions

Page 298

Aggregate Functions

An aggregate function groups the values of multiple rows together based on some defined inputexpression. Aggregate functions return one value for a group of rows, and are only valid for definingmeasures in Platfora. In the dataset, measures can be defined using any of the aggregate functions. In thevizboard, only the DISTINCT, MAX, or MIN aggregate functions are allowed.

Function Description Example

AVG returns the averageof all valid numericvalues

AVG(sale_amount)

COUNT returns the number ofrows in a dataset

COUNT(sales.customers)

COUNT_VALID returns the numberof rows for which thegiven expression isvalid

COUNT_VALID(page_views)

DISTINCT returns the number ofdistinct values for thegiven expression

DISTINCT(user_id)

MAX returns the biggestvalue from the giveninput expression

MAX(sale_amount)

MIN returns the smallestvalue from the giveninput expression

MIN(sale_amount)

SUM returns the total of allvalues from the giveninput expression

SUM(sale_amount)

STDDEV calculates thepopulation standarddeviation for a groupof numeric values

STDDEV(sale_amount)

VARIANCE calculates thepopulation variancefor a group of numericvalues

VARIANCE(sale_amount)

ROLLUP and Window Functions

ROLLUP is a modifier to an aggregate expression that turns an aggregate into a windowed aggregate.Window functions (RANK, DENSE_RANK and NTILE) can only be used within a ROLLUP statement.The ROLLUP statement defines the partitioning and ordering of a rowset before the associated aggregatefunction or window function is applied.

Page 299: Data Ingest Guide

Data Ingest Guide - Platfora Expressions

Page 299

ROLLUP defines a window or user-specified set of rows within a query result set. A window functionthen computes a value for each row in the window. You can use window functions to computeaggregated values such as moving averages, cumulative aggregates, running totals, or a top N per groupresults.

ROLLUP statements can be specified in either the dataset or the vizboard. When using a ROLLUP in avizboard, the measure for which you are calculating the ROLLUP must already exist in the lens you areusing in the vizboard.

Function Description Example

DENSE_RANK assigns the rank(position) of each rowin a group (partition)of rows and does notskip rank numbers inthe event of tie

ROLLUP DENSE_RANK() TO () ORDER BY([Sales(Sum)] DESC) ROWS UNBOUNDEDPRECEDING

NTILE divides a partitionedgroup of rows into thespecified number ofbuckets, and returnsthe bucket number towhich the current rowbelongs

ROLLUP NTILE(100) TO () ORDER BY([Total Records] DESC) ROWS BETWEENUNBOUNDED PRECEDING AND UNBOUNDEDFOLLOWING

RANK assigns the rank(position) of each rowin a group (partition)of rows and skips ranknumbers in the eventof tie

ROLLUP RANK() TO () ORDER BY([Sales(Sum)] DESC) ROWS UNBOUNDEDPRECEDING

ROLLUP a modifier to anaggregate functionthat turns a regularaggregate functioninto a windowed,partitioned, or adaptiveaggregate function

100 * COUNT(Flights) / ROLLUPCOUNT(Flights) TO ([Departure Date])

ROW_NUMBER a modifier to anaggregate functionthat turns a regularaggregate functioninto a windowed,partitioned, or adaptiveaggregate function

ROLLUP ROW_NUMBER() TO (Quarter)ORDER BY (Sum_Sales DESC) ROWSUNBOUNDED PRECEDING

Page 300: Data Ingest Guide

Data Ingest Guide - Platfora Expressions

Page 300

User Defined Functions

User defined functions (UDFs) allow you to define your own per-row processing logic, and then exposethat functionality to users in the Platfora application expression builder. See User Defined Functions(UDFs) for more information.

Comparison Operators

Comparison operators are used to compare the equivalency of two expressions of the same data type.The result of a comparison expression is a Boolean value (returns 1 for true, 0 for false, or NULL forinvalid). Boolean expressions are most often used to specify data processing conditions or filters.

Operator Meaning Example Expression

= or == Equal to order_date = "12/22/2011"

> Greater than age > 18

!> Not greater than(equivalent to <)

age !> 8

< Less than age < 30

!< Not less than (equivalentto >=)

age !< 12

>= Greater than or equal to age >= 20

<= Less than or equal to age <= 29

<> or != or ^= Not equal to age <> 30

BETWEENmin_value ANDmax_value

Test whether a date ornumeric value is withinthe min and max values(inclusive).

year BETWEEN 2000 AND 2012

IN(list) Test whether a value iswithin a set.

product_typeIN("tablet","phone","laptop")

LIKE("pattern") Simple inclusive case-insensitive characterpattern matching. The *character matches anynumber of characters.The ? character matchesexactly one character.

last_name LIKE("?utch*")matches Kutcher, hutch but not Krutcher orcrutchcompany_name LIKE("platfora")matches Platfora or platfora

Page 301: Data Ingest Guide

Data Ingest Guide - Platfora Expressions

Page 301

Operator Meaning Example Expression

valueIS NULL

Check whether a fieldvalue or expression isnull (empty)

ship_date IS NULLevaluates to true when the ship_date fieldis empty

Logical Operators

Logical operators are used to define Boolean (true / false) expressions. Logical operators are usedin expressions to test for a condition, and return 1 if the condition is true or 0 if it is false. Logicaloperators are often used in lens filters, CASE expressions, PARTITION expressions, and WHERE clausesof queries.

Operator Meaning Example Expression

AND Test whether twoconditions are true.

OR Test if either of twoconditions are true.

NOT Reverses the value ofother operators.

• year NOT BETWEEN 2000 AND 2012

• first_name NOT LIKE("Jo?n*")excludes John, jonny but not Jon or Joann

• Date.Weekday NOTIN("Saturday","Sunday")

• purchase_date IS NOT NULLevaluates to true when the purchase_datefield is not empty

Arithmetic Operators

Arithmetic operators perform basic math operations on two expressions of the same data type resultingin a numeric value. The plus (+) and minus (-) operators can also be used to perform arithmeticoperations on DATETIME values.

Operator Description Example

+ Addition amount + 10(add 10 to the value of theamountfield)

Page 302: Data Ingest Guide

Data Ingest Guide - Platfora Expressions

Page 302

Operator Description Example

- Subtraction amount - 10(subtract 10 from the value of theamountfield)

* Multiplication amount * 100(multiply the value of theamountfield by 100)

/ Division bytes / 1024(divide the value of thebytesfield by 1024 and return the quotient)

Comparison Operators

Comparison operators are used to compare the equivalency or inequivalency of two expressions of thesame data type. The result of a comparison expression is a Boolean value (returns 1 for true, 0 for false,or NULL for invalid). Boolean expressions are most often used to specify data processing conditions orfilter criteria.

Operator Definitions

Operator Meaning Example Expression

= or == Equal to order_date = "12/22/2011"

> Greater than age > 18

!> Not greater than(equivalent to <)

age !> 8

< Less than age < 30

!< Not less than (equivalentto >=)

age !< 12

>= Greater than or equal to age >= 20

<= Less than or equal to age <= 29

<> or != or ^= Not equal to age <> 30

Page 303: Data Ingest Guide

Data Ingest Guide - Platfora Expressions

Page 303

Operator Meaning Example Expression

BETWEENmin_value ANDmax_value

Test whether a date ornumeric value is withinthe min and max values(inclusive).

year BETWEEN 2000 AND 2012

IN(list) Test whether a value iswithin a set.

product_typeIN("tablet","phone","laptop")

LIKE("pattern") Simple inclusive case-insensitive characterpattern matching. The *character matches anynumber of characters.The ? character matchesexactly one character.

last_name LIKE("?utch*")matches Kutcher, hutch but not Krutcher orcrutchcompany_name LIKE("platfora")matches Platfora or platfora

valueIS NULL

Check whether a fieldvalue or expression isnull (empty)

ship_date IS NULLevaluates to true when the ship_date fieldis empty

If you are writing queries with REST and the query string includes an = (equal)character, you must URL encode it as %3D. Failure to encode the character canresult in this error:string matching regex `(?i)\Qnot\E\b' expected but end of source found.

Logical Operators

Logical operators are used to define Boolean (true / false) expressions. Logical operators are usedin expressions to test for a condition, and return 1 if the condition is true or 0 if it is false. Logicaloperators are often used in CASE expressions, PARTITION expressions, and WHERE clauses of queries.

Operator Meaning Example Expression

AND Test whether twoconditions are true.

OR Test if either of twoconditions are true.

Page 304: Data Ingest Guide

Data Ingest Guide - Platfora Expressions

Page 304

Operator Meaning Example Expression

NOT Reverses the value ofother operators.

• year NOT BETWEEN 2000 AND 2012

• first_name NOT LIKE("Jo?n*")excludes John, jonny but not Jon or Joann

• Date.Weekday NOTIN("Saturday","Sunday")

• purchase_date IS NOT NULLevaluates to true when the purchase_datefield is not empty

Arithmetic Operators

Arithmetic operators perform basic math operations on two expressions of the same data type resultingin a numeric value. The plus (+) and minus (-) operators can also be used to perform arithmeticoperations on DATETIME values.

Operator Description Example

+ Addition amount + 10(add 10 to the value of theamountfield)

- Subtraction amount - 10(subtract 10 from the value of theamountfield)

* Multiplication amount * 100(multiply the value of theamountfield by 100)

/ Division bytes / 1024(divide the value of thebytesfield by 1024 and return the quotient)

Conditional and NULL Processing

Conditional and NULL processing allows you to transform or manipulate data values based on certaindefined conditions. Conditional processing (CASE) can be done at either the dataset or vizboard level.NULL processing (COALESCE and IS_VALID) is only applicable at the dataset level. During a lens

Page 305: Data Ingest Guide

Data Ingest Guide - Platfora Expressions

Page 305

build, any NULL values in the source data are converted to default values, so lenses and vizboards haveno concept of NULL values.

CASE

CASE is a row function that evaluates each row in the dataset according to one or more input conditions,and outputs the specified result when the input conditions are met.

CASE WHEN input_condition [AND|OR input_condition]THENoutput_expression [...] [ELSE other_output_expression] END

Returns one value per row of the same type as the output expression. All output expressions must returnthe same data type.

If there are multiple output expressions that return different data types, then you will need to encloseyour entire CASE expression in one of the data type conversion functions to explicitly cast all outputvalues to a particular data type.

WHEN input_condition

Required. The WHEN keyword is used to specify one or more Boolean expressions (see Platfora'ssupported conditional operators). If an input value meets the condition, then the output expressionis applied. Input conditions can include other row functions in their expression, but cannot containaggregate functions or measure expressions. You can use the AND or OR keywords to combine multipleinput conditions.

THEN output_expression

Required. The THEN keyword is used to specify an output expression when the specified conditionsare met. Output expressions can include other row functions in their expression, but cannot containaggregate functions or measure expressions.

ELSE other_output_expression

Optional. The ELSE keyword can be used to specify an alternate output expression to use when thespecified conditions are not met. If an ELSE expression is not supplied, ELSE NULL is the default.

END

Required. Denotes the end of CASE function processing.

Convert values in the age column into a range-based groupings (binning):

CASE WHEN age <= 25 THEN "0-25" WHEN age <= 50 THEN "26-50" ELSE "over50" END

Transform values in the gender column from one string to another:

CASE WHEN gender = "M" THEN "Male" WHEN gender = "F" THEN "Female" ELSE"Unknown" END

The vehicle column contains the following values: truck, bus, car, scooter, wagon, bike, tricycle, andmotorcycle. The following example convert multiple values in the vehicle column into a single value:

Page 306: Data Ingest Guide

Data Ingest Guide - Platfora Expressions

Page 306

CASE WHEN vehicle in ("bike","scooter","motorcycle) THEN "two-wheelers"ELSE "other" END

COALESCE

COALESCE is a row function that returns the first valid value (NOT NULL value) from a comma-separated list of expressions.

COALESCE(expression[,expression][,...])

Returns one value per row of the same type as the first valid input expression.

expression

At least one required. A field name or expression.

The following example shows an expression to calculate employee yearly income for exempt employeesthat have a salary and non-exempt employees that have an hourly_wage. This expression checks thevalues of both fields for each row, and returns the value of the first expression that is valid (NOT NULL).

COALESCE(hourly_wage * 40 * 52, salary)

IS_VALID

IS_VALID is a row function that returns 0 if the returned value is NULL, and 1 if the returned value isNOT NULL. This is useful for computing other calculations where you want to exclude NULL values(such as when computing averages).

IS_VALID(expression)

Returns 0 if the returned value is NULL, and 1 if the returned value is NOT NULL.

expression

Required. A field name or expression.

Define a computed field using IS_VALID. This returns a row count only for the rows where this fieldvalue is NOT NULL. If a value is NULL, it returns 0 for that row. In this example, we create a computedfield (sale_amount_not_null) using the sale_amount field as the basis.

IS_VALID(sale_amount)

Then you can use the sale_amount_not_null computed field to calculate an acurate average forsale_amount that excludes NULL values:

SUM(sale_amount)/SUM(sale_amount_not_null)

This is what happens automatically when you use the AVG function.

Event Series Processing

Event series processing allows you to partition rows of input data, order the rows sequentially (typicallyby a timestamp), and search for matching patterns in a set of rows. Computed fields that are definedin a dataset using a PARTITION expression are considered event series processing computed fields.

Page 307: Data Ingest Guide

Data Ingest Guide - Platfora Expressions

Page 307

Event series processing computed fields are processed differently than regular computed fields. Insteadof computing values from the input of a single row, they compute values from inputs of multiple rowsin the dataset. Event series processing computed fields can only be defined in the dataset - not in thevizboard or a lens query.

PARTITION

PARTITION is an event series processing language that partitions the rows of a dataset, orders therows sequentially (typically by a timestamp), and searches for matching patterns in a set of rows.Computed fields that are defined in a dataset using a PARTITION expression are considered eventseries processing computed fields. Event series processing computed fields are processed differentlythan regular computed fields. Instead of computing values from the input of a single row, they computevalues from inputs of multiple rows in the dataset.

The PARTITION function can only be used to define a computed field in thedataset definition (pre-lens build). PARTITION cannot be used to define avizboard computed field. Unlike other expressions, PARTITION expressionscannot be embedded within other functions or expressions - it must be a top-levelexpression.

PARTITION BYfield_name ORDER BY field_name [ASC|DESC] PATTERN (pattern_expression) DEFINE symbol_1 AS filter_expression [,symbol_n AS filter_expression ] [, ...] OUTPUT output_expression

To understand how event series processing works, we'll walk through a simple example of aPARTITION expression.

Page 308: Data Ingest Guide

Data Ingest Guide - Platfora Expressions

Page 308

This is a simple example of some weblog page view data. Each row represents a page view by a user ata give point in time. Session IDs are used to group together page views that happened in the same usersession:

Suppose you wanted to know how many sessions included the path of page visits to ‘home.html’ then‘products.html’ then ‘checkout.html’. You could define a PARTITION expression that groups the rowsby session, orders by time, and then iterates through the rows from top to bottom to find sessions thatmatch the pattern:

PARTITION BY SessionIDORDER BY TimestampPATTERN (A,B,C) DEFINE A AS Page = "home.html", B AS Page = "product.html", C AS Page = "checkout.html"OUTPUT "TRUE"

1. The PARTITION BY clause partitions (or groups) the rows of the dataset by session.

2. Within each partition, the ORDER BY clause sorts the rows by time (in ascending order by default).

3. Each DEFINE clause specifies a condition used to evaluate a row, and binds that condition to asymbol that is then used in the PATTERN clause.

4. The PATTERN clause checks if the conditions are met in the specified order and frequency. Thispattern says that there is a match whenever there are 3 consecutive rows that meet criteria A then Bthen C.

Page 309: Data Ingest Guide

Data Ingest Guide - Platfora Expressions

Page 309

5. For a row that satisfies all of the PATTERN criteria, the value of the OUTPUT clause is applied.Otherwise the output is NULL for rows that don’t meet all of the PATTERN criteria.

Returns one value per row of the same type as the output_expression for rows that match thedefined match pattern, otherwise returns NULL for rows that do not match the pattern.

PARTITION BY field_name

Required. The PARTITION BY clause is used to specify a field in the current dataset bywhich to partition the rows. Rows that share the same value for this field will be groupedtogether, and each group will then be processed independently according to the matchingpattern criteria.

The partition field cannot be a field of a referenced dataset; it must be a field inthe current focus dataset.

ORDER BY field_name

Optional. The ORDER BY clause specifies a field by which to sort the rows within eachpartition before applying the match pattern criteria. For event series processing, records aretypically ordered by a DATETIME type field, such as a date or a timestamp. The defaultsort order is ascending (first to last or low to high).

The ordering field cannot be a field of a referenced dataset; it must be a field inthe current focus dataset.

PATTERN (pattern_expression)

Required. The PATTERN clause specifies the matching pattern to search for within apartition of rows. The pattern_expression is expressed in a format similar to a regularexpression. The pattern_expression can include:

• A symbol that represents some match criteria (as declared in the DEFINE clause).

• A symbol followed by one of the following regex quantifiers:

Page 310: Data Ingest Guide

Data Ingest Guide - Platfora Expressions

Page 310

? (matches once or not at all - greedy construct)

?? (matches once or not at all - reluctant construct)

* (matches zero or more times - greedy construct)

*? (matches zero or more times - reluctant construct)

+ (matches one or more times - greedy construct)

+? (matches one or more times - reluctant construct)

** (matches the empty sequence, or one or more of the quantified symbol, with gapsallowed in between. The match need not begin or end with the quantified symbol)

*+ (matches the empty sequence, or one or more of the quantified symbol, with gapsallowed in between. The match must end with the quantified symbol)

++ (matches the quantified symbol, followed by zero or more of the quantified symbol,with gaps allowed in between. The match must end with the quantified symbol)

+* (matches the quantified symbol, followed by zero or more of the quantified symbol,with gaps allowed in between. The match need not end with the quantified symbol)

• A symbol or pattern of symbols anchored by the regex special characters for thebeginning of string.

^ (marks the beginning of the set of rows that match to the pattern)

• patternA|patternB - The alternation operator (pipe symbol) between two symbols orpatterns signifies an OR match.

• patternA,patternB - The concatenation operator (comma) between two symbols orpatterns signifies a match when pattern B immediately follows pattern A.

• patternA->patternB - The follows operator (minus and greater-than sign) betweentwo symbols or patterns signifies a match when pattern B eventually follows pattern A.

• (pattern_expression) - By default, pattern expressions are matched from left toright. If parenthesis are used to group sub-expressions, the sub-expression within theparenthesis is evaluated first.

You cannot use quantifiers outside of parenthesis. For example, you cannot write((A,B,C)*), to indicate that the asterisk quantifier applies to the whole (A,B,C)expression.

DEFINE symbol AS filter_expression

Required. The DEFINE clause is used to enumerate symbols used in the PATTERN clause(or in the filter_expression of a subsequent symbol definition).

A symbol is a name used to refer to some pattern matching criteria. This can be any nameor token that follows Platfora's object naming rules. For example, if the name containsspaces, special characters, keywords, or starts with a number, you must enclose the name

Page 311: Data Ingest Guide

Data Ingest Guide - Platfora Expressions

Page 311

in brackets [] to escape it. Otherwise, this can be any logical name that helps you identify apiece of pattern matching logic in your expression.

The filter_expression is a Boolean (true or false) expression that operates on each row ofthe partition.

A filter_expression can contain:

• The special expression TRUE or 1, meaning allow the match to occur for any row in thepartition.

• Any field_name in the current dataset.

• symbol.field_name - A field from the dataset qualified by the name of a symbolthat (1) appears only once in the PATTERN clause, (2) preceeds this symbol in thePATTERN clause, and (3) is not followed by a repetition quantifier in the PATTERNclause.

For example:

PATTERN (A, B) DEFINE A AS TRUE, B AS product = A.product

This means that the expression for symbol B will match to a row if the product fieldfor that row is also equal to the product field for the row that is bound to symbol A.

• Any of the comparison operators, such as greater than, less than, equals, and so on.

• The keywords AND or OR (for combining multiple criteria in a single filter expression)

• FIRST|LAST(symbol.field_name) - A field from the dataset, qualified by the nameof a symbol that (1) only appears once in the PATTERN clause, (2) preceeds this symbolin the PATTERN clause, and (3) is followed by a repetition quantifier in the PATTERNclause (*,*?,+, or +?). This returns the field value for the first or last row when thepattern matches to a set of rows.For example:

PATTERN (A+) DEFINE A AS product = FIRST(A.product) OR COUNT(A)=0

The pattern A+ will match to a series of consecutive rows that all have the same valuefor the product field as the first row in the sequence. If the current row happens to bethe first row in the sequence, then it will also be included in the match.

A FIRST or LAST expression evaluates to NULL if it refers to asymbol that ends up matching an empty sequence. Make sureyour expression handles the row at the beginning or end of asequence if you want that row to match as well.

• Any computed expression that operates on the fields or expressions listed above and/oron literal values.

OUTPUT output_expression

Required. An expression that specifies what the output value should be. The outputexpression can refer to:

Page 312: Data Ingest Guide

Data Ingest Guide - Platfora Expressions

Page 312

• The field declared in the PARTITION BY clause.

• symbol.field_name - A field from the dataset, qualified by the name of a symbol that(1) appears only once in the PATTERN clause, and (2) is not followed by a repetitionquantifier in the PATTERN clause. This will output the matching field value.

• COUNT(symbol) where symbol (1) appears only once in the PATTERN clause, and(2) is followed by a repetition quantifier in the PATTERN clause. This will output thesequence number of the row that matched the symbol pattern.

• FIRST | LAST | SUM | COUNT | AVG(symbol.field_name) where symbol (1)appears only once in the PATTERN clause, and (2) is followed by a repetition quantifierin the PATTERN clause. This will output an aggregated value for a set of rows thatmatched the symbol pattern.

• Since you can only output a single column value, you can use the PACK_VALUESfunction to output multiple results in a single column as key/value pairs.

'Session Start Time' Expression

Calculate a user session by partitioning by user and ordering by time. The matching logic representedby symbol A checks if the time of the current row is less than 30 minutes from the preceding row. Ifit is, then it is considered part of the same session as the previous row. Otherwise, the current row isconsidered the start of a new session. The PATTERN (A+) means that the matching logic representedby symbol A must be true for one or more consecutive rows. The output then returns the time of the firstrow in a session.

PARTITION BY UserID ORDER BY Timestamp PATTERN (A+) DEFINE A AS COUNT(A)=0 OR MINUTES_BETWEEN(Timestamp,LAST(A.Timestamp)) < 30 OUTPUT FIRST(A. Timestamp)

'Click Number in Session' Expression

Calculate where a click happened in a session by partitioning by session and ordering by time. Thematching logic represented by symbol A simply matches to any row in the session. The PATTERN (A+) means that the matching logic represented by symbol A must be true for one or more consecutiverows. The output then returns to count of the row within the partition (based on its order or position inthe partition).

PARTITION BY [Session ID] ORDER BY Timestamp PATTERN (A+) DEFINE A AS TRUE OUTPUT COUNT(A)

'Path to Page' Expression

This is a complicated expression that looks back from the current row's position to determine theprevious 4 pages viewed in a session. Since a PARTITION expression can only output one column valueas its result, the OUTPUT clause uses the PACK_VALUES function to return the previous page positions

Page 313: Data Ingest Guide

Data Ingest Guide - Platfora Expressions

Page 313

1,2,3, and 4 in one output value. You can then use a series of EXTRACT_VALUE expressions to createindividual columns for each prior page view in the path.

PARTITION BY SessionID ORDER BY TimestampPATTERN (^OtherPreviousPages*?, Page4Back??, Page3Back??, Page2Back??, Page1Back??, CurrentPage) DEFINE OtherPreviousPages AS TRUE, Page4Back AS TRUE, Page3Back AS TRUE, Page2Back AS TRUE, Page1Back AS TRUE, CurrentPage AS TRUEOUTPUT PACK_VALUES("Back4",Page4Back.Page, "Back3",Page3Back.Page, "Back2",Page2Back.Page, "Back1",Page1Back.Page)

‘Page -1 Back’ Expression

Use the output from the Path to Page expression and extract the last page viewed before the currentpage.

EXTRACT_VALUE([Path to Page],"Back1")

PACK_VALUES

PACK_VALUES is a row function that returns multiple output values packed into a single string of key/value pairs separated by the Platfora default key and pair separators. This is useful when the OUTPUTclause of a PARTITION expression returns multiple output values. The string returned is in a format thatcan be read by the EXTRACT_VALUE function. PACK_VALUES uses the same key and pair separatorvalues that EXTRACT_VALUE uses (the Unicode escape sequences u0003 and u0002, respectively).

PACK_VALUES(key_string,value_expression[,key_string,value_expression][,...])

Returns one value per row of type STRING. If the value for either key_string orvalue_expression of a pair is null or contains either of the two separators, the full key/value pair isomitted from the return value.

key_string

At least one required. A field name of any type, a literal string or number, or an expression that returnsany value.

value_expression

At least one required. A field name of any type, a literal string or number, or an expression that returnsany value. The expression must include one value_expression instance for each key_string instance.

Combine the values of the custid and age fields into a single string field.

PACK_VALUES("ID",custid,"Age",age)

The following expression returns ID\u00035555\u0002Age\u000329 when the value of the custid field is5555 and the value of the age field is 29:

Page 314: Data Ingest Guide

Data Ingest Guide - Platfora Expressions

Page 314

PACK_VALUES("ID",custid,"Age",age)

The following expression returns Age\u000329 when the value of the age field is 29:

PACK_VALUES("ID",NULL,"Age",age)

The following expression returns 29 as a STRING value when the age field is an INTEGER and its valueis 29:

EXTRACT_VALUE(PACK_VALUES("ID",custid,"Age",age),"Age")

You might want to use the PACK_VALUES function to combine multiple field values into a single valuein the OUTPUT clause of the PARTITION (event series processing) function. Then you can use theEXTRACT_VALUE function in a different computed field in the dataset to get one of the values returnedby the PARTITION function. For example, in the example below, the PARTITION function creates a setof rows that defines the previous five web pages accessed in a particular user session:

PARTITION BY SessionORDER BY Time DESCPATTERN (A?, B?, C?, D?, E)DEFINE A AS true, B AS true, C AS true, D AS true, E AS trueOUTPUT PACK_VALUES("A", A.Page, "B", B.Page, "C", C.Page, "D", D.Page)

String Functions

String functions allow you to manipulate and transform textual data, such as combining string values orextracting a portion of a string value.

CONCAT

CONCAT is a row function that returns a string by concatenating (combining together) the results ofmultiple string expressions.

CONCAT(value_expression[,value_expression][,...])

Returns one value per row of type STRING.

value_expression

At least one required. A field name of any type, a literal string or number, or an expression that returnsany value.

Combine the values of the month, day, and year fields into a single date field formatted as MM/DD/YYYY.

CONCAT(month,"/",day,"/",year)

ARRAY_CONTAINS

ARRAY_CONTAINS is a row function that performs a whole string match against a string containingdelimited values and returns a 1 or 0 depending on whether or not the string contains the search value.

ARRAY_CONTAINS(array_string,"delimiter","search_string")

Page 315: Data Ingest Guide

Data Ingest Guide - Platfora Expressions

Page 315

Returns one value per row of type INTEGER. A return value of 1 indicates a positive match, and a returnvalue of 0 indicates no match.

array_string

Required. The name of a field or expression of type STRING (or a literal string) that contains a validarray.

delimiter

Required. The delimiter used between values in the array string. This can be a name of a field orexpression of type STRING.

search_string

Required. The literal string that you want to search for. This can be a name of a field or expression oftype STRING.

If you had a device field that contained a comma delimited list formatted like this:

Safari,iPad

You could determine whether or not the device used was an iPad using the following expression:

ARRAY_CONTAINS(device,",","iPad")

The following expressions return 1:

ARRAY_CONTAINS("platfora","|","platfora")

ARRAY_CONTAINS("platfora|hadoop|2.3","|","hadoop")

The following expressions return 0:

ARRAY_CONTAINS("platfora","|","plat")

ARRAY_CONTAINS("platfora,hadoop","|","platfora")

FILE_NAME

FILE_NAME is a row function that returns the original file name from the source file system. This isuseful when the source data that comprises a dataset comes from multiple files, and there is usefulinformation in the file names themselves (such as dates or server names). You can use FILE_NAME incombination with other string processing functions to extract useful information from the file name.

FILE_NAME()

Returns one value per row of type STRING.

Your dataset is based on daily log files that use an 8 character date as part of the file name. For example,20120704.log is the file name used for the log file created on July 4, 2012. The following expressionuses FILE_NAME in combination with SUBSTRING and TO_DATE to create a date field from the first 8characters of the file name.

TO_DATE(SUBSTRING(FILE_NAME(),0,8),"yyyyMMdd")

Page 316: Data Ingest Guide

Data Ingest Guide - Platfora Expressions

Page 316

Your dataset is based on log files that use the server IP address as part of the file name. For example,172.12.131.118.log is the log file name for server 172.12.131.118. The following expression usesFILE_NAME in combination with REGEX to extract the IP address from the file name.

REGEX(FILE_NAME(),"(\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3})\.log")

FILE_PATH

FILE_PATH is a row function that returns the full URI path from the source file system. This isuseful when the source data that comprises a dataset comes from multiple files, and there is usefulinformation in the directory names or file names themselves (such as dates or server names). You canuse FILE_PATH in combination with other string processing functions to extract useful informationfrom the file path.

FILE_PATH()

Returns one value per row of type STRING.

Your dataset is based on daily log files that are organized into directories by date on the source filesystem, and the file names are the server IP address of the server that produced the log file. Forexample, the URI path to a log file produced by server 172.12.131.118 on July 4, 2012 is hdfs://myhdfs-server.com/data/logs/20120704/172.12.131.118.log.

The following expression uses FILE_PATH in combination with REGEX and TO_DATE to create a datefield from the date directory name.

TO_DATE(REGEX(FILE_PATH(),"hdfs://myhdfs-server.com/data/logs/(\d{8})/(?:\d{1,3}\.*)+\.log"),"yyyyMMdd")

And the following expression uses FILE_NAME and REGEX to extract the server IP address from the filename:

REGEX(FILE_NAME(),"(\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3})\.log")

EXTRACT_COOKIE

EXTRACT_COOKIE is a row function that extracts the value of the given cookie identifier from a semi-colon delimited list of cookie key=value pairs. This function can be used to extract a particular cookievalue from a combined web access log Cookie column.

EXTRACT_COOKIE("cookie_list_string",cookie_key_string)

Returns the value of the specified cookie key as type STRING.

cookie_list_string

Required. A field or literal string that has a semi-colon delimited list of cookie key=value pairs.

cookie_key_string

Required. The cookie key name for which to extract the cookie value.

Extract the value of the vID cookie from a literal cookie string:

Page 317: Data Ingest Guide

Data Ingest Guide - Platfora Expressions

Page 317

EXTRACT_COOKIE("SSID=ABC; vID=44", "vID") returns 44

Extract the value of the vID cookie from a field named Cookie:

EXTRACT_COOKIE(Cookie,"vID")

EXTRACT_VALUE

EXTRACT_VALUE is a row function that extracts the value for the given key from a string containingdelimited key/value pairs.

EXTRACT_VALUE(string,key_name [,delimiter] [,pair_delimiter])

Returns the value of the specified key as type STRING.

string

Required. A field or literal string that contains a delimited list of key/value pairs.

key_name

Required. The key name for which to extract the value.

delimiter

Optional. The delimiter used between the key and the value. If not specified, the value u0003 is used.This is the Unicode escape sequence for the start of text character (which is the default delimiter usedby Hive).

pair_delimiter

Optional. The delimiter used between key/value pairs when the input string contains more than one key/value pair. If not specified, the value u0002 is used. This is the Unicode escape sequence for the end oftext character (which is the default delimiter used by Hive).

Extract the value of the lastname key from a literal string of key/value pairs:

EXTRACT_VALUE("firstname;daria|lastname;hutch","lastname",";","|")returns hutch

Extract the value of the email key from a string field named contact_info that contains strings in theformat of key:value,key:value:

EXTRACT_VALUE(contact_info,"email",":",",")

INSTR

INSTR is a row function that returns an integer indicating the position of a character within a string thatis the first character of the occurrence of a substring. Platfora's INSTR function is similar to the FINDfunction in Excel, except that the first letter is position 0 and the order of the arguments is reversed.

INSTR(string,substring,position,occurrence)

Returns one value per row of type INTEGER. The first position is indicated with the value of zero (0).

string

Page 318: Data Ingest Guide

Data Ingest Guide - Platfora Expressions

Page 318

Required. The name of a field or expression of type STRING (or a literal string).

substring

Required. A literal string or name of a field that specifies the substring to search for in string. Note thatto search for the double quotation mark ( " ) as a literal string, you must escape it with another doublequotation mark: ""

position

Optional. An integer that specifies at which character in string to start searching for substring. A valueof 0 (zero) starts the search at the beginning of string. Use a positive integer to start searching fromthe beginning of string, and use a negative integer to start searching from the end of string. When noposition is specified, INSTR searches at the beginning of the string (0).

occurrence

Optional. A positive integer that specifies which occurrence of substring to search for. When nooccurrence is specified, INSTR searches for the first occurrence of the substring (1).

Return the position of the first occurrence of the substring "http://" starting at the end of the url field:

INSTR(url,"http://",-1,1)

The following expression searches for the second occurrence of the substring "st" starting at thebeginning of the string "bestteststring". INSTR finds that the substring starts at the seventh character inthe string, so it returns 6:

INSTR("bestteststring","st",0,2)

The following expression searches backward for the second occurrence of the substring "st" starting at 7characters before the end of the string "bestteststring". INSTR finds that the substring starts at the thirdcharacter in the string, so it returns 2:

INSTR("bestteststring","st",-7,2)

JAVA_STRING

JAVA_STRING is a row function that returns the unescaped version of a Java unicode character escapesequence as a string value. This is useful when you want to specify unicode characters in an expression.For example, you can use JAVA_STRING to specify the unicode value representing a control character.

JAVA_STRING(unicode_escape_sequence)

Returns the unescaped version of the specified unicode character, one value per row of type STRING.

unicode_escape_sequence

Required. A STRING value containing a unicode character expressed as a Java unicode escapesequence. Unicode escape sequences consist ofa backslash '\' (ASCII character 92, hex 0x5c), a'u' (ASCII 117, hex 0x75), optionally one or more additional 'u' characters, and four hexadecimal digits(the characters '0' through '9' or 'a' through 'f' or 'A' through 'F'). Such sequences represent the UTF-16encoding of a Unicode character. For example, the letter 'a' is equivalent to '\u0061'.

Page 319: Data Ingest Guide

Data Ingest Guide - Platfora Expressions

Page 319

Evaluates whether the currency field is equal to the yen symbol.

CASE WHEN currency == JAVA_STRING("\u00a5") THEN "yes" ELSE "no" END

JOIN_STRINGS

JOIN_STRINGS is a row function that returns a string by concatenating (combining together) the resultsof multiple values with the separator in between each non-null value.

JOIN_STRINGS(separator,value_expression[,value_expression][,...])

Returns one value per row of type STRING.

separator

Required. A field name of type STRING, a literal string, or an expression that returns a string.

value_expression

At least one required. A field name of any type, a literal string or number, or an expression that returnsany value.

Combine the values of the month, day, and year fields into a single date field formatted as MM/DD/YYYY.

JOIN_STRINGS("/",month,day,year)

The following expression returns NULL:

JOIN_STRINGS("+",NULL,NULL,NULL)

The following expression returns a+b:

JOIN_STRINGS("+","a","b",NULL)

JSON_ARRAY

JSON_ARRAY is a row function that extracts a JSON ARRAY as a STRING value from a field in aJSON object.

JSON_ARRAY(json_string,"json_field")

Returns one value per row of type STRING.

json_string

Required. The name of a field or expression of type STRING (or a literal string) that contains a validJSON object.

json_field

Required. The key or name of the field value you want to extract.

For top-level fields, specify the name identifier (key) of the field.

Page 320: Data Ingest Guide

Data Ingest Guide - Platfora Expressions

Page 320

To access fields within a nested object, specify a dot-separated path of field names (for exampletop_level_field_name.nested_field_name).

To extract a value from an array, specify the dot-separated path of field names and the array positionstarting at 0 for the first value in an array, 1 for the second value, and so on (for example, field_name.0).

If the name identifier contains dot or period characters within the name itself, escape the name byenclosing it in brackets (for example, [field.name.with.dot].[another.dot.field.name]

If the field name is null (empty), use brackets with nothing in between as the identifier, for example [].

If you had a friends field that contained a JSON object formatted like this:

{"f1":[{"id":0,"name":"Brenda Griffin","f2":{"id":0,"name":"Bowen Blair","f3":{"id":0,"name":"Maude Hoffman"}}},{"f4":{"f5":1}}]}

You could extract the f1 value using the following expression:

JSON_ARRAY(friends, "f1")

This expression would return the following value:

[{"id":0,"name":"Brenda Griffin","f2":{"id":0,"name":"Bowen Blair","f3":{"id":0,"name":"Maude Hoffman"}}},{"f4":{"f5":1}}]

Suppose you have a field called json_field that contains the following value:

{"int":10, "string": "hello world", "array": [1,2,3], "object": {"key":"value" }, "nested": [{"nkey":"nvalue"},[4,5,6]]}

The following expressions return the following results:

Expression Return Value

JSON_ARRAY(json_field, "array") [1,2,3]

JSON_ARRAY(json_field, "object") NULL

JSON_ARRAY(json_field, "int") NULL

JSON_ARRAY(json_field, "string") NULL

JSON_ARRAY(json_field, "nested.0") NULL

JSON_ARRAY(json_field, "nested.1") [4,5,6]

JSON_ARRAY(json_field, "array.0") NULL

JSON_ARRAY_CONTAINS

JSON_ARRAY_CONTAINS is a row function that performs a whole string match against a stringformatted as a JSON array and returns a 1 or 0 depending on whether or not the string contains thesearch value.

JSON_ARRAY_CONTAINS(json_array_string,"search_string")

Page 321: Data Ingest Guide

Data Ingest Guide - Platfora Expressions

Page 321

Returns one value per row of type INTEGER. A return value of 1 indicates a positive match, and a returnvalue of 0 indicates no match.

json_array_string

Required. The name of a field or expression of type STRING (or a literal string) that contains a validJSON array. A JSON array is an ordered sequence of values separated by commas and enclosed insquare brackets.

search_string

Required. The literal string that you want to search for. This can be a name of a field or expression oftype STRING.

If you have a software field that contains a JSON array formatted like this:

["hadoop","platfora"]

The following expression returns 1:

JSON_ARRAY_CONTAINS(software,"platfora")

JSON_DOUBLE

JSON_DOUBLE is a row function that extracts a DOUBLE value from a field in a JSON object.

JSON_DOUBLE(json_string,"json_field")

Returns one value per row of type DOUBLE.

json_string

Required. The name of a field or expression of type STRING (or a literal string) that contains a validJSON object.

json_field

Required. The key or name of the field value you want to extract.

For top-level fields, specify the name identifier (key) of the field.

To access fields within a nested object, specify a dot-separated path of field names (for exampletop_level_field_name.nested_field_name).

To extract a value from an array, specify the dot-separated path of field names and the array positionstarting at 0 for the first value in an array, 1 for the second value, and so on (for example, field_name.0).

If the name identifier contains dot or period characters within the name itself, escape the name byenclosing it in brackets (for example, [field.name.with.dot].[another.dot.field.name]

If the field name is null (empty), use brackets with nothing in between as the identifier, for example [].

If you had a top_scores field that contained a JSON object formatted like this (with the values containedin an array):

Page 322: Data Ingest Guide

Data Ingest Guide - Platfora Expressions

Page 322

{"practice_scores":["538.67","674.99","1021.52"], "test_scores":["753.21","957.88","1032.87"]}

You could extract the third value of the test_scores array using the expression:

JSON_DOUBLE(top_scores,"test_scores.2")

JSON_FIXED

JSON_FIXED is a row function that extracts a FIXED value from a field in a JSON object.

JSON_FIXED(json_string,"json_field")

Returns one value per row of type FIXED.

json_string

Required. The name of a field or expression of type STRING (or a literal string) that contains a validJSON object.

json_field

Required. The key or name of the field value you want to extract.

For top-level fields, specify the name identifier (key) of the field.

To access fields within a nested object, specify a dot-separated path of field names (for exampletop_level_field_name.nested_field_name).

To extract a value from an array, specify the dot-separated path of field names and the array positionstarting at 0 for the first value in an array, 1 for the second value, and so on (for example, field_name.0).

If the name identifier contains dot or period characters within the name itself, escape the name byenclosing it in brackets (for example, [field.name.with.dot].[another.dot.field.name]

If the field name is null (empty), use brackets with nothing in between as the identifier, for example [].

If you had a top_scores field that contained a JSON object formatted like this (with the values containedin an array):

{"practice_scores":["538.67","674.99","1021.52"], "test_scores":["753.21","957.88","1032.87"]}

You could extract the third value of the test_scores array using the expression:

JSON_FIXED(top_scores,"test_scores.2")

JSON_INTEGER

JSON_INTEGER is a row function that extracts an INTEGER value from a field in a JSON object.

JSON_INTEGER(json_string,"json_field")

Returns one value per row of type INTEGER.

json_string

Page 323: Data Ingest Guide

Data Ingest Guide - Platfora Expressions

Page 323

Required. The name of a field or expression of type STRING (or a literal string) that contains a validJSON object.

json_field

Required. The key or name of the field value you want to extract.

For top-level fields, specify the name identifier (key) of the field.

To access fields within a nested object, specify a dot-separated path of field names (for exampletop_level_field_name.nested_field_name).

To extract a value from an array, specify the dot-separated path of field names and the array positionstarting at 0 for the first value in an array, 1 for the second value, and so on (for example, field_name.0).

If the name identifier contains dot or period characters within the name itself, escape the name byenclosing it in brackets (for example, [field.name.with.dot].[another.dot.field.name]

If the field name is null (empty), use brackets with nothing in between as the identifier, for example [].

If you had an address field that contained a JSON object formatted like this:

{"street_address":"123 B Street", "city":"San Mateo", "state":"CA","zip_code":"94403"}

You could extract the zip_code value using the expression:

JSON_INTEGER(address,"zip_code")

If you had a top_scores field that contained a JSON object formatted like this (with the values containedin an array):

{"practice_scores":["538","674","1021"], "test_scores":["753","957","1032"]}

You could extract the third value of the test_scores array using the expression:

JSON_INTEGER(top_scores,"test_scores.2")

JSON_LONG

JSON_LONG is a row function that extracts a LONG value from a field in a JSON object.

JSON_LONG(json_string,"json_field")

Returns one value per row of type LONG.

json_string

Required. The name of a field or expression of type STRING (or a literal string) that contains a validJSON object.

json_field

Required. The key or name of the field value you want to extract.

Page 324: Data Ingest Guide

Data Ingest Guide - Platfora Expressions

Page 324

For top-level fields, specify the name identifier (key) of the field.

To access fields within a nested object, specify a dot-separated path of field names (for exampletop_level_field_name.nested_field_name).

To extract a value from an array, specify the dot-separated path of field names and the array positionstarting at 0 for the first value in an array, 1 for the second value, and so on (for example, field_name.0).

If the name identifier contains dot or period characters within the name itself, escape the name byenclosing it in brackets (for example, [field.name.with.dot].[another.dot.field.name]

If the field name is null (empty), use brackets with nothing in between as the identifier, for example [].

If you had a top_scores field that contained a JSON object formatted like this (with the values containedin an array):

{"practice_scores":["538","674","1021"], "test_scores":["753","957","1032"]}

You could extract the third value of the test_scores array using the expression:

JSON_LONG(top_scores,"test_scores.2")

JSON_OBJECT

JSON_OBJECT is a row function that extracts a JSON OBJECT as a STRING value from a field in aJSON object.

JSON_OBJECT(json_string,"json_field")

Returns one value per row of type STRING.

json_string

Required. The name of a field or expression of type STRING (or a literal string) that contains a validJSON object.

json_field

Required. The key or name of the field value you want to extract.

For top-level fields, specify the name identifier (key) of the field.

To access fields within a nested object, specify a dot-separated path of field names (for exampletop_level_field_name.nested_field_name).

To extract a value from an array, specify the dot-separated path of field names and the array positionstarting at 0 for the first value in an array, 1 for the second value, and so on (for example, field_name.0).

If the name identifier contains dot or period characters within the name itself, escape the name byenclosing it in brackets (for example, [field.name.with.dot].[another.dot.field.name]

If the field name is null (empty), use brackets with nothing in between as the identifier, for example [].

Page 325: Data Ingest Guide

Data Ingest Guide - Platfora Expressions

Page 325

If you had a friends field that contained a JSON object formatted like this:

{"f1":[{"id":0,"name":"Brenda Griffin","f2":{"id":0,"name":"Bowen Blair","f3":{"id":0,"name":"Maude Hoffman"}}},{"f4":{"f5":1}}]}

You could extract the f3 value using the following expression:

JSON_OBJECT(friends, "f1.0.f2.f3")

This expression would return the following value:

{"id":0,"name":"Maude Hoffman"}And the following expression:

JSON_OBJECT(friends, "f1.1")

Returns the following value:

{"f4":{"f5":1}}

Suppose you have a field called json_field that contains the following value:

{"int":10, "string": "hello world", "array": [1,2,3], "object": {"key":"value" }, "nested": [{"nkey":"nvalue"},[4,5,6]]}

The following expressions return the following results:

Expression Return Value

JSON_ARRAY(json_field, "array") NULL

JSON_ARRAY(json_field, "object") {"key":"value"}

JSON_ARRAY(json_field, "int") NULL

JSON_ARRAY(json_field, "string") NULL

JSON_ARRAY(json_field, "nested.0") {"nkey":"nvalue"}

JSON_ARRAY(json_field, "nested.1") NULL

JSON_OBJECT(json_field, "object.key") NULL

JSON_STRING

JSON_STRING is a row function that extracts a STRING value from a field in a JSON object.

JSON_STRING(json_string,"json_field")

Returns one value per row of type STRING.

json_string

Required. The name of a field or expression of type STRING (or a literal string) that contains a validJSON object.

json_field

Page 326: Data Ingest Guide

Data Ingest Guide - Platfora Expressions

Page 326

Required. The key or name of the field value you want to extract.

For top-level fields, specify the name identifier (key) of the field.

To access fields within a nested object, specify a dot-separated path of field names (for exampletop_level_field_name.nested_field_name).

To extract a value from an array, specify the dot-separated path of field names and the array positionstarting at 0 for the first value in an array, 1 for the second value, and so on (for example, field_name.0).

If the name identifier contains dot or period characters within the name itself, escape the name byenclosing it in brackets (for example, [field.name.with.dot].[another.dot.field.name]

If the field name is null (empty), use brackets with nothing in between as the identifier, for example [].

If you had an address field that contained a JSON object formatted like this:

{"street_address":"123 B Street", "city":"San Mateo", "state":"CA","zip":"94403"}

You could extract the state value using the expression:

JSON_STRING(address,"state")

If you had a misc field that contained a JSON object formatted like this (with the values contained in anarray):

{"hobbies":["sailing","hiking","cooking"], "interests":["art","music","travel"]}

You could extract the first value of the hobbies array using the expression:

JSON_STRING(misc,"hobbies.0")

LENGTH

LENGTH is a row function that returns the count of characters in a string value.

LENGTH(string)

Returns one value per row of type INTEGER.

string

Required. The name of a field or expression of type STRING (or a literal string).

Return count of characters from values in the name field. For example, the value Bob would return alength of 3, Julie would return a length of 5, and so on:

LENGTH(name)

Page 327: Data Ingest Guide

Data Ingest Guide - Platfora Expressions

Page 327

REGEX

REGEX is a row function that performs a whole string match against a string value with a regularexpression and returns the portion of the string matching the first capturing group of the regularexpression.

REGEX(string_expression,"regex_matching_pattern")

Returns the matched STRING value of the first capturing group of the regular expression. If there is nomatch, returns NULL.

string_expression

Required. The name of a field or expression of type STRING (or a literal string).

regex_matching_pattern

Required. A regular expression pattern based on the regular expression pattern matching syntax of theJava programming language. To return a non-NULL value, the regular expression pattern must matchthe entire string value.

This section lists a summary of the most commonly used constructs for defining a regular expressionmatching pattern. See the Regular Expression Reference for more information about regular expressionsupport in Platfora.

Literal and Special Characters

The most basic form of pattern matching is the match of literal characters. For example, if the regularexpression is foo and the input string is foo, the match will succeed because the strings are identical.Certain characters are reserved for special use in regular expressions. These special characters are oftencalled metacharacters. If you want to use special characters as literal characters, they must be escaped.You can escape a single character using a \ (backslash), or escape a character sequence by enclosing itin \Q ... \E.

To escape literal double-quotes, double the double-quotes ("").

Character Name Character Reserved For

opening bracket [ start of a character class

closing bracket ] end of a character class

hyphen - character ranges within a character class

backslash \ general escape character

caret ^ beginning of string, negating of a character class

dollar sign $ end of string

period . matching any single character

Page 328: Data Ingest Guide

Data Ingest Guide - Platfora Expressions

Page 328

Character Name Character Reserved For

pipe | alternation (OR) operator

question mark ? optional quantifier, quantifier minimizer

asterisk * zero or more quantifier

plus sign + once or more quantifier

opening parenthesis ( start of a subexpression group

closing parenthesis ) end of a subexpression group

opening brace { start of min/max quantifier

closing brace } end of min/max quantifier

Character Class Constructs

A character class allows you to specify a set of characters, enclosed in square brackets, that can producea single character match. There are also a number of special predefined character classes (backslashcharacter sequences that are shorthand for the most common character sets).

Construct Type Description

[abc] simple matchesaorborc

[^abc] negation matches any character exceptaorborc

Page 329: Data Ingest Guide

Data Ingest Guide - Platfora Expressions

Page 329

Construct Type Description

[a-zA-Z] range matchesathroughz, orAthroughZ(inclusive)

[a-d[m-p]] union matchesathroughd, ormthroughp

[a-z&&[def]] intersection matchesd,e, orf

[a-z&&[^xq]] subtraction matchesathroughz, except forxandq

Predefined Character Classes

Page 330: Data Ingest Guide

Data Ingest Guide - Platfora Expressions

Page 330

Predefined character classes offer convenient shorthands for commonly used regular expressions.

Construct Description Example

. matches any single character (except newline) .atmatches "cat", "hat", and also"bat" in thephrase "batch files"

\d matches any digit character (equivalent to[0-9])

\dmatches "3" in "C3PO" and "2" in"file_2.txt"

\D matches any non-digit character (equivalent to[^0-9])

\Dmatches "S" in "900S" and "Q" in "Q45"

\s matches any single white-space character(equivalent to[ \t\n\x0B\f\r])

\sbookmatches "book" in "blue book" butnothing in "notebook"

\S matches any single non-white-space character \Sbookmatches "book" in "notebook" butnothing in "blue book"

\w matches any alphanumeric character, includingunderscore (equivalent to[A-Za-z0-9_])

r\w*matches "rm" and "root"

\W matches any non-alphanumeric character(equivalent to[^A-Za-z0-9_])

\Wmatches "&" in "stmd &" , "%" in"100%", and "$" in "$HOME"

Line and Word Boundaries

Boundary matching constructs are used to specify where in a string to apply a matching pattern. Forexample, you can search for a particular pattern within a word boundary, or search for a pattern at thebeginning or end of a line.

Construct Description Example

^ matches from the beginning of a line (multi-line matches are currently not supported)

^172will match the "172" in IP address"172.18.1.11" but not in "192.172.2.33"

Page 331: Data Ingest Guide

Data Ingest Guide - Platfora Expressions

Page 331

Construct Description Example

$ matches from the end of a line (multi-linematches are currently not supported)

d$will match the "d" in "maid" but not in"made"

\b matches within a word boundary \bis\bmatches the word "is" in "this is myisland", but not the "is" part of "this" or"island".\bismatches both "is" and the "is" in "island",but not in "this".

\B matches within a non-word boundary \Bbmatches "b" in "sbin" but not in "bash"

Quantifiers

Quantifiers specify how often the preceding regular expression construct should match. There are threeclasses of quantifiers: greedy, reluctant, and possessive. The difference between greedy, reluctant, andpossessive quantifiers involves what part of the string to try for the initial match, and how to retry if theinitial attempt does not produce a match.

Greedy

Construct

Reluctant

Construct

Possessive

Construct

Description Example

? ?? ?+ matches the previouscharacter or construct onceor not at all

st?onmatches "son" in "johnson" and "ston"in "johnston" but nothing in "clinton" or"version"

* *? *+ matches the previouscharacter or construct zeroor more times

if*matches "if", "iff" in "diff", or "i" in "print"

+ +? ++ matches the previouscharacter or construct oneor more times

if+matches "if", "iff" in "diff", but nothing in"print"

{n} {n}? {n}+ matches the previouscharacter or constructexactlyntimes

o{2}matches "oo" in "lookup" and the first two o'sin "fooooo" but nothing in "mount"

Page 332: Data Ingest Guide

Data Ingest Guide - Platfora Expressions

Page 332

Greedy

Construct

Reluctant

Construct

Possessive

Construct

Description Example

{n,} {n,}? {n,}+ matches the previouscharacter or construct atleastntimes

o{2,}matches "oo" in "lookup" all five o's in"fooooo" but nothing in "mount"

{n,m} {n,m}? {n,m}+matches the previouscharacter or construct atleastntimes, but no more thanmtimes

F{2,4}matches "FF" in "#FF0000" and the last fourF's in "#FFFFFF"

Groups are specified by a pair of parenthesis around a subpattern in the regular expression. A pattern canhave more than one group and the groups can be nested. The groups are numbered 1-n from left to right,starting with the first opening parenthesis. There is always an implicit group 0, which contains the entirematch. For example, the pattern:

(a(b*))+(c)

contains three groups:

group 1: (a(b*))group 2: (b*) group 3: (c)

Capturing Groups

By default, a group captures the text that produces a match, and only the most recent match is captured.The REGEX function returns the string that matches the first capturing group in the regular expression.For example, if the input string to the expression above was abc, the entire REGEX function wouldmatch to abc, but only return the result of group 1, which is ab.

Non-Capturing Groups

In some cases, you may want to use parenthesis to group subpatterns, but not capture text. A non-capturing group starts with (?: (a question mark and colon following the opening parenthesis). Forexample, h(?:a|i|o)t matches hat or hit or hot, but does not capture the a, i, or o from thesubexpression.

Match all possible email address strings with a pattern of [email protected], but only returnthe provider portion of the email address from the email field:

REGEX(email,"^[a-zA-Z0-9._%+-]+@([a-zA-Z0-9._-]+)\.[a-zA-Z]{2,4}$")

Match the request line of a web log, where the value is in the format of:

Page 333: Data Ingest Guide

Data Ingest Guide - Platfora Expressions

Page 333

GET /some_page.html HTTP/1.1

and return just the requested HTML page names:

REGEX(weblog.request_line,"GET\s/([a-zA-Z0-9._%-]+\.[html])\sHTTP/[0-9.]+")

Extract the inches portion from a height field where example values are 6'2", 5'11" (notice theescaping of the literal quote with a double double-quote):

REGEX(height, "\d\'(\d)+""")

Extract all of the contents of the device field when the value is either iPod, iPad, or iPhone:

REGEX(device,"(iP[ao]d|iPhone)")

REGEX_REPLACE

REGEX_REPLACE is a row function that evaluates a string value against a regular expression todetermine if there is a match, and replaces matched strings with the specified replacement value.

REGEX_REPLACE(string_expression,"regex_match_pattern","regex_replace_pattern")

Returns the regex_replace_pattern as a STRING value when regex_match_pattern produces a match. Ifthere is no match, returns the value of string_expression as a STRING.

string_expression

Required. The name of a field or expression of type STRING (or a literal string).

regex_match_pattern

Required. A string literal or regular expression pattern based on the regular expression pattern matchingsyntax of the Java programming language. You can use capturing groups to create backreferences thatcan be used in the regex_replace_pattern. You might want to use a string literal to make a case-sensitivematch. For example, when you enter jane as the match value, the function matches jane but not Jane.The function matches all occurrences of a string literal in the string expression.

regex_replace_pattern

Required. A string literal or regular expression pattern based on the regular expression patternmatching syntax of the Java programming language. You can refer to backreferences from theregex_match_pattern using the syntax $n (where n is the group number).

This section lists a summary of the most commonly used constructs for defining a regular expressionmatching pattern. See the Regular Expression Reference for more information.

Literal and Special Characters

The most basic form of pattern matching is the match of literal characters. For example, if the regularexpression is foo and the input string is foo, the match will succeed because the strings are identical.Certain characters are reserved for special use in regular expressions. These special characters are oftencalled metacharacters. If you want to use special characters as literal characters, they must be escaped.

Page 334: Data Ingest Guide

Data Ingest Guide - Platfora Expressions

Page 334

You can escape a single character using a \ (backslash), or escape a character sequence by enclosing itin \Q ... \E.

Character Name Character Reserved For

opening bracket [ start of a character class

closing bracket ] end of a character class

hyphen - character ranges within a character class

backslash \ general escape character

caret ^ beginning of string, negating of a character class

dollar sign $ end of string

period . matching any single character

pipe | alternation (OR) operator

question mark ? optional quantifier, quantifier minimizer

asterisk * zero or more quantifier

plus sign + once or more quantifier

opening parenthesis ( start of a subexpression group

closing parenthesis ) end of a subexpression group

opening brace { start of min/max quantifier

closing brace } end of min/max quantifier

Character Class Constructs

Page 335: Data Ingest Guide

Data Ingest Guide - Platfora Expressions

Page 335

A character class allows you to specify a set of characters, enclosed in square brackets, that can producea single character match. There are also a number of special predefined character classes (backslashcharacter sequences that are shorthand for the most common character sets).

Construct Type Description

[abc] simple matchesaorborc

[^abc] negation matches any character exceptaorborc

[a-zA-Z] range matchesathroughz, orAthroughZ(inclusive)

[a-d[m-p]] union matchesathroughd, ormthroughp

[a-z&&[def]] intersection matchesd,e, orf

Page 336: Data Ingest Guide

Data Ingest Guide - Platfora Expressions

Page 336

Construct Type Description

[a-z&&[^xq]] subtraction matchesathroughz, except forxandq

Predefined Character Classes

Predefined character classes offer convenient shorthands for commonly used regular expressions.

Construct Description Example

. matches any single character (except newline) .atmatches "cat", "hat", and also"bat" in thephrase "batch files"

\d matches any digit character (equivalent to[0-9])

\dmatches "3" in "C3PO" and "2" in"file_2.txt"

\D matches any non-digit character (equivalent to[^0-9])

\Dmatches "S" in "900S" and "Q" in "Q45"

\s matches any single white-space character(equivalent to[ \t\n\x0B\f\r])

\sbookmatches "book" in "blue book" butnothing in "notebook"

\S matches any single non-white-space character \Sbookmatches "book" in "notebook" butnothing in "blue book"

\w matches any alphanumeric character, includingunderscore (equivalent to[A-Za-z0-9_])

r\w*matches "rm" and "root"

Page 337: Data Ingest Guide

Data Ingest Guide - Platfora Expressions

Page 337

Construct Description Example

\W matches any non-alphanumeric character(equivalent to[^A-Za-z0-9_])

\Wmatches "&" in "stmd &" , "%" in"100%", and "$" in "$HOME"

Line and Word Boundaries

Boundary matching constructs are used to specify where in a string to apply a matching pattern. Forexample, you can search for a particular pattern within a word boundary, or search for a pattern at thebeginning or end of a line.

Construct Description Example

^ matches from the beginning of a line (multi-line matches are currently not supported)

^172will match the "172" in IP address"172.18.1.11" but not in "192.172.2.33"

$ matches from the end of a line (multi-linematches are currently not supported)

d$will match the "d" in "maid" but not in"made"

\b matches within a word boundary \bis\bmatches the word "is" in "this is myisland", but not the "is" part of "this" or"island".\bismatches both "is" and the "is" in "island",but not in "this".

\B matches within a non-word boundary \Bbmatches "b" in "sbin" but not in "bash"

Quantifiers

Quantifiers specify how often the preceding regular expression construct should match. There are threeclasses of quantifiers: greedy, reluctant, and possessive. The difference between greedy, reluctant, and

Page 338: Data Ingest Guide

Data Ingest Guide - Platfora Expressions

Page 338

possessive quantifiers involves what part of the string to try for the initial match, and how to retry if theinitial attempt does not produce a match.

Greedy

Construct

Reluctant

Construct

Possessive

Construct

Description Example

? ?? ?+ matches the previouscharacter or construct onceor not at all

st?onmatches "son" in "johnson" and "ston"in "johnston" but nothing in "clinton" or"version"

* *? *+ matches the previouscharacter or construct zeroor more times

if*matches "if", "iff" in "diff", or "i" in "print"

+ +? ++ matches the previouscharacter or construct oneor more times

if+matches "if", "iff" in "diff", but nothing in"print"

{n} {n}? {n}+ matches the previouscharacter or constructexactlyntimes

o{2}matches "oo" in "lookup" and the first two o'sin "fooooo" but nothing in "mount"

{n,} {n,}? {n,}+ matches the previouscharacter or construct atleastntimes

o{2,}matches "oo" in "lookup" all five o's in"fooooo" but nothing in "mount"

{n,m} {n,m}? {n,m}+matches the previouscharacter or construct atleastntimes, but no more thanmtimes

F{2,4}matches "FF" in "#FF0000" and the last fourF's in "#FFFFFF"

Match the values in a phone_number field where phone number values are formatted asxxx.xxx.xxxx and replace them with phone number values formatted as (xxx) xxx-xxxx:

REGEX_REPLACE(phone_number,"([0-9]{3})\.([[0-9]]{3})\.([[0-9]]{4})","\($1\) $2-$3")

Match the values in a name field where name values are formatted as firstname lastname andreplace them with name values formatted as lastname, firstname:

Page 339: Data Ingest Guide

Data Ingest Guide - Platfora Expressions

Page 339

REGEX_REPLACE(name,"(.*) (.*)","$2, $1")

Match the string literal mrs in a title field and replace it with the string literal Mrs.

REGEX_REPLACE(title,"mrs","Mrs")

SPLIT

SPLIT is a row function that breaks down a delimited input string into sections and returns the specifiedsection of the string. A section is considered any sub-string between the specified delimiter.

SPLIT(input_string_expression,"delimiter_string",position_integer)

Returns one value per row of type STRING.

input_string_expression

Required. The name of a field or expression of type STRING (or a literal string).

delimiter_string

Required. A literal string representing the delimiter used to separate values in the input string. Thedelimiter can be a single character or multiple characters.

position_integer

Required. An integer representing the position of the section in the input string that you want to extract.Positive integers count the position from the beginning of the string, and negative integers count theposition from the end of the string. A value of 0 returns NULL.

Return the third section of the literal delimited string: Restaurants>Location>San Francisco:

SPLIT("Restaurants>Location>San Francisco",">", -1) returns San Francisco

Return the first section of a phone_number field where phone number values are in the format of123-456-7890:

SPLIT(phone_number,"-",1)

SUBSTRING

SUBSTRING is a row function that returns the specified characters of a string value based on the givenstart and end position.

SUBSTRING(string,start,end)

Returns one value per row of type STRING.

string

Required. The name of a field or expression of type STRING (or a literal string).

start

Page 340: Data Ingest Guide

Data Ingest Guide - Platfora Expressions

Page 340

Required. An integer that specifies where the returned characters start (inclusive), with 0 being the firstcharacter of the string. If start is greater than the number of characters, then an empty string is returned.If start is greater than end, then an empty string is returned.

end

Required. A positive integer that specifies where the returned characters end (exclusive), with the endcharacter not being part of the return value. If end is greater than the number of characters, the wholestring value (from start) is returned.

Return the first letter of the name field:

SUBSTRING(name,0,1)

TO_LOWER

TO_LOWER is a row function that converts all alphabetic characters in a string to lower case.

TO_LOWER(string_expression)

Returns one value per row of type STRING.

string_expression

Required. The name of a field or expression of type STRING (or a literal string).

Return the literal input string 123 Main Street in all lower case letters::

TO_LOWER("123 Main Street") returns 123 main street

TO_UPPER

TO_UPPER is a row function that converts all alphabetic characters in a string to upper case.

TO_UPPER(string_expression)

Returns one value per row of type STRING.

string_expression

Required. The name of a field or expression of type STRING (or a literal string).

Return the literal input string 123 Main Street in all upper case letters:

TO_UPPER("123 Main Street") returns 123 MAIN STREET

TRIM

TRIM is a row function that removes leading and trailing spaces from a string value.

TRIM(string_expression)

Returns one value per row of type STRING.

string_expression

Page 341: Data Ingest Guide

Data Ingest Guide - Platfora Expressions

Page 341

Required. The name of a field or expression of type STRING (or a literal string).

Return the value of the area_code field without any leading or trailing spaces. For example, if the inputstring is " 650 ", then the return value would be "650":

TRIM(area_code)

Return the value of the phone_number field without any leading or trailing spaces. For example, if theinput string is " 650 123-4567 ", then the return value would be "650 123-4567" (note that the extraspaces in the middle of the string are not removed, only the spaces at the beginning and end of thestring):

TRIM(phone_number)

XPATH_STRING

XPATH_STRING is a row function that takes an XML-formatted string and returns the first stringmatching the given XPath expression.

XPATH_STRING(xml_formatted_string,"xpath_expression")

Returns one value per row of type STRING.

If the XPath expression matches more than one string in the given XML node, this function will returnthe first match only. To return all matches, use XPATH_STRINGS instead.

xml_formatted_string

Required. The name of a field or a literal string that contains a valid XML node (a snippet of XMLconsisting of a parent element and one or more child nodes).

xpath_expression

Required. An XPath expression that refers to a node, element, or attribute within the XML string passedto this expression. Any XPath expression that complies to the XML Path Language (XPath) Version 1.0specification is valid.

These example XPATH_STRING expressions assume you have a field in your dataset named addressthat contains XML-formatted strings such as this:

<list> <address type="work"> <street>1300 So. El Camino Real</street1> <street>Suite 600</street2> <city>San Mateo</city> <state>CA</state> <zipcode>94403</zipcode> </address> <address type="home"> <street>123 Oakdale Street</street1> <street/> <city>San Francisco</city> <state>CA</state>

Page 342: Data Ingest Guide

Data Ingest Guide - Platfora Expressions

Page 342

<zipcode>94123</zipcode> </address></list>

Get the zipcode value from any address element where the type attribute equals home:

XPATH_STRING(address,"//address[@type='home']/zipcode")

returns: 94123

Get the city value from the second address element:

XPATH_STRING(address,"/list/address[2]/city")

returns: San Francisco

Get the values from all child elements of the first address element (as one string):

XPATH_STRING(address,"/list/address")

returns: 1300 So. El Camino RealSuite 600 San MateoCA94403

XPATH_STRINGS

XPATH_STRINGS is a row function that takes an XML-formatted string and returns a newline-separatedarray of strings matching the given XPath expression.

XPATH_STRINGS(xml_formatted_string,"xpath_expression")

Returns one value per row of type STRING.

If the XPath expression matches more than one string in the given XML node, this function will returnall matches separated by a newline (you cannot specify a different delimiter).

xml_formatted_string

Required. The name of a field or a literal string that contains a valid XML node (a snippet of XMLconsisting of a parent element and one or more child nodes).

xpath_expression

Required. An XPath expression that refers to a node, element, or attribute within the XML string passedto this expression. Any XPath expression that complies to the XML Path Language (XPath) Version 1.0specification is valid.

These example XPATH_STRINGS expressions assume you have a field in your dataset named addressthat contains XML-formatted strings such as this:

<list> <address type="work"> <street>1300 So. El Camino Real</street1> <street>Suite 600</street2> <city>San Mateo</city> <state>CA</state> <zipcode>94403</zipcode>

Page 343: Data Ingest Guide

Data Ingest Guide - Platfora Expressions

Page 343

</address> <address type="home"> <street>123 Oakdale Street</street1> <street/> <city>San Francisco</city> <state>CA</state> <zipcode>94123</zipcode> </address></list>

Get all zipcode values from all address elements:

XPATH_STRINGS(address,"//address/zipcode")

returns:

94123 94403

Get all street values from the first address element:

XPATH_STRINGS(address,"/list/address[1]/street")

returns:

1300 So. El Camino RealSuite 600

Get the values from all child elements of all address elements (as one string per line):

XPATH_STRINGS(address,"/list/address")

returns:

123 Oakdale StreetSan FranciscoCA94123 1300 So. El Camino RealSuite 600 San MateoCA94403

XPATH_XML

XPATH_XML is a row function that takes an XML-formatted string and returns an XML-formatted stringmatching the given XPath expression.

XPATH_XML(xml_formatted_string,"xpath_expression")

Returns one value per row of type STRING in XML format.

xml_formatted_string

Required. The name of a field or a literal string that contains a valid XML node (a snippet of XMLconsisting of a parent element and one or more child nodes).

xpath_expression

Required. An XPath expression that refers to a node, element, or attribute within the XML string passedto this expression. Any XPath expression that complies to the XML Path Language (XPath) Version 1.0specification is valid.

Page 344: Data Ingest Guide

Data Ingest Guide - Platfora Expressions

Page 344

These example XPATH_STRING expressions assume you have a field in your dataset named addressthat contains XML-formatted strings such as this:

<list> <address type="work"> <street>1300 So. El Camino Real</street1> <street>Suite 600</street2> <city>San Mateo</city> <state>CA</state> <zipcode>94403</zipcode> </address> <address type="home"> <street>123 Oakdale Street</street1> <street/> <city>San Francisco</city> <state>CA</state> <zipcode>94123</zipcode> </address></list>

Get the last address node and its child nodes in XML format:

XPATH_XML(address,"//address[last()]")

returns:

<address type="home"><street>123 Oakdale Street</street1><street/><city>San Francisco</city><state>CA</state><zipcode>94123</zipcode></address>

Get the city value from the second address node in XML format:

XPATH_XML(address,"/list/address[2]/city")

returns: <city>San Francisco</city>

Get the first address node and its child nodes in XML format:

XPATH_XML(address,"/list/address[1]")

returns:

<address type="work"><street>1300 So. El Camino Real</street1><street>Suite 600</street2><city>San Mateo</city><state>CA</state><zipcode>94403</zipcode></address>

Page 345: Data Ingest Guide

Data Ingest Guide - Platfora Expressions

Page 345

URL Functions

URL functions allow you to extract different portions of a URL string, and decode text that is URL-encoded.

URL_AUTHORITY

URL_AUTHORITY is a row function that returns the authority portion of a URL string. The authorityportion of a URL is the part that has the information on how to locate and connect to the server.

URL_AUTHORITY(string)

Returns the authority portion of a URL as a STRING value, or NULL if the input string is not a validURL.

For example, in the string http://www.platfora.com/company/contact.html, the authorityportion is www.platfora.com.

In the string http://user:[email protected]:8012/mypage.html, the authorityportion is user:[email protected]:8012.

In the string mailto:[email protected]?subject=Topic, the authority portion isNULL.

string

Required. A field or expression that returns a STRING value in URI (uniform resource identifier) formatof: protocol:authority[/path][?query][#fragment].

The authority portion of the URL contains the host information, which can be specified as a domainname (www.platfora.com), a host name (localhost), or an IP address (127.0.0.1). The hostinformation can be preceeded by optional user information terminated with @ (for example,username:[email protected]), and followed by an optional port number preceded by a colon(for example, localhost:8001).

Return the authority portion of URL string values in the referrer field:

URL_AUTHORITY(referrer)

Return the authority portion of a literal URL string:

URL_AUTHORITY("http://user:[email protected]:8012/mypage.html")returns user:[email protected]:8012

URL_FRAGMENT

URL_FRAGMENT is a row function that returns the fragment portion of a URL string.

URL_FRAGMENT(string)

Returns the fragment portion of a URL as a STRING value, NULL if the URL or does not contain afragment, or NULL if the input string is not a valid URL.

Page 346: Data Ingest Guide

Data Ingest Guide - Platfora Expressions

Page 346

For example, in the string http://www.platfora.com/contact.html#phone, the fragmentportion is phone.

In the string http://www.platfora.com/contact.html, the fragment portion is NULL.

In the string http://platfora.com/news.php?topic=press#Platfora%20News, thefragment portion is Platfora%20News.

string

Required. A field or expression that returns a STRING value in URI (uniform resource identifier) formatof: protocol:authority[/path][?query][#fragment].

The optional fragment portion of the URL is separated by a hash mark (#) and provides direction to asecondary resource, such as a heading or anchor identifier.

Return the fragment portion of URL string values in the request field:

URL_FRAGMENT(request)

Return the fragment portion of a literal URL string:

URL_FRAGMENT("http://platfora.com/news.php?topic=press#Platfora%20News")returns Platfora%20News

Return and decode the fragment portion of a literal URL string:

URLDECODE(URL_FRAGMENT("http://platfora.com/news.php?topic=press#Platfora%20News")) returns Platfora News

URL_HOST

URL_HOST is a row function that returns the host, domain, or IP address portion of a URL string.

URL_HOST(string)

Returns the host portion of a URL as a STRING value, or NULL if the input string is not a valid URL.

For example, in the string http://www.platfora.com/company/contact.html, the hostportion is www.platfora.com.

In the string http://admin:[email protected]:8001/index.html, the host portion is127.0.0.1.

In the string mailto:[email protected]?subject=Topic, the host portion is NULL.

string

Required. A field or expression that returns a STRING value in URI (uniform resource identifier) formatof: protocol:authority[/path][?query][#fragment].

The authority portion of the URL contains the host information, which can be specified as a domainname (www.platfora.com), a host name (localhost), or an IP address (127.0.0.1).

Page 347: Data Ingest Guide

Data Ingest Guide - Platfora Expressions

Page 347

Return the host portion of URL string values in the referrer field:

URL_HOST(referrer)

Return the host portion of a literal URL string:

URL_HOST("http://user:[email protected]:8012/mypage.html") returnsmycompany.com

URL_PATH

URL_PATH is a row function that returns the path portion of a URL string.

URL_PATH(string)

Returns the path portion of a URL as a STRING value, NULL if the URL or does not contain a path, orNULL if the input string is not a valid URL.

For example, in the string http://www.platfora.com/company/contact.html, the pathportion is /company/contact.html.

In the string http://admin:[email protected]:8001/index.html, the path portion is /index.html.

In the string mailto:[email protected]?subject=Topic, the path portion [email protected].

string

Required. A field or expression that returns a STRING value in URI (uniform resource identifier) formatof: protocol:authority[/path][?query][#fragment].

The optional path portion of the URL is a sequence of resource location segments separated by aforward slash (/), conceptually similar to a directory path.

Return the path portion of URL string values in the request field:

URL_PATH(request)

Return the path portion of a literal URL string:

URL_PATH("http://platfora.com/company/contact.html") returns /company/contact.html

URL_PORT

URL_PORT is a row function that returns the port portion of a URL string.

URL_PORT(string)

Returns the port portion of a URL as an INTEGER value. If the URL does not specify a port, then returns-1. If the input string is not a valid URL, returns NULL.

Page 348: Data Ingest Guide

Data Ingest Guide - Platfora Expressions

Page 348

For example, in the string http://localhost:8001, the port portion is 8001.

string

Required. A field or expression that returns a STRING value in URI (uniform resource identifier) formatof: protocol:authority[/path][?query][#fragment].

The authority portion of the URL contains the host information, which can be specified as adomain name (www.platfora.com), a host name (localhost), or an IP address (127.0.0.1). Thehost information can be followed by an optional port number preceded by a colon (for example,localhost:8001).

Return the port portion of URL string values in the referrer field:

URL_PORT(referrer)

Return the port portion of a literal URL string:

URL_PORT("http://user:[email protected]:8012/mypage.html") returns8012

URL_PROTOCOL

URL_PROTOCOL is a row function that returns the protocol (or URI scheme name) portion of a URLstring.

URL_PROTOCOL(string)

Returns the protocol portion of a URL as a STRING value, or NULL if the input string is not a validURL.

For example, in the string http://www.platfora.com, the protocol portion is http.

In the string ftp://ftp.platfora.com/articles/platfora.pdf, the protocol portion isftp.

string

Required. A field or expression that returns a STRING value in URI (uniform resource identifier) formatof: protocol:authority[/path][?query][#fragment]

The protocol portion of a URL consists of a sequence of characters beginning with a letter and followedby any combination of letter, number, plus (+), period (.), or hyphen (-) characters, followed by a colon(:). For example: http:, ftp:, mailto:

Return the protocol portion of URL string values in the referrer field:

URL_PROTOCOL(referrer)

Return the protocol portion of the literal URL string:

URL_PROTOCOL("http://www.platfora.com") returns http

Page 349: Data Ingest Guide

Data Ingest Guide - Platfora Expressions

Page 349

URL_QUERY

URL_QUERY is a row function that returns the query portion of a URL string.

URL_QUERY(string)

Returns the query portion of a URL as a STRING value, NULL if the URL or does not contain a query, orNULL if the input string is not a valid URL.

For example, in the string http://www.platfora.com/contact.html, the query portion isNULL.

In the string http://platfora.com/news.php?topic=press&timeframe=today#Platfora%20News, the query portion istopic=press&timeframe=today.

In the string mailto:[email protected]?subject=Topic, the query portion issubject=Topic.

string

Required. A field or expression that returns a STRING value in URI (uniform resource identifier) formatof: protocol:authority[/path][?query][#fragment].

The optional query portion of the URL is separated by a question mark (?) and typically contains anunordered list of key=value pairs separated by an ampersand (&) or semicolon (;).

Return the query portion of URL string values in the request field:

URL_QUERY(request)

Return the query portion of a literal URL string:

URL_QUERY("http://platfora.com/news.php?topic=press&timeframe=today")returns topic=press&timeframe=today

URLDECODE

URLDECODE is a row function that decodes a string that has been encoded with the application/x-www-form-urlencoded media type. URL encoding, also known as percent-encoding, is amechanism for encoding information in a Uniform Resource Identifier (URI). When sent in an HTTPGET request, application/x-www-form-urlencoded data is included in the query componentof the request URI. When sent in an HTTP POST request, the data is placed in the body of the message,and the name of the media type is included in the message Content-Type header.

URLDECODE(string)

Returns a value of type STRING with characters decoded as follows:

• Alphanumeric characters (a-z, A-Z, 0-9) remain unchanged.

• The special characters hyphen (-), comma (,), underscore (_), period (.), and asterisk (*) remainunchanged.

Page 350: Data Ingest Guide

Data Ingest Guide - Platfora Expressions

Page 350

• The plus sign (+) character is converted to a space character.

• The percent character (%) is interpreted as the start of a special escaped sequence, where in thesequence %HH, HH represents the hexadecimal value of the byte. For example, some common escapesequences are:

percent encoding sequence value

%20 space

%0A or %0D or %0D%0A newline

%22 double quote (")

%25 percent (%)

%2D hyphen (-)

%2E period (.)

%3C less than (<)

%3D greater than (>)

%5C backslash (\)

%7C pipe (|)

string

Required. A field or expression that returns a STRING value. It is assumed that all characters in theinput string are one of the following: lower-case letters (a-z), upper-case letters (A-Z), numeric digits(0-9), or the hyphen (-), comma (,), underscore (_), period (.) or asterisk (*) character. The percentcharacter (%) is allowed, but is interpreted as the start of a special escaped sequence. The plus character(+) is allowed, but is interpreted as a space character.

Decode the values of the url_query field:

URLDECODE(url_query)

Convert a literal URL encoded string (N%2FA%20or%20%22not%20applicable%22) to a human-readable value (N/A or "not applicable"):

URLDECODE("N%2FA%20or%20%22not%20applicable%22") returns N/A or "notapplicable"

IP Address Functions

IP address functions allow you to manipulate and transform STRING data consisting of IP addressvalues.

Page 351: Data Ingest Guide

Data Ingest Guide - Platfora Expressions

Page 351

CIDR_MATCH

CIDR_MATCH is a row function that compares two STRING arguments representing a CIDR mask andan IP address, and returns 1 if the IP address falls within the specified subnet mask or 0 if it does not.

CIDR_MATCH(CIDR_string, IP_string)

Returns an INTEGER value of 1 if the IP address falls within the subnet indicated by the CIDR maskand 0 if it does not.

CIDR_string

Required. A field or expression that returns a STRING value containing either an IPv4 or IPv6 CIDRmask (Classless InterDomain Routing subnet notation). An IPv4 CIDR mask can only successfullymatch IPv4 addresses, and an IPv6 CIDR mask can only successfully match IPv6 addresses.

IP_string

Required. A field or expression that returns a STRING value containing either an IPv4 or IPv6 internetprotocol (IP) address.

Compare an IPv4 CIDR subnet mask to an IPv4 IP address:

CIDR_MATCH("60.145.56.0/24","60.145.56.246") returns 1

CIDR_MATCH("60.145.56.0/30","60.145.56.246") returns 0

Compare an IPv6 CIDR subnet mask to an IPv6 IP address:

CIDR_MATCH("fe80::/70","FE80::0202:B3FF:FE1E:8329") returns 1

CIDR_MATCH("fe80::/72","FE80::0202:B3FF:FE1E:8329") returns 0

HEX_TO_IP

HEX_TO_IP is a row function that converts a hexadecimal-encoded STRING to a text representation ofan IP address.

HEX_TO_IP(string)

Returns a value of type STRING representing either an IPv4 or IPv6 address. The type of IP addressreturned depends on the input string. An 8 character hexadecimal string will return an IPv4 address. A32 character long hexadecimal string will return an IPv6 address. IPv6 addresses are represented in fulllength, without removing any leading zeros and without using the compressed :: notation. For example, 2001:0db8:0000:0000:0000:ff00:0042:8329 rather than2001:db8::ff00:42:8329. Input strings that do not contain either 8 or 32 valid hexadecimalcharacters will return NULL.

string

Page 352: Data Ingest Guide

Data Ingest Guide - Platfora Expressions

Page 352

Required. A field or expression that returns a hexadecimal-encoded STRING value. The hexadecimalstring must be either 8 characters long (in which case it is converted to an IPv4 address) or 32 characterslong (in which case it is converted to an IPv6 address).

Return a plain text IP address for each hexadecimal-encoded string value in the byte_encoded_ipscolumn:

HEX_TO_IP(byte_encoded_ips)

Convert an 8 character hexadecimal-encoded string to a plain text IPv4 address:

HEX_TO_IP(AB20FE01) returns 171.32.254.1

Convert a 32 character hexadecimal-encoded string to a plain text IPv6 address:

HEX_TO_IP(FE800000000000000202B3FFFE1E8329) returnsfe80:0000:0000:0000:0202:b3ff:fe1e:8329

Date and Time Functions

Date and time functions allow you to manipulate and transform datetime values, such as calculating timedifferences between two datetime values, or extracting a portion of a datetime value.

DAYS_BETWEEN

DAYS_BETWEEN is a row function that calculates the whole number of days (ignoring time) betweentwo DATETIME values (value1-value2).

DAYS_BETWEEN(datetime_1,datetime_2)

Returns one value per row of type INTEGER.

datetime_1

Required. A field or expression of type DATETIME.

datetime_2

Required. A field or expression of type DATETIME.

Calculate the number of days to ship a product by subtracting the value of the order_date field from theship_date field:

DAYS_BETWEEN(ship_date,order_date)

Calculate the number of days since a product's release by subtracting the value of the release_date fieldin the product dataset from the current date (the result of the expression):

DAYS_BETWEEN(NOW(),product.release_date)

DATE_ADD

DATE_ADD is a row function that adds the specified time interval to a DATETIME value.

Page 353: Data Ingest Guide

Data Ingest Guide - Platfora Expressions

Page 353

DATE_ADD(datetime,quantity,"interval")

Returns a value of type DATETIME.

datetime

Required. A field name or expression that returns a DATETIME value.

quantity

Required. An integer value. To add time intervals, use a positive integer. To subtract time intervals, usea negative integer.

interval

Required. One of the following time intervals:

• millisecond - Adds the specified number of milliseconds to a datetime value.

• second - Adds the specified number of seconds to a datetime value.

• minute - Adds the specified number of minutes to a datetime value.

• hour - Adds the specified number of hours to a datetime value.

• day - Adds the specified number of days to a datetime value.

• week - Adds the specified number of weeks to a datetime value.

• month - Adds the specified number of months to a datetime value.

• quarter - Adds the specified number of quarters to a datetime value.

• year - Adds the specified number of years to a datetime value.

• weekyear - Adds the specified number of weekyears to a datetime value.

Add 45 days to the value of the invoice_date field to calculate the date a payment is due:

DATE_ADD(invoice_date,45,"day")

HOURS_BETWEEN

HOURS_BETWEEN is a row function that calculates the whole number of hours (ignoring minutes,seconds, and milliseconds) between two DATETIME values (value1-value2).

HOURS_BETWEEN(datetime_1,datetime_2)

Returns one value per row of type INTEGER.

datetime_1

Required. A field or expression of type DATETIME.

datetime_2

Required. A field or expression of type DATETIME.

Calculate the number of hours to ship a product by subtracting the value of the ship_date field from theorder_date field:

HOURS_BETWEEN(ship_date,order_date)

Page 354: Data Ingest Guide

Data Ingest Guide - Platfora Expressions

Page 354

Calculate the number of hours since an advertisement was viewed by subtracting the value of theadview_timestamp field in the impressions dataset from the current date and time (the result of the expression):

HOURS_BETWEEN(NOW(),impressions.adview_timestamp)

EXTRACT

EXTRACT is a row function that returns the specified portion of a DATETIME value.

EXTRACT("extract_value",datetime)

Returns the specified extracted value as type INTEGER. EXTRACT removes leading zeros. For example,the month of April returns a value of 4, not 04.

extract_value

Required. One of the following extract values:

• millisecond - Returns the millisecond portion of a datetime value. For example, an input datetimevalue of 2012-08-15 20:38:40.213 would return an integer value of 213.

• second - Returns the second portion of a datetime value. For example, an input datetime value of2012-08-15 20:38:40.213 would return an integer value of 40.

• minute - Returns the minute portion of a datetime value. For example, an input datetime value of2012-08-15 20:38:40.213 would return an integer value of 38.

• hour - Returns the hour portion of a datetime value. For example, an input datetime value of2012-08-15 20:38:40.213 would return an integer value of 20.

• day - Returns the day portion of a datetime value. For example, an input datetime value of2012-08-15 would return an integer value of 15.

• week - Returns the ISO week number for the input datetime value. For example, an input datetimevalue of 2012-01-02 would return an integer value of 1 (the first ISO week of 2012 starts onMonday January 2). An input datetime value of 2012-01-01 would return an integer value of 52(January 1, 2012 is part of the last ISO week of 2011).

• month - Returns the month portion of a datetime value. For example, an input datetime value of2012-08-15 would return an integer value of 8.

• quarter - Returns the quarter number for the input datetime value, where quarters start on January 1,April 1, July 1, or October 1. For example, an input datetime value of 2012-08-15 would return ainteger value of 3.

• year - Returns the year portion of a datetime value. For example, an input datetime value of2012-01-01 would return an integer value of 2012.

• weekyear - Returns the year value that corresponds the the ISO week number of the input datetimevalue. For example, an input datetime value of 2012-01-02 would return an integer value of 2012(the first ISO week of 2012 starts on Monday January 2). An input datetime value of 2012-01-01would return an integer value of 2011 (January 1, 2012 is part of the last ISO week of 2011).

datetime

Required. A field name or expression that returns a DATETIME value.

Page 355: Data Ingest Guide

Data Ingest Guide - Platfora Expressions

Page 355

Extract the hour portion from the order_date datetime field:

EXTRACT("hour",order_date)

Cast the value of the order_date string field to a datetime value using TO_DATE, and extract the ISOweek year:

EXTRACT("weekyear",TO_DATE(order_date,"MM/dd/yyyy HH:mm:ss"))

MILLISECONDS_BETWEEN

MILLISECONDS_BETWEEN is a row function that calculates the whole number of milliseconds betweentwo DATETIME values (value1-value2).

MILLISECONDS_BETWEEN(datetime_1,datetime_2)

Returns one value per row of type INTEGER.

datetime_1

Required. A field or expression of type DATETIME.

datetime_2

Required. A field or expression of type DATETIME.

Calculate the number of milliseconds it took to serve a web page by subtracting the value of therequest_timestamp field from the response_timestamp field:

MILLISECONDS_BETWEEN(request_timestamp,response_timestamp)

MINUTES_BETWEEN

MINUTES_BETWEEN is a row function that calculates the whole number of minutes (ignoring secondsand milliseconds) between two DATETIME values (value1-value2).

MINUTES_BETWEEN(datetime_1,datetime_2)

Returns one value per row of type INTEGER.

datetime_1

Required. A field or expression of type DATETIME.

datetime_2

Required. A field or expression of type DATETIME.

Calculate the number of minutes it took for a user to click on an advertisement by subtracting the valueof the impression_timestamp field from the conversion_timestamp field:

MINUTES_BETWEEN(impression_timestamp,conversion_timestamp)

Calculate the number of minutes since a user last logged in by subtracting the login_timestamp field inthe weblogs dataset from the current date and time (the result of the expression):

Page 356: Data Ingest Guide

Data Ingest Guide - Platfora Expressions

Page 356

MINUTES_BETWEEN(NOW(),weblogs.login_timestamp)

NOW

NOW is a scalar function that returns the current system date and time as a DATETIME value. It can beused in other expressions involving DATETIME type fields, such as , , or . Note that the value of NOW isonly evaluated at the time a lens is built (it is not re-evaluated with each query).

NOW()

Returns the current system date and time as a DATETIME value.

Calculate a user's age using to subtract the value of the birthdate field in the users dataset from thecurrent date:

YEAR_DIFF(NOW(),users.birthdate)

Calculate the number of days since a product's release using to subtract the value of the release_datefield from the current date:

DAYS_BETWEEN(NOW(),release_date)

SECONDS_BETWEEN

SECONDS_BETWEEN is a row function that calculates the whole number of seconds (ignoringmilliseconds) between two DATETIME values (value1-value2).

SECONDS_BETWEEN(datetime_1,datetime_2)

Returns one value per row of type INTEGER.

datetime_1

Required. A field or expression of type DATETIME.

datetime_2

Required. A field or expression of type DATETIME.

Calculate the number of seconds it took for a user to click on an advertisement by subtracting the valueof the impression_timestamp field from the conversion_timestamp field:

SECONDS_BETWEEN(impression_timestamp,conversion_timestamp)

Calculate the number of seconds since a user last logged in by subtracting the login_timestamp field inthe weblogs dataset from the current date and time (the result of the expression):

SECONDS_BETWEEN(NOW(),weblogs.login_timestamp)

TRUNC

TRUNC is a row function that truncates a DATETIME value to the specified format.

TRUNC(datetime,"format")

Page 357: Data Ingest Guide

Data Ingest Guide - Platfora Expressions

Page 357

Returns a value of type DATETIME truncated to the specified format.

datetime

Required. A field or expression that returns a DATETIME value.

format

Required. One of the following format values:

• millisecond - Returns a datetime value truncated to millisecond granularity. Has no effect sincemillisecond is already the most granular format for datetime values. For example, an input datetimevalue of 2012-08-15 20:38:40.213 would return a datetime value of 2012-08-15 20:38:40.213.

• second - Returns a datetime value truncated to second granularity. For example, an input datetimevalue of 2012-08-15 20:38:40.213 would return a datetime value of 2012-08-15 20:38:40.000.

• minute - Returns a datetime value truncated to minute granularity. For example, an input datetimevalue of 2012-08-15 20:38:40.213 would return a datetime value of 2012-08-15 20:38:00.000.

• hour - Returns a datetime value truncated to hour granularity. For example, an input datetime valueof 2012-08-15 20:38:40.213 would return a datetime value of 2012-08-15 20:00:00.000.

• day - Returns a datetime value truncated to day granularity. For example, an input datetime value of2012-08-15 20:38:40.213 would return a datetime value of 2012-08-15 00:00:00.000.

• week - Returns a datetime value truncated to the first day of the week (starting on a Monday). Forexample, an input datetime value of 2012-08-15 (a Wednesday) would return a datetime value of2012-08-13 (the Monday prior).

• month - Returns a datetime value truncated to the first day of the month. For example, an inputdatetime value of 2012-08-15 would return a datetime value of 2012-08-01.

• quarter - Returns a datetime value truncated to the first day of the quarter (January 1, April 1, July 1,or October 1). For example, an input datetime value of 2012-08-15 20:38:40.213 would return adatetime value of 2012-07-01.

• year - Returns a datetime value truncated to the first day of the year (January 1). For example, aninput datetime value of 2012-08-15 would return a datetime value of 2012-01-01.

• weekyear - Returns a datetime value trucated to the first day of the ISO weekyear (the ISO weekstarting with the Monday which is nearest in time to January 1). For example, an input datetime valueof 2008-08-15 would return a datetime value of 2007-12-31. The first day of the ISO weekyear for2008 is December 31, 2007 (the prior Monday closest to January 1).

Truncate the order_date datetime field to day granularity:

TRUNC(order_date,"day")

Cast the value of the order_date string field to a datetime value using TO_DATE, and truncate it to daygranularity:

TRUNC(TO_DATE(order_date,"MM/dd/yyyy HH:mm:ss"),"day")

Page 358: Data Ingest Guide

Data Ingest Guide - Platfora Expressions

Page 358

YEAR_DIFF

YEAR_DIFF is a row function that calculates the fractional number of years between two DATETIMEvalues (value1-value2).

YEAR_DIFF(datetime_1,datetime_2)

Returns one value per row of type DOUBLE.

datetime_1

Required. A field or expression of type DATETIME.

datetime_2

Required. A field or expression of type DATETIME.

Calculate the number of years a user has been a customer by subtracting the value of theregistration_date field from the current date (the result of the expression):

YEAR_DIFF(NOW(),registration_date)

Calculate a user's age by subtracting the value of the birthdate field in the users dataset from the currentdate (the result of the expression):

YEAR_DIFF(NOW(),users.birthdate)

Math Functions

Math functions allow you to perform basic math calculations on numeric values. You can also usearithmetic operators to perform simple math calculations.

DIV

DIV is a row function that divides two LONG values and returns a quotient value of type LONG (the resultis truncated to 0 decimal places).

DIV(dividend,divisor)

Returns one value per row of type LONG.

dividend

Required. A field or expression of type LONG.

divisor

Required. A field or expression of type LONG.

Cast the value of the file_size field to LONG and divide by 1024:

DIV(TO_LONG(file_size),1024)

Page 359: Data Ingest Guide

Data Ingest Guide - Platfora Expressions

Page 359

EXP

EXP is a row function that raises the mathematical constant e to the power (exponent) of a numeric valueand returns a value of type DOUBLE.

EXP(power)

Returns one value per row of type DOUBLE.

power

Required. A field or expression of a numeric type.

Raise e to the power in the Value field.

EXP(Value)

When the Value field value is 2.0, the result is equal to 7.3890 when truncated to four decimal places.

FLOOR

FLOOR is a row function that returns the largest integer that is less than or equal to the input argument.

FLOOR(double)

Returns one value per row of type DOUBLE.

double

Required. A field or expression of type DOUBLE.

Return the floor value of 32.6789:

FLOOR(32.6789) returns 32.0

HASH

HASH is a row function that evenly partitions data values into the specified number of buckets. It createsa hash of the input value and assigns that value a bucket number. Equal values will always hash to thesame bucket number.

HASH(field_name,integer)

Returns one value per row of type INTEGER corresponding to the bucket number that the input valuehashes to.

field_name

Required. The name of the field whose values you want to partition. When this value is NULL and theinteger parameter is a value other than zero or NULL, the function returns zero, otherwise it returnsNULL.

integer

Page 360: Data Ingest Guide

Data Ingest Guide - Platfora Expressions

Page 360

Required. The desired number of buckets. This parameter can be a numeric value of any data type,but when it is a non-integer value, Platfora truncates the value to an integer. When the value is zero orNULL, the function returns NULL. When the value is negative, the function uses absolute value.

Partition the values of the username field into 20 buckets:

HASH(username,20)

LN

LN is a row function that returns the natural logarithm of a number. The natural logarithm is thelogarithm to the base e, where e (Euler's number) is a mathematical constant approximately equal to2.718281828. The natural logarithm of a number x is the power to which the constant e must be raised inorder to equal x.

LN(positive_number)

Returns the exponent to which base e must be raised to obtain the input value, where e denotes theconstant number 2.718281828. The return value is the same data type as the input value.

For example, LN(7.389) is 2, because e to the power of 2 is approximately 7.389.

positive_number

Required. A field or expression that returns a number greater than 0. Inputs can be of type INTEGER,LONG, DOUBLE, or FIXED.

Return the natural logarithm of base number e, which is approximately 2.718281828:

LN(2.718281828) returns 1

LN(3.0000) returns 1.098612

LN(300.0000) returns 5.703782

MOD

MOD is a row function that divides two LONG values and returns the remainder value of type LONG (theresult is truncated to 0 decimal places).

MOD(dividend,divisor)

Returns one value per row of type LONG.

dividend

Required. A field or expression of type LONG.

divisor

Required. A field or expression of type LONG.

Cast the value of the file_size field to LONG and divide by 1024:

MOD(TO_LONG(file_size),1024)

Page 361: Data Ingest Guide

Data Ingest Guide - Platfora Expressions

Page 361

POW

POW is a row function that raises the a numeric value to the power (exponent) of another numeric valueand returns a value of type DOUBLE.

POW(index,power)

Returns one value per row of type DOUBLE.

index

Required. A field or expression of a numeric type.

power

Required. A field or expression of a numeric type.

Calculate the compound annual growth rate (CAGR) percentage for a given investment over a five yearspan.

100 * POW(end_value/start_value, 0.2) - 1

Calculate the square of the Value field.

POW(Value,2)

Calculate the square root of the Value field.

POW(Value,0.5)

The following expression returns 1.

POW(0,0)

ROUND

ROUND is a row function that rounds a DOUBLE value to the specified number of decimal places.

ROUND(double,number_decimal_places)

Returns one value per row of type DOUBLE.

double

Required. A field or expression of type DOUBLE.

number_decimal_places

Required. An integer that specifies the number of decimal places to round to.

Round the number 32.4678954 to two decimal places:

ROUND(32.4678954,2) returns 32.47

Page 362: Data Ingest Guide

Data Ingest Guide - Platfora Expressions

Page 362

Data Type Conversion Functions

Data type conversion functions allow you to cast data values from one data type to another. Thesefunctions are used implicitly whenever you set the data type of a field or column in the Platfora userinterface. The supported data types are: INTEGER, LONG, DOUBLE, FIXED, DATETIME, and STRING.

EPOCH_MS_TO_DATE

EPOCH_MS_TO_DATE is a row function that converts LONG values to DATETIME values, where theinput number represents the number of milliseconds since the epoch.

EPOCH_MS_TO_DATE(long_expression)

Returns one value per row of type DATETIME in UTC format yyyy-MM-dd HH:mm:ss:SSS Z.

long_expression

Required. A field or expression of type LONG representing the number of milliseconds since the epochdatetime (January 1, 1970 00:00:00:000 GMT).

Convert a number representing the number of milliseconds from the epoch to a human-readable date andtime:

EPOCH_MS_TO_DATE(1360260240000) returns 2013-02-07T18:04:00:000Z or February 7,2013 18:04:00:000 GMT

Or if your data is in seconds instead of milliseconds:

EPOCH_MS_TO_DATE(1360260240 * 1000) returns 2013-02-07T18:04:00:000Z or February7, 2013 18:04:00:000 GMT

TO_CURRENCY

This function is deprecated. Use the TO_FIXED function instead.

TO_DATE

TO_DATE is a row function that converts STRING values to DATETIME values, and specifies the formatof the date and time elements in the string.

TO_DATE(string_expression,"date_format")

Returns one value per row of type DATETIME (which by definition is in UTC).

string_expression

Required. A field or expression of type STRING.

date_format

Required. A pattern that describes how the date is formatted.

Use the following pattern symbols to define your date format. The count and ordering of the patternletters determines the datetime format. Any characters in the pattern that are not in the ranges of a-z and

Page 363: Data Ingest Guide

Data Ingest Guide - Platfora Expressions

Page 363

A-Z are treated as quoted delimiter text. For instance, characters such as slash (/) or colon (:) will appearin the resulting output even they are not escaped with single quotes.

Table 2: Date Pattern Symbols

SymbolMeaning PresentationExamples Notes

G era text AD

C century of era (0 orgreater)

number 20

Y year of era (0 orgreater)

year 1996

x week year year 1996

Numeric presentation for year and weekyear fields are handled specially. Forexample, if the count of 'y' is 2, the yearwill be displayed as the zero-based yearof the century, which is two digits.

w week number of weekyear

number 27

e day of week (number) number 2

E day of week (name) text Tuesday; Tue If the number of pattern letters is 4 ormore, the full form is used; otherwise ashort or abbreviated form is used.

y year year 1996

D day of year number 189

M month of year month July; Jul; 07 3 or more uses text, otherwise uses anumber

d day of month number 10 If the number of pattern letters is 3 ormore, the text form is used; otherwisethe number is used.

a half day of day text PM

K hour of half day (0-11) number 0

h clock hour of half day(1-12)

number 12

H hour of day (0-23) number 0

k clock hour of day(1-24)

number 24

m minute of hour number 30

s second of minute number 55

S fraction of second number 978

Page 364: Data Ingest Guide

Data Ingest Guide - Platfora Expressions

Page 364

SymbolMeaning PresentationExamples Notes

z time zone text Pacific StandardTime; PST

If the number of pattern letters is 4 ormore, the full form is used; otherwise ashort or abbreviated form is used.

Z time zone offset/id zone -0800; -08:00;America/Los_Angeles

'Z' outputs offset without a colon, 'ZZ'outputs the offset with a colon, 'ZZZ' ormore outputs the zone id.

' escape character fortext-based delimiters

delimiter

'' literal representation ofa single quote

literal '

Define a new DATETIME computed field based on the order_date base field, which contains timestampsin the format of: 2014.07.10 at 15:08:56 PDT:

TO_DATE(order_date,"yyyy.MM.dd 'at' HH:mm:ss z")

Define a new DATETIME computed field by first combining individual month, day, year, anddepart_time fields (using CONCAT), and performing a transformation on depart_time to make sure three-digit times are converted to four-digit times (using REGEX_REPLACE):

TO_DATE(CONCAT(month,"/",day,"/",year,":",REGEX_REPLACE(depart_time,"\b(\d{3})\b","0$1")),"MM/dd/yyyy:HHmm")

Define a new DATETIME computed field based on the created_at base field, which contains timestampsin the format of: Sat Jan 25 16:35:23 +0800 2014 (this is the timestamp format returned by Twitter'sAPI):

TO_DATE(created_at,"EEE MMM dd HH:mm:ss Z yyyy")

TO_DOUBLE

TO_DOUBLE is a row function that converts STRING, INTEGER, LONG, or DOUBLE values to DOUBLE(decimal) values.

TO_DOUBLE(expression)

Returns one value per row of type DOUBLE.

expression

Required. A field or expression of type STRING (must be numeric characters), INTEGER, LONG, orDOUBLE.

Convert the values of the average_rating field to a double data type:

TO_DOUBLE(average_rating)

Convert the average_rating field to a double data type, but first transform the occurrence of any NAvalues to NULL values using a CASE expression:

Page 365: Data Ingest Guide

Data Ingest Guide - Platfora Expressions

Page 365

TO_DOUBLE(CASE WHEN average_rating="N/A" then NULL ELSE average_ratingEND)

TO_FIXED

TO_FIXED is a row function that converts STRING, INTEGER, LONG, or DOUBLE values to fixed-decimal values. Using a FIXED data type to represent monetary values allows you to calculate andaggregate monetary values with accuracy to a ten-thousandth of a monetary unit.

TO_FIXED(expression)

Returns one value per row of type FIXED (fixed-decimal value to 10000th accuracy).

expression

Required. A field or expression of type STRING (must be numeric characters), INTEGER, LONG, orDOUBLE.

Convert the opening_price field to a fixed decimal data type:

TO_FIXED(opening_price)

Convert the sale_price field to a fixed decimal data type, but first transform the occurrence of any N/Astring values to NULL values using a CASE expression:

TO_FIXED(CASE WHEN sale_price="N/A" then NULL ELSE sale_price END)

TO_INT

TO_INT is a row function that converts STRING, INTEGER, LONG, or DOUBLE values to INTEGER(whole number) values. When converting DOUBLE values, everything after the decimal will be truncated(not rounded up or down).

TO_INT(expression)

Returns one value per row of type INTEGER.

expression

Required. A field or expression of type STRING, INTEGER, LONG, or DOUBLE. If a STRING fieldcontains non-numeric characters, the function returns NULL which Platfora converts to the default valuein a lens (by default, the default value for INTEGER fields is 0).

Convert the values of the average_rating field to an integer data type:

TO_INT(average_rating)

Convert the flight_duration field to an integer data type, but first transform the occurrence of any NAvalues to NULL values using a CASE expression:

TO_INT(CASE WHEN flight_duration="N/A" then NULL ELSE flight_durationEND)

Page 366: Data Ingest Guide

Data Ingest Guide - Platfora Expressions

Page 366

TO_LONG

TO_LONG is a row function that converts STRING, INTEGER, LONG, or DOUBLE values to LONG (wholenumber) values. When converting DOUBLE values, everything after the decimal will be truncated (notrounded up or down).

TO_LONG(expression)

Returns one value per row of type LONG.

expression

Required. A field or expression of type STRING (must be numeric characters only, no period orcomma), INTEGER, LONG, or DOUBLE. When a STRING field value includes a decimal, the functionreturns a NULL value.

Convert the values of the average_rating field to a long data type:

TO_LONG(average_rating)

Convert the average_rating field to a long data type, but first transform the occurrence of any NA valuesto NULL values using a CASE expression:

TO_LONG(CASE WHEN average_rating="N/A" then NULL ELSE average_ratingEND)

TO_STRING

TO_STRING is a row function that converts values of other data types to STRING (character) values.

TO_STRING(expression)

TO_STRING(datetime_expression,date_format)

Returns one value per row of type STRING.

expression

A field or expression of type FIXED, STRING, INTEGER, LONG, or DOUBLE.

datetime_expression

A field or expression of type DATETIME.

date_format

If converting a DATETIME to a string, a pattern that describes how the date is formatted. See TO_DATEfor the date format patterns.

Convert the values of the sku_number field to a string data type:

TO_STRING(sku_number)

Convert values in the age column into a range-based groupings (binning), and cast output values to aSTRING:

Page 367: Data Ingest Guide

Data Ingest Guide - Platfora Expressions

Page 367

TO_STRING(CASE WHEN age <= 25 THEN "0-25" WHEN age <= 50 THEN "26-50"ELSE "over 50" END)

Convert the values of a timestamp datetime field to a string, where the timestamp values are in theformat of: 2002.07.10 at 15:08:56 PDT:

TO_STRING(timestamp,"yyyy.MM.dd 'at' HH:mm:ss z")

Aggregate Functions

An aggregate function groups the values of multiple rows together based on some defined inputexpression. Aggregate functions return one value for a group of rows, and are only valid for definingmeasures in Platfora. Aggregate functions cannot be combined with row functions.

AVG

AVG is an aggregate function that returns the average of all valid numeric values. It sums all values inthe provided expression and divides by the number of valid (NOT NULL) rows. If you want to computean average that includes all values in the row count (including NULL values), you can use a SUM/COUNTexpression instead.

AVG(numeric_field)

Returns a value of type DOUBLE.

numeric_field

Required. A field of type INTEGER, LONG, DOUBLE, or FIXED. Unlike row functions, aggregatefunctions can only take field names as input.

Get the average of the valid sale_amount field values:

AVG(sale_amount)

Get the average of the valid net_worth field values in the billionaires data set, which resides in thesamples namespace:

AVG([(samples) billionaires].net_worth)

Get the average of all page_views field values in the web_logs dataset (including NULL values):

SUM(page_views)/COUNT(web_logs)

COUNT

COUNT is an aggregate function that returns the number of rows in a dataset.

COUNT([namespace_name]dataset_name)

Returns a value of type INTEGER.

namespace_name

Page 368: Data Ingest Guide

Data Ingest Guide - Platfora Expressions

Page 368

Optional. The name of the namespace in which the dataset resides. If not specified, uses the defaultnamespace.

dataset_name

Required. The name of the dataset for which to obtain a count of rows. If you want to count rows of adown-stream dataset that is related to the current dataset, you can specify the hierarchy of dataset namesin the format of:parent_dataset_name.child_dataset_name.[...]

Count the rows in the sales dataset:

COUNT(sales)

Count the rows in the billionaires dataset, which resides in the samples namespace:

COUNT([(samples) billionaires])

Count the rows in the customer dataset, which is a related dataset down-stream of sales:

COUNT(sales.customers)

COUNT_VALID

COUNT_VALID is an aggregate function that returns the number of rows for which the given expressionis valid (excludes NULL values).

COUNT_VALID(field)

Returns a numeric value of type INTEGER.

field

Required. A field name. Unlike row functions, aggregate functions can only take field names as input.

Count the valid values in the page_views field:

COUNT_VALID(page_views)

DISTINCT

DISTINCT is an aggregate function that returns the number of distinct values for the given expression.

DISTINCT(field)

Returns a numeric value of type INTEGER.

field

Required. A field name. Unlike row functions, aggregate functions can only take field names as input.

Count the unique values of the user_id field in the currently selected dataset:

DISTINCT(user_id)

Page 369: Data Ingest Guide

Data Ingest Guide - Platfora Expressions

Page 369

Count the unique values of the name field in the billionaires dataset, which resides in the samplesnamespace:

DISTINCT([(samples) billionaires].name)

Count the unique values of the customer_id field in the customer dataset, which is a related datasetdown-stream of web sales:

DISTINCT([web sales].customers.customer_id)

MAX

MAX is an aggregate function that returns the biggest value from the given input expression.

MAX(numeric_or_datetime_field)

Returns a numeric or datetime value of the same type as the input expression.

numeric_or_datetime_field

Required. A field of type INTEGER, LONG, DOUBLE, FIXED, or DATETIME. Unlike rowfunctions, aggregate functions can only take field names as input.

Get the highest value from the sale_amount field:

MAX(sale_amount)

Get the latest date from the Session Timestamp datetime field:

MAX([Session Timestamp])

MIN

MIN is an aggregate function that returns the smallest value from the given input expression.

MIN(numeric_or_datetime_field)

Returns a numeric or datetime value of the same type as the input expression.

numeric_or_datetime_field

Required. A field of type INTEGER, LONG, DOUBLE, FIXED, or DATETIME. Unlike rowfunctions, aggregate functions can only take field names as input.

Get the lowest value from the sale_amount field:

MIN(sale_amount)

Get the earliest date from the Session Timestamp datetime field:

MIN([Session Timestamp])

SUM

SUM is an aggregate function that returns the total of all values from the given input expression.

Page 370: Data Ingest Guide

Data Ingest Guide - Platfora Expressions

Page 370

SUM(numeric_field)

Returns a numeric value of the same type as the input expression.

numeric_field

Required. A field of type INTEGER, LONG, DOUBLE, or FIXED. Unlike row functions, aggregatefunctions can only take field names as input.

Add the values of the sale_amount field:

SUM(sale_amount)

Add values of the session count field in the users dataset, which is a related dataset down-stream ofclicks:

SUM(clicks.users.[session count])

STDDEV

STDDEV is an aggregate function that calculates the population standard deviation for a group ofnumeric values. Standard deviation is the square root of the variance.

STDDEV(numeric_field)

Returns a value of type DOUBLE. If there are less than two values in the input group, returns NULL.

numeric_field

Required. A field of type INTEGER, LONG, DOUBLE, or FIXED. Unlike row functions, aggregatefunctions can only take field names as input.

Calculate the standard deviation of the values contained in the sale_amount field:

STDDEV(sale_amount)

VARIANCE

VARIANCE is an aggregate function that calculates the population variance for a group of numericvalues. Variance measures the amount by which all values in a group vary from the average value ofthe group. Data with low variance contains values that are identical or similar. Data with high variancecontains values that are not similar. Variance is calculated as the average of the squares of the deviationsfrom the mean. Squaring the deviations ensures that negative and positive deviations do not cancel eachother out.

VARIANCE(numeric_field)

Returns a value of type DOUBLE. If there are less than two values in the input group, returns NULL.

numeric_field

Required. A field of type INTEGER, LONG, DOUBLE, or FIXED. Unlike row functions, aggregatefunctions can only take field names as input.

Page 371: Data Ingest Guide

Data Ingest Guide - Platfora Expressions

Page 371

Get the population variance of the values contained in the sale_amount field:

VARIANCE(sale_amount)

ROLLUP and Window Functions

Window functions can only be used in conjunction with ROLLUP. ROLLUP is a modifier to an aggregateexpression that determines the partitioning and ordering of a rowset before the associated aggregatefunction or window function is applied. ROLLUP defines a window or user-specified set of rows withina query result set. A window function then computes a value for each row in the window. You canuse window functions to compute aggregated values such as moving averages, cumulative aggregates,running totals, or a top N per group results.

ROLLUP

ROLLUP is a modifier to an aggregate function that turns a regular aggregate function into a windowed,partitioned, or adaptive aggregate function. This is useful when you want to compute an aggregationover a subset of rows within the overall result of a viz query.

ROLLUP aggregate_expression [ WHERE input_group_condition [...] ] [ TO ([partitioning_columns]) [ ORDER BY (ordering_column [ASC | DESC]) ROWS|RANGE window_boundary | BETWEEN window_boundary AND window_boundary ] ]

where window_boundary can be one of:

UNBOUNDED PRECEDINGvalue PRECEDINGvalue FOLLOWINGUNBOUNDED FOLLOWING

A regular measure is the result of an aggregation (such as SUM or AVG) applied to some fact or metriccolumn of a dataset. For example, suppose we had a dataset with the following rows and columns:

Date Sale Amount Product Region

05/01/2013 100 gadget west

05/01/2013 200 widget east

06/01/2013 100 gadget east

06/01/2013 400 widget west

Page 372: Data Ingest Guide

Data Ingest Guide - Platfora Expressions

Page 372

Date Sale Amount Product Region

07/01/2013 300 widget west

07/01/2013 200 gadget east

To define a regular measure called Total Sales, we would use the expression:

SUM([Sale Amount])

When this measure is used in a visualization, the group of input records passed into the aggregatecalculation is determined by the dimensions selected by the user when they create the viz. For example,if the user chose Region as a dimension in the viz, there would be two input groups for which themeasure would be calculated:

Total Sales / Region

east west

500 800

If an aggregate expression includes a ROLLUP clause, the column(s) specified in the TO clause of theROLLUP expression determine the additional partitions over which to compute the aggregate expression.It divides the overall rows returned by the viz query into subsets or buckets, and then computes theaggregate expression within each bucket. Every ROLLUP expression has implicit partitioning defined: anabsent TO clause treats the entire result set as one partition; an empty TO clause partitions by whateverdimension columns are present in the viz query.

The WHERE clause is used to filter the input rows that flow into each partition. Input rows that meet theWHERE clause criteria will be partitioned, and rows that don't will not be partitioned.

The ORDER BY with a RANGE or ROW clause is used to define a window frame within each partitionover which to compute the aggregate expression.

When a ROLLUP measure is used in a visualization, the aggregate calculation is computed across aset of input rows that are related to, but separate from, the other dimension(s) used in the viz. This issimilar to the type of calculation that is done with a regular measure. However unlike a regular measure,a ROLLUP measure does not cause the input rows to be grouped into a single result set; the input rowsstill retain their separate identities. The ROLLUP clause determines how the input rows are split up forprocessing by the ROLLUP's aggregate function.

ROLLUP expressions can be written to make the partitioning adaptive to whatever dimension columnsare selected in the visualization. This is done by using a reference name as the partitioning column, asopposed to a regular column. For example, suppose we wanted to be able to calculate the total sales forany granularity of date. We could create an adaptive measure called Rollup Sales to Date that partitionstotal sales by date as follows:

ROLLUP SUM([Sale Amount]) TO (Date)

Page 373: Data Ingest Guide

Data Ingest Guide - Platfora Expressions

Page 373

When this measure is used in a visualization, the group of input records passed into the aggregatecalculation is determined by the dimension fields selected by the user in the viz, but partitioned by thegranularity of Date selected by the user. For example, if the user chose the dimensions Date.Month andRegion in the viz, then total sales would be grouped by month and region, but the ROLLUP measureexpression would aggregate the sales by month only.

Notice that the results for the east and west regions are the same - this is because the aggregationexpression is only considering rows that share the same month when calculating the sum of sales.

Month / (Measures) / Region

May 2013 June 2013 July 2013

Rollup Sales to Date Rollup Sales to Date Rollup Sales to Date

east | west east | west east | west

300 | 300 500 | 500 500 | 500

Suppose within the date partition, we wanted to calculate the cumulative total day to day. We coulddefine a window measure called Running Total to Date that looks at each day and all preceding days asfollows:

ROLLUP SUM([Sale Amount]) TO (Date) ORDER BY (Date.Date) ROWS UNBOUNDED PRECEDING

When this measure is used in a visualization, the group of input records passed into the aggregatecalculation is determined by the dimension fields selected by the user in the viz, and partitioned by thegranularity of Date selected by the user. Within each partition the rows are ordered chronologically (byDate.Date), and the sum amount is then calculated per date partition by looking at the current row (ormark), and all rows that come before it within the partition. For example, if the user chose the dimensionDate.Month in the viz, then the ROLLUP measure expression would cumulatively aggregate the saleswithin each month.

Month / (Measures) / Date.Date

May 2013 June 2013 July 2013

2013-05-01 2013-06-01 2013-07-01

Running Total to Date Rollup Sales to Date Rollup Sales to Date

300 500 500

Returns a numeric value per partition based on the output type of the aggregate_expression.

aggregate_expression

Page 374: Data Ingest Guide

Data Ingest Guide - Platfora Expressions

Page 374

Required. An expression containing an aggregate or window function. Simple aggregatefunctions such as COUNT, AVG, SUM, MIN, and MAX are supported. Window functionssuch as RANK, DENSE_RANK, and NTILE are supported and can only be used inconjuction with ROLLUP.

Complex aggregate functions such as STDDEV and VARIANCE are not supported.

WHERE input_group_condition

The WHERE clause limits the group of input rows over which to compute the aggregateexpression. The input group condition is a Boolean (true or false) condition definedusing a comparison operator expression. Any row that does not satisfy the condition willbe excluded from the input group used to calculate the aggregated measure value. Forexample (note that datetime values must be specified in yyyy-MM-dd format):

WHERE Date.Date BETWEEN 2012-06-01 AND 2012-07-31

WHERE Date.Year BETWEEN 2009 AND 2013

WHERE Company LIKE("Plat*")

WHERE Code IN("a","b","c")

WHERE Sales < 50.00

WHERE Age >= 21

You can specify multiple WHERE clauses in a ROLLUP expression.

TO ([partitioning_columns])

The TO clause is used to specify the dimension column(s) used to partition a group ofinput rows. This allows you to calculate a measure value for a specific dimension group(a subset of input rows) that are somehow related to the other dimension groups used in avisualization (all input rows). It is possible to define an empty group (meaning all rows) byusing empty parenthesis.

When used in a visualization, measure values are computed for groups of input rows thatreturn the same value for the columns specified in the partitioning list. For example, if theDate.Month column is used as a partitioning column, then all records that have the samevalue for Date.Month will be grouped together in order to calculate the measure value.The aggregate expression is applied to the group specified in the TO clause independentlyof the other dimension groupings used in the visualization. Note that the partitioningcolumn(s) specified in the TO clause of an adaptive measure expression must also beincluded as dimensions (or grouping columns) in the visualization.

A partitioning column can also be the name of a reference field. Using a reference field allows thepartition criteria to dynamically adapt based on any field of the referenced dataset that is used in a viz.For example, if the partition column is a reference field pointing to the Date dimension, then any sub-field of Date (Date.Year, Date.Month, etc.) can be used as the partitioning column by selecting it in aviz.

Page 375: Data Ingest Guide

Data Ingest Guide - Platfora Expressions

Page 375

A TO clause with an empty partitioning list treats each mark in the result set as an inputgroup. For example, if the viz includes the Month and Region columns, then TO() wouldbe equivalent to TO(Month,Region).

ORDER BY (ordering_column)

The optional ORDER BY clause orders the input rows using the values in the specifiedcolumn within each partition identified in the TO clause. Use the ORDER BY clausealong with the ROWS or RANGE clauses to define windows over which to compute theaggregate function. This is useful for computing moving averages, cumulative aggregates,running totals, or a top value per group of input rows. The ordering column specified in theORDER BY clause can be a dimension, measure, or an aggregate expression (for exampleORDER BY (SUM(Sales))). If the ordering column is a dimension, it must be included inthe viz.

By default, rows are sorted in ascending order (low to high values). You can use the DESCkeyword to sort in descending order (high to low values).

ROWS | RANGE

Required when using ORDER BY. Further limits the rows within the partition byspecifying start and end points within the partition. This is done by specifying a range ofrows with respect to the current row either by logical association (RANGE) or physicalassociation (ROWS). Use either a ROWS or RANGE clause to express the windowboundary (the set of input rows in each partition, relative to the current row, over which tocompute the aggregate expression). The window boundary can include one, several, or allrows of the partition.

When using the RANGE clause, the ordering column used in the ORDER BY clause mustbe a sub-column of a reference to Platfora's built-in Date dimension dataset.

window_boundary

A window boundary is required when using either ROWS or RANGE. This defines the setof rows, relative to the current row, over which to compute the aggregate expression. Therow order is based on the ordering specified in the ORDER BY clause.

A PRECEEDING clause defines a lower window boundary (the number of rows to includebefore the current row). The FOLLOWING clause defines an upper window boundary(the number of rows to include after the current row). The window boundary expressionmust include either a PRECEEDING or FOLLOWING clause, or both. If PRECEEDINGis omitted, the current row is considered the first row in the window. Similarly, ifFOLLOWING is omitted, the current row is considered the last row in the window. TheUNBOUNDED keyword includes all rows in the direction specified. When you need tospecify both a start and end of a window, use the BETWEEN and AND keywords.

For example:

ROWS 2 PRECEDING means that the window is three rows in size, starting with tworows preceding until and including the current row.

Page 376: Data Ingest Guide

Data Ingest Guide - Platfora Expressions

Page 376

ROWS BETWEEN 2 PRECEDING AND 5 FOLLOWING means that the window is eightrows in size, starting with two rows preceding, the current row, and five rows followingthe current row. The current row is included in the set of rows by default.

You can exclude the current row from the window by specifying a window start and endpoint before or after the current row. For example:

ROWS BETWEEN UNBOUNDED PRECEDING AND 1 PRECEDING starts the windowwith all rows that come before the current row, and ends the window one row before thecurrent row, thereby excluding the current row from the window.

Calculate the percentage of flight records in the same departure date period. Note that thedeparture_date field is a reference to the Date dataset, meaning that the group to which themeasure is applied can adapt to any downstream field of departure_date (departure_date.Year,departure_date.Month, and so on). When used in a viz, this will calculate the percentage of flights foreach dimension group in the viz that share the same value for departure_date:

100 * COUNT(Flights) / ROLLUP COUNT(Flights) TO ([Departure Date])

Normalize the number of flights using the carrier American Airlines (AA) as the benchmark. This willallow you to compare the number of flights for other carriers against the fixed baseline number of flightsfor AA (if AA = 100 percent, then all other carriers will fall either above or below that percentage):

100 * COUNT(Flights) / ROLLUP COUNT(Flights) WHERE [Carrier Code]="AA"

Calculate a generic percentage of total sales. When this measure is used in a visualization, it will showthe percentage of total sales that a mark in the viz is contributing to the total for all marks in the viz. Theinput rows depend on the dimensions selected in the viz.

100 * SUM(sales) / ROLLUP SUM(sales) TO ()

Calculate the cumulative total of sales for a given year on a month-to-month basis (year-to-month salestotals):

ROLLUP SUM(sales) TO (Date.Year) ORDER BY (Date.Month) ROWS UNBOUNDEDPRECEDING

Calculate the cumulative total of sales (for all input rows) for all previous years, but exclude the currentyear from the total.

ROLLUP SUM(sales) TO () ORDER BY (Date.Year) ROWS BETWEEN UNBOUNDEDPRECEDING AND 1 PRECEDING

DENSE_RANK

DENSE_RANK is a windowing aggregate function that orders rows by a measure value and assigns arank number to each row in the given partition. Rank positions are not skipped in the event of a tie.DENSE_RANK must be used within a ROLLUP expression.

ROLLUP DENSE_RANK() TO ([partitioning_column]) [ORDER BY (measure_expression [ASC | DESC])

Page 377: Data Ingest Guide

Data Ingest Guide - Platfora Expressions

Page 377

ROWS|RANGE window_boundary | BETWEEN window_boundary AND window_boundary ]

where window_boundary can be one of:

UNBOUNDED PRECEDINGvalue PRECEDINGvalue FOLLOWINGUNBOUNDED FOLLOWING

DENSE_RANK is a window aggregate function used to assign a ranking number to each row in a group.If multiple rows have the same ranking value (there is a tie), then the tied rows are given the same rankvalue and subsequent rank positions are not skipped.

The TO clause of the ROLLUP is used to specify the dimension column(s) used to partition a group ofinput rows. To define a global ranking that can adapt to any dimension groupings used in a viz, specifyan empty TO clause.

The ORDER BY clause of the ROLLUP expression determines how to order the rows before they areranked. The ORDER BY clause should specify the measure field for which you want to calculate theranks. The ranked rows in the partition are numbered starting at one.

For example, suppose we had a dataset with the following rows and columns and you want to rank theQuarters and Regions according to the values in the Sales column.

Quarter Region Sales

2010 Q1 North 100

2010 Q1 South 200

2010 Q1 East 300

2010 Q1 West 400

2010 Q2 North 400

2010 Q2 South 250

2010 Q2 East 150

2010 Q2 West 250

Page 378: Data Ingest Guide

Data Ingest Guide - Platfora Expressions

Page 378

Supposing the lens has an existing measure field called Sales(Sum), you could then define a measurecalled Sales_Dense_Rank using the following expression:

ROLLUP DENSE_RANK() TO () ORDER BY ([Sales(Sum)] DESC) ROWS UNBOUNDED PRECEDING

When you include the Quarter, Region, and Sales_Dense_Rank columns in the viz, you get thefollowing data points. Notice that tied values are given the same rank number and no rank positions areskipped:

Quarter Region SalesRank

2010 Q1 North 6

2010 Q1 South 4

2010 Q1 East 2

2010 Q1 West 1

2010 Q2 North 1

2010 Q2 South 3

2010 Q2 East 5

2010 Q2 West 3

Returns a value of type LONG.

ROLLUP

Required. DENSE_RANK must be used within a ROLLUPROLLUP expression in place of theaggregate_expression of the ROLLUP.

The TO clause of the ROLLUP expression specifies the dimension group(s) over which to calculate thewindow function. An empty TO calculates the window function over all rows in the query as one group.

The ORDER BY clause of the ROLLUP expression specifies a measure field or aggregate expression.

Rank the sum of all sales in descending order, so the highest sales is given the ranking of 1.

ROLLUP DENSE_RANK() TO () ORDER BY ([Sales(Sum)] DESC) ROWS UNBOUNDEDPRECEDING

Rank the sum of all sales within a given quarter in descending order, so the highest sales in each quarteris given the ranking of 1.

ROLLUP DENSE_RANK() TO (Quarter) ORDER BY ([Sales(Sum)] DESC) ROWSUNBOUNDED PRECEDING

Page 379: Data Ingest Guide

Data Ingest Guide - Platfora Expressions

Page 379

NTILE

NTILE is a windowing aggregate function that divides a partitioned group of rows into the specifiednumber of buckets, and returns the bucket number to which the current row belongs. NTILE must beused within a ROLLUP expression.

ROLLUP NTILE(integer) TO ([partitioning_column]) [ORDER BY (measure_expression [ASC | DESC]) ROWS|RANGE window_boundary | BETWEEN window_boundary AND window_boundary ]

where window_boundary can be one of:

UNBOUNDED PRECEDINGvalue PRECEDINGvalue FOLLOWINGUNBOUNDED FOLLOWING

NTILE is a window aggregate function typically used to calculate percentiles. A percentile (or centile)is a measure used in statistics indicating the value below which a given percentage of records in a groupfalls. For example, the 20th percentile is the value (or score) below which 20 percent of the records maybe found. The term percentile is often used in the reporting of test scores. For example, if a score is inthe 86th percentile, it is higher than 86% of the other scores. The 25th percentile is also known as thefirst quartile (Q1), the 50th percentile as the median or second quartile (Q2), and the 75th percentile asthe third quartile (Q3). In general, percentiles, deciles and quartiles are specific types of ntiles.

NTILE must be used within a ROLLUPROLLUP expression in place of the aggregate_expressionof the ROLLUP.

The TO clause of the ROLLUP is used to specify a fixed dimension column used to partition a group ofinput rows. To define a global NTILE ranking that can adapt to any dimension groupings used in a viz,specify an empty TO clause.

The ORDER BY clause of the ROLLUP expression determines how to order the rows before they aredivided into buckets. The ORDER BY clause should specify the measure field for which you want tocalculate NTILE bucket values. A centile would be 100 buckets, a decile would be 10 buckets, a quartile4 buckets, and so on. The buckets in the partition are numbered starting at one.

For example, suppose we had a dataset with the following rows and columns and you want to dividethe year-to-date sales into four buckets (quartiles) with the highest quartile ranked as 1 and thelowest ranked as 4. Supposing a measure field has been defined called Sum_YTD_Sales, defined as

Page 380: Data Ingest Guide

Data Ingest Guide - Platfora Expressions

Page 380

SUM([Sales YTD]), you could then define a measure called YTD_Sales_Quartile using the followingexpression:

ROLLUP NTILE(4) TO() ORDER BY(Sum_YTD_Sales DESC) ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING

Name Gender Sales YTD YTD_Sales_Quartile

Chen F 3,500,000 1

John M 3,100,000 1

Pete M 2,900,000 1

Daria F 2,500,000 2

Jennie F 2,200,000 2

Mary F 2,100,000 2

Mike M 1,900,000 3

Brian M 1,700,000 3

Molly F 1,500,000 3

Theresa F 1,200,000 4

Hans M 900,000 4

Ben M 500,000 4

Because the TO clause of the ROLLUP expression is empty, the quartile partitioning adapts to whateverdimensions are used in the viz. For example, if you include the Gender dimension field in the viz, thequartiles would then be computed per gender. The following example divides each gender into bucketswith each gender having 6 year-to-date sales values. The two extra values (the remainder of 6 / 4) areallocated to buckets 1 and 2, which therefore have one more value than buckets 3 or 4.

Name Gender Sales YTD YTD_Sales_Quartile (partitioned by Gender)

Chen F 3,500,000 1

Daria F 2,500,000 1

Jennie F 2,200,000 2

Mary F 2,100,000 2

Molly F 1,500,000 3

Page 381: Data Ingest Guide

Data Ingest Guide - Platfora Expressions

Page 381

Name Gender Sales YTD YTD_Sales_Quartile (partitioned by Gender)

Theresa F 1,200,000 4

John M 3,100,000 1

Pete M 2,900,000 1

Mike M 1,900,000 2

Brian M 1,700,000 2

Hans M 900,000 3

Ben M 500,000 4

Returns a value of type LONG.

ROLLUP

Required. NTILE must be used within a ROLLUPROLLUP expression in place of theaggregate_expression of the ROLLUP.

The TO clause of the ROLLUP expression specifies the dimension group(s) over which to calculate thewindow function. An empty TO calculates the window function over all rows in the query as one group.

The ORDER BY clause of the ROLLUP expression specifies a measure field or aggregate expression.

integer

Required. An integer that specifies the number of buckets to divide the partitioned rows into.

Perhaps the most common use case for NTILE is to get a global ranking of result rows. For example,if you wanted to get the percentile of Total Records per City, you may think the expression to useis: ROLLUP NTILE(100) TO (City) ORDER BY ([Total Records] DESC) ROWS BETWEENUNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING.

However, by leaving the TO clause blank, the percentile buckets can adapt to whatever dimension(s)you use in the viz. To calculate the Total Records percentiles by City, you could define a globalTotal_Records_Percentiles measure and then use this measure in conjunction with the City dimension inthe viz (or any other dimension for that matter).

ROLLUP NTILE(100) TO () ORDER BY ([Total Records] DESC) ROWS BETWEENUNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING

RANK

RANK is a windowing aggregate function that orders rows by a measure value and assigns a rank numberto each row in the given partition. Rank positions are skipped in the event of a tie. RANK must be usedwithin a ROLLUP expression.

ROLLUP RANK()

Page 382: Data Ingest Guide

Data Ingest Guide - Platfora Expressions

Page 382

TO ([partitioning_column]) [ORDER BY (measure_expression [ASC | DESC]) ROWS|RANGE window_boundary | BETWEEN window_boundary AND window_boundary ]

where window_boundary can be one of:

UNBOUNDED PRECEDINGvalue PRECEDINGvalue FOLLOWINGUNBOUNDED FOLLOWING

RANK is a window aggregate function used to assign a ranking number to each row in a group. Ifmultiple rows have the same ranking value (there is a tie), then the tied rows are given the same rankvalue and the subsequent rank position is skipped.

The TO clause of the ROLLUP is used to specify the dimension column(s) used to partition a group ofinput rows. To define a global ranking that can adapt to any dimension groupings used in a viz, specifyan empty TO clause.

The ORDER BY clause of the ROLLUP expression determines how to order the rows before they areranked. The ORDER BY clause should specify the measure field for which you want to calculate theranks. The ranked rows in the partition are numbered starting at one.

For example, suppose we had a dataset with the following rows and columns and you want to rank theQuarters and Regions according to the values in the Sales column.

Quarter Region Sales

2010 Q1 North 100

2010 Q1 South 200

2010 Q1 East 300

2010 Q1 West 400

2010 Q2 North 400

2010 Q2 South 250

2010 Q2 East 150

2010 Q2 West 250

Page 383: Data Ingest Guide

Data Ingest Guide - Platfora Expressions

Page 383

Supposing the lens has an existing measure field called Sales(Sum), you could then define a measurecalled Sales_Rank using the following expression:

ROLLUP RANK() TO () ORDER BY ([Sales(Sum)] DESC) ROWS UNBOUNDED PRECEDING

When you include the Quarter, Region, and Sales_Rank columns in the viz, you get the followingdata points. Notice that tied values are given the same rank number and the rank positions 2 and 5 areskipped:

Quarter Region SalesRank

2010 Q1 North 8

2010 Q1 South 6

2010 Q1 East 3

2010 Q1 West 1

2010 Q2 North 1

2010 Q2 South 4

2010 Q2 East 7

2010 Q2 West 4

Returns a value of type LONG.

ROLLUP

Required. RANK must be used within a ROLLUPROLLUP expression in place of theaggregate_expression of the ROLLUP.

The TO clause of the ROLLUP expression specifies the dimension group(s) over which to calculate thewindow function. An empty TO calculates the window function over all rows in the query as one group.

The ORDER BY clause of the ROLLUP expression specifies a measure field or aggregate expression.

Rank the sum of all sales in descending order, so the highest sales is given the ranking of 1.

ROLLUP RANK() TO () ORDER BY ([Sales(Sum)] DESC) ROWS UNBOUNDEDPRECEDING

Rank the sum of all sales within a given quarter in descending order, so the highest sales in each quarteris given the ranking of 1.

ROLLUP RANK() TO (Quarter) ORDER BY ([Sales(Sum)] DESC) ROWS UNBOUNDEDPRECEDING

Page 384: Data Ingest Guide

Data Ingest Guide - Platfora Expressions

Page 384

ROW_NUMBER

ROW_NUMBER is a windowing aggregate function that assigns a unique, sequential number to each rowin a group (partition) of rows, starting at 1 for the first row in each partition. ROW_NUMBER must be usedwithin a ROLLUP expression, which acts as a modifier for ROW_NUMBER. Use a column in the ROLLUPorder by clause to determine on which column to determine the row number.

ROLLUP ROW_NUMBER(integer) TO ([partitioning_column]) [ORDER BY (measure_expression [ASC | DESC]) ROWS|RANGE window_boundary | BETWEEN window_boundary AND window_boundary ]

where window_boundary can be one of:

UNBOUNDED PRECEDINGvalue PRECEDINGvalue FOLLOWINGUNBOUNDED FOLLOWING

For example, suppose we had a dataset with the following rows and columns:

Quarter Region Sales

2010 Q1 North 100

2010 Q1 South 200

2010 Q1 East 300

2010 Q1 West 400

2010 Q2 North 400

2010 Q2 South 250

2010 Q2 East 150

2010 Q2 West 250

Suppose you want to assign a unique ID to the sales of each region by quarter in descending order. Inthis example, a measure field is defined called Sum_Sales with the expression SUM(Sales). You couldthen define a measure called SalesNumber using the following expression:

ROLLUP ROW_NUMBER() TO (Quarter) ORDER BY (Sum_Sales DESC) ROWS UNBOUNDED PRECEDING

Page 385: Data Ingest Guide

Data Ingest Guide - Platfora Expressions

Page 385

When you include the Quarter, Region, and SalesNumber columns in the viz, you get the following datapoints:

Quarter Region SalesNumber

2010 Q1 North 4

2010 Q1 South 3

2010 Q1 East 2

2010 Q1 West 1

2010 Q2 North 1

2010 Q2 South 2

2010 Q2 East 4

2010 Q2 West 3

Returns a value of type LONG.

None

Assign a unique ID to the sales of each region by quarter in descending order, so the highest sales isgiven the number of 1.

ROLLUP ROW_NUMBER() TO (Quarter) ORDER BY (Sum_Sales DESC) ROWSUNBOUNDED PRECEDING

User Defined Functions (UDFs)

User defined functions (UDFs) allow you to define your own per-row processing logic, and then exposethat functionality to users in the Platfora application expression builder.

User defined functions can only be used to implement new row functions, notaggregate functions. If a computed field that uses a UDF is included in a lens, theUDF will be executed once for each row during the lens build process. This is goodto keep in mind when writing UDF Java programs, so you do not write programsthat negatively impact lens build resources or execution times.

Writing a Platfora UDF Java Program

User defined functions (UDFs) are written in the Java programming language and implement thePlatfora-provided Java interface, com.platfora.udf.UserDefinedFunction.

Verify that any JAR file that the UDF will use is compatible with the existing libraries Platfora uses.You can find those libraries in $PLATFORA_HOME/lib.

Page 386: Data Ingest Guide

Data Ingest Guide - Platfora Expressions

Page 386

To define a user defined function for Platfora, you must have the Java Development Kit (JDK) version 6or 7 installed on the machine where you plan to do your development.

You will also need the com.platfora.udf.UserDefinedFunction interface Java code fromyour Platfora master server installation. If you go to the $PLATFORA_HOME/tools/udf directory ofyour Platfora master server installation, you will find two files:

• platfora-udf.jar – This is the compiled code for thecom.platfora.udf.UserDefinedFunction interface. You must link to this jar file (place itin the CLASSPATH) when you compile your UDF Java program.

• /com/platfora/udf/UserDefinedFunction.java – This is the source code for theJava interface that your UDF classes need to implement. The source code is provided as referencedocumentation of the Platfora UserDefinedFunction interface. You can refer to this file whenwriting your UDF Java programs.

1. Copy the file $PLATFORA_HOME/tools/udf/platfora-udf.jar to a directory on themachine where you plan to develop and compile your UDF program.

2. Write a Java program that implements com.platfora.udf.UserDefinedFunction interface.

For example, here is a sample Java program that defines a REPEAT_STRING user defined function.This simple function repeats an input string a specified number of times.

import java.util.List;

/** * Sample user-defined function implementation that demonstrates * how to create a REPEAT_STRING function. */

public class RepeatString implements com.platfora.udf.UserDefinedFunction {

/** * Returns the name of the user-defined function. * The first character in the name must be a letter, * and subsequent characters must be either letters, * digits, or underscores. You cannot name your function * the same name as an existing Platfora * built-in function. Names are case-insensitive. */

@Override public String getFunctionName() { return "REPEAT_STRING"; }

/** * Returns one of the following values, reflecting the * return type of the user-defined function: * DATETIME, DOUBLE, FIXED, INTEGER, LONG, or STRING. */

Page 387: Data Ingest Guide

Data Ingest Guide - Platfora Expressions

Page 387

@Override public String getReturnType() { return "STRING"; }

/** * Returns an array of Strings, one for each of the * input arguments to the user-defined function, * specifying the required data type for each argument. * The Strings should be of the following values: * DATETIME, DOUBLE, FIXED, INTEGER, LONG, STRING. */

@Override public String[] getArgumentTypes() { return new String[] { "STRING", "INTEGER" }; }

/** * Returns a human-readable description of what the function * does, to be displayed to Platfora users in the * Expression Builder. May return null. */ @Override public String getDescription() { return "The REPEAT_STRING function returns an input string repeated " + " a specified number of times."; }

/** * Returns a human-readable description explaining the * value that the function returns, to be displayed to * Platfora users in the Expression Builder. May return null. */ @Override public String getReturnValueDescription() { return "Returns one value per row of type STRING"; }

/** * Returns a human-readable example of the function syntax, * to be displayed to Platfora users in the Expression * Builder. May return null. */ @Override public String getExampleUsage() { return "CONCAT(\"It's a \", REPEAT_STRING(\"Mad \",4), \" World\")"; }

/**

Page 388: Data Ingest Guide

Data Ingest Guide - Platfora Expressions

Page 388

* The compute method performs the actual work of evaluating * the user-defined function. The method should operate on the * argument values provided to calculate the function return value * and return a Java object of the appropriate type to represent * the return value. The following mapping describes the Java * object type that is used to represent each Platfora data type: * DATETIME -> java.lang.Long * DOUBLE -> java.lang.Double * FIXED -> java.lang.Long * INTEGER -> java.lang.Integer * LONG -> java.lang.Long * STRING -> java.lang.String * * Note on DATETIME type: datetime values in Platfora are represented as * Longs in Unix Epoch Time. (The number of milliseconds since the epoch.) * * Note on FIXED type: fixed-precision numbers in Platfora are represented * as Longs that have been scaled by a factor of 10,000. For example, the * fixed-precision value 2.5000 would be represented as the Java value * returned by the expression new Long(25000L). * * In the event that the user-defined function * encounters invalid inputs, or the function return value is not * defined given the inputs provided, the compute method should return * null rather than throwing an exception. The compute method should * avoid throwing any exceptions. * * @param arguments The values of the function inputs. * * The entries in this list will match the specification * provided by getArgumentTypes method in type, number, and order: * for example, if getArgumentTypes returned an array of * length 3 with the values STRING, DOUBLE, STRING, then * the arguments parameter will hold be a list of 3 Java * objects: a java.lang.String, a java.lang.Double, and a * java.lang.String. Any of the values within the * arguments List may be null. */ @Override public String compute(List arguments) { // cast the inputs to the correct types final String toRepeat = (String) arguments.get(0); final Integer numberOfRepeats = (Integer) arguments.get(1);

// check for invalid inputs

Page 389: Data Ingest Guide

Data Ingest Guide - Platfora Expressions

Page 389

if (toRepeat == null || numberOfRepeats == null || numberOfRepeats < 0) return null;

// repeat the input string the specified number of times final StringBuilder builder = new StringBuilder(); for (int i = 0; i < numberOfRepeats; i++) { builder.append(toRepeat); } return builder.toString(); }}

3. Compile your .java UDF program file into a .class file (make sure to link to the platfora-udf.jar file or place it in your Java CLASSPATH).

For example, to compile the RepeatString.java program using Java 1.7:

javac -source 1.7 -target 1.7 -cp platfora-udf.jar RepeatString.java

4. Create a Java archive file (.jar) containing your .class file.

For example:

jar cf repeat-string-udf.jar RepeatString.class

After you have written and compiled your UDF Java program, you must then install and enable it on thePlatfora master server. See Adding a UDF to the Platfora Expression Builder.

Adding a UDF to the Platfora Expression Builder

After you have written and compiled a user defined function (UDF) Java class, you must install yourclass on the Platfora master server and enable it so that it can be seen and used in the Platfora expressionbuilder.

This task is performed on the Platfora master server.

Before you begin, you must have written and compiled a Java class for your user defined function. SeeWriting a Platfora UDF Java Program.

1. Create a directory named extlib in the Platfora data directory on the Platfora master server.

For example:

$ mkdir $PLATFORA_DATA_DIR/extlib

2. Copy the Java archive (.jar) file containing your UDF class to the $PLATFORA_DATA_DIR/extlib directory on the Platfora master server.

For example:

$ cp repeat-string-udf.jar $PLATFORA_DATA_DIR/extlib/

3. Set the Platfora server configuration property, platfora.udf.class.names, so it containsthe name of your UDF Java class. If you have more than one class, separate the class names with acomma.

Page 390: Data Ingest Guide

Data Ingest Guide - Platfora Expressions

Page 390

For example, to set this property using the platfora-config command-line utility:

$ $PLATFORA_HOME/bin/platfora-config set --key platfora.udf.class.names --value RepeatString

4. Restart the Platfora server:

$ platfora-services restart

The user defined function will then be available for defining computed field expressions in the AddField dialog of the Platfora application.

Due to the way some web browsers cache Javascript files, the newly addedfunction may not appear in the Functions list for up to 24 hours. However, thefunction is immediately available for use and recognized by the Expression auto-complete feature.

Regular Expression Reference

Regular expressions vary in complexity using a combination of basic constructs to describe a stringmatching pattern. This reference describes the most common regular expression matching patterns, butis not a comprehensive list.

Regular expressions, also referred to as regex or regexp, are a standardized collection of specialcharacters and constructs used for matching strings of text. They provide a flexible and precise languagefor matching particular characters, words, or patterns of characters.

Page 391: Data Ingest Guide

Data Ingest Guide - Platfora Expressions

Page 391

Platfora regular expressions are based on the pattern matching syntax of the Java programminglanguage. For more in depth information on writing valid regular expressions, refer to the Java regularexpression pattern documentation.

Platfora makes use of regular expressions in the following contexts:

• In computed field expressions that use the REGEX or REGEX_REPLACE functions.

• In PARTITION expression statements for event series processing computed fields.

• In the Regex file parser in data ingest.

• In the data source location path descriptor in data ingest.

• In lens filter expressions.

Regex Literal and Special Characters

The most basic form of regular expression pattern matching is the match of a literal character or string.Regular expressions also have a number of special characters that affect the way a pattern is matched.This section describes the regular expression syntax for referring to literal characters, special characters,non-printable characters (such as a tab or a newline), and special character escaping.

The most basic form of pattern matching is the match of literal characters. For example, if the regularexpression is foo and the input string is foo, the match will succeed because the strings are identical.

Certain characters are reserved for special use in regular expressions. These special characters are oftencalled metacharacters. If you want to use special characters as literal characters, they must be escaped.

Character Name Character Reserved For

opening bracket [ start of a character class

closing bracket ] end of a character class

hyphen - character ranges within a character class

backslash \ general escape character

caret ^ beginning of string, negating of a character class

dollar sign $ end of string

period . matching any single character

pipe | alternation (OR) operator

question mark ? optional quantifier, quantifier minimizer

asterisk * zero or more quantifier

plus sign + once or more quantifier

Page 392: Data Ingest Guide

Data Ingest Guide - Platfora Expressions

Page 392

Character Name Character Reserved For

opening parenthesis ( start of a subexpression group

closing parenthesis ) end of a subexpression group

opening brace { start of min/max quantifier

closing brace } end of min/max quantifier

There are two ways to force a special character to be treated as an ordinary character:

• Precede the special character with a \ (backslash character). For example, to specify an asterisk as aliteral character instead of a quantifier, use \*.

• Enclose the special character(s) within \Q (starting quote) and \E (ending quote). Everythingbetween \Q and \E is then treated as literal characters.

• To escape literal double-quotes in a REGEX() expression, double the double-quotes (""). Forexample, to extract the inches portion from a height field where example values are 6'2", 5'11":

REGEX(height, "\'(\d)+""$")

You can use special character sequence constructs to specify non-printable characters in a regularexpression. Some of the most commonly used constructs are:

Construct Matches

\n newline character

\r carriage return character

\t tab character

\f form feed character

Regex Character Classes

A character class allows you to specify a set of characters, enclosed in square brackets, that can producea single character match. There are also a number of special predefined character classes (backslashcharacter sequences that are shorthand for the most common character sets).

A character class matches only to a single character. For example, gr[ae]y will match to gray orgrey, but not to graay or graey. The order of the characters inside the brackets does not matter.

You can use a hyphen inside a character class to specify a range of characters. For example, [a-z] matches a single lower-case letter between a and z. You can also use more than one range, or acombination of ranges and single characters. For example, [0-9X] matches a numeric digit or the letterX. Again, the order of the characters and the ranges does not matter.

Page 393: Data Ingest Guide

Data Ingest Guide - Platfora Expressions

Page 393

A caret following an opening bracket specifies characters to exclude from a match. For example,[^abc] will match any character except a, b, or c.

Construct Type Description

[abc] simple matchesaorborc

[^abc] negation matches any character exceptaorborc

[a-zA-Z] range matchesathroughz, orAthroughZ(inclusive)

[a-d[m-p]] union matchesathroughd, ormthroughp

[a-z&&[def]] intersection matchesd,e, orf

Page 394: Data Ingest Guide

Data Ingest Guide - Platfora Expressions

Page 394

Construct Type Description

[a-z&&[^xq]] subtraction matchesathroughz, except forxandq

Predefined character classes offer convenient shorthands for commonly used regular expressions.

Construct Description Example

. matches any single character (except newline) .atmatches "cat", "hat", and also"bat" in thephrase "batch files"

\d matches any digit character (equivalent to[0-9])

\dmatches "3" in "C3PO" and "2" in"file_2.txt"

\D matches any non-digit character (equivalent to[^0-9])

\Dmatches "S" in "900S" and "Q" in "Q45"

\s matches any single white-space character(equivalent to[ \t\n\x0B\f\r])

\sbookmatches "book" in "blue book" butnothing in "notebook"

\S matches any single non-white-space character \Sbookmatches "book" in "notebook" butnothing in "blue book"

\w matches any alphanumeric character, includingunderscore (equivalent to[A-Za-z0-9_])

r\w*matches "rm" and "root"

\W matches any non-alphanumeric character(equivalent to[^A-Za-z0-9_])

\Wmatches "&" in "stmd &" , "%" in"100%", and "$" in "$HOME"

Page 395: Data Ingest Guide

Data Ingest Guide - Platfora Expressions

Page 395

POSIX has a set of character classes that denote certain common ranges. They are similar to bracket andpredefined character classes, except they take into account the locale (the local language/coding system).

\p{Lower} a lower-case alphabetic character,[a-z]

\p{Upper} an upper-case alphabetic character,[A-Z]

\p{ASCII} an ASCII character,[\x00-\x7F]

\p{Alpha} an alphabetic character,[a-zA-z]

\p{Digit} a decimal digit,[0-9]

\p{Alnum} an alphanumeric character,[a-zA-z0-9]

\p{Punct} a punctuation character, one of!"#$%&'()*+,-./:;<=>?@[\]^_`{|}~

\p{Graph} a visible character,[\p{Alnum}\p{Punct}]

\p{Print} a printable character,[\p{Graph}\x20]

\p{Blank} a space or tab,[ t]

\p{Cntrl} a control character,[\x00-\x1F\x7F]

\p{XDigit} a hexidecimal digit,[0-9a-fA-F]

\p{Space} a whitespace character,[ \t\n\x0B\f\r]

Page 396: Data Ingest Guide

Data Ingest Guide - Platfora Expressions

Page 396

Regex Line and Word Boundaries

Boundary matching constructs are used to specify where in a string to apply a matching pattern. Forexample, you can search for a particular pattern within a word boundary, or search for a pattern at thebeginning or end of a line.

Construct Description Example

^ matches from the beginning of a line (multi-line matches are currently not supported)

^172will match the "172" in IP address"172.18.1.11" but not in "192.172.2.33"

$ matches from the end of a line (multi-linematches are currently not supported)

d$will match the "d" in "maid" but not in"made"

\b matches within a word boundary \bis\bmatches the word "is" in "this is myisland", but not the "is" part of "this" or"island".\bismatches both "is" and the "is" in "island",but not in "this".

\B matches within a non-word boundary \Bbmatches "b" in "sbin" but not in "bash"

Regex Quantifiers

Quantifiers specify how often the preceding regular expression construct should match. There are threeclasses of quantifiers: greedy, reluctant, and possessive. The difference between greedy, reluctant, andpossessive quantifiers involves what part of the string to try for the initial match, and how to retry if theinitial attempt does not produce a match.

By default, quantifiers are greedy. A greedy quantifier will first try for a match with the entire inputstring. If that produces a match, then the match is considered a success, and the engine can move on tothe next construct in the regular expression. If the first try does not produce a match, the engine backs-off one character at a time until a match is found. So a greedy quantifier checks for possible matches inorder from the longest possible input string to the shortest possible input string, recursively trying fromright to left.

Adding a ? (question mark) to a greedy quantifier makes it reluctant. A reluctant quantifier will first tryfor a match from the beginning of the input string, starting with the shortest possible piece of the stringthat matches the regex construct. If that produces a match, then the match is considered a success, andthe engine can move on to the next construct in the regular expression. If the first try does not producea match, the engine adds one character at a time until a match is found. So a reluctant quantifier checks

Page 397: Data Ingest Guide

Data Ingest Guide - Platfora Expressions

Page 397

for possible matches in order from the shortest possible input string to the longest possible input string,recursively trying from left to right.

Adding a + (plus sign) to a greedy quantifier makes it possessive. A possessive quantifier is like a greedyquantifier on the first attempt (it tries for a match with the entire input string). The difference is thatunlike a greedy quantifier, a possessive quantifier does not retry a shorter string if a match is not found.If the initial match fails, the possessive quantifier reports a failed match. It does not make any moreattempts.

Greedy

Construct

Reluctant

Construct

Possessive

Construct

Description Example

? ?? ?+ matches the previouscharacter or construct onceor not at all

st?onmatches "son" in "johnson" and "ston"in "johnston" but nothing in "clinton" or"version"

* *? *+ matches the previouscharacter or construct zeroor more times

if*matches "if", "iff" in "diff", or "i" in "print"

+ +? ++ matches the previouscharacter or construct oneor more times

if+matches "if", "iff" in "diff", but nothing in"print"

{n} {n}? {n}+ matches the previouscharacter or constructexactlyntimes

o{2}matches "oo" in "lookup" and the first two o'sin "fooooo" but nothing in "mount"

{n,} {n,}? {n,}+ matches the previouscharacter or construct atleastntimes

o{2,}matches "oo" in "lookup" all five o's in"fooooo" but nothing in "mount"

{n,m} {n,m}? {n,m}+matches the previouscharacter or construct atleastntimes, but no more thanmtimes

F{2,4}matches "FF" in "#FF0000" and the last fourF's in "#FFFFFF"

Page 398: Data Ingest Guide

Data Ingest Guide - Platfora Expressions

Page 398

Regex Capturing Groups

Groups are specified by a pair of parenthesis around a subpattern in the regular expression. By placingpart of a regular expression inside parentheses, you group that part of the regular expression together.This allows you to apply regex operators and quantifiers to the entire group at once. Besides groupingpart of a regular expression together, parenthesis also create a capturing group. Capturing groups areused to determine which matching values to save or return from your regular expression.

A regular expression can have more than one group and the groups can be nested. The groups arenumbered 1-n from left to right, starting with the first opening parenthesis. There is always an implicitgroup 0, which contains the entire match. For example, the pattern:

(a(b*))+(c)

contains three groups:

group 1: (a(b*))group 2: (b*) group 3: (c)

By default, a group captures the text that produces a match. Besides grouping part of a regularexpression together, parenthesis also create a capturing group or a backreference. The portion of thestring matched by the grouped subexpression is captured in memory for later retrieval or use.

Capturing Groups and the Regex Line Parser

When you choose the Regex line parser during the Parse step of the data ingest process, Platfora usescapturing groups to determine what parts of the regular expression to return as columns. The Regexline parser applies the user-supplied regular expression against each line in the source file, and returnseach capturing group in the regular expression as a column value.

For example, suppose you had user records in a file, and the lines were formatted like this:

Name: John Smith Address: 123 Main St. Age: 25 Comment: ActiveName: Sally R. Jones Address: 2 E. El Camino Real Age: 32Name: Rod Rogers Address: 55 Elm Street Age: 47 Comment: Suspended

You could use the following regular expression to extract the Full Name, Last Name, Address, Age, andComment column values:

Name: (.*\s(\p{Alpha}+)) Address:\s+(.*) Age:\s+([0-9]+)(?:\s+Comment:\s+(.*))?

Capturing Groups and the REGEX Function

The REGEX function can be used to extract a portion of a string value. For the REGEX function, only thevalue of the first capturing group is returned. For example, if you wanted to match all possible emailaddress strings with a pattern of [email protected], but only return the provider portion of theemail address from the email field:

REGEX(email,"^[a-zA-Z0-9._%+-]+@([a-zA-Z0-9._-]+)\.[a-zA-Z]{2,4}$")

Capturing Groups and the REGEX_REPLACE Function

Page 399: Data Ingest Guide

Data Ingest Guide - Platfora Expressions

Page 399

The REGEX_REPLACE function is used to match a string value, and replace matched strings withanother value. The REGEX_REPLACE function takes three arguments: an input string, a matching regex,and a replacement regex. Capturing groups can be used to capture backreferences (see Backreferences),but do not control what portions of the match are returned (the entire match is always returned).

Backreferences allow you to capture and reuse a subexpression match inside the same regularexpression. You can reuse a capturing group as a backreference by referring to its group numberpreceded by a backslash (for example, \1 refers to capturing group 1, \2 refers to capturing group 2,and so on).

For example, if you wanted to match a pair of HTML tags and their enclosed text, you could capture theopening tag into a backreference, and then reuse it to match the corresponding closing tag:

(<([A-Z][A-Z0-9]*)\b[^>]*>.*?</\2>)

This regular expression contains two capturing groups, the outermost capturing group (which capturesthe entire string), and one which captures the string matched by [A-Z][A-Z0-9]* into backreferencenumber two. This backreference can then be reused with \2 (backslash two) to match the correspondingclosing HTML tag.

When referring to capturing groups in the previous regular expression, the backreference syntax isslightly different. The backreference group number is preceded by a dollar sign instead of a backslash(for example, $1 refers to capturing group 1 of the previous expression). An example of this would bethe REGEX_REPLACE function, which takes two regular expressions: one for the matching string, andone for the replacement string.

The following example matches the values in a phone_number field where phone number values areformatted as xxx.xxx.xxxx, and replaces them with phone number values formatted as (xxx) xxx-xxxx. Notice the backreferences in the replacement expression; they refer to the capturing groups of theprevious matching expression:

REGEX_REPLACE(phone_number,"([0-9]{3})\.([[0-9]]{3})\.([[0-9]]{4})","\($1\) $2-$3")

In some cases, you may want to use parenthesis to group subpatterns, but not capture text. A non-capturing group starts with (?: (a question mark and colon following the opening parenthesis). Forexample, h(?:a|i|o)t matches hat or hit or hot, but does not capture the a, i, or o from thesubexpression.

Page 400: Data Ingest Guide

Page 400

Appendix

APlatfora Expression Language DictionaryAn expression computes or produces a value by combining field or column values, constant values, operators,and functions. Platfora has a built-in expression language. You use the language's functions and operators indataset computed fields, vizboard computed fields, lens filters, and programmatic lens queries.

Topics:• Expression Quick Dictionary

• Comparison Operators

• Logical Operators

• Arithmetic Operators

• Conditional and NULL Processing

• Event Series Processing

• String Functions

• URL Functions

• IP Address Functions

• Date and Time Functions

• Math Functions

• Data Type Conversion Functions

• Aggregate Functions

• ROLLUP and Window Functions

• User Defined Functions (UDFs)

• Regular Expression Reference

Expression Quick DictionaryAn expression is a combination of columns (or fields), constant values, operators, and functions usedto evaluate, transform, or produce a value. Simple expressions can be combined to make more complexexpressions. This quick reference describes the functions and operators that can be used to writeexpressions.

Page 401: Data Ingest Guide

Data Ingest Guide - Platfora Expression Language Dictionary

Page 401

Platfora's built-in statements, functions and operators are divided into the following categories:

• Conditional and NULL Processing

• Event Series Processing

• String Processing

• Date and Time Processing

• URL Processing

• IP Address Processing

• Mathematical Processing

• Data Type Conversion

• Aggregation and Measure Processing

• ROLLUP and Window Calculations

• User Defined Functions

• Comparison Operators

• Logical Operators

• Arithmetic Operators

Conditional and NULL Processing

Conditional and NULL processing allows you to transform or manipulate data values based on certaindefined conditions. Conditional processing (CASE) can be done at either the dataset or vizboard level.NULL processing (COALESCE and IS_VALID) is only applicable at the dataset level. During a lensbuild, any NULL values in the source data are converted to default values, so lenses and vizboards haveno concept of NULL values.

Function Description Example

CASE evaluates each row inthe dataset accordingto one or more inputconditions, andoutputs the specifiedresult when the inputconditions are met

CASE WHEN gender = "M" THEN "Male"WHEN gender = "F" THEN "Female" ELSE"Unknown" END

COALESCE returns the first validvalue (NOT NULLvalue) from a comma-separated list ofexpressions

COALESCE(hourly_wage * 40 * 52, salary)

IS_VALID returns 0 if thereturned value isNULL, and 1 if thereturned value is NOTNULL.

IS_VALID(sale_amount)

Page 402: Data Ingest Guide

Data Ingest Guide - Platfora Expression Language Dictionary

Page 402

Event Series Processing

Event series processing allows you to partition rows of input data, order the rows sequentially (typicallyby a timestamp), and search for matching patterns in a set of rows. Computed fields that are definedin a dataset using a PARTITION expression are considered event series processing computed fields.Event series processing computed fields are processed differently than regular computed fields. Insteadof computing values from the input of a single row, they compute values from inputs of multiple rowsin the dataset. Event series processing computed fields can only be defined in the dataset - not in thevizboard.

Function Description Example

PACK_VALUES returns multipleoutput values packedinto a single stringof key/value pairsseparated by thePlatfora default keyand pair separators- useful when theOUTPUT clause of aPARTITION expressionreturns multiple outputvalues

PACK_VALUES("ID",custid,"Age",age)

PARTITION partitions the rowsof a dataset, ordersthe rows sequentially(typically by atimestamp), andsearches for matchingpatterns in a set ofrows

PARTITION BY SessionID ORDER BYTimestamp PATTERN (A,B,C) DEFINEA AS Page = "home.html", B ASPage = "product.html", C AS Page ="checkout.html" OUTPUT "TRUE"

String Functions

String functions allow you to manipulate and transform textual data, such as combining string values orextracting a portion of a string value.

Function Description Example

ARRAY_CONTAINS performs a wholestring match againsta string containingdelimited valuesand returns a 1 or 0depending on whetheror not the stringcontains the searchvalue.

ARRAY_CONTAINS(device,",","iPad")

Page 403: Data Ingest Guide

Data Ingest Guide - Platfora Expression Language Dictionary

Page 403

Function Description Example

CONCAT concatenates(combines together)the results of multiplestring expressions

CONCAT(month,"/",day,"/",year)

FILE_NAME returns the original filename from the sourcefile system

TO_DATE(SUBSTRING(FILE_NAME(),0,8),"yyyyMMdd")

FILE_PATH returns the full URIpath from the sourcefile system

TO_DATE(REGEX(FILE_PATH(),"hdfs://myhdfs-server.com/data/logs/(\d{8})/(?:\d{1,3}\.*)+\.log"),"yyyyMMdd")

EXTRACT_COOKIE extracts the valueof the given cookieidentifier from a semi-colon delimited list ofcookie key=value pairs.

EXTRACT_COOKIE("SSID=ABC; vID=44","vID") returns 44

EXTRACT_VALUE extracts the value forthe given key froma string containingdelimited key/valuepairs.

EXTRACT_VALUE("firstname;daria|lastname;hutch","lastname",";","|") returnshutch

INSTR returns an integerindicating the positionof a character withina string that is thefirst character ofthe occurrence of asubstring.

INSTR(url,"http://",-1,1)

JAVA_STRING returns the unescapedversion of a Javaunicode characterescape sequence as astring value

CASE WHEN currency ==JAVA_STRING("\u00a5") THEN "yes" ELSE"no" END

JOIN_STRINGS concatenates(combines together)the results of multiplestring expressionswith the separator inbetween each non-nullvalue

JOIN_STRINGS("/",month,day,year)

JSON_ARRAY extracts a JSONARRAY as a STRINGvalue from a field in aJSON object

JSON_ARRAY(friends, "f1")

Page 404: Data Ingest Guide

Data Ingest Guide - Platfora Expression Language Dictionary

Page 404

Function Description Example

JSON_ARRAY_CONTAINSperforms a wholestring match againsta string formattedas a JSON arrayand returns a 1 or 0depending on whetheror not the stringcontains the searchvalue

JSON_ARRAY_CONTAINS(software,"platfora")

JSON_DOUBLE extracts a DOUBLEvalue from a field in aJSON object

JSON_DOUBLE(top_scores,"test_scores.2")

JSON_FIXED extracts a FIXED valuefrom a field in a JSONobject

JSON_FIXED(top_scores,"test_scores.2")

JSON_INTEGER extracts an INTEGERvalue from a field in aJSON object

JSON_INTEGER(top_scores,"test_scores.2")

JSON_LONG extracts a LONG valuefrom a field in a JSONobject

JSON_LONG(top_scores,"test_scores.2")

JSON_OBJECT extracts a JSONOBJECT as a STRINGvalue from a field in aJSON object

JSON_OBJECT(friends, "f1.0.f2.f3")

JSON_STRING extracts a STRINGvalue from a field in aJSON object

JSON_STRING(misc,"hobbies.0")

LENGTH returns the count ofcharacters in a stringvalue

LENGTH(name)

REGEX performs a wholestring match againsta string value with aregular expression andreturns the portion ofthe string matchingthe first capturinggroup of the regularexpression

REGEX(weblog.request_line,"GET\s/([a-zA-Z0-9._%-]+\.[html])\sHTTP/[0-9.]+")

Page 405: Data Ingest Guide

Data Ingest Guide - Platfora Expression Language Dictionary

Page 405

Function Description Example

REGEX_REPLACE evaluates a stringvalue against aregular expression todetermine if there isa match, and replacesmatched stringswith the specifiedreplacement value

REGEX_REPLACE(phone_number,"([0-9]{3})\.([[0-9]]{3})\.([[0-9]]{4})","\($1\)$2-$3")

SPLIT breaks down adelimited input stringinto sections andreturns the specifiedsection of the string

SPLIT("Restaurants>Location>SanFrancisco",">", -1) returns San Francisco

SUBSTRING returns the specifiedcharacters of a stringvalue based on thegiven start and endposition

SUBSTRING(name,0,1)

TO_LOWER converts all alphabeticcharacters in a stringto lower case

TO_LOWER("123 Main Street") returns 123main street

TO_UPPER converts all alphabeticcharacters in a stringto upper case

TO_UPPER("123 Main Street") returns 123MAIN STREET

TRIM removes leading andtrailing spaces from astring value

TRIM(area_code)

XPATH_STRING takes an XML-formatted string andreturns the first stringmatching the givenXPath expression

XPATH_STRING(address,"//address[@type='home']/zipcode")

XPATH_STRINGS takes an XML-formatted string andreturns a newline-separated array ofstrings matchingthe given XPathexpression

XPATH_STRINGS(address,"/list/address[1]/street")

Page 406: Data Ingest Guide

Data Ingest Guide - Platfora Expression Language Dictionary

Page 406

Function Description Example

XPATH_XML takes an XML-formatted stringand returns an XML-formatted stringmatching the givenXPath expression

XPATH_XML(address,"//address[last()]")

Date and Time Functions

Date and time functions allow you to manipulate and transform datetime values, such as calculating timedifferences between two datetime values, or extracting a portion of a datetime value.

Function Description Example

DAYS_BETWEEN calculates thewhole number ofdays (ignoringtime) between twoDATETIME values

DAYS_BETWEEN(ship_date,order_date)

DATE_ADD adds the specified timeinterval to a DATETIMEvalue

DATE_ADD(invoice_date,45,"day")

HOURS_BETWEEN calculates thewhole number ofhours (ignoringminutes, seconds, andmilliseconds) betweentwo DATETIME values

HOURS_BETWEEN(NOW(),impressions.adview_timestamp)

EXTRACT returns the specifiedportion of a DATETIMEvalue

EXTRACT("hour",order_date)

MILLISECONDS_BETWEENcalculates thewhole number ofmilliseconds betweentwo DATETIME values

MILLISECONDS_BETWEEN(request_timestamp,response_timestamp)

MINUTES_BETWEEN calculates the wholenumber of minutes(ignoring seconds andmilliseconds) betweentwo DATETIME values

MINUTES_BETWEEN(impression_timestamp,conversion_timestamp)

NOW returns the currentsystem date and timeas a DATETIME value

YEAR_DIFF(NOW(),users.birthdate)

Page 407: Data Ingest Guide

Data Ingest Guide - Platfora Expression Language Dictionary

Page 407

Function Description Example

SECONDS_BETWEEN calculates thewhole number ofseconds (ignoringmilliseconds) betweentwo DATETIME values

SECONDS_BETWEEN(impression_timestamp,conversion_timestamp)

TRUNC truncates a DATETIMEvalue to the specifiedformat

TRUNC(TO_DATE(order_date,"MM/dd/yyyyHH:mm:ss"),"day")

YEAR_DIFF calculates thefractional number ofyears between twoDATETIME values

YEAR_DIFF(NOW(),users.birthdate)

URL Functions

URL functions allow you to extract different portions of a URL string, and decode text that is URL-encoded.

Function Description Example

URL_AUTHORITY returns the authorityportion of a URL string

URL_AUTHORITY("http://user:[email protected]:8012/mypage.html") returnsuser:[email protected]:8012

URL_FRAGMENT returns the fragmentportion of a URL string

URL_FRAGMENT("http://platfora.com/news.php?topic=press#Platfora%20News")returns Platfora%20News

URL_HOST returns the host,domain, or IP addressportion of a URL string

URL_HOST("http://user:[email protected]:8012/mypage.html") returns mycompany.com

URL_PATH returns the pathportion of a URL string

URL_PATH("http://platfora.com/company/contact.html") returns /company/contact.html

URL_PORT returns the portportion of a URL string

URL_PORT("http://user:[email protected]:8012/mypage.html") returns 8012

URL_PROTOCOL returns the protocol(or URI scheme name)portion of a URL string

URL_PROTOCOL("http://www.platfora.com")returns http

Page 408: Data Ingest Guide

Data Ingest Guide - Platfora Expression Language Dictionary

Page 408

Function Description Example

URL_QUERY returns the queryportion of a URL string

URL_QUERY("http://platfora.com/news.php?topic=press&timeframe=today") returnstopic=press&timeframe=today

URLDECODE decodes a string thathas been encodedwith the application/x-www-form-urlencodedmedia type

URLDECODE("N%2FA%20or%20%22not%20applicable%22")

IP Address Functions

IP address functions allow you to manipulate and transform STRING data consisting of IP addressvalues.

Function Description Example

CIDR_MATCH compares twoSTRING argumentsrepresenting a CIDRmask and an IPaddress, and returns 1if the IP address fallswithin the specifiedsubnet mask or 0 if itdoes not

CIDR_MATCH("60.145.56.0/24","60.145.56.246")returns 1

HEX_TO_IP converts ahexadecimal-encodedSTRING to a textrepresentation of an IPaddress

HEX_TO_IP(AB20FE01) returns 171.32.254.1

Math Functions

Math functions allow you to perform basic math calculations on numeric values. You can also use thearithmetic operators to perform simple math calculations, such as addition, subtraction, division andmultiplication.

Function Description Example

DIV divides two LONGvalues and returns aquotient value of typeLONG

DIV(TO_LONG(file_size),1024)

Page 409: Data Ingest Guide

Data Ingest Guide - Platfora Expression Language Dictionary

Page 409

Function Description Example

EXP raises themathematical constante to the power(exponent) of anumeric value andreturns a value of typeDOUBLE.

EXP(Value)

FLOOR returns the largestinteger that is lessthan or equal to theinput argument

FLOOR(32.6789) returns 32.0

HASH evenly partitions datavalues into the specifiednumber of buckets

HASH(username,20)

LN returns the naturallogarithm of a number

LN(2.718281828) returns 1

MOD divides two LONGvalues and returns theremainder value oftype LONG

MOD(TO_LONG(file_size),1024)

POW raises a numericvalue to the power(exponent) of anothernumeric value andreturns a value of typeDOUBLE.

100 * POW(end_value/start_value, 0.2) - 1

ROUND rounds a DOUBLEvalue to the specifiednumber of decimalplaces

ROUND(32.4678954,2) returns 32.47

Page 410: Data Ingest Guide

Data Ingest Guide - Platfora Expression Language Dictionary

Page 410

Data Type Conversion Functions

Data type conversion functions allow you to cast data values from one data type to another. Thesefunctions are used implicitly whenever you set the data type of a field or column in the Platfora userinterface. The supported data types are: INTEGER, LONG, DOUBLE, FIXED, DATETIME, and STRING

Function Description Example

EPOCH_MS_TO_DATEconverts LONG valuesto DATETIME values,where the inputnumber represents thenumber of millisecondssince the epoch

EPOCH_MS_TO_DATE(1360260240000)returns 2013-02-07T18:04:00:000Z

TO_FIXED converts STRING,INTEGER, LONG, orDOUBLE values tofixed-decimal values

TO_FIXED(opening_price)

TO_DATE converts STRINGvalues to DATETIMEvalues, and specifiesthe format of the dateand time elements inthe string

TO_DATE(order_date,"yyyy.MM.dd 'at'HH:mm:ss z")

TO_DOUBLE converts STRING,INTEGER, LONG, orDOUBLE values toDOUBLE (decimal)values

TO_DOUBLE(average_rating)

TO_INT converts STRING,INTEGER, LONG,or DOUBLE valuesto INTEGER (wholenumber) values

TO_INT(average_rating)

TO_LONG converts STRING,INTEGER, LONG, orDOUBLE values toLONG (whole number)values

TO_LONG(average_rating)

TO_STRING converts values ofother data types toSTRING (character)values

TO_STRING(sku_number)

Page 411: Data Ingest Guide

Data Ingest Guide - Platfora Expression Language Dictionary

Page 411

Aggregate Functions

An aggregate function groups the values of multiple rows together based on some defined inputexpression. Aggregate functions return one value for a group of rows, and are only valid for definingmeasures in Platfora. In the dataset, measures can be defined using any of the aggregate functions. In thevizboard, only the DISTINCT, MAX, or MIN aggregate functions are allowed.

Function Description Example

AVG returns the averageof all valid numericvalues

AVG(sale_amount)

COUNT returns the number ofrows in a dataset

COUNT(sales.customers)

COUNT_VALID returns the numberof rows for which thegiven expression isvalid

COUNT_VALID(page_views)

DISTINCT returns the number ofdistinct values for thegiven expression

DISTINCT(user_id)

MAX returns the biggestvalue from the giveninput expression

MAX(sale_amount)

MIN returns the smallestvalue from the giveninput expression

MIN(sale_amount)

SUM returns the total of allvalues from the giveninput expression

SUM(sale_amount)

STDDEV calculates thepopulation standarddeviation for a groupof numeric values

STDDEV(sale_amount)

VARIANCE calculates thepopulation variancefor a group of numericvalues

VARIANCE(sale_amount)

ROLLUP and Window Functions

ROLLUP is a modifier to an aggregate expression that turns an aggregate into a windowed aggregate.Window functions (RANK, DENSE_RANK and NTILE) can only be used within a ROLLUP statement.The ROLLUP statement defines the partitioning and ordering of a rowset before the associated aggregatefunction or window function is applied.

Page 412: Data Ingest Guide

Data Ingest Guide - Platfora Expression Language Dictionary

Page 412

ROLLUP defines a window or user-specified set of rows within a query result set. A window functionthen computes a value for each row in the window. You can use window functions to computeaggregated values such as moving averages, cumulative aggregates, running totals, or a top N per groupresults.

ROLLUP statements can be specified in either the dataset or the vizboard. When using a ROLLUP in avizboard, the measure for which you are calculating the ROLLUP must already exist in the lens you areusing in the vizboard.

Function Description Example

DENSE_RANK assigns the rank(position) of each rowin a group (partition)of rows and does notskip rank numbers inthe event of tie

ROLLUP DENSE_RANK() TO () ORDER BY([Sales(Sum)] DESC) ROWS UNBOUNDEDPRECEDING

NTILE divides a partitionedgroup of rows into thespecified number ofbuckets, and returnsthe bucket number towhich the current rowbelongs

ROLLUP NTILE(100) TO () ORDER BY([Total Records] DESC) ROWS BETWEENUNBOUNDED PRECEDING AND UNBOUNDEDFOLLOWING

RANK assigns the rank(position) of each rowin a group (partition)of rows and skips ranknumbers in the eventof tie

ROLLUP RANK() TO () ORDER BY([Sales(Sum)] DESC) ROWS UNBOUNDEDPRECEDING

ROLLUP a modifier to anaggregate functionthat turns a regularaggregate functioninto a windowed,partitioned, or adaptiveaggregate function

100 * COUNT(Flights) / ROLLUPCOUNT(Flights) TO ([Departure Date])

ROW_NUMBER a modifier to anaggregate functionthat turns a regularaggregate functioninto a windowed,partitioned, or adaptiveaggregate function

ROLLUP ROW_NUMBER() TO (Quarter)ORDER BY (Sum_Sales DESC) ROWSUNBOUNDED PRECEDING

Page 413: Data Ingest Guide

Data Ingest Guide - Platfora Expression Language Dictionary

Page 413

User Defined Functions

User defined functions (UDFs) allow you to define your own per-row processing logic, and then exposethat functionality to users in the Platfora application expression builder. See User Defined Functions(UDFs) for more information.

Comparison Operators

Comparison operators are used to compare the equivalency of two expressions of the same data type.The result of a comparison expression is a Boolean value (returns 1 for true, 0 for false, or NULL forinvalid). Boolean expressions are most often used to specify data processing conditions or filters.

Operator Meaning Example Expression

= or == Equal to order_date = "12/22/2011"

> Greater than age > 18

!> Not greater than(equivalent to <)

age !> 8

< Less than age < 30

!< Not less than (equivalentto >=)

age !< 12

>= Greater than or equal to age >= 20

<= Less than or equal to age <= 29

<> or != or ^= Not equal to age <> 30

BETWEENmin_value ANDmax_value

Test whether a date ornumeric value is withinthe min and max values(inclusive).

year BETWEEN 2000 AND 2012

IN(list) Test whether a value iswithin a set.

product_typeIN("tablet","phone","laptop")

LIKE("pattern") Simple inclusive case-insensitive characterpattern matching. The *character matches anynumber of characters.The ? character matchesexactly one character.

last_name LIKE("?utch*")matches Kutcher, hutch but not Krutcher orcrutchcompany_name LIKE("platfora")matches Platfora or platfora

Page 414: Data Ingest Guide

Data Ingest Guide - Platfora Expression Language Dictionary

Page 414

Operator Meaning Example Expression

valueIS NULL

Check whether a fieldvalue or expression isnull (empty)

ship_date IS NULLevaluates to true when the ship_date fieldis empty

Logical Operators

Logical operators are used to define Boolean (true / false) expressions. Logical operators are usedin expressions to test for a condition, and return 1 if the condition is true or 0 if it is false. Logicaloperators are often used in lens filters, CASE expressions, PARTITION expressions, and WHERE clausesof queries.

Operator Meaning Example Expression

AND Test whether twoconditions are true.

OR Test if either of twoconditions are true.

NOT Reverses the value ofother operators.

• year NOT BETWEEN 2000 AND 2012

• first_name NOT LIKE("Jo?n*")excludes John, jonny but not Jon or Joann

• Date.Weekday NOTIN("Saturday","Sunday")

• purchase_date IS NOT NULLevaluates to true when the purchase_datefield is not empty

Arithmetic Operators

Arithmetic operators perform basic math operations on two expressions of the same data type resultingin a numeric value. The plus (+) and minus (-) operators can also be used to perform arithmeticoperations on DATETIME values.

Operator Description Example

+ Addition amount + 10(add 10 to the value of theamountfield)

Page 415: Data Ingest Guide

Data Ingest Guide - Platfora Expression Language Dictionary

Page 415

Operator Description Example

- Subtraction amount - 10(subtract 10 from the value of theamountfield)

* Multiplication amount * 100(multiply the value of theamountfield by 100)

/ Division bytes / 1024(divide the value of thebytesfield by 1024 and return the quotient)

Comparison OperatorsComparison operators are used to compare the equivalency or inequivalency of two expressions of thesame data type. The result of a comparison expression is a Boolean value (returns 1 for true, 0 for false,or NULL for invalid). Boolean expressions are most often used to specify data processing conditions orfilter criteria.

Operator Definitions

Operator Meaning Example Expression

= or == Equal to order_date = "12/22/2011"

> Greater than age > 18

!> Not greater than(equivalent to <)

age !> 8

< Less than age < 30

!< Not less than (equivalentto >=)

age !< 12

>= Greater than or equal to age >= 20

<= Less than or equal to age <= 29

<> or != or ^= Not equal to age <> 30

Page 416: Data Ingest Guide

Data Ingest Guide - Platfora Expression Language Dictionary

Page 416

Operator Meaning Example Expression

BETWEENmin_value ANDmax_value

Test whether a date ornumeric value is withinthe min and max values(inclusive).

year BETWEEN 2000 AND 2012

IN(list) Test whether a value iswithin a set.

product_typeIN("tablet","phone","laptop")

LIKE("pattern") Simple inclusive case-insensitive characterpattern matching. The *character matches anynumber of characters.The ? character matchesexactly one character.

last_name LIKE("?utch*")matches Kutcher, hutch but not Krutcher orcrutchcompany_name LIKE("platfora")matches Platfora or platfora

valueIS NULL

Check whether a fieldvalue or expression isnull (empty)

ship_date IS NULLevaluates to true when the ship_date fieldis empty

If you are writing queries with REST and the query string includes an = (equal)character, you must URL encode it as %3D. Failure to encode the character canresult in this error:string matching regex `(?i)\Qnot\E\b' expected but end of source found.

Logical OperatorsLogical operators are used to define Boolean (true / false) expressions. Logical operators are usedin expressions to test for a condition, and return 1 if the condition is true or 0 if it is false. Logicaloperators are often used in CASE expressions, PARTITION expressions, and WHERE clauses of queries.

Operator Meaning Example Expression

AND Test whether twoconditions are true.

OR Test if either of twoconditions are true.

Page 417: Data Ingest Guide

Data Ingest Guide - Platfora Expression Language Dictionary

Page 417

Operator Meaning Example Expression

NOT Reverses the value ofother operators.

• year NOT BETWEEN 2000 AND 2012

• first_name NOT LIKE("Jo?n*")excludes John, jonny but not Jon or Joann

• Date.Weekday NOTIN("Saturday","Sunday")

• purchase_date IS NOT NULLevaluates to true when the purchase_datefield is not empty

Arithmetic OperatorsArithmetic operators perform basic math operations on two expressions of the same data type resultingin a numeric value. The plus (+) and minus (-) operators can also be used to perform arithmeticoperations on DATETIME values.

Operator Description Example

+ Addition amount + 10(add 10 to the value of theamountfield)

- Subtraction amount - 10(subtract 10 from the value of theamountfield)

* Multiplication amount * 100(multiply the value of theamountfield by 100)

/ Division bytes / 1024(divide the value of thebytesfield by 1024 and return the quotient)

Page 418: Data Ingest Guide

Data Ingest Guide - Platfora Expression Language Dictionary

Page 418

Conditional and NULL ProcessingConditional and NULL processing allows you to transform or manipulate data values based on certaindefined conditions. Conditional processing (CASE) can be done at either the dataset or vizboard level.NULL processing (COALESCE and IS_VALID) is only applicable at the dataset level. During a lensbuild, any NULL values in the source data are converted to default values, so lenses and vizboards haveno concept of NULL values.

CASE

CASE is a row function that evaluates each row in the dataset according to one or more input conditions,and outputs the specified result when the input conditions are met.

Syntax

CASE WHEN input_condition [AND|OR input_condition]THENoutput_expression [...] [ELSE other_output_expression] END

Return Value

Returns one value per row of the same type as the output expression. All output expressions must returnthe same data type.

If there are multiple output expressions that return different data types, then you will need to encloseyour entire CASE expression in one of the data type conversion functions to explicitly cast all outputvalues to a particular data type.

Input Parameters

WHEN input_condition

Required. The WHEN keyword is used to specify one or more Boolean expressions (see Platfora'ssupported conditional operators). If an input value meets the condition, then the output expressionis applied. Input conditions can include other row functions in their expression, but cannot containaggregate functions or measure expressions. You can use the AND or OR keywords to combine multipleinput conditions.

THEN output_expression

Required. The THEN keyword is used to specify an output expression when the specified conditionsare met. Output expressions can include other row functions in their expression, but cannot containaggregate functions or measure expressions.

ELSE other_output_expression

Optional. The ELSE keyword can be used to specify an alternate output expression to use when thespecified conditions are not met. If an ELSE expression is not supplied, ELSE NULL is the default.

END

Required. Denotes the end of CASE function processing.

Page 419: Data Ingest Guide

Data Ingest Guide - Platfora Expression Language Dictionary

Page 419

Examples

Convert values in the age column into a range-based groupings (binning):

CASE WHEN age <= 25 THEN "0-25" WHEN age <= 50 THEN "26-50" ELSE "over50" END

Transform values in the gender column from one string to another:

CASE WHEN gender = "M" THEN "Male" WHEN gender = "F" THEN "Female" ELSE"Unknown" END

The vehicle column contains the following values: truck, bus, car, scooter, wagon, bike, tricycle, andmotorcycle. The following example convert multiple values in the vehicle column into a single value:

CASE WHEN vehicle in ("bike","scooter","motorcycle) THEN "two-wheelers"ELSE "other" END

COALESCE

COALESCE is a row function that returns the first valid value (NOT NULL value) from a comma-separated list of expressions.

Syntax

COALESCE(expression[,expression][,...])

Return Value

Returns one value per row of the same type as the first valid input expression.

Input Parameters

expression

At least one required. A field name or expression.

Examples

The following example shows an expression to calculate employee yearly income for exempt employeesthat have a salary and non-exempt employees that have an hourly_wage. This expression checks thevalues of both fields for each row, and returns the value of the first expression that is valid (NOT NULL).

COALESCE(hourly_wage * 40 * 52, salary)

IS_VALID

IS_VALID is a row function that returns 0 if the returned value is NULL, and 1 if the returned value isNOT NULL. This is useful for computing other calculations where you want to exclude NULL values(such as when computing averages).

Syntax

IS_VALID(expression)

Page 420: Data Ingest Guide

Data Ingest Guide - Platfora Expression Language Dictionary

Page 420

Return Value

Returns 0 if the returned value is NULL, and 1 if the returned value is NOT NULL.

Input Parameters

expression

Required. A field name or expression.

Examples

Define a computed field using IS_VALID. This returns a row count only for the rows where this fieldvalue is NOT NULL. If a value is NULL, it returns 0 for that row. In this example, we create a computedfield (sale_amount_not_null) using the sale_amount field as the basis.

IS_VALID(sale_amount)

Then you can use the sale_amount_not_null computed field to calculate an acurate average forsale_amount that excludes NULL values:

SUM(sale_amount)/SUM(sale_amount_not_null)

This is what happens automatically when you use the AVG function.

Event Series ProcessingEvent series processing allows you to partition rows of input data, order the rows sequentially (typicallyby a timestamp), and search for matching patterns in a set of rows. Computed fields that are definedin a dataset using a PARTITION expression are considered event series processing computed fields.Event series processing computed fields are processed differently than regular computed fields. Insteadof computing values from the input of a single row, they compute values from inputs of multiple rowsin the dataset. Event series processing computed fields can only be defined in the dataset - not in thevizboard or a lens query.

PARTITION

PARTITION is an event series processing language that partitions the rows of a dataset, orders therows sequentially (typically by a timestamp), and searches for matching patterns in a set of rows.Computed fields that are defined in a dataset using a PARTITION expression are considered eventseries processing computed fields. Event series processing computed fields are processed differently

Page 421: Data Ingest Guide

Data Ingest Guide - Platfora Expression Language Dictionary

Page 421

than regular computed fields. Instead of computing values from the input of a single row, they computevalues from inputs of multiple rows in the dataset.

The PARTITION function can only be used to define a computed field in thedataset definition (pre-lens build). PARTITION cannot be used to define avizboard computed field. Unlike other expressions, PARTITION expressionscannot be embedded within other functions or expressions - it must be a top-levelexpression.

Syntax

PARTITION BYfield_name ORDER BY field_name [ASC|DESC] PATTERN (pattern_expression) DEFINE symbol_1 AS filter_expression [,symbol_n AS filter_expression ] [, ...] OUTPUT output_expression

Description

To understand how event series processing works, we'll walk through a simple example of aPARTITION expression.

This is a simple example of some weblog page view data. Each row represents a page view by a user ata give point in time. Session IDs are used to group together page views that happened in the same usersession:

Suppose you wanted to know how many sessions included the path of page visits to ‘home.html’ then‘products.html’ then ‘checkout.html’. You could define a PARTITION expression that groups the rows

Page 422: Data Ingest Guide

Data Ingest Guide - Platfora Expression Language Dictionary

Page 422

by session, orders by time, and then iterates through the rows from top to bottom to find sessions thatmatch the pattern:

PARTITION BY SessionIDORDER BY TimestampPATTERN (A,B,C) DEFINE A AS Page = "home.html", B AS Page = "product.html", C AS Page = "checkout.html"OUTPUT "TRUE"

1. The PARTITION BY clause partitions (or groups) the rows of the dataset by session.

2. Within each partition, the ORDER BY clause sorts the rows by time (in ascending order by default).

3. Each DEFINE clause specifies a condition used to evaluate a row, and binds that condition to asymbol that is then used in the PATTERN clause.

4. The PATTERN clause checks if the conditions are met in the specified order and frequency. Thispattern says that there is a match whenever there are 3 consecutive rows that meet criteria A then Bthen C.

5. For a row that satisfies all of the PATTERN criteria, the value of the OUTPUT clause is applied.Otherwise the output is NULL for rows that don’t meet all of the PATTERN criteria.

Return Value

Returns one value per row of the same type as the output_expression for rows that match thedefined match pattern, otherwise returns NULL for rows that do not match the pattern.

Input Parameters

PARTITION BY field_name

Required. The PARTITION BY clause is used to specify a field in the current dataset bywhich to partition the rows. Rows that share the same value for this field will be grouped

Page 423: Data Ingest Guide

Data Ingest Guide - Platfora Expression Language Dictionary

Page 423

together, and each group will then be processed independently according to the matchingpattern criteria.

The partition field cannot be a field of a referenced dataset; it must be a field inthe current focus dataset.

ORDER BY field_name

Optional. The ORDER BY clause specifies a field by which to sort the rows within eachpartition before applying the match pattern criteria. For event series processing, records aretypically ordered by a DATETIME type field, such as a date or a timestamp. The defaultsort order is ascending (first to last or low to high).

The ordering field cannot be a field of a referenced dataset; it must be a field inthe current focus dataset.

PATTERN (pattern_expression)

Required. The PATTERN clause specifies the matching pattern to search for within apartition of rows. The pattern_expression is expressed in a format similar to a regularexpression. The pattern_expression can include:

• A symbol that represents some match criteria (as declared in the DEFINE clause).

• A symbol followed by one of the following regex quantifiers:? (matches once or not at all - greedy construct)

?? (matches once or not at all - reluctant construct)

* (matches zero or more times - greedy construct)

*? (matches zero or more times - reluctant construct)

+ (matches one or more times - greedy construct)

+? (matches one or more times - reluctant construct)

** (matches the empty sequence, or one or more of the quantified symbol, with gapsallowed in between. The match need not begin or end with the quantified symbol)

*+ (matches the empty sequence, or one or more of the quantified symbol, with gapsallowed in between. The match must end with the quantified symbol)

++ (matches the quantified symbol, followed by zero or more of the quantified symbol,with gaps allowed in between. The match must end with the quantified symbol)

+* (matches the quantified symbol, followed by zero or more of the quantified symbol,with gaps allowed in between. The match need not end with the quantified symbol)

• A symbol or pattern of symbols anchored by the regex special characters for thebeginning of string.

^ (marks the beginning of the set of rows that match to the pattern)

Page 424: Data Ingest Guide

Data Ingest Guide - Platfora Expression Language Dictionary

Page 424

• patternA|patternB - The alternation operator (pipe symbol) between two symbols orpatterns signifies an OR match.

• patternA,patternB - The concatenation operator (comma) between two symbols orpatterns signifies a match when pattern B immediately follows pattern A.

• patternA->patternB - The follows operator (minus and greater-than sign) betweentwo symbols or patterns signifies a match when pattern B eventually follows pattern A.

• (pattern_expression) - By default, pattern expressions are matched from left toright. If parenthesis are used to group sub-expressions, the sub-expression within theparenthesis is evaluated first.

You cannot use quantifiers outside of parenthesis. For example, you cannot write((A,B,C)*), to indicate that the asterisk quantifier applies to the whole (A,B,C)expression.

DEFINE symbol AS filter_expression

Required. The DEFINE clause is used to enumerate symbols used in the PATTERN clause(or in the filter_expression of a subsequent symbol definition).

A symbol is a name used to refer to some pattern matching criteria. This can be any nameor token that follows Platfora's object naming rules. For example, if the name containsspaces, special characters, keywords, or starts with a number, you must enclose the namein brackets [] to escape it. Otherwise, this can be any logical name that helps you identify apiece of pattern matching logic in your expression.

The filter_expression is a Boolean (true or false) expression that operates on each row ofthe partition.

A filter_expression can contain:

• The special expression TRUE or 1, meaning allow the match to occur for any row in thepartition.

• Any field_name in the current dataset.

• symbol.field_name - A field from the dataset qualified by the name of a symbolthat (1) appears only once in the PATTERN clause, (2) preceeds this symbol in thePATTERN clause, and (3) is not followed by a repetition quantifier in the PATTERNclause.

For example:

PATTERN (A, B) DEFINE A AS TRUE, B AS product = A.product

This means that the expression for symbol B will match to a row if the product fieldfor that row is also equal to the product field for the row that is bound to symbol A.

• Any of the comparison operators, such as greater than, less than, equals, and so on.

• The keywords AND or OR (for combining multiple criteria in a single filter expression)

• FIRST|LAST(symbol.field_name) - A field from the dataset, qualified by the nameof a symbol that (1) only appears once in the PATTERN clause, (2) preceeds this symbol

Page 425: Data Ingest Guide

Data Ingest Guide - Platfora Expression Language Dictionary

Page 425

in the PATTERN clause, and (3) is followed by a repetition quantifier in the PATTERNclause (*,*?,+, or +?). This returns the field value for the first or last row when thepattern matches to a set of rows.For example:

PATTERN (A+) DEFINE A AS product = FIRST(A.product) OR COUNT(A)=0

The pattern A+ will match to a series of consecutive rows that all have the same valuefor the product field as the first row in the sequence. If the current row happens to bethe first row in the sequence, then it will also be included in the match.

A FIRST or LAST expression evaluates to NULL if it refers to asymbol that ends up matching an empty sequence. Make sureyour expression handles the row at the beginning or end of asequence if you want that row to match as well.

• Any computed expression that operates on the fields or expressions listed above and/oron literal values.

OUTPUT output_expression

Required. An expression that specifies what the output value should be. The outputexpression can refer to:

• The field declared in the PARTITION BY clause.

• symbol.field_name - A field from the dataset, qualified by the name of a symbol that(1) appears only once in the PATTERN clause, and (2) is not followed by a repetitionquantifier in the PATTERN clause. This will output the matching field value.

• COUNT(symbol) where symbol (1) appears only once in the PATTERN clause, and(2) is followed by a repetition quantifier in the PATTERN clause. This will output thesequence number of the row that matched the symbol pattern.

• FIRST | LAST | SUM | COUNT | AVG(symbol.field_name) where symbol (1)appears only once in the PATTERN clause, and (2) is followed by a repetition quantifierin the PATTERN clause. This will output an aggregated value for a set of rows thatmatched the symbol pattern.

• Since you can only output a single column value, you can use the PACK_VALUESfunction to output multiple results in a single column as key/value pairs.

Examples

'Session Start Time' Expression

Calculate a user session by partitioning by user and ordering by time. The matching logic representedby symbol A checks if the time of the current row is less than 30 minutes from the preceding row. Ifit is, then it is considered part of the same session as the previous row. Otherwise, the current row isconsidered the start of a new session. The PATTERN (A+) means that the matching logic represented

Page 426: Data Ingest Guide

Data Ingest Guide - Platfora Expression Language Dictionary

Page 426

by symbol A must be true for one or more consecutive rows. The output then returns the time of the firstrow in a session.

PARTITION BY UserID ORDER BY Timestamp PATTERN (A+) DEFINE A AS COUNT(A)=0 OR MINUTES_BETWEEN(Timestamp,LAST(A.Timestamp)) < 30 OUTPUT FIRST(A. Timestamp)

'Click Number in Session' Expression

Calculate where a click happened in a session by partitioning by session and ordering by time. Thematching logic represented by symbol A simply matches to any row in the session. The PATTERN (A+) means that the matching logic represented by symbol A must be true for one or more consecutiverows. The output then returns to count of the row within the partition (based on its order or position inthe partition).

PARTITION BY [Session ID] ORDER BY Timestamp PATTERN (A+) DEFINE A AS TRUE OUTPUT COUNT(A)

'Path to Page' Expression

This is a complicated expression that looks back from the current row's position to determine theprevious 4 pages viewed in a session. Since a PARTITION expression can only output one column valueas its result, the OUTPUT clause uses the PACK_VALUES function to return the previous page positions1,2,3, and 4 in one output value. You can then use a series of EXTRACT_VALUE expressions to createindividual columns for each prior page view in the path.

PARTITION BY SessionID ORDER BY TimestampPATTERN (^OtherPreviousPages*?, Page4Back??, Page3Back??, Page2Back??, Page1Back??, CurrentPage) DEFINE OtherPreviousPages AS TRUE, Page4Back AS TRUE, Page3Back AS TRUE, Page2Back AS TRUE, Page1Back AS TRUE, CurrentPage AS TRUEOUTPUT PACK_VALUES("Back4",Page4Back.Page, "Back3",Page3Back.Page, "Back2",Page2Back.Page, "Back1",Page1Back.Page)

‘Page -1 Back’ Expression

Use the output from the Path to Page expression and extract the last page viewed before the currentpage.

EXTRACT_VALUE([Path to Page],"Back1")

Page 427: Data Ingest Guide

Data Ingest Guide - Platfora Expression Language Dictionary

Page 427

PACK_VALUES

PACK_VALUES is a row function that returns multiple output values packed into a single string of key/value pairs separated by the Platfora default key and pair separators. This is useful when the OUTPUTclause of a PARTITION expression returns multiple output values. The string returned is in a format thatcan be read by the EXTRACT_VALUE function. PACK_VALUES uses the same key and pair separatorvalues that EXTRACT_VALUE uses (the Unicode escape sequences u0003 and u0002, respectively).

Syntax

PACK_VALUES(key_string,value_expression[,key_string,value_expression][,...])

Return Value

Returns one value per row of type STRING. If the value for either key_string orvalue_expression of a pair is null or contains either of the two separators, the full key/value pair isomitted from the return value.

Input Parameters

key_string

At least one required. A field name of any type, a literal string or number, or an expression that returnsany value.

value_expression

At least one required. A field name of any type, a literal string or number, or an expression that returnsany value. The expression must include one value_expression instance for each key_string instance.

Examples

Combine the values of the custid and age fields into a single string field.

PACK_VALUES("ID",custid,"Age",age)

The following expression returns ID\u00035555\u0002Age\u000329 when the value of the custid field is5555 and the value of the age field is 29:

PACK_VALUES("ID",custid,"Age",age)

The following expression returns Age\u000329 when the value of the age field is 29:

PACK_VALUES("ID",NULL,"Age",age)

The following expression returns 29 as a STRING value when the age field is an INTEGER and its valueis 29:

EXTRACT_VALUE(PACK_VALUES("ID",custid,"Age",age),"Age")

You might want to use the PACK_VALUES function to combine multiple field values into a single valuein the OUTPUT clause of the PARTITION (event series processing) function. Then you can use theEXTRACT_VALUE function in a different computed field in the dataset to get one of the values returned

Page 428: Data Ingest Guide

Data Ingest Guide - Platfora Expression Language Dictionary

Page 428

by the PARTITION function. For example, in the example below, the PARTITION function creates a setof rows that defines the previous five web pages accessed in a particular user session:

PARTITION BY SessionORDER BY Time DESCPATTERN (A?, B?, C?, D?, E)DEFINE A AS true, B AS true, C AS true, D AS true, E AS trueOUTPUT PACK_VALUES("A", A.Page, "B", B.Page, "C", C.Page, "D", D.Page)

String FunctionsString functions allow you to manipulate and transform textual data, such as combining string values orextracting a portion of a string value.

CONCAT

CONCAT is a row function that returns a string by concatenating (combining together) the results ofmultiple string expressions.

Syntax

CONCAT(value_expression[,value_expression][,...])

Return Value

Returns one value per row of type STRING.

Input Parameters

value_expression

At least one required. A field name of any type, a literal string or number, or an expression that returnsany value.

Examples

Combine the values of the month, day, and year fields into a single date field formatted as MM/DD/YYYY.

CONCAT(month,"/",day,"/",year)

ARRAY_CONTAINS

ARRAY_CONTAINS is a row function that performs a whole string match against a string containingdelimited values and returns a 1 or 0 depending on whether or not the string contains the search value.

Syntax

ARRAY_CONTAINS(array_string,"delimiter","search_string")

Page 429: Data Ingest Guide

Data Ingest Guide - Platfora Expression Language Dictionary

Page 429

Return Value

Returns one value per row of type INTEGER. A return value of 1 indicates a positive match, and a returnvalue of 0 indicates no match.

Input Parameters

array_string

Required. The name of a field or expression of type STRING (or a literal string) that contains a validarray.

delimiter

Required. The delimiter used between values in the array string. This can be a name of a field orexpression of type STRING.

search_string

Required. The literal string that you want to search for. This can be a name of a field or expression oftype STRING.

Examples

If you had a device field that contained a comma delimited list formatted like this:

Safari,iPad

You could determine whether or not the device used was an iPad using the following expression:

ARRAY_CONTAINS(device,",","iPad")

The following expressions return 1:

ARRAY_CONTAINS("platfora","|","platfora")

ARRAY_CONTAINS("platfora|hadoop|2.3","|","hadoop")

The following expressions return 0:

ARRAY_CONTAINS("platfora","|","plat")

ARRAY_CONTAINS("platfora,hadoop","|","platfora")

FILE_NAME

FILE_NAME is a row function that returns the original file name from the source file system. This isuseful when the source data that comprises a dataset comes from multiple files, and there is usefulinformation in the file names themselves (such as dates or server names). You can use FILE_NAME incombination with other string processing functions to extract useful information from the file name.

Syntax

FILE_NAME()

Page 430: Data Ingest Guide

Data Ingest Guide - Platfora Expression Language Dictionary

Page 430

Return Value

Returns one value per row of type STRING.

Examples

Your dataset is based on daily log files that use an 8 character date as part of the file name. For example,20120704.log is the file name used for the log file created on July 4, 2012. The following expressionuses FILE_NAME in combination with SUBSTRING and TO_DATE to create a date field from the first 8characters of the file name.

TO_DATE(SUBSTRING(FILE_NAME(),0,8),"yyyyMMdd")

Your dataset is based on log files that use the server IP address as part of the file name. For example,172.12.131.118.log is the log file name for server 172.12.131.118. The following expression usesFILE_NAME in combination with REGEX to extract the IP address from the file name.

REGEX(FILE_NAME(),"(\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3})\.log")

FILE_PATH

FILE_PATH is a row function that returns the full URI path from the source file system. This isuseful when the source data that comprises a dataset comes from multiple files, and there is usefulinformation in the directory names or file names themselves (such as dates or server names). You canuse FILE_PATH in combination with other string processing functions to extract useful informationfrom the file path.

Syntax

FILE_PATH()

Return Value

Returns one value per row of type STRING.

Examples

Your dataset is based on daily log files that are organized into directories by date on the source filesystem, and the file names are the server IP address of the server that produced the log file. Forexample, the URI path to a log file produced by server 172.12.131.118 on July 4, 2012 is hdfs://myhdfs-server.com/data/logs/20120704/172.12.131.118.log.

The following expression uses FILE_PATH in combination with REGEX and TO_DATE to create a datefield from the date directory name.

TO_DATE(REGEX(FILE_PATH(),"hdfs://myhdfs-server.com/data/logs/(\d{8})/(?:\d{1,3}\.*)+\.log"),"yyyyMMdd")

And the following expression uses FILE_NAME and REGEX to extract the server IP address from the filename:

REGEX(FILE_NAME(),"(\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3})\.log")

Page 431: Data Ingest Guide

Data Ingest Guide - Platfora Expression Language Dictionary

Page 431

EXTRACT_COOKIE

EXTRACT_COOKIE is a row function that extracts the value of the given cookie identifier from a semi-colon delimited list of cookie key=value pairs. This function can be used to extract a particular cookievalue from a combined web access log Cookie column.

Syntax

EXTRACT_COOKIE("cookie_list_string",cookie_key_string)

Return Value

Returns the value of the specified cookie key as type STRING.

Input Parameters

cookie_list_string

Required. A field or literal string that has a semi-colon delimited list of cookie key=value pairs.

cookie_key_string

Required. The cookie key name for which to extract the cookie value.

Examples

Extract the value of the vID cookie from a literal cookie string:

EXTRACT_COOKIE("SSID=ABC; vID=44", "vID") returns 44

Extract the value of the vID cookie from a field named Cookie:

EXTRACT_COOKIE(Cookie,"vID")

EXTRACT_VALUE

EXTRACT_VALUE is a row function that extracts the value for the given key from a string containingdelimited key/value pairs.

Syntax

EXTRACT_VALUE(string,key_name [,delimiter] [,pair_delimiter])

Return Value

Returns the value of the specified key as type STRING.

Input Parameters

string

Required. A field or literal string that contains a delimited list of key/value pairs.

key_name

Required. The key name for which to extract the value.

Page 432: Data Ingest Guide

Data Ingest Guide - Platfora Expression Language Dictionary

Page 432

delimiter

Optional. The delimiter used between the key and the value. If not specified, the value u0003 is used.This is the Unicode escape sequence for the start of text character (which is the default delimiter usedby Hive).

pair_delimiter

Optional. The delimiter used between key/value pairs when the input string contains more than one key/value pair. If not specified, the value u0002 is used. This is the Unicode escape sequence for the end oftext character (which is the default delimiter used by Hive).

Examples

Extract the value of the lastname key from a literal string of key/value pairs:

EXTRACT_VALUE("firstname;daria|lastname;hutch","lastname",";","|")returns hutch

Extract the value of the email key from a string field named contact_info that contains strings in theformat of key:value,key:value:

EXTRACT_VALUE(contact_info,"email",":",",")

INSTR

INSTR is a row function that returns an integer indicating the position of a character within a string thatis the first character of the occurrence of a substring. Platfora's INSTR function is similar to the FINDfunction in Excel, except that the first letter is position 0 and the order of the arguments is reversed.

Syntax

INSTR(string,substring,position,occurrence)

Return Value

Returns one value per row of type INTEGER. The first position is indicated with the value of zero (0).

Input Parameters

string

Required. The name of a field or expression of type STRING (or a literal string).

substring

Required. A literal string or name of a field that specifies the substring to search for in string. Note thatto search for the double quotation mark ( " ) as a literal string, you must escape it with another doublequotation mark: ""

position

Optional. An integer that specifies at which character in string to start searching for substring. A valueof 0 (zero) starts the search at the beginning of string. Use a positive integer to start searching from

Page 433: Data Ingest Guide

Data Ingest Guide - Platfora Expression Language Dictionary

Page 433

the beginning of string, and use a negative integer to start searching from the end of string. When noposition is specified, INSTR searches at the beginning of the string (0).

occurrence

Optional. A positive integer that specifies which occurrence of substring to search for. When nooccurrence is specified, INSTR searches for the first occurrence of the substring (1).

Examples

Return the position of the first occurrence of the substring "http://" starting at the end of the url field:

INSTR(url,"http://",-1,1)

The following expression searches for the second occurrence of the substring "st" starting at thebeginning of the string "bestteststring". INSTR finds that the substring starts at the seventh character inthe string, so it returns 6:

INSTR("bestteststring","st",0,2)

The following expression searches backward for the second occurrence of the substring "st" starting at 7characters before the end of the string "bestteststring". INSTR finds that the substring starts at the thirdcharacter in the string, so it returns 2:

INSTR("bestteststring","st",-7,2)

JAVA_STRING

JAVA_STRING is a row function that returns the unescaped version of a Java unicode character escapesequence as a string value. This is useful when you want to specify unicode characters in an expression.For example, you can use JAVA_STRING to specify the unicode value representing a control character.

Syntax

JAVA_STRING(unicode_escape_sequence)

Return Value

Returns the unescaped version of the specified unicode character, one value per row of type STRING.

Input Parameters

unicode_escape_sequence

Required. A STRING value containing a unicode character expressed as a Java unicode escapesequence. Unicode escape sequences consist ofa backslash '\' (ASCII character 92, hex 0x5c), a'u' (ASCII 117, hex 0x75), optionally one or more additional 'u' characters, and four hexadecimal digits(the characters '0' through '9' or 'a' through 'f' or 'A' through 'F'). Such sequences represent the UTF-16encoding of a Unicode character. For example, the letter 'a' is equivalent to '\u0061'.

Examples

Evaluates whether the currency field is equal to the yen symbol.

Page 434: Data Ingest Guide

Data Ingest Guide - Platfora Expression Language Dictionary

Page 434

CASE WHEN currency == JAVA_STRING("\u00a5") THEN "yes" ELSE "no" END

JOIN_STRINGS

JOIN_STRINGS is a row function that returns a string by concatenating (combining together) the resultsof multiple values with the separator in between each non-null value.

Syntax

JOIN_STRINGS(separator,value_expression[,value_expression][,...])

Return Value

Returns one value per row of type STRING.

Input Parameters

separator

Required. A field name of type STRING, a literal string, or an expression that returns a string.

value_expression

At least one required. A field name of any type, a literal string or number, or an expression that returnsany value.

Examples

Combine the values of the month, day, and year fields into a single date field formatted as MM/DD/YYYY.

JOIN_STRINGS("/",month,day,year)

The following expression returns NULL:

JOIN_STRINGS("+",NULL,NULL,NULL)

The following expression returns a+b:

JOIN_STRINGS("+","a","b",NULL)

JSON_ARRAY

JSON_ARRAY is a row function that extracts a JSON ARRAY as a STRING value from a field in aJSON object.

Syntax

JSON_ARRAY(json_string,"json_field")

Return Value

Returns one value per row of type STRING.

Page 435: Data Ingest Guide

Data Ingest Guide - Platfora Expression Language Dictionary

Page 435

Input Parameters

json_string

Required. The name of a field or expression of type STRING (or a literal string) that contains a validJSON object.

json_field

Required. The key or name of the field value you want to extract.

For top-level fields, specify the name identifier (key) of the field.

To access fields within a nested object, specify a dot-separated path of field names (for exampletop_level_field_name.nested_field_name).

To extract a value from an array, specify the dot-separated path of field names and the array positionstarting at 0 for the first value in an array, 1 for the second value, and so on (for example, field_name.0).

If the name identifier contains dot or period characters within the name itself, escape the name byenclosing it in brackets (for example, [field.name.with.dot].[another.dot.field.name]

If the field name is null (empty), use brackets with nothing in between as the identifier, for example [].

Examples

If you had a friends field that contained a JSON object formatted like this:

{"f1":[{"id":0,"name":"Brenda Griffin","f2":{"id":0,"name":"Bowen Blair","f3":{"id":0,"name":"Maude Hoffman"}}},{"f4":{"f5":1}}]}

You could extract the f1 value using the following expression:

JSON_ARRAY(friends, "f1")

This expression would return the following value:

[{"id":0,"name":"Brenda Griffin","f2":{"id":0,"name":"Bowen Blair","f3":{"id":0,"name":"Maude Hoffman"}}},{"f4":{"f5":1}}]

Suppose you have a field called json_field that contains the following value:

{"int":10, "string": "hello world", "array": [1,2,3], "object": {"key":"value" }, "nested": [{"nkey":"nvalue"},[4,5,6]]}

The following expressions return the following results:

Expression Return Value

JSON_ARRAY(json_field, "array") [1,2,3]

JSON_ARRAY(json_field, "object") NULL

JSON_ARRAY(json_field, "int") NULL

JSON_ARRAY(json_field, "string") NULL

Page 436: Data Ingest Guide

Data Ingest Guide - Platfora Expression Language Dictionary

Page 436

Expression Return Value

JSON_ARRAY(json_field, "nested.0") NULL

JSON_ARRAY(json_field, "nested.1") [4,5,6]

JSON_ARRAY(json_field, "array.0") NULL

JSON_ARRAY_CONTAINS

JSON_ARRAY_CONTAINS is a row function that performs a whole string match against a stringformatted as a JSON array and returns a 1 or 0 depending on whether or not the string contains thesearch value.

Syntax

JSON_ARRAY_CONTAINS(json_array_string,"search_string")

Return Value

Returns one value per row of type INTEGER. A return value of 1 indicates a positive match, and a returnvalue of 0 indicates no match.

Input Parameters

json_array_string

Required. The name of a field or expression of type STRING (or a literal string) that contains a validJSON array. A JSON array is an ordered sequence of values separated by commas and enclosed insquare brackets.

search_string

Required. The literal string that you want to search for. This can be a name of a field or expression oftype STRING.

Examples

If you have a software field that contains a JSON array formatted like this:

["hadoop","platfora"]

The following expression returns 1:

JSON_ARRAY_CONTAINS(software,"platfora")

JSON_DOUBLE

JSON_DOUBLE is a row function that extracts a DOUBLE value from a field in a JSON object.

Syntax

JSON_DOUBLE(json_string,"json_field")

Page 437: Data Ingest Guide

Data Ingest Guide - Platfora Expression Language Dictionary

Page 437

Return Value

Returns one value per row of type DOUBLE.

Input Parameters

json_string

Required. The name of a field or expression of type STRING (or a literal string) that contains a validJSON object.

json_field

Required. The key or name of the field value you want to extract.

For top-level fields, specify the name identifier (key) of the field.

To access fields within a nested object, specify a dot-separated path of field names (for exampletop_level_field_name.nested_field_name).

To extract a value from an array, specify the dot-separated path of field names and the array positionstarting at 0 for the first value in an array, 1 for the second value, and so on (for example, field_name.0).

If the name identifier contains dot or period characters within the name itself, escape the name byenclosing it in brackets (for example, [field.name.with.dot].[another.dot.field.name]

If the field name is null (empty), use brackets with nothing in between as the identifier, for example [].

Examples

If you had a top_scores field that contained a JSON object formatted like this (with the values containedin an array):

{"practice_scores":["538.67","674.99","1021.52"], "test_scores":["753.21","957.88","1032.87"]}

You could extract the third value of the test_scores array using the expression:

JSON_DOUBLE(top_scores,"test_scores.2")

JSON_FIXED

JSON_FIXED is a row function that extracts a FIXED value from a field in a JSON object.

Syntax

JSON_FIXED(json_string,"json_field")

Return Value

Returns one value per row of type FIXED.

Input Parameters

json_string

Page 438: Data Ingest Guide

Data Ingest Guide - Platfora Expression Language Dictionary

Page 438

Required. The name of a field or expression of type STRING (or a literal string) that contains a validJSON object.

json_field

Required. The key or name of the field value you want to extract.

For top-level fields, specify the name identifier (key) of the field.

To access fields within a nested object, specify a dot-separated path of field names (for exampletop_level_field_name.nested_field_name).

To extract a value from an array, specify the dot-separated path of field names and the array positionstarting at 0 for the first value in an array, 1 for the second value, and so on (for example, field_name.0).

If the name identifier contains dot or period characters within the name itself, escape the name byenclosing it in brackets (for example, [field.name.with.dot].[another.dot.field.name]

If the field name is null (empty), use brackets with nothing in between as the identifier, for example [].

Examples

If you had a top_scores field that contained a JSON object formatted like this (with the values containedin an array):

{"practice_scores":["538.67","674.99","1021.52"], "test_scores":["753.21","957.88","1032.87"]}

You could extract the third value of the test_scores array using the expression:

JSON_FIXED(top_scores,"test_scores.2")

JSON_INTEGER

JSON_INTEGER is a row function that extracts an INTEGER value from a field in a JSON object.

Syntax

JSON_INTEGER(json_string,"json_field")

Return Value

Returns one value per row of type INTEGER.

Input Parameters

json_string

Required. The name of a field or expression of type STRING (or a literal string) that contains a validJSON object.

json_field

Required. The key or name of the field value you want to extract.

Page 439: Data Ingest Guide

Data Ingest Guide - Platfora Expression Language Dictionary

Page 439

For top-level fields, specify the name identifier (key) of the field.

To access fields within a nested object, specify a dot-separated path of field names (for exampletop_level_field_name.nested_field_name).

To extract a value from an array, specify the dot-separated path of field names and the array positionstarting at 0 for the first value in an array, 1 for the second value, and so on (for example, field_name.0).

If the name identifier contains dot or period characters within the name itself, escape the name byenclosing it in brackets (for example, [field.name.with.dot].[another.dot.field.name]

If the field name is null (empty), use brackets with nothing in between as the identifier, for example [].

Examples

If you had an address field that contained a JSON object formatted like this:

{"street_address":"123 B Street", "city":"San Mateo", "state":"CA","zip_code":"94403"}

You could extract the zip_code value using the expression:

JSON_INTEGER(address,"zip_code")

If you had a top_scores field that contained a JSON object formatted like this (with the values containedin an array):

{"practice_scores":["538","674","1021"], "test_scores":["753","957","1032"]}

You could extract the third value of the test_scores array using the expression:

JSON_INTEGER(top_scores,"test_scores.2")

JSON_LONG

JSON_LONG is a row function that extracts a LONG value from a field in a JSON object.

Syntax

JSON_LONG(json_string,"json_field")

Return Value

Returns one value per row of type LONG.

Input Parameters

json_string

Required. The name of a field or expression of type STRING (or a literal string) that contains a validJSON object.

json_field

Page 440: Data Ingest Guide

Data Ingest Guide - Platfora Expression Language Dictionary

Page 440

Required. The key or name of the field value you want to extract.

For top-level fields, specify the name identifier (key) of the field.

To access fields within a nested object, specify a dot-separated path of field names (for exampletop_level_field_name.nested_field_name).

To extract a value from an array, specify the dot-separated path of field names and the array positionstarting at 0 for the first value in an array, 1 for the second value, and so on (for example, field_name.0).

If the name identifier contains dot or period characters within the name itself, escape the name byenclosing it in brackets (for example, [field.name.with.dot].[another.dot.field.name]

If the field name is null (empty), use brackets with nothing in between as the identifier, for example [].

Examples

If you had a top_scores field that contained a JSON object formatted like this (with the values containedin an array):

{"practice_scores":["538","674","1021"], "test_scores":["753","957","1032"]}

You could extract the third value of the test_scores array using the expression:

JSON_LONG(top_scores,"test_scores.2")

JSON_OBJECT

JSON_OBJECT is a row function that extracts a JSON OBJECT as a STRING value from a field in aJSON object.

Syntax

JSON_OBJECT(json_string,"json_field")

Return Value

Returns one value per row of type STRING.

Input Parameters

json_string

Required. The name of a field or expression of type STRING (or a literal string) that contains a validJSON object.

json_field

Required. The key or name of the field value you want to extract.

For top-level fields, specify the name identifier (key) of the field.

Page 441: Data Ingest Guide

Data Ingest Guide - Platfora Expression Language Dictionary

Page 441

To access fields within a nested object, specify a dot-separated path of field names (for exampletop_level_field_name.nested_field_name).

To extract a value from an array, specify the dot-separated path of field names and the array positionstarting at 0 for the first value in an array, 1 for the second value, and so on (for example, field_name.0).

If the name identifier contains dot or period characters within the name itself, escape the name byenclosing it in brackets (for example, [field.name.with.dot].[another.dot.field.name]

If the field name is null (empty), use brackets with nothing in between as the identifier, for example [].

Examples

If you had a friends field that contained a JSON object formatted like this:

{"f1":[{"id":0,"name":"Brenda Griffin","f2":{"id":0,"name":"Bowen Blair","f3":{"id":0,"name":"Maude Hoffman"}}},{"f4":{"f5":1}}]}

You could extract the f3 value using the following expression:

JSON_OBJECT(friends, "f1.0.f2.f3")

This expression would return the following value:

{"id":0,"name":"Maude Hoffman"}And the following expression:

JSON_OBJECT(friends, "f1.1")

Returns the following value:

{"f4":{"f5":1}}

Suppose you have a field called json_field that contains the following value:

{"int":10, "string": "hello world", "array": [1,2,3], "object": {"key":"value" }, "nested": [{"nkey":"nvalue"},[4,5,6]]}

The following expressions return the following results:

Expression Return Value

JSON_ARRAY(json_field, "array") NULL

JSON_ARRAY(json_field, "object") {"key":"value"}

JSON_ARRAY(json_field, "int") NULL

JSON_ARRAY(json_field, "string") NULL

JSON_ARRAY(json_field, "nested.0") {"nkey":"nvalue"}

JSON_ARRAY(json_field, "nested.1") NULL

JSON_OBJECT(json_field, "object.key") NULL

Page 442: Data Ingest Guide

Data Ingest Guide - Platfora Expression Language Dictionary

Page 442

JSON_STRING

JSON_STRING is a row function that extracts a STRING value from a field in a JSON object.

Syntax

JSON_STRING(json_string,"json_field")

Return Value

Returns one value per row of type STRING.

Input Parameters

json_string

Required. The name of a field or expression of type STRING (or a literal string) that contains a validJSON object.

json_field

Required. The key or name of the field value you want to extract.

For top-level fields, specify the name identifier (key) of the field.

To access fields within a nested object, specify a dot-separated path of field names (for exampletop_level_field_name.nested_field_name).

To extract a value from an array, specify the dot-separated path of field names and the array positionstarting at 0 for the first value in an array, 1 for the second value, and so on (for example, field_name.0).

If the name identifier contains dot or period characters within the name itself, escape the name byenclosing it in brackets (for example, [field.name.with.dot].[another.dot.field.name]

If the field name is null (empty), use brackets with nothing in between as the identifier, for example [].

Examples

If you had an address field that contained a JSON object formatted like this:

{"street_address":"123 B Street", "city":"San Mateo", "state":"CA","zip":"94403"}

You could extract the state value using the expression:

JSON_STRING(address,"state")

If you had a misc field that contained a JSON object formatted like this (with the values contained in anarray):

{"hobbies":["sailing","hiking","cooking"], "interests":["art","music","travel"]}

You could extract the first value of the hobbies array using the expression:

Page 443: Data Ingest Guide

Data Ingest Guide - Platfora Expression Language Dictionary

Page 443

JSON_STRING(misc,"hobbies.0")

LENGTH

LENGTH is a row function that returns the count of characters in a string value.

Syntax

LENGTH(string)

Return Value

Returns one value per row of type INTEGER.

Input Parameters

string

Required. The name of a field or expression of type STRING (or a literal string).

Examples

Return count of characters from values in the name field. For example, the value Bob would return alength of 3, Julie would return a length of 5, and so on:

LENGTH(name)

REGEX

REGEX is a row function that performs a whole string match against a string value with a regularexpression and returns the portion of the string matching the first capturing group of the regularexpression.

Syntax

REGEX(string_expression,"regex_matching_pattern")

Return Value

Returns the matched STRING value of the first capturing group of the regular expression. If there is nomatch, returns NULL.

Input Parameters

string_expression

Required. The name of a field or expression of type STRING (or a literal string).

regex_matching_pattern

Required. A regular expression pattern based on the regular expression pattern matching syntax of theJava programming language. To return a non-NULL value, the regular expression pattern must matchthe entire string value.

Page 444: Data Ingest Guide

Data Ingest Guide - Platfora Expression Language Dictionary

Page 444

Regular Expression Constructs

This section lists a summary of the most commonly used constructs for defining a regular expressionmatching pattern. See the Regular Expression Reference for more information about regular expressionsupport in Platfora.

Literal and Special Characters

The most basic form of pattern matching is the match of literal characters. For example, if the regularexpression is foo and the input string is foo, the match will succeed because the strings are identical.Certain characters are reserved for special use in regular expressions. These special characters are oftencalled metacharacters. If you want to use special characters as literal characters, they must be escaped.You can escape a single character using a \ (backslash), or escape a character sequence by enclosing itin \Q ... \E.

To escape literal double-quotes, double the double-quotes ("").

Character Name Character Reserved For

opening bracket [ start of a character class

closing bracket ] end of a character class

hyphen - character ranges within a character class

backslash \ general escape character

caret ^ beginning of string, negating of a character class

dollar sign $ end of string

period . matching any single character

pipe | alternation (OR) operator

question mark ? optional quantifier, quantifier minimizer

asterisk * zero or more quantifier

plus sign + once or more quantifier

opening parenthesis ( start of a subexpression group

closing parenthesis ) end of a subexpression group

opening brace { start of min/max quantifier

closing brace } end of min/max quantifier

Character Class Constructs

Page 445: Data Ingest Guide

Data Ingest Guide - Platfora Expression Language Dictionary

Page 445

A character class allows you to specify a set of characters, enclosed in square brackets, that can producea single character match. There are also a number of special predefined character classes (backslashcharacter sequences that are shorthand for the most common character sets).

Construct Type Description

[abc] simple matchesaorborc

[^abc] negation matches any character exceptaorborc

[a-zA-Z] range matchesathroughz, orAthroughZ(inclusive)

[a-d[m-p]] union matchesathroughd, ormthroughp

[a-z&&[def]] intersection matchesd,e, orf

Page 446: Data Ingest Guide

Data Ingest Guide - Platfora Expression Language Dictionary

Page 446

Construct Type Description

[a-z&&[^xq]] subtraction matchesathroughz, except forxandq

Predefined Character Classes

Predefined character classes offer convenient shorthands for commonly used regular expressions.

Construct Description Example

. matches any single character (except newline) .atmatches "cat", "hat", and also"bat" in thephrase "batch files"

\d matches any digit character (equivalent to[0-9])

\dmatches "3" in "C3PO" and "2" in"file_2.txt"

\D matches any non-digit character (equivalent to[^0-9])

\Dmatches "S" in "900S" and "Q" in "Q45"

\s matches any single white-space character(equivalent to[ \t\n\x0B\f\r])

\sbookmatches "book" in "blue book" butnothing in "notebook"

\S matches any single non-white-space character \Sbookmatches "book" in "notebook" butnothing in "blue book"

\w matches any alphanumeric character, includingunderscore (equivalent to[A-Za-z0-9_])

r\w*matches "rm" and "root"

Page 447: Data Ingest Guide

Data Ingest Guide - Platfora Expression Language Dictionary

Page 447

Construct Description Example

\W matches any non-alphanumeric character(equivalent to[^A-Za-z0-9_])

\Wmatches "&" in "stmd &" , "%" in"100%", and "$" in "$HOME"

Line and Word Boundaries

Boundary matching constructs are used to specify where in a string to apply a matching pattern. Forexample, you can search for a particular pattern within a word boundary, or search for a pattern at thebeginning or end of a line.

Construct Description Example

^ matches from the beginning of a line (multi-line matches are currently not supported)

^172will match the "172" in IP address"172.18.1.11" but not in "192.172.2.33"

$ matches from the end of a line (multi-linematches are currently not supported)

d$will match the "d" in "maid" but not in"made"

\b matches within a word boundary \bis\bmatches the word "is" in "this is myisland", but not the "is" part of "this" or"island".\bismatches both "is" and the "is" in "island",but not in "this".

\B matches within a non-word boundary \Bbmatches "b" in "sbin" but not in "bash"

Quantifiers

Quantifiers specify how often the preceding regular expression construct should match. There are threeclasses of quantifiers: greedy, reluctant, and possessive. The difference between greedy, reluctant, and

Page 448: Data Ingest Guide

Data Ingest Guide - Platfora Expression Language Dictionary

Page 448

possessive quantifiers involves what part of the string to try for the initial match, and how to retry if theinitial attempt does not produce a match.

Greedy

Construct

Reluctant

Construct

Possessive

Construct

Description Example

? ?? ?+ matches the previouscharacter or construct onceor not at all

st?onmatches "son" in "johnson" and "ston"in "johnston" but nothing in "clinton" or"version"

* *? *+ matches the previouscharacter or construct zeroor more times

if*matches "if", "iff" in "diff", or "i" in "print"

+ +? ++ matches the previouscharacter or construct oneor more times

if+matches "if", "iff" in "diff", but nothing in"print"

{n} {n}? {n}+ matches the previouscharacter or constructexactlyntimes

o{2}matches "oo" in "lookup" and the first two o'sin "fooooo" but nothing in "mount"

{n,} {n,}? {n,}+ matches the previouscharacter or construct atleastntimes

o{2,}matches "oo" in "lookup" all five o's in"fooooo" but nothing in "mount"

{n,m} {n,m}? {n,m}+matches the previouscharacter or construct atleastntimes, but no more thanmtimes

F{2,4}matches "FF" in "#FF0000" and the last fourF's in "#FFFFFF"

Capturing and Non-Capturing Groups

Groups are specified by a pair of parenthesis around a subpattern in the regular expression. A pattern canhave more than one group and the groups can be nested. The groups are numbered 1-n from left to right,starting with the first opening parenthesis. There is always an implicit group 0, which contains the entirematch. For example, the pattern:

(a(b*))+(c)

Page 449: Data Ingest Guide

Data Ingest Guide - Platfora Expression Language Dictionary

Page 449

contains three groups:

group 1: (a(b*))group 2: (b*) group 3: (c)

Capturing Groups

By default, a group captures the text that produces a match, and only the most recent match is captured.The REGEX function returns the string that matches the first capturing group in the regular expression.For example, if the input string to the expression above was abc, the entire REGEX function wouldmatch to abc, but only return the result of group 1, which is ab.

Non-Capturing Groups

In some cases, you may want to use parenthesis to group subpatterns, but not capture text. A non-capturing group starts with (?: (a question mark and colon following the opening parenthesis). Forexample, h(?:a|i|o)t matches hat or hit or hot, but does not capture the a, i, or o from thesubexpression.

Examples

Match all possible email address strings with a pattern of [email protected], but only returnthe provider portion of the email address from the email field:

REGEX(email,"^[a-zA-Z0-9._%+-]+@([a-zA-Z0-9._-]+)\.[a-zA-Z]{2,4}$")

Match the request line of a web log, where the value is in the format of:

GET /some_page.html HTTP/1.1

and return just the requested HTML page names:

REGEX(weblog.request_line,"GET\s/([a-zA-Z0-9._%-]+\.[html])\sHTTP/[0-9.]+")

Extract the inches portion from a height field where example values are 6'2", 5'11" (notice theescaping of the literal quote with a double double-quote):

REGEX(height, "\d\'(\d)+""")

Extract all of the contents of the device field when the value is either iPod, iPad, or iPhone:

REGEX(device,"(iP[ao]d|iPhone)")

REGEX_REPLACE

REGEX_REPLACE is a row function that evaluates a string value against a regular expression todetermine if there is a match, and replaces matched strings with the specified replacement value.

Syntax

REGEX_REPLACE(string_expression,"regex_match_pattern","regex_replace_pattern")

Page 450: Data Ingest Guide

Data Ingest Guide - Platfora Expression Language Dictionary

Page 450

Return Value

Returns the regex_replace_pattern as a STRING value when regex_match_pattern produces a match. Ifthere is no match, returns the value of string_expression as a STRING.

Input Parameters

string_expression

Required. The name of a field or expression of type STRING (or a literal string).

regex_match_pattern

Required. A string literal or regular expression pattern based on the regular expression pattern matchingsyntax of the Java programming language. You can use capturing groups to create backreferences thatcan be used in the regex_replace_pattern. You might want to use a string literal to make a case-sensitivematch. For example, when you enter jane as the match value, the function matches jane but not Jane.The function matches all occurrences of a string literal in the string expression.

regex_replace_pattern

Required. A string literal or regular expression pattern based on the regular expression patternmatching syntax of the Java programming language. You can refer to backreferences from theregex_match_pattern using the syntax $n (where n is the group number).

Regular Expression Constructs

This section lists a summary of the most commonly used constructs for defining a regular expressionmatching pattern. See the Regular Expression Reference for more information.

Literal and Special Characters

The most basic form of pattern matching is the match of literal characters. For example, if the regularexpression is foo and the input string is foo, the match will succeed because the strings are identical.Certain characters are reserved for special use in regular expressions. These special characters are oftencalled metacharacters. If you want to use special characters as literal characters, they must be escaped.You can escape a single character using a \ (backslash), or escape a character sequence by enclosing itin \Q ... \E.

Character Name Character Reserved For

opening bracket [ start of a character class

closing bracket ] end of a character class

hyphen - character ranges within a character class

backslash \ general escape character

caret ^ beginning of string, negating of a character class

Page 451: Data Ingest Guide

Data Ingest Guide - Platfora Expression Language Dictionary

Page 451

Character Name Character Reserved For

dollar sign $ end of string

period . matching any single character

pipe | alternation (OR) operator

question mark ? optional quantifier, quantifier minimizer

asterisk * zero or more quantifier

plus sign + once or more quantifier

opening parenthesis ( start of a subexpression group

closing parenthesis ) end of a subexpression group

opening brace { start of min/max quantifier

closing brace } end of min/max quantifier

Character Class Constructs

A character class allows you to specify a set of characters, enclosed in square brackets, that can producea single character match. There are also a number of special predefined character classes (backslashcharacter sequences that are shorthand for the most common character sets).

Construct Type Description

[abc] simple matchesaorborc

[^abc] negation matches any character exceptaorborc

Page 452: Data Ingest Guide

Data Ingest Guide - Platfora Expression Language Dictionary

Page 452

Construct Type Description

[a-zA-Z] range matchesathroughz, orAthroughZ(inclusive)

[a-d[m-p]] union matchesathroughd, ormthroughp

[a-z&&[def]] intersection matchesd,e, orf

[a-z&&[^xq]] subtraction matchesathroughz, except forxandq

Predefined Character Classes

Page 453: Data Ingest Guide

Data Ingest Guide - Platfora Expression Language Dictionary

Page 453

Predefined character classes offer convenient shorthands for commonly used regular expressions.

Construct Description Example

. matches any single character (except newline) .atmatches "cat", "hat", and also"bat" in thephrase "batch files"

\d matches any digit character (equivalent to[0-9])

\dmatches "3" in "C3PO" and "2" in"file_2.txt"

\D matches any non-digit character (equivalent to[^0-9])

\Dmatches "S" in "900S" and "Q" in "Q45"

\s matches any single white-space character(equivalent to[ \t\n\x0B\f\r])

\sbookmatches "book" in "blue book" butnothing in "notebook"

\S matches any single non-white-space character \Sbookmatches "book" in "notebook" butnothing in "blue book"

\w matches any alphanumeric character, includingunderscore (equivalent to[A-Za-z0-9_])

r\w*matches "rm" and "root"

\W matches any non-alphanumeric character(equivalent to[^A-Za-z0-9_])

\Wmatches "&" in "stmd &" , "%" in"100%", and "$" in "$HOME"

Line and Word Boundaries

Boundary matching constructs are used to specify where in a string to apply a matching pattern. Forexample, you can search for a particular pattern within a word boundary, or search for a pattern at thebeginning or end of a line.

Construct Description Example

^ matches from the beginning of a line (multi-line matches are currently not supported)

^172will match the "172" in IP address"172.18.1.11" but not in "192.172.2.33"

Page 454: Data Ingest Guide

Data Ingest Guide - Platfora Expression Language Dictionary

Page 454

Construct Description Example

$ matches from the end of a line (multi-linematches are currently not supported)

d$will match the "d" in "maid" but not in"made"

\b matches within a word boundary \bis\bmatches the word "is" in "this is myisland", but not the "is" part of "this" or"island".\bismatches both "is" and the "is" in "island",but not in "this".

\B matches within a non-word boundary \Bbmatches "b" in "sbin" but not in "bash"

Quantifiers

Quantifiers specify how often the preceding regular expression construct should match. There are threeclasses of quantifiers: greedy, reluctant, and possessive. The difference between greedy, reluctant, andpossessive quantifiers involves what part of the string to try for the initial match, and how to retry if theinitial attempt does not produce a match.

Greedy

Construct

Reluctant

Construct

Possessive

Construct

Description Example

? ?? ?+ matches the previouscharacter or construct onceor not at all

st?onmatches "son" in "johnson" and "ston"in "johnston" but nothing in "clinton" or"version"

* *? *+ matches the previouscharacter or construct zeroor more times

if*matches "if", "iff" in "diff", or "i" in "print"

+ +? ++ matches the previouscharacter or construct oneor more times

if+matches "if", "iff" in "diff", but nothing in"print"

{n} {n}? {n}+ matches the previouscharacter or constructexactlyntimes

o{2}matches "oo" in "lookup" and the first two o'sin "fooooo" but nothing in "mount"

Page 455: Data Ingest Guide

Data Ingest Guide - Platfora Expression Language Dictionary

Page 455

Greedy

Construct

Reluctant

Construct

Possessive

Construct

Description Example

{n,} {n,}? {n,}+ matches the previouscharacter or construct atleastntimes

o{2,}matches "oo" in "lookup" all five o's in"fooooo" but nothing in "mount"

{n,m} {n,m}? {n,m}+matches the previouscharacter or construct atleastntimes, but no more thanmtimes

F{2,4}matches "FF" in "#FF0000" and the last fourF's in "#FFFFFF"

Examples

Match the values in a phone_number field where phone number values are formatted asxxx.xxx.xxxx and replace them with phone number values formatted as (xxx) xxx-xxxx:

REGEX_REPLACE(phone_number,"([0-9]{3})\.([[0-9]]{3})\.([[0-9]]{4})","\($1\) $2-$3")

Match the values in a name field where name values are formatted as firstname lastname andreplace them with name values formatted as lastname, firstname:

REGEX_REPLACE(name,"(.*) (.*)","$2, $1")

Match the string literal mrs in a title field and replace it with the string literal Mrs.

REGEX_REPLACE(title,"mrs","Mrs")

SPLIT

SPLIT is a row function that breaks down a delimited input string into sections and returns the specifiedsection of the string. A section is considered any sub-string between the specified delimiter.

Syntax

SPLIT(input_string_expression,"delimiter_string",position_integer)

Return Value

Returns one value per row of type STRING.

Input Parameters

input_string_expression

Page 456: Data Ingest Guide

Data Ingest Guide - Platfora Expression Language Dictionary

Page 456

Required. The name of a field or expression of type STRING (or a literal string).

delimiter_string

Required. A literal string representing the delimiter used to separate values in the input string. Thedelimiter can be a single character or multiple characters.

position_integer

Required. An integer representing the position of the section in the input string that you want to extract.Positive integers count the position from the beginning of the string, and negative integers count theposition from the end of the string. A value of 0 returns NULL.

Examples

Return the third section of the literal delimited string: Restaurants>Location>San Francisco:

SPLIT("Restaurants>Location>San Francisco",">", -1) returns San Francisco

Return the first section of a phone_number field where phone number values are in the format of123-456-7890:

SPLIT(phone_number,"-",1)

SUBSTRING

SUBSTRING is a row function that returns the specified characters of a string value based on the givenstart and end position.

Syntax

SUBSTRING(string,start,end)

Return Value

Returns one value per row of type STRING.

Input Parameters

string

Required. The name of a field or expression of type STRING (or a literal string).

start

Required. An integer that specifies where the returned characters start (inclusive), with 0 being the firstcharacter of the string. If start is greater than the number of characters, then an empty string is returned.If start is greater than end, then an empty string is returned.

end

Required. A positive integer that specifies where the returned characters end (exclusive), with the endcharacter not being part of the return value. If end is greater than the number of characters, the wholestring value (from start) is returned.

Page 457: Data Ingest Guide

Data Ingest Guide - Platfora Expression Language Dictionary

Page 457

Examples

Return the first letter of the name field:

SUBSTRING(name,0,1)

TO_LOWER

TO_LOWER is a row function that converts all alphabetic characters in a string to lower case.

Syntax

TO_LOWER(string_expression)

Return Value

Returns one value per row of type STRING.

Input Parameters

string_expression

Required. The name of a field or expression of type STRING (or a literal string).

Examples

Return the literal input string 123 Main Street in all lower case letters::

TO_LOWER("123 Main Street") returns 123 main street

TO_UPPER

TO_UPPER is a row function that converts all alphabetic characters in a string to upper case.

Syntax

TO_UPPER(string_expression)

Return Value

Returns one value per row of type STRING.

Input Parameters

string_expression

Required. The name of a field or expression of type STRING (or a literal string).

Examples

Return the literal input string 123 Main Street in all upper case letters:

TO_UPPER("123 Main Street") returns 123 MAIN STREET

Page 458: Data Ingest Guide

Data Ingest Guide - Platfora Expression Language Dictionary

Page 458

TRIM

TRIM is a row function that removes leading and trailing spaces from a string value.

Syntax

TRIM(string_expression)

Return Value

Returns one value per row of type STRING.

Input Parameters

string_expression

Required. The name of a field or expression of type STRING (or a literal string).

Examples

Return the value of the area_code field without any leading or trailing spaces. For example, if the inputstring is " 650 ", then the return value would be "650":

TRIM(area_code)

Return the value of the phone_number field without any leading or trailing spaces. For example, if theinput string is " 650 123-4567 ", then the return value would be "650 123-4567" (note that the extraspaces in the middle of the string are not removed, only the spaces at the beginning and end of thestring):

TRIM(phone_number)

XPATH_STRING

XPATH_STRING is a row function that takes an XML-formatted string and returns the first stringmatching the given XPath expression.

Syntax

XPATH_STRING(xml_formatted_string,"xpath_expression")

Return Value

Returns one value per row of type STRING.

If the XPath expression matches more than one string in the given XML node, this function will returnthe first match only. To return all matches, use XPATH_STRINGS instead.

Input Parameters

xml_formatted_string

Required. The name of a field or a literal string that contains a valid XML node (a snippet of XMLconsisting of a parent element and one or more child nodes).

Page 459: Data Ingest Guide

Data Ingest Guide - Platfora Expression Language Dictionary

Page 459

xpath_expression

Required. An XPath expression that refers to a node, element, or attribute within the XML string passedto this expression. Any XPath expression that complies to the XML Path Language (XPath) Version 1.0specification is valid.

Examples

These example XPATH_STRING expressions assume you have a field in your dataset named addressthat contains XML-formatted strings such as this:

<list> <address type="work"> <street>1300 So. El Camino Real</street1> <street>Suite 600</street2> <city>San Mateo</city> <state>CA</state> <zipcode>94403</zipcode> </address> <address type="home"> <street>123 Oakdale Street</street1> <street/> <city>San Francisco</city> <state>CA</state> <zipcode>94123</zipcode> </address></list>

Get the zipcode value from any address element where the type attribute equals home:

XPATH_STRING(address,"//address[@type='home']/zipcode")

returns: 94123

Get the city value from the second address element:

XPATH_STRING(address,"/list/address[2]/city")

returns: San Francisco

Get the values from all child elements of the first address element (as one string):

XPATH_STRING(address,"/list/address")

returns: 1300 So. El Camino RealSuite 600 San MateoCA94403

XPATH_STRINGS

XPATH_STRINGS is a row function that takes an XML-formatted string and returns a newline-separatedarray of strings matching the given XPath expression.

Syntax

XPATH_STRINGS(xml_formatted_string,"xpath_expression")

Page 460: Data Ingest Guide

Data Ingest Guide - Platfora Expression Language Dictionary

Page 460

Return Value

Returns one value per row of type STRING.

If the XPath expression matches more than one string in the given XML node, this function will returnall matches separated by a newline (you cannot specify a different delimiter).

Input Parameters

xml_formatted_string

Required. The name of a field or a literal string that contains a valid XML node (a snippet of XMLconsisting of a parent element and one or more child nodes).

xpath_expression

Required. An XPath expression that refers to a node, element, or attribute within the XML string passedto this expression. Any XPath expression that complies to the XML Path Language (XPath) Version 1.0specification is valid.

Examples

These example XPATH_STRINGS expressions assume you have a field in your dataset named addressthat contains XML-formatted strings such as this:

<list> <address type="work"> <street>1300 So. El Camino Real</street1> <street>Suite 600</street2> <city>San Mateo</city> <state>CA</state> <zipcode>94403</zipcode> </address> <address type="home"> <street>123 Oakdale Street</street1> <street/> <city>San Francisco</city> <state>CA</state> <zipcode>94123</zipcode> </address></list>

Get all zipcode values from all address elements:

XPATH_STRINGS(address,"//address/zipcode")

returns:

94123 94403

Get all street values from the first address element:

XPATH_STRINGS(address,"/list/address[1]/street")

Page 461: Data Ingest Guide

Data Ingest Guide - Platfora Expression Language Dictionary

Page 461

returns:

1300 So. El Camino RealSuite 600

Get the values from all child elements of all address elements (as one string per line):

XPATH_STRINGS(address,"/list/address")

returns:

123 Oakdale StreetSan FranciscoCA94123 1300 So. El Camino RealSuite 600 San MateoCA94403

XPATH_XML

XPATH_XML is a row function that takes an XML-formatted string and returns an XML-formatted stringmatching the given XPath expression.

Syntax

XPATH_XML(xml_formatted_string,"xpath_expression")

Return Value

Returns one value per row of type STRING in XML format.

Input Parameters

xml_formatted_string

Required. The name of a field or a literal string that contains a valid XML node (a snippet of XMLconsisting of a parent element and one or more child nodes).

xpath_expression

Required. An XPath expression that refers to a node, element, or attribute within the XML string passedto this expression. Any XPath expression that complies to the XML Path Language (XPath) Version 1.0specification is valid.

Examples

These example XPATH_STRING expressions assume you have a field in your dataset named addressthat contains XML-formatted strings such as this:

<list> <address type="work"> <street>1300 So. El Camino Real</street1> <street>Suite 600</street2> <city>San Mateo</city> <state>CA</state> <zipcode>94403</zipcode> </address> <address type="home"> <street>123 Oakdale Street</street1>

Page 462: Data Ingest Guide

Data Ingest Guide - Platfora Expression Language Dictionary

Page 462

<street/> <city>San Francisco</city> <state>CA</state> <zipcode>94123</zipcode> </address></list>

Get the last address node and its child nodes in XML format:

XPATH_XML(address,"//address[last()]")

returns:

<address type="home"><street>123 Oakdale Street</street1><street/><city>San Francisco</city><state>CA</state><zipcode>94123</zipcode></address>

Get the city value from the second address node in XML format:

XPATH_XML(address,"/list/address[2]/city")

returns: <city>San Francisco</city>

Get the first address node and its child nodes in XML format:

XPATH_XML(address,"/list/address[1]")

returns:

<address type="work"><street>1300 So. El Camino Real</street1><street>Suite 600</street2><city>San Mateo</city><state>CA</state><zipcode>94403</zipcode></address>

URL FunctionsURL functions allow you to extract different portions of a URL string, and decode text that is URL-encoded.

URL_AUTHORITY

URL_AUTHORITY is a row function that returns the authority portion of a URL string. The authorityportion of a URL is the part that has the information on how to locate and connect to the server.

Page 463: Data Ingest Guide

Data Ingest Guide - Platfora Expression Language Dictionary

Page 463

Syntax

URL_AUTHORITY(string)

Return Value

Returns the authority portion of a URL as a STRING value, or NULL if the input string is not a validURL.

For example, in the string http://www.platfora.com/company/contact.html, the authorityportion is www.platfora.com.

In the string http://user:[email protected]:8012/mypage.html, the authorityportion is user:[email protected]:8012.

In the string mailto:[email protected]?subject=Topic, the authority portion isNULL.

Input Parameters

string

Required. A field or expression that returns a STRING value in URI (uniform resource identifier) formatof: protocol:authority[/path][?query][#fragment].

The authority portion of the URL contains the host information, which can be specified as a domainname (www.platfora.com), a host name (localhost), or an IP address (127.0.0.1). The hostinformation can be preceeded by optional user information terminated with @ (for example,username:[email protected]), and followed by an optional port number preceded by a colon(for example, localhost:8001).

Examples

Return the authority portion of URL string values in the referrer field:

URL_AUTHORITY(referrer)

Return the authority portion of a literal URL string:

URL_AUTHORITY("http://user:[email protected]:8012/mypage.html")returns user:[email protected]:8012

URL_FRAGMENT

URL_FRAGMENT is a row function that returns the fragment portion of a URL string.

Syntax

URL_FRAGMENT(string)

Page 464: Data Ingest Guide

Data Ingest Guide - Platfora Expression Language Dictionary

Page 464

Return Value

Returns the fragment portion of a URL as a STRING value, NULL if the URL or does not contain afragment, or NULL if the input string is not a valid URL.

For example, in the string http://www.platfora.com/contact.html#phone, the fragmentportion is phone.

In the string http://www.platfora.com/contact.html, the fragment portion is NULL.

In the string http://platfora.com/news.php?topic=press#Platfora%20News, thefragment portion is Platfora%20News.

Input Parameters

string

Required. A field or expression that returns a STRING value in URI (uniform resource identifier) formatof: protocol:authority[/path][?query][#fragment].

The optional fragment portion of the URL is separated by a hash mark (#) and provides direction to asecondary resource, such as a heading or anchor identifier.

Examples

Return the fragment portion of URL string values in the request field:

URL_FRAGMENT(request)

Return the fragment portion of a literal URL string:

URL_FRAGMENT("http://platfora.com/news.php?topic=press#Platfora%20News")returns Platfora%20News

Return and decode the fragment portion of a literal URL string:

URLDECODE(URL_FRAGMENT("http://platfora.com/news.php?topic=press#Platfora%20News")) returns Platfora News

URL_HOST

URL_HOST is a row function that returns the host, domain, or IP address portion of a URL string.

Syntax

URL_HOST(string)

Return Value

Returns the host portion of a URL as a STRING value, or NULL if the input string is not a valid URL.

For example, in the string http://www.platfora.com/company/contact.html, the hostportion is www.platfora.com.

Page 465: Data Ingest Guide

Data Ingest Guide - Platfora Expression Language Dictionary

Page 465

In the string http://admin:[email protected]:8001/index.html, the host portion is127.0.0.1.

In the string mailto:[email protected]?subject=Topic, the host portion is NULL.

Input Parameters

string

Required. A field or expression that returns a STRING value in URI (uniform resource identifier) formatof: protocol:authority[/path][?query][#fragment].

The authority portion of the URL contains the host information, which can be specified as a domainname (www.platfora.com), a host name (localhost), or an IP address (127.0.0.1).

Examples

Return the host portion of URL string values in the referrer field:

URL_HOST(referrer)

Return the host portion of a literal URL string:

URL_HOST("http://user:[email protected]:8012/mypage.html") returnsmycompany.com

URL_PATH

URL_PATH is a row function that returns the path portion of a URL string.

Syntax

URL_PATH(string)

Return Value

Returns the path portion of a URL as a STRING value, NULL if the URL or does not contain a path, orNULL if the input string is not a valid URL.

For example, in the string http://www.platfora.com/company/contact.html, the pathportion is /company/contact.html.

In the string http://admin:[email protected]:8001/index.html, the path portion is /index.html.

In the string mailto:[email protected]?subject=Topic, the path portion [email protected].

Input Parameters

string

Page 466: Data Ingest Guide

Data Ingest Guide - Platfora Expression Language Dictionary

Page 466

Required. A field or expression that returns a STRING value in URI (uniform resource identifier) formatof: protocol:authority[/path][?query][#fragment].

The optional path portion of the URL is a sequence of resource location segments separated by aforward slash (/), conceptually similar to a directory path.

Examples

Return the path portion of URL string values in the request field:

URL_PATH(request)

Return the path portion of a literal URL string:

URL_PATH("http://platfora.com/company/contact.html") returns /company/contact.html

URL_PORT

URL_PORT is a row function that returns the port portion of a URL string.

Syntax

URL_PORT(string)

Return Value

Returns the port portion of a URL as an INTEGER value. If the URL does not specify a port, then returns-1. If the input string is not a valid URL, returns NULL.

For example, in the string http://localhost:8001, the port portion is 8001.

Input Parameters

string

Required. A field or expression that returns a STRING value in URI (uniform resource identifier) formatof: protocol:authority[/path][?query][#fragment].

The authority portion of the URL contains the host information, which can be specified as adomain name (www.platfora.com), a host name (localhost), or an IP address (127.0.0.1). Thehost information can be followed by an optional port number preceded by a colon (for example,localhost:8001).

Examples

Return the port portion of URL string values in the referrer field:

URL_PORT(referrer)

Return the port portion of a literal URL string:

Page 467: Data Ingest Guide

Data Ingest Guide - Platfora Expression Language Dictionary

Page 467

URL_PORT("http://user:[email protected]:8012/mypage.html") returns8012

URL_PROTOCOL

URL_PROTOCOL is a row function that returns the protocol (or URI scheme name) portion of a URLstring.

Syntax

URL_PROTOCOL(string)

Return Value

Returns the protocol portion of a URL as a STRING value, or NULL if the input string is not a validURL.

For example, in the string http://www.platfora.com, the protocol portion is http.

In the string ftp://ftp.platfora.com/articles/platfora.pdf, the protocol portion isftp.

Input Parameters

string

Required. A field or expression that returns a STRING value in URI (uniform resource identifier) formatof: protocol:authority[/path][?query][#fragment]

The protocol portion of a URL consists of a sequence of characters beginning with a letter and followedby any combination of letter, number, plus (+), period (.), or hyphen (-) characters, followed by a colon(:). For example: http:, ftp:, mailto:

Examples

Return the protocol portion of URL string values in the referrer field:

URL_PROTOCOL(referrer)

Return the protocol portion of the literal URL string:

URL_PROTOCOL("http://www.platfora.com") returns http

URL_QUERY

URL_QUERY is a row function that returns the query portion of a URL string.

Syntax

URL_QUERY(string)

Page 468: Data Ingest Guide

Data Ingest Guide - Platfora Expression Language Dictionary

Page 468

Return Value

Returns the query portion of a URL as a STRING value, NULL if the URL or does not contain a query, orNULL if the input string is not a valid URL.

For example, in the string http://www.platfora.com/contact.html, the query portion isNULL.

In the string http://platfora.com/news.php?topic=press&timeframe=today#Platfora%20News, the query portion istopic=press&timeframe=today.

In the string mailto:[email protected]?subject=Topic, the query portion issubject=Topic.

Input Parameters

string

Required. A field or expression that returns a STRING value in URI (uniform resource identifier) formatof: protocol:authority[/path][?query][#fragment].

The optional query portion of the URL is separated by a question mark (?) and typically contains anunordered list of key=value pairs separated by an ampersand (&) or semicolon (;).

Examples

Return the query portion of URL string values in the request field:

URL_QUERY(request)

Return the query portion of a literal URL string:

URL_QUERY("http://platfora.com/news.php?topic=press&timeframe=today")returns topic=press&timeframe=today

URLDECODE

URLDECODE is a row function that decodes a string that has been encoded with the application/x-www-form-urlencoded media type. URL encoding, also known as percent-encoding, is amechanism for encoding information in a Uniform Resource Identifier (URI). When sent in an HTTPGET request, application/x-www-form-urlencoded data is included in the query componentof the request URI. When sent in an HTTP POST request, the data is placed in the body of the message,and the name of the media type is included in the message Content-Type header.

Syntax

URLDECODE(string)

Return Value

Returns a value of type STRING with characters decoded as follows:

Page 469: Data Ingest Guide

Data Ingest Guide - Platfora Expression Language Dictionary

Page 469

• Alphanumeric characters (a-z, A-Z, 0-9) remain unchanged.

• The special characters hyphen (-), comma (,), underscore (_), period (.), and asterisk (*) remainunchanged.

• The plus sign (+) character is converted to a space character.

• The percent character (%) is interpreted as the start of a special escaped sequence, where in thesequence %HH, HH represents the hexadecimal value of the byte. For example, some common escapesequences are:

percent encoding sequence value

%20 space

%0A or %0D or %0D%0A newline

%22 double quote (")

%25 percent (%)

%2D hyphen (-)

%2E period (.)

%3C less than (<)

%3D greater than (>)

%5C backslash (\)

%7C pipe (|)

Input Parameters

string

Required. A field or expression that returns a STRING value. It is assumed that all characters in theinput string are one of the following: lower-case letters (a-z), upper-case letters (A-Z), numeric digits(0-9), or the hyphen (-), comma (,), underscore (_), period (.) or asterisk (*) character. The percentcharacter (%) is allowed, but is interpreted as the start of a special escaped sequence. The plus character(+) is allowed, but is interpreted as a space character.

Examples

Decode the values of the url_query field:

URLDECODE(url_query)

Convert a literal URL encoded string (N%2FA%20or%20%22not%20applicable%22) to a human-readable value (N/A or "not applicable"):

Page 470: Data Ingest Guide

Data Ingest Guide - Platfora Expression Language Dictionary

Page 470

URLDECODE("N%2FA%20or%20%22not%20applicable%22") returns N/A or "notapplicable"

IP Address FunctionsIP address functions allow you to manipulate and transform STRING data consisting of IP addressvalues.

CIDR_MATCH

CIDR_MATCH is a row function that compares two STRING arguments representing a CIDR mask andan IP address, and returns 1 if the IP address falls within the specified subnet mask or 0 if it does not.

Syntax

CIDR_MATCH(CIDR_string, IP_string)

Return Value

Returns an INTEGER value of 1 if the IP address falls within the subnet indicated by the CIDR maskand 0 if it does not.

Input Parameters

CIDR_string

Required. A field or expression that returns a STRING value containing either an IPv4 or IPv6 CIDRmask (Classless InterDomain Routing subnet notation). An IPv4 CIDR mask can only successfullymatch IPv4 addresses, and an IPv6 CIDR mask can only successfully match IPv6 addresses.

IP_string

Required. A field or expression that returns a STRING value containing either an IPv4 or IPv6 internetprotocol (IP) address.

Examples

Compare an IPv4 CIDR subnet mask to an IPv4 IP address:

CIDR_MATCH("60.145.56.0/24","60.145.56.246") returns 1

CIDR_MATCH("60.145.56.0/30","60.145.56.246") returns 0

Compare an IPv6 CIDR subnet mask to an IPv6 IP address:

CIDR_MATCH("fe80::/70","FE80::0202:B3FF:FE1E:8329") returns 1

CIDR_MATCH("fe80::/72","FE80::0202:B3FF:FE1E:8329") returns 0

Page 471: Data Ingest Guide

Data Ingest Guide - Platfora Expression Language Dictionary

Page 471

HEX_TO_IP

HEX_TO_IP is a row function that converts a hexadecimal-encoded STRING to a text representation ofan IP address.

Syntax

HEX_TO_IP(string)

Return Value

Returns a value of type STRING representing either an IPv4 or IPv6 address. The type of IP addressreturned depends on the input string. An 8 character hexadecimal string will return an IPv4 address. A32 character long hexadecimal string will return an IPv6 address. IPv6 addresses are represented in fulllength, without removing any leading zeros and without using the compressed :: notation. For example, 2001:0db8:0000:0000:0000:ff00:0042:8329 rather than2001:db8::ff00:42:8329. Input strings that do not contain either 8 or 32 valid hexadecimalcharacters will return NULL.

Input Parameters

string

Required. A field or expression that returns a hexadecimal-encoded STRING value. The hexadecimalstring must be either 8 characters long (in which case it is converted to an IPv4 address) or 32 characterslong (in which case it is converted to an IPv6 address).

Examples

Return a plain text IP address for each hexadecimal-encoded string value in the byte_encoded_ipscolumn:

HEX_TO_IP(byte_encoded_ips)

Convert an 8 character hexadecimal-encoded string to a plain text IPv4 address:

HEX_TO_IP(AB20FE01) returns 171.32.254.1

Convert a 32 character hexadecimal-encoded string to a plain text IPv6 address:

HEX_TO_IP(FE800000000000000202B3FFFE1E8329) returnsfe80:0000:0000:0000:0202:b3ff:fe1e:8329

Date and Time FunctionsDate and time functions allow you to manipulate and transform datetime values, such as calculating timedifferences between two datetime values, or extracting a portion of a datetime value.

Page 472: Data Ingest Guide

Data Ingest Guide - Platfora Expression Language Dictionary

Page 472

DAYS_BETWEEN

DAYS_BETWEEN is a row function that calculates the whole number of days (ignoring time) betweentwo DATETIME values (value1-value2).

Syntax

DAYS_BETWEEN(datetime_1,datetime_2)

Return Value

Returns one value per row of type INTEGER.

Input Parameters

datetime_1

Required. A field or expression of type DATETIME.

datetime_2

Required. A field or expression of type DATETIME.

Examples

Calculate the number of days to ship a product by subtracting the value of the order_date field from theship_date field:

DAYS_BETWEEN(ship_date,order_date)

Calculate the number of days since a product's release by subtracting the value of the release_date fieldin the product dataset from the current date (the result of the expression):

DAYS_BETWEEN(NOW(),product.release_date)

DATE_ADD

DATE_ADD is a row function that adds the specified time interval to a DATETIME value.

Syntax

DATE_ADD(datetime,quantity,"interval")

Return Value

Returns a value of type DATETIME.

Input Parameters

datetime

Required. A field name or expression that returns a DATETIME value.

quantity

Page 473: Data Ingest Guide

Data Ingest Guide - Platfora Expression Language Dictionary

Page 473

Required. An integer value. To add time intervals, use a positive integer. To subtract time intervals, usea negative integer.

interval

Required. One of the following time intervals:

• millisecond - Adds the specified number of milliseconds to a datetime value.

• second - Adds the specified number of seconds to a datetime value.

• minute - Adds the specified number of minutes to a datetime value.

• hour - Adds the specified number of hours to a datetime value.

• day - Adds the specified number of days to a datetime value.

• week - Adds the specified number of weeks to a datetime value.

• month - Adds the specified number of months to a datetime value.

• quarter - Adds the specified number of quarters to a datetime value.

• year - Adds the specified number of years to a datetime value.

• weekyear - Adds the specified number of weekyears to a datetime value.

Examples

Add 45 days to the value of the invoice_date field to calculate the date a payment is due:

DATE_ADD(invoice_date,45,"day")

HOURS_BETWEEN

HOURS_BETWEEN is a row function that calculates the whole number of hours (ignoring minutes,seconds, and milliseconds) between two DATETIME values (value1-value2).

Syntax

HOURS_BETWEEN(datetime_1,datetime_2)

Return Value

Returns one value per row of type INTEGER.

Input Parameters

datetime_1

Required. A field or expression of type DATETIME.

datetime_2

Required. A field or expression of type DATETIME.

Examples

Calculate the number of hours to ship a product by subtracting the value of the ship_date field from theorder_date field:

Page 474: Data Ingest Guide

Data Ingest Guide - Platfora Expression Language Dictionary

Page 474

HOURS_BETWEEN(ship_date,order_date)

Calculate the number of hours since an advertisement was viewed by subtracting the value of theadview_timestamp field in the impressions dataset from the current date and time (the result of the expression):

HOURS_BETWEEN(NOW(),impressions.adview_timestamp)

EXTRACT

EXTRACT is a row function that returns the specified portion of a DATETIME value.

Syntax

EXTRACT("extract_value",datetime)

Return Value

Returns the specified extracted value as type INTEGER. EXTRACT removes leading zeros. For example,the month of April returns a value of 4, not 04.

Input Parameters

extract_value

Required. One of the following extract values:

• millisecond - Returns the millisecond portion of a datetime value. For example, an input datetimevalue of 2012-08-15 20:38:40.213 would return an integer value of 213.

• second - Returns the second portion of a datetime value. For example, an input datetime value of2012-08-15 20:38:40.213 would return an integer value of 40.

• minute - Returns the minute portion of a datetime value. For example, an input datetime value of2012-08-15 20:38:40.213 would return an integer value of 38.

• hour - Returns the hour portion of a datetime value. For example, an input datetime value of2012-08-15 20:38:40.213 would return an integer value of 20.

• day - Returns the day portion of a datetime value. For example, an input datetime value of2012-08-15 would return an integer value of 15.

• week - Returns the ISO week number for the input datetime value. For example, an input datetimevalue of 2012-01-02 would return an integer value of 1 (the first ISO week of 2012 starts onMonday January 2). An input datetime value of 2012-01-01 would return an integer value of 52(January 1, 2012 is part of the last ISO week of 2011).

• month - Returns the month portion of a datetime value. For example, an input datetime value of2012-08-15 would return an integer value of 8.

• quarter - Returns the quarter number for the input datetime value, where quarters start on January 1,April 1, July 1, or October 1. For example, an input datetime value of 2012-08-15 would return ainteger value of 3.

• year - Returns the year portion of a datetime value. For example, an input datetime value of2012-01-01 would return an integer value of 2012.

Page 475: Data Ingest Guide

Data Ingest Guide - Platfora Expression Language Dictionary

Page 475

• weekyear - Returns the year value that corresponds the the ISO week number of the input datetimevalue. For example, an input datetime value of 2012-01-02 would return an integer value of 2012(the first ISO week of 2012 starts on Monday January 2). An input datetime value of 2012-01-01would return an integer value of 2011 (January 1, 2012 is part of the last ISO week of 2011).

datetime

Required. A field name or expression that returns a DATETIME value.

Examples

Extract the hour portion from the order_date datetime field:

EXTRACT("hour",order_date)

Cast the value of the order_date string field to a datetime value using TO_DATE, and extract the ISOweek year:

EXTRACT("weekyear",TO_DATE(order_date,"MM/dd/yyyy HH:mm:ss"))

MILLISECONDS_BETWEEN

MILLISECONDS_BETWEEN is a row function that calculates the whole number of milliseconds betweentwo DATETIME values (value1-value2).

Syntax

MILLISECONDS_BETWEEN(datetime_1,datetime_2)

Return Value

Returns one value per row of type INTEGER.

Input Parameters

datetime_1

Required. A field or expression of type DATETIME.

datetime_2

Required. A field or expression of type DATETIME.

Examples

Calculate the number of milliseconds it took to serve a web page by subtracting the value of therequest_timestamp field from the response_timestamp field:

MILLISECONDS_BETWEEN(request_timestamp,response_timestamp)

MINUTES_BETWEEN

MINUTES_BETWEEN is a row function that calculates the whole number of minutes (ignoring secondsand milliseconds) between two DATETIME values (value1-value2).

Page 476: Data Ingest Guide

Data Ingest Guide - Platfora Expression Language Dictionary

Page 476

Syntax

MINUTES_BETWEEN(datetime_1,datetime_2)

Return Value

Returns one value per row of type INTEGER.

Input Parameters

datetime_1

Required. A field or expression of type DATETIME.

datetime_2

Required. A field or expression of type DATETIME.

Examples

Calculate the number of minutes it took for a user to click on an advertisement by subtracting the valueof the impression_timestamp field from the conversion_timestamp field:

MINUTES_BETWEEN(impression_timestamp,conversion_timestamp)

Calculate the number of minutes since a user last logged in by subtracting the login_timestamp field inthe weblogs dataset from the current date and time (the result of the expression):

MINUTES_BETWEEN(NOW(),weblogs.login_timestamp)

NOW

NOW is a scalar function that returns the current system date and time as a DATETIME value. It can beused in other expressions involving DATETIME type fields, such as , , or . Note that the value of NOW isonly evaluated at the time a lens is built (it is not re-evaluated with each query).

Syntax

NOW()

Return Value

Returns the current system date and time as a DATETIME value.

Examples

Calculate a user's age using to subtract the value of the birthdate field in the users dataset from thecurrent date:

YEAR_DIFF(NOW(),users.birthdate)

Calculate the number of days since a product's release using to subtract the value of the release_datefield from the current date:

DAYS_BETWEEN(NOW(),release_date)

Page 477: Data Ingest Guide

Data Ingest Guide - Platfora Expression Language Dictionary

Page 477

SECONDS_BETWEEN

SECONDS_BETWEEN is a row function that calculates the whole number of seconds (ignoringmilliseconds) between two DATETIME values (value1-value2).

Syntax

SECONDS_BETWEEN(datetime_1,datetime_2)

Return Value

Returns one value per row of type INTEGER.

Input Parameters

datetime_1

Required. A field or expression of type DATETIME.

datetime_2

Required. A field or expression of type DATETIME.

Examples

Calculate the number of seconds it took for a user to click on an advertisement by subtracting the valueof the impression_timestamp field from the conversion_timestamp field:

SECONDS_BETWEEN(impression_timestamp,conversion_timestamp)

Calculate the number of seconds since a user last logged in by subtracting the login_timestamp field inthe weblogs dataset from the current date and time (the result of the expression):

SECONDS_BETWEEN(NOW(),weblogs.login_timestamp)

TRUNC

TRUNC is a row function that truncates a DATETIME value to the specified format.

Syntax

TRUNC(datetime,"format")

Return Value

Returns a value of type DATETIME truncated to the specified format.

Input Parameters

datetime

Required. A field or expression that returns a DATETIME value.

format

Required. One of the following format values:

Page 478: Data Ingest Guide

Data Ingest Guide - Platfora Expression Language Dictionary

Page 478

• millisecond - Returns a datetime value truncated to millisecond granularity. Has no effect sincemillisecond is already the most granular format for datetime values. For example, an input datetimevalue of 2012-08-15 20:38:40.213 would return a datetime value of 2012-08-15 20:38:40.213.

• second - Returns a datetime value truncated to second granularity. For example, an input datetimevalue of 2012-08-15 20:38:40.213 would return a datetime value of 2012-08-15 20:38:40.000.

• minute - Returns a datetime value truncated to minute granularity. For example, an input datetimevalue of 2012-08-15 20:38:40.213 would return a datetime value of 2012-08-15 20:38:00.000.

• hour - Returns a datetime value truncated to hour granularity. For example, an input datetime valueof 2012-08-15 20:38:40.213 would return a datetime value of 2012-08-15 20:00:00.000.

• day - Returns a datetime value truncated to day granularity. For example, an input datetime value of2012-08-15 20:38:40.213 would return a datetime value of 2012-08-15 00:00:00.000.

• week - Returns a datetime value truncated to the first day of the week (starting on a Monday). Forexample, an input datetime value of 2012-08-15 (a Wednesday) would return a datetime value of2012-08-13 (the Monday prior).

• month - Returns a datetime value truncated to the first day of the month. For example, an inputdatetime value of 2012-08-15 would return a datetime value of 2012-08-01.

• quarter - Returns a datetime value truncated to the first day of the quarter (January 1, April 1, July 1,or October 1). For example, an input datetime value of 2012-08-15 20:38:40.213 would return adatetime value of 2012-07-01.

• year - Returns a datetime value truncated to the first day of the year (January 1). For example, aninput datetime value of 2012-08-15 would return a datetime value of 2012-01-01.

• weekyear - Returns a datetime value trucated to the first day of the ISO weekyear (the ISO weekstarting with the Monday which is nearest in time to January 1). For example, an input datetime valueof 2008-08-15 would return a datetime value of 2007-12-31. The first day of the ISO weekyear for2008 is December 31, 2007 (the prior Monday closest to January 1).

Examples

Truncate the order_date datetime field to day granularity:

TRUNC(order_date,"day")

Cast the value of the order_date string field to a datetime value using TO_DATE, and truncate it to daygranularity:

TRUNC(TO_DATE(order_date,"MM/dd/yyyy HH:mm:ss"),"day")

YEAR_DIFF

YEAR_DIFF is a row function that calculates the fractional number of years between two DATETIMEvalues (value1-value2).

Syntax

YEAR_DIFF(datetime_1,datetime_2)

Page 479: Data Ingest Guide

Data Ingest Guide - Platfora Expression Language Dictionary

Page 479

Return Value

Returns one value per row of type DOUBLE.

Input Parameters

datetime_1

Required. A field or expression of type DATETIME.

datetime_2

Required. A field or expression of type DATETIME.

Examples

Calculate the number of years a user has been a customer by subtracting the value of theregistration_date field from the current date (the result of the expression):

YEAR_DIFF(NOW(),registration_date)

Calculate a user's age by subtracting the value of the birthdate field in the users dataset from the currentdate (the result of the expression):

YEAR_DIFF(NOW(),users.birthdate)

Math FunctionsMath functions allow you to perform basic math calculations on numeric values. You can also usearithmetic operators to perform simple math calculations.

DIV

DIV is a row function that divides two LONG values and returns a quotient value of type LONG (the resultis truncated to 0 decimal places).

Syntax

DIV(dividend,divisor)

Return Value

Returns one value per row of type LONG.

Input Parameters

dividend

Required. A field or expression of type LONG.

divisor

Required. A field or expression of type LONG.

Page 480: Data Ingest Guide

Data Ingest Guide - Platfora Expression Language Dictionary

Page 480

Examples

Cast the value of the file_size field to LONG and divide by 1024:

DIV(TO_LONG(file_size),1024)

EXP

EXP is a row function that raises the mathematical constant e to the power (exponent) of a numeric valueand returns a value of type DOUBLE.

Syntax

EXP(power)

Return Value

Returns one value per row of type DOUBLE.

Input Parameters

power

Required. A field or expression of a numeric type.

Examples

Raise e to the power in the Value field.

EXP(Value)

When the Value field value is 2.0, the result is equal to 7.3890 when truncated to four decimal places.

FLOOR

FLOOR is a row function that returns the largest integer that is less than or equal to the input argument.

Syntax

FLOOR(double)

Return Value

Returns one value per row of type DOUBLE.

Input Parameters

double

Required. A field or expression of type DOUBLE.

Examples

Return the floor value of 32.6789:

FLOOR(32.6789) returns 32.0

Page 481: Data Ingest Guide

Data Ingest Guide - Platfora Expression Language Dictionary

Page 481

HASH

HASH is a row function that evenly partitions data values into the specified number of buckets. It createsa hash of the input value and assigns that value a bucket number. Equal values will always hash to thesame bucket number.

Syntax

HASH(field_name,integer)

Return Value

Returns one value per row of type INTEGER corresponding to the bucket number that the input valuehashes to.

Input Parameters

field_name

Required. The name of the field whose values you want to partition. When this value is NULL and theinteger parameter is a value other than zero or NULL, the function returns zero, otherwise it returnsNULL.

integer

Required. The desired number of buckets. This parameter can be a numeric value of any data type,but when it is a non-integer value, Platfora truncates the value to an integer. When the value is zero orNULL, the function returns NULL. When the value is negative, the function uses absolute value.

Examples

Partition the values of the username field into 20 buckets:

HASH(username,20)

LN

LN is a row function that returns the natural logarithm of a number. The natural logarithm is thelogarithm to the base e, where e (Euler's number) is a mathematical constant approximately equal to2.718281828. The natural logarithm of a number x is the power to which the constant e must be raised inorder to equal x.

Syntax

LN(positive_number)

Return Value

Returns the exponent to which base e must be raised to obtain the input value, where e denotes theconstant number 2.718281828. The return value is the same data type as the input value.

For example, LN(7.389) is 2, because e to the power of 2 is approximately 7.389.

Page 482: Data Ingest Guide

Data Ingest Guide - Platfora Expression Language Dictionary

Page 482

Input Parameters

positive_number

Required. A field or expression that returns a number greater than 0. Inputs can be of type INTEGER,LONG, DOUBLE, or FIXED.

Examples

Return the natural logarithm of base number e, which is approximately 2.718281828:

LN(2.718281828) returns 1

LN(3.0000) returns 1.098612

LN(300.0000) returns 5.703782

MOD

MOD is a row function that divides two LONG values and returns the remainder value of type LONG (theresult is truncated to 0 decimal places).

Syntax

MOD(dividend,divisor)

Return Value

Returns one value per row of type LONG.

Input Parameters

dividend

Required. A field or expression of type LONG.

divisor

Required. A field or expression of type LONG.

Examples

Cast the value of the file_size field to LONG and divide by 1024:

MOD(TO_LONG(file_size),1024)

POW

POW is a row function that raises the a numeric value to the power (exponent) of another numeric valueand returns a value of type DOUBLE.

Syntax

POW(index,power)

Page 483: Data Ingest Guide

Data Ingest Guide - Platfora Expression Language Dictionary

Page 483

Return Value

Returns one value per row of type DOUBLE.

Input Parameters

index

Required. A field or expression of a numeric type.

power

Required. A field or expression of a numeric type.

Examples

Calculate the compound annual growth rate (CAGR) percentage for a given investment over a five yearspan.

100 * POW(end_value/start_value, 0.2) - 1

Calculate the square of the Value field.

POW(Value,2)

Calculate the square root of the Value field.

POW(Value,0.5)

The following expression returns 1.

POW(0,0)

ROUND

ROUND is a row function that rounds a DOUBLE value to the specified number of decimal places.

Syntax

ROUND(double,number_decimal_places)

Return Value

Returns one value per row of type DOUBLE.

Input Parameters

double

Required. A field or expression of type DOUBLE.

number_decimal_places

Required. An integer that specifies the number of decimal places to round to.

Page 484: Data Ingest Guide

Data Ingest Guide - Platfora Expression Language Dictionary

Page 484

Examples

Round the number 32.4678954 to two decimal places:

ROUND(32.4678954,2) returns 32.47

Data Type Conversion FunctionsData type conversion functions allow you to cast data values from one data type to another. Thesefunctions are used implicitly whenever you set the data type of a field or column in the Platfora userinterface. The supported data types are: INTEGER, LONG, DOUBLE, FIXED, DATETIME, and STRING.

EPOCH_MS_TO_DATE

EPOCH_MS_TO_DATE is a row function that converts LONG values to DATETIME values, where theinput number represents the number of milliseconds since the epoch.

Syntax

EPOCH_MS_TO_DATE(long_expression)

Return Value

Returns one value per row of type DATETIME in UTC format yyyy-MM-dd HH:mm:ss:SSS Z.

Input Parameters

long_expression

Required. A field or expression of type LONG representing the number of milliseconds since the epochdatetime (January 1, 1970 00:00:00:000 GMT).

Examples

Convert a number representing the number of milliseconds from the epoch to a human-readable date andtime:

EPOCH_MS_TO_DATE(1360260240000) returns 2013-02-07T18:04:00:000Z or February 7,2013 18:04:00:000 GMT

Or if your data is in seconds instead of milliseconds:

EPOCH_MS_TO_DATE(1360260240 * 1000) returns 2013-02-07T18:04:00:000Z or February7, 2013 18:04:00:000 GMT

TO_CURRENCY

This function is deprecated. Use the TO_FIXED function instead.

Page 485: Data Ingest Guide

Data Ingest Guide - Platfora Expression Language Dictionary

Page 485

TO_DATE

TO_DATE is a row function that converts STRING values to DATETIME values, and specifies the formatof the date and time elements in the string.

Syntax

TO_DATE(string_expression,"date_format")

Return Value

Returns one value per row of type DATETIME (which by definition is in UTC).

Input Parameters

string_expression

Required. A field or expression of type STRING.

date_format

Required. A pattern that describes how the date is formatted.

Date Pattern Format

Use the following pattern symbols to define your date format. The count and ordering of the patternletters determines the datetime format. Any characters in the pattern that are not in the ranges of a-z andA-Z are treated as quoted delimiter text. For instance, characters such as slash (/) or colon (:) will appearin the resulting output even they are not escaped with single quotes.

Table 3: Date Pattern Symbols

SymbolMeaning PresentationExamples Notes

G era text AD

C century of era (0 orgreater)

number 20

Y year of era (0 orgreater)

year 1996

x week year year 1996

Numeric presentation for year and weekyear fields are handled specially. Forexample, if the count of 'y' is 2, the yearwill be displayed as the zero-based yearof the century, which is two digits.

w week number of weekyear

number 27

e day of week (number) number 2

E day of week (name) text Tuesday; Tue If the number of pattern letters is 4 ormore, the full form is used; otherwise ashort or abbreviated form is used.

Page 486: Data Ingest Guide

Data Ingest Guide - Platfora Expression Language Dictionary

Page 486

SymbolMeaning PresentationExamples Notes

y year year 1996

D day of year number 189

M month of year month July; Jul; 07 3 or more uses text, otherwise uses anumber

d day of month number 10 If the number of pattern letters is 3 ormore, the text form is used; otherwisethe number is used.

a half day of day text PM

K hour of half day (0-11) number 0

h clock hour of half day(1-12)

number 12

H hour of day (0-23) number 0

k clock hour of day(1-24)

number 24

m minute of hour number 30

s second of minute number 55

S fraction of second number 978

z time zone text Pacific StandardTime; PST

If the number of pattern letters is 4 ormore, the full form is used; otherwise ashort or abbreviated form is used.

Z time zone offset/id zone -0800; -08:00;America/Los_Angeles

'Z' outputs offset without a colon, 'ZZ'outputs the offset with a colon, 'ZZZ' ormore outputs the zone id.

' escape character fortext-based delimiters

delimiter

'' literal representation ofa single quote

literal '

Examples

Define a new DATETIME computed field based on the order_date base field, which contains timestampsin the format of: 2014.07.10 at 15:08:56 PDT:

TO_DATE(order_date,"yyyy.MM.dd 'at' HH:mm:ss z")

Page 487: Data Ingest Guide

Data Ingest Guide - Platfora Expression Language Dictionary

Page 487

Define a new DATETIME computed field by first combining individual month, day, year, anddepart_time fields (using CONCAT), and performing a transformation on depart_time to make sure three-digit times are converted to four-digit times (using REGEX_REPLACE):

TO_DATE(CONCAT(month,"/",day,"/",year,":",REGEX_REPLACE(depart_time,"\b(\d{3})\b","0$1")),"MM/dd/yyyy:HHmm")

Define a new DATETIME computed field based on the created_at base field, which contains timestampsin the format of: Sat Jan 25 16:35:23 +0800 2014 (this is the timestamp format returned by Twitter'sAPI):

TO_DATE(created_at,"EEE MMM dd HH:mm:ss Z yyyy")

TO_DOUBLE

TO_DOUBLE is a row function that converts STRING, INTEGER, LONG, or DOUBLE values to DOUBLE(decimal) values.

Syntax

TO_DOUBLE(expression)

Return Value

Returns one value per row of type DOUBLE.

Input Parameters

expression

Required. A field or expression of type STRING (must be numeric characters), INTEGER, LONG, orDOUBLE.

Examples

Convert the values of the average_rating field to a double data type:

TO_DOUBLE(average_rating)

Convert the average_rating field to a double data type, but first transform the occurrence of any NAvalues to NULL values using a CASE expression:

TO_DOUBLE(CASE WHEN average_rating="N/A" then NULL ELSE average_ratingEND)

TO_FIXED

TO_FIXED is a row function that converts STRING, INTEGER, LONG, or DOUBLE values to fixed-decimal values. Using a FIXED data type to represent monetary values allows you to calculate andaggregate monetary values with accuracy to a ten-thousandth of a monetary unit.

Syntax

TO_FIXED(expression)

Page 488: Data Ingest Guide

Data Ingest Guide - Platfora Expression Language Dictionary

Page 488

Return Value

Returns one value per row of type FIXED (fixed-decimal value to 10000th accuracy).

Input Parameters

expression

Required. A field or expression of type STRING (must be numeric characters), INTEGER, LONG, orDOUBLE.

Examples

Convert the opening_price field to a fixed decimal data type:

TO_FIXED(opening_price)

Convert the sale_price field to a fixed decimal data type, but first transform the occurrence of any N/Astring values to NULL values using a CASE expression:

TO_FIXED(CASE WHEN sale_price="N/A" then NULL ELSE sale_price END)

TO_INT

TO_INT is a row function that converts STRING, INTEGER, LONG, or DOUBLE values to INTEGER(whole number) values. When converting DOUBLE values, everything after the decimal will be truncated(not rounded up or down).

Syntax

TO_INT(expression)

Return Value

Returns one value per row of type INTEGER.

Input Parameters

expression

Required. A field or expression of type STRING, INTEGER, LONG, or DOUBLE. If a STRING fieldcontains non-numeric characters, the function returns NULL which Platfora converts to the default valuein a lens (by default, the default value for INTEGER fields is 0).

Examples

Convert the values of the average_rating field to an integer data type:

TO_INT(average_rating)

Convert the flight_duration field to an integer data type, but first transform the occurrence of any NAvalues to NULL values using a CASE expression:

Page 489: Data Ingest Guide

Data Ingest Guide - Platfora Expression Language Dictionary

Page 489

TO_INT(CASE WHEN flight_duration="N/A" then NULL ELSE flight_durationEND)

TO_LONG

TO_LONG is a row function that converts STRING, INTEGER, LONG, or DOUBLE values to LONG (wholenumber) values. When converting DOUBLE values, everything after the decimal will be truncated (notrounded up or down).

Syntax

TO_LONG(expression)

Return Value

Returns one value per row of type LONG.

Input Parameters

expression

Required. A field or expression of type STRING (must be numeric characters only, no period orcomma), INTEGER, LONG, or DOUBLE. When a STRING field value includes a decimal, the functionreturns a NULL value.

Examples

Convert the values of the average_rating field to a long data type:

TO_LONG(average_rating)

Convert the average_rating field to a long data type, but first transform the occurrence of any NA valuesto NULL values using a CASE expression:

TO_LONG(CASE WHEN average_rating="N/A" then NULL ELSE average_ratingEND)

TO_STRING

TO_STRING is a row function that converts values of other data types to STRING (character) values.

Syntax

TO_STRING(expression)

TO_STRING(datetime_expression,date_format)

Return Value

Returns one value per row of type STRING.

Input Parameters

expression

Page 490: Data Ingest Guide

Data Ingest Guide - Platfora Expression Language Dictionary

Page 490

A field or expression of type FIXED, STRING, INTEGER, LONG, or DOUBLE.

datetime_expression

A field or expression of type DATETIME.

date_format

If converting a DATETIME to a string, a pattern that describes how the date is formatted. See TO_DATEfor the date format patterns.

Examples

Convert the values of the sku_number field to a string data type:

TO_STRING(sku_number)

Convert values in the age column into a range-based groupings (binning), and cast output values to aSTRING:

TO_STRING(CASE WHEN age <= 25 THEN "0-25" WHEN age <= 50 THEN "26-50"ELSE "over 50" END)

Convert the values of a timestamp datetime field to a string, where the timestamp values are in theformat of: 2002.07.10 at 15:08:56 PDT:

TO_STRING(timestamp,"yyyy.MM.dd 'at' HH:mm:ss z")

Aggregate FunctionsAn aggregate function groups the values of multiple rows together based on some defined inputexpression. Aggregate functions return one value for a group of rows, and are only valid for definingmeasures in Platfora. Aggregate functions cannot be combined with row functions.

AVG

AVG is an aggregate function that returns the average of all valid numeric values. It sums all values inthe provided expression and divides by the number of valid (NOT NULL) rows. If you want to computean average that includes all values in the row count (including NULL values), you can use a SUM/COUNTexpression instead.

Syntax

AVG(numeric_field)

Return Value

Returns a value of type DOUBLE.

Input Parameters

numeric_field

Page 491: Data Ingest Guide

Data Ingest Guide - Platfora Expression Language Dictionary

Page 491

Required. A field of type INTEGER, LONG, DOUBLE, or FIXED. Unlike row functions, aggregatefunctions can only take field names as input.

Examples

Get the average of the valid sale_amount field values:

AVG(sale_amount)

Get the average of the valid net_worth field values in the billionaires data set, which resides in thesamples namespace:

AVG([(samples) billionaires].net_worth)

Get the average of all page_views field values in the web_logs dataset (including NULL values):

SUM(page_views)/COUNT(web_logs)

COUNT

COUNT is an aggregate function that returns the number of rows in a dataset.

Syntax

COUNT([namespace_name]dataset_name)

Return Value

Returns a value of type INTEGER.

Input Parameters

namespace_name

Optional. The name of the namespace in which the dataset resides. If not specified, uses the defaultnamespace.

dataset_name

Required. The name of the dataset for which to obtain a count of rows. If you want to count rows of adown-stream dataset that is related to the current dataset, you can specify the hierarchy of dataset namesin the format of:parent_dataset_name.child_dataset_name.[...]

Examples

Count the rows in the sales dataset:

COUNT(sales)

Count the rows in the billionaires dataset, which resides in the samples namespace:

COUNT([(samples) billionaires])

Page 492: Data Ingest Guide

Data Ingest Guide - Platfora Expression Language Dictionary

Page 492

Count the rows in the customer dataset, which is a related dataset down-stream of sales:

COUNT(sales.customers)

COUNT_VALID

COUNT_VALID is an aggregate function that returns the number of rows for which the given expressionis valid (excludes NULL values).

Syntax

COUNT_VALID(field)

Return Value

Returns a numeric value of type INTEGER.

Input Parameters

field

Required. A field name. Unlike row functions, aggregate functions can only take field names as input.

Examples

Count the valid values in the page_views field:

COUNT_VALID(page_views)

DISTINCT

DISTINCT is an aggregate function that returns the number of distinct values for the given expression.

Syntax

DISTINCT(field)

Return Value

Returns a numeric value of type INTEGER.

Input Parameters

field

Required. A field name. Unlike row functions, aggregate functions can only take field names as input.

Examples

Count the unique values of the user_id field in the currently selected dataset:

DISTINCT(user_id)

Count the unique values of the name field in the billionaires dataset, which resides in the samplesnamespace:

Page 493: Data Ingest Guide

Data Ingest Guide - Platfora Expression Language Dictionary

Page 493

DISTINCT([(samples) billionaires].name)

Count the unique values of the customer_id field in the customer dataset, which is a related datasetdown-stream of web sales:

DISTINCT([web sales].customers.customer_id)

MAX

MAX is an aggregate function that returns the biggest value from the given input expression.

Syntax

MAX(numeric_or_datetime_field)

Return Value

Returns a numeric or datetime value of the same type as the input expression.

Input Parameters

numeric_or_datetime_field

Required. A field of type INTEGER, LONG, DOUBLE, FIXED, or DATETIME. Unlike rowfunctions, aggregate functions can only take field names as input.

Examples

Get the highest value from the sale_amount field:

MAX(sale_amount)

Get the latest date from the Session Timestamp datetime field:

MAX([Session Timestamp])

MIN

MIN is an aggregate function that returns the smallest value from the given input expression.

Syntax

MIN(numeric_or_datetime_field)

Return Value

Returns a numeric or datetime value of the same type as the input expression.

Input Parameters

numeric_or_datetime_field

Required. A field of type INTEGER, LONG, DOUBLE, FIXED, or DATETIME. Unlike rowfunctions, aggregate functions can only take field names as input.

Page 494: Data Ingest Guide

Data Ingest Guide - Platfora Expression Language Dictionary

Page 494

Examples

Get the lowest value from the sale_amount field:

MIN(sale_amount)

Get the earliest date from the Session Timestamp datetime field:

MIN([Session Timestamp])

SUM

SUM is an aggregate function that returns the total of all values from the given input expression.

Syntax

SUM(numeric_field)

Return Value

Returns a numeric value of the same type as the input expression.

Input Parameters

numeric_field

Required. A field of type INTEGER, LONG, DOUBLE, or FIXED. Unlike row functions, aggregatefunctions can only take field names as input.

Examples

Add the values of the sale_amount field:

SUM(sale_amount)

Add values of the session count field in the users dataset, which is a related dataset down-stream ofclicks:

SUM(clicks.users.[session count])

STDDEV

STDDEV is an aggregate function that calculates the population standard deviation for a group ofnumeric values. Standard deviation is the square root of the variance.

Syntax

STDDEV(numeric_field)

Return Value

Returns a value of type DOUBLE. If there are less than two values in the input group, returns NULL.

Page 495: Data Ingest Guide

Data Ingest Guide - Platfora Expression Language Dictionary

Page 495

Input Parameters

numeric_field

Required. A field of type INTEGER, LONG, DOUBLE, or FIXED. Unlike row functions, aggregatefunctions can only take field names as input.

Examples

Calculate the standard deviation of the values contained in the sale_amount field:

STDDEV(sale_amount)

VARIANCE

VARIANCE is an aggregate function that calculates the population variance for a group of numericvalues. Variance measures the amount by which all values in a group vary from the average value ofthe group. Data with low variance contains values that are identical or similar. Data with high variancecontains values that are not similar. Variance is calculated as the average of the squares of the deviationsfrom the mean. Squaring the deviations ensures that negative and positive deviations do not cancel eachother out.

Syntax

VARIANCE(numeric_field)

Return Value

Returns a value of type DOUBLE. If there are less than two values in the input group, returns NULL.

Input Parameters

numeric_field

Required. A field of type INTEGER, LONG, DOUBLE, or FIXED. Unlike row functions, aggregatefunctions can only take field names as input.

Examples

Get the population variance of the values contained in the sale_amount field:

VARIANCE(sale_amount)

ROLLUP and Window FunctionsWindow functions can only be used in conjunction with ROLLUP. ROLLUP is a modifier to an aggregateexpression that determines the partitioning and ordering of a rowset before the associated aggregatefunction or window function is applied. ROLLUP defines a window or user-specified set of rows withina query result set. A window function then computes a value for each row in the window. You canuse window functions to compute aggregated values such as moving averages, cumulative aggregates,running totals, or a top N per group results.

Page 496: Data Ingest Guide

Data Ingest Guide - Platfora Expression Language Dictionary

Page 496

ROLLUP

ROLLUP is a modifier to an aggregate function that turns a regular aggregate function into a windowed,partitioned, or adaptive aggregate function. This is useful when you want to compute an aggregationover a subset of rows within the overall result of a viz query.

Syntax

ROLLUP aggregate_expression [ WHERE input_group_condition [...] ] [ TO ([partitioning_columns]) [ ORDER BY (ordering_column [ASC | DESC]) ROWS|RANGE window_boundary | BETWEEN window_boundary AND window_boundary ] ]

where window_boundary can be one of:

UNBOUNDED PRECEDINGvalue PRECEDINGvalue FOLLOWINGUNBOUNDED FOLLOWING

Description

A regular measure is the result of an aggregation (such as SUM or AVG) applied to some fact or metriccolumn of a dataset. For example, suppose we had a dataset with the following rows and columns:

Date Sale Amount Product Region

05/01/2013 100 gadget west

05/01/2013 200 widget east

06/01/2013 100 gadget east

06/01/2013 400 widget west

07/01/2013 300 widget west

07/01/2013 200 gadget east

To define a regular measure called Total Sales, we would use the expression:

SUM([Sale Amount])

Page 497: Data Ingest Guide

Data Ingest Guide - Platfora Expression Language Dictionary

Page 497

When this measure is used in a visualization, the group of input records passed into the aggregatecalculation is determined by the dimensions selected by the user when they create the viz. For example,if the user chose Region as a dimension in the viz, there would be two input groups for which themeasure would be calculated:

Total Sales / Region

east west

500 800

If an aggregate expression includes a ROLLUP clause, the column(s) specified in the TO clause of theROLLUP expression determine the additional partitions over which to compute the aggregate expression.It divides the overall rows returned by the viz query into subsets or buckets, and then computes theaggregate expression within each bucket. Every ROLLUP expression has implicit partitioning defined: anabsent TO clause treats the entire result set as one partition; an empty TO clause partitions by whateverdimension columns are present in the viz query.

The WHERE clause is used to filter the input rows that flow into each partition. Input rows that meet theWHERE clause criteria will be partitioned, and rows that don't will not be partitioned.

The ORDER BY with a RANGE or ROW clause is used to define a window frame within each partitionover which to compute the aggregate expression.

When a ROLLUP measure is used in a visualization, the aggregate calculation is computed across aset of input rows that are related to, but separate from, the other dimension(s) used in the viz. This issimilar to the type of calculation that is done with a regular measure. However unlike a regular measure,a ROLLUP measure does not cause the input rows to be grouped into a single result set; the input rowsstill retain their separate identities. The ROLLUP clause determines how the input rows are split up forprocessing by the ROLLUP's aggregate function.

ROLLUP expressions can be written to make the partitioning adaptive to whatever dimension columnsare selected in the visualization. This is done by using a reference name as the partitioning column, asopposed to a regular column. For example, suppose we wanted to be able to calculate the total sales forany granularity of date. We could create an adaptive measure called Rollup Sales to Date that partitionstotal sales by date as follows:

ROLLUP SUM([Sale Amount]) TO (Date)

When this measure is used in a visualization, the group of input records passed into the aggregatecalculation is determined by the dimension fields selected by the user in the viz, but partitioned by thegranularity of Date selected by the user. For example, if the user chose the dimensions Date.Month andRegion in the viz, then total sales would be grouped by month and region, but the ROLLUP measureexpression would aggregate the sales by month only.

Page 498: Data Ingest Guide

Data Ingest Guide - Platfora Expression Language Dictionary

Page 498

Notice that the results for the east and west regions are the same - this is because the aggregationexpression is only considering rows that share the same month when calculating the sum of sales.

Month / (Measures) / Region

May 2013 June 2013 July 2013

Rollup Sales to Date Rollup Sales to Date Rollup Sales to Date

east | west east | west east | west

300 | 300 500 | 500 500 | 500

Suppose within the date partition, we wanted to calculate the cumulative total day to day. We coulddefine a window measure called Running Total to Date that looks at each day and all preceding days asfollows:

ROLLUP SUM([Sale Amount]) TO (Date) ORDER BY (Date.Date) ROWS UNBOUNDED PRECEDING

When this measure is used in a visualization, the group of input records passed into the aggregatecalculation is determined by the dimension fields selected by the user in the viz, and partitioned by thegranularity of Date selected by the user. Within each partition the rows are ordered chronologically (byDate.Date), and the sum amount is then calculated per date partition by looking at the current row (ormark), and all rows that come before it within the partition. For example, if the user chose the dimensionDate.Month in the viz, then the ROLLUP measure expression would cumulatively aggregate the saleswithin each month.

Month / (Measures) / Date.Date

May 2013 June 2013 July 2013

2013-05-01 2013-06-01 2013-07-01

Running Total to Date Rollup Sales to Date Rollup Sales to Date

300 500 500

Return Value

Returns a numeric value per partition based on the output type of the aggregate_expression.

Input Parameters

aggregate_expression

Required. An expression containing an aggregate or window function. Simple aggregatefunctions such as COUNT, AVG, SUM, MIN, and MAX are supported. Window functions

Page 499: Data Ingest Guide

Data Ingest Guide - Platfora Expression Language Dictionary

Page 499

such as RANK, DENSE_RANK, and NTILE are supported and can only be used inconjuction with ROLLUP.

Complex aggregate functions such as STDDEV and VARIANCE are not supported.

WHERE input_group_condition

The WHERE clause limits the group of input rows over which to compute the aggregateexpression. The input group condition is a Boolean (true or false) condition definedusing a comparison operator expression. Any row that does not satisfy the condition willbe excluded from the input group used to calculate the aggregated measure value. Forexample (note that datetime values must be specified in yyyy-MM-dd format):

WHERE Date.Date BETWEEN 2012-06-01 AND 2012-07-31

WHERE Date.Year BETWEEN 2009 AND 2013

WHERE Company LIKE("Plat*")

WHERE Code IN("a","b","c")

WHERE Sales < 50.00

WHERE Age >= 21

You can specify multiple WHERE clauses in a ROLLUP expression.

TO ([partitioning_columns])

The TO clause is used to specify the dimension column(s) used to partition a group ofinput rows. This allows you to calculate a measure value for a specific dimension group(a subset of input rows) that are somehow related to the other dimension groups used in avisualization (all input rows). It is possible to define an empty group (meaning all rows) byusing empty parenthesis.

When used in a visualization, measure values are computed for groups of input rows thatreturn the same value for the columns specified in the partitioning list. For example, if theDate.Month column is used as a partitioning column, then all records that have the samevalue for Date.Month will be grouped together in order to calculate the measure value.The aggregate expression is applied to the group specified in the TO clause independentlyof the other dimension groupings used in the visualization. Note that the partitioningcolumn(s) specified in the TO clause of an adaptive measure expression must also beincluded as dimensions (or grouping columns) in the visualization.

A partitioning column can also be the name of a reference field. Using a reference field allows thepartition criteria to dynamically adapt based on any field of the referenced dataset that is used in a viz.For example, if the partition column is a reference field pointing to the Date dimension, then any sub-field of Date (Date.Year, Date.Month, etc.) can be used as the partitioning column by selecting it in aviz.

A TO clause with an empty partitioning list treats each mark in the result set as an inputgroup. For example, if the viz includes the Month and Region columns, then TO() wouldbe equivalent to TO(Month,Region).

ORDER BY (ordering_column)

Page 500: Data Ingest Guide

Data Ingest Guide - Platfora Expression Language Dictionary

Page 500

The optional ORDER BY clause orders the input rows using the values in the specifiedcolumn within each partition identified in the TO clause. Use the ORDER BY clausealong with the ROWS or RANGE clauses to define windows over which to compute theaggregate function. This is useful for computing moving averages, cumulative aggregates,running totals, or a top value per group of input rows. The ordering column specified in theORDER BY clause can be a dimension, measure, or an aggregate expression (for exampleORDER BY (SUM(Sales))). If the ordering column is a dimension, it must be included inthe viz.

By default, rows are sorted in ascending order (low to high values). You can use the DESCkeyword to sort in descending order (high to low values).

ROWS | RANGE

Required when using ORDER BY. Further limits the rows within the partition byspecifying start and end points within the partition. This is done by specifying a range ofrows with respect to the current row either by logical association (RANGE) or physicalassociation (ROWS). Use either a ROWS or RANGE clause to express the windowboundary (the set of input rows in each partition, relative to the current row, over which tocompute the aggregate expression). The window boundary can include one, several, or allrows of the partition.

When using the RANGE clause, the ordering column used in the ORDER BY clause mustbe a sub-column of a reference to Platfora's built-in Date dimension dataset.

window_boundary

A window boundary is required when using either ROWS or RANGE. This defines the setof rows, relative to the current row, over which to compute the aggregate expression. Therow order is based on the ordering specified in the ORDER BY clause.

A PRECEEDING clause defines a lower window boundary (the number of rows to includebefore the current row). The FOLLOWING clause defines an upper window boundary(the number of rows to include after the current row). The window boundary expressionmust include either a PRECEEDING or FOLLOWING clause, or both. If PRECEEDINGis omitted, the current row is considered the first row in the window. Similarly, ifFOLLOWING is omitted, the current row is considered the last row in the window. TheUNBOUNDED keyword includes all rows in the direction specified. When you need tospecify both a start and end of a window, use the BETWEEN and AND keywords.

For example:

ROWS 2 PRECEDING means that the window is three rows in size, starting with tworows preceding until and including the current row.

ROWS BETWEEN 2 PRECEDING AND 5 FOLLOWING means that the window is eightrows in size, starting with two rows preceding, the current row, and five rows followingthe current row. The current row is included in the set of rows by default.

You can exclude the current row from the window by specifying a window start and endpoint before or after the current row. For example:

Page 501: Data Ingest Guide

Data Ingest Guide - Platfora Expression Language Dictionary

Page 501

ROWS BETWEEN UNBOUNDED PRECEDING AND 1 PRECEDING starts the windowwith all rows that come before the current row, and ends the window one row before thecurrent row, thereby excluding the current row from the window.

Examples

Calculate the percentage of flight records in the same departure date period. Note that thedeparture_date field is a reference to the Date dataset, meaning that the group to which themeasure is applied can adapt to any downstream field of departure_date (departure_date.Year,departure_date.Month, and so on). When used in a viz, this will calculate the percentage of flights foreach dimension group in the viz that share the same value for departure_date:

100 * COUNT(Flights) / ROLLUP COUNT(Flights) TO ([Departure Date])

Normalize the number of flights using the carrier American Airlines (AA) as the benchmark. This willallow you to compare the number of flights for other carriers against the fixed baseline number of flightsfor AA (if AA = 100 percent, then all other carriers will fall either above or below that percentage):

100 * COUNT(Flights) / ROLLUP COUNT(Flights) WHERE [Carrier Code]="AA"

Calculate a generic percentage of total sales. When this measure is used in a visualization, it will showthe percentage of total sales that a mark in the viz is contributing to the total for all marks in the viz. Theinput rows depend on the dimensions selected in the viz.

100 * SUM(sales) / ROLLUP SUM(sales) TO ()

Calculate the cumulative total of sales for a given year on a month-to-month basis (year-to-month salestotals):

ROLLUP SUM(sales) TO (Date.Year) ORDER BY (Date.Month) ROWS UNBOUNDEDPRECEDING

Calculate the cumulative total of sales (for all input rows) for all previous years, but exclude the currentyear from the total.

ROLLUP SUM(sales) TO () ORDER BY (Date.Year) ROWS BETWEEN UNBOUNDEDPRECEDING AND 1 PRECEDING

DENSE_RANK

DENSE_RANK is a windowing aggregate function that orders rows by a measure value and assigns arank number to each row in the given partition. Rank positions are not skipped in the event of a tie.DENSE_RANK must be used within a ROLLUP expression.

Syntax

ROLLUP DENSE_RANK() TO ([partitioning_column]) [ORDER BY (measure_expression [ASC | DESC]) ROWS|RANGE

Page 502: Data Ingest Guide

Data Ingest Guide - Platfora Expression Language Dictionary

Page 502

window_boundary | BETWEEN window_boundary AND window_boundary ]

where window_boundary can be one of:

UNBOUNDED PRECEDINGvalue PRECEDINGvalue FOLLOWINGUNBOUNDED FOLLOWING

Description

DENSE_RANK is a window aggregate function used to assign a ranking number to each row in a group.If multiple rows have the same ranking value (there is a tie), then the tied rows are given the same rankvalue and subsequent rank positions are not skipped.

The TO clause of the ROLLUP is used to specify the dimension column(s) used to partition a group ofinput rows. To define a global ranking that can adapt to any dimension groupings used in a viz, specifyan empty TO clause.

The ORDER BY clause of the ROLLUP expression determines how to order the rows before they areranked. The ORDER BY clause should specify the measure field for which you want to calculate theranks. The ranked rows in the partition are numbered starting at one.

For example, suppose we had a dataset with the following rows and columns and you want to rank theQuarters and Regions according to the values in the Sales column.

Quarter Region Sales

2010 Q1 North 100

2010 Q1 South 200

2010 Q1 East 300

2010 Q1 West 400

2010 Q2 North 400

2010 Q2 South 250

2010 Q2 East 150

2010 Q2 West 250

Page 503: Data Ingest Guide

Data Ingest Guide - Platfora Expression Language Dictionary

Page 503

Supposing the lens has an existing measure field called Sales(Sum), you could then define a measurecalled Sales_Dense_Rank using the following expression:

ROLLUP DENSE_RANK() TO () ORDER BY ([Sales(Sum)] DESC) ROWS UNBOUNDED PRECEDING

When you include the Quarter, Region, and Sales_Dense_Rank columns in the viz, you get thefollowing data points. Notice that tied values are given the same rank number and no rank positions areskipped:

Quarter Region SalesRank

2010 Q1 North 6

2010 Q1 South 4

2010 Q1 East 2

2010 Q1 West 1

2010 Q2 North 1

2010 Q2 South 3

2010 Q2 East 5

2010 Q2 West 3

Return Value

Returns a value of type LONG.

Input Parameters

ROLLUP

Required. DENSE_RANK must be used within a ROLLUPROLLUP expression in place of theaggregate_expression of the ROLLUP.

The TO clause of the ROLLUP expression specifies the dimension group(s) over which to calculate thewindow function. An empty TO calculates the window function over all rows in the query as one group.

The ORDER BY clause of the ROLLUP expression specifies a measure field or aggregate expression.

Examples

Rank the sum of all sales in descending order, so the highest sales is given the ranking of 1.

ROLLUP DENSE_RANK() TO () ORDER BY ([Sales(Sum)] DESC) ROWS UNBOUNDEDPRECEDING

Page 504: Data Ingest Guide

Data Ingest Guide - Platfora Expression Language Dictionary

Page 504

Rank the sum of all sales within a given quarter in descending order, so the highest sales in each quarteris given the ranking of 1.

ROLLUP DENSE_RANK() TO (Quarter) ORDER BY ([Sales(Sum)] DESC) ROWSUNBOUNDED PRECEDING

NTILE

NTILE is a windowing aggregate function that divides a partitioned group of rows into the specifiednumber of buckets, and returns the bucket number to which the current row belongs. NTILE must beused within a ROLLUP expression.

Syntax

ROLLUP NTILE(integer) TO ([partitioning_column]) [ORDER BY (measure_expression [ASC | DESC]) ROWS|RANGE window_boundary | BETWEEN window_boundary AND window_boundary ]

where window_boundary can be one of:

UNBOUNDED PRECEDINGvalue PRECEDINGvalue FOLLOWINGUNBOUNDED FOLLOWING

Description

NTILE is a window aggregate function typically used to calculate percentiles. A percentile (or centile)is a measure used in statistics indicating the value below which a given percentage of records in a groupfalls. For example, the 20th percentile is the value (or score) below which 20 percent of the records maybe found. The term percentile is often used in the reporting of test scores. For example, if a score is inthe 86th percentile, it is higher than 86% of the other scores. The 25th percentile is also known as thefirst quartile (Q1), the 50th percentile as the median or second quartile (Q2), and the 75th percentile asthe third quartile (Q3). In general, percentiles, deciles and quartiles are specific types of ntiles.

NTILE must be used within a ROLLUPROLLUP expression in place of the aggregate_expressionof the ROLLUP.

The TO clause of the ROLLUP is used to specify a fixed dimension column used to partition a group ofinput rows. To define a global NTILE ranking that can adapt to any dimension groupings used in a viz,specify an empty TO clause.

The ORDER BY clause of the ROLLUP expression determines how to order the rows before they aredivided into buckets. The ORDER BY clause should specify the measure field for which you want to

Page 505: Data Ingest Guide

Data Ingest Guide - Platfora Expression Language Dictionary

Page 505

calculate NTILE bucket values. A centile would be 100 buckets, a decile would be 10 buckets, a quartile4 buckets, and so on. The buckets in the partition are numbered starting at one.

For example, suppose we had a dataset with the following rows and columns and you want to dividethe year-to-date sales into four buckets (quartiles) with the highest quartile ranked as 1 and thelowest ranked as 4. Supposing a measure field has been defined called Sum_YTD_Sales, defined asSUM([Sales YTD]), you could then define a measure called YTD_Sales_Quartile using the followingexpression:

ROLLUP NTILE(4) TO() ORDER BY(Sum_YTD_Sales DESC) ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING

Name Gender Sales YTD YTD_Sales_Quartile

Chen F 3,500,000 1

John M 3,100,000 1

Pete M 2,900,000 1

Daria F 2,500,000 2

Jennie F 2,200,000 2

Mary F 2,100,000 2

Mike M 1,900,000 3

Brian M 1,700,000 3

Molly F 1,500,000 3

Theresa F 1,200,000 4

Hans M 900,000 4

Ben M 500,000 4

Because the TO clause of the ROLLUP expression is empty, the quartile partitioning adapts to whateverdimensions are used in the viz. For example, if you include the Gender dimension field in the viz, thequartiles would then be computed per gender. The following example divides each gender into bucketswith each gender having 6 year-to-date sales values. The two extra values (the remainder of 6 / 4) areallocated to buckets 1 and 2, which therefore have one more value than buckets 3 or 4.

Name Gender Sales YTD YTD_Sales_Quartile (partitioned by Gender)

Chen F 3,500,000 1

Daria F 2,500,000 1

Page 506: Data Ingest Guide

Data Ingest Guide - Platfora Expression Language Dictionary

Page 506

Name Gender Sales YTD YTD_Sales_Quartile (partitioned by Gender)

Jennie F 2,200,000 2

Mary F 2,100,000 2

Molly F 1,500,000 3

Theresa F 1,200,000 4

John M 3,100,000 1

Pete M 2,900,000 1

Mike M 1,900,000 2

Brian M 1,700,000 2

Hans M 900,000 3

Ben M 500,000 4

Return Value

Returns a value of type LONG.

Input Parameters

ROLLUP

Required. NTILE must be used within a ROLLUPROLLUP expression in place of theaggregate_expression of the ROLLUP.

The TO clause of the ROLLUP expression specifies the dimension group(s) over which to calculate thewindow function. An empty TO calculates the window function over all rows in the query as one group.

The ORDER BY clause of the ROLLUP expression specifies a measure field or aggregate expression.

integer

Required. An integer that specifies the number of buckets to divide the partitioned rows into.

Examples

Perhaps the most common use case for NTILE is to get a global ranking of result rows. For example,if you wanted to get the percentile of Total Records per City, you may think the expression to useis: ROLLUP NTILE(100) TO (City) ORDER BY ([Total Records] DESC) ROWS BETWEENUNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING.

However, by leaving the TO clause blank, the percentile buckets can adapt to whatever dimension(s)you use in the viz. To calculate the Total Records percentiles by City, you could define a global

Page 507: Data Ingest Guide

Data Ingest Guide - Platfora Expression Language Dictionary

Page 507

Total_Records_Percentiles measure and then use this measure in conjunction with the City dimension inthe viz (or any other dimension for that matter).

ROLLUP NTILE(100) TO () ORDER BY ([Total Records] DESC) ROWS BETWEENUNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING

RANK

RANK is a windowing aggregate function that orders rows by a measure value and assigns a rank numberto each row in the given partition. Rank positions are skipped in the event of a tie. RANK must be usedwithin a ROLLUP expression.

Syntax

ROLLUP RANK() TO ([partitioning_column]) [ORDER BY (measure_expression [ASC | DESC]) ROWS|RANGE window_boundary | BETWEEN window_boundary AND window_boundary ]

where window_boundary can be one of:

UNBOUNDED PRECEDINGvalue PRECEDINGvalue FOLLOWINGUNBOUNDED FOLLOWING

Description

RANK is a window aggregate function used to assign a ranking number to each row in a group. Ifmultiple rows have the same ranking value (there is a tie), then the tied rows are given the same rankvalue and the subsequent rank position is skipped.

The TO clause of the ROLLUP is used to specify the dimension column(s) used to partition a group ofinput rows. To define a global ranking that can adapt to any dimension groupings used in a viz, specifyan empty TO clause.

The ORDER BY clause of the ROLLUP expression determines how to order the rows before they areranked. The ORDER BY clause should specify the measure field for which you want to calculate theranks. The ranked rows in the partition are numbered starting at one.

For example, suppose we had a dataset with the following rows and columns and you want to rank theQuarters and Regions according to the values in the Sales column.

Quarter Region Sales

2010 Q1 North 100

Page 508: Data Ingest Guide

Data Ingest Guide - Platfora Expression Language Dictionary

Page 508

Quarter Region Sales

2010 Q1 South 200

2010 Q1 East 300

2010 Q1 West 400

2010 Q2 North 400

2010 Q2 South 250

2010 Q2 East 150

2010 Q2 West 250

Supposing the lens has an existing measure field called Sales(Sum), you could then define a measurecalled Sales_Rank using the following expression:

ROLLUP RANK() TO () ORDER BY ([Sales(Sum)] DESC) ROWS UNBOUNDED PRECEDING

When you include the Quarter, Region, and Sales_Rank columns in the viz, you get the followingdata points. Notice that tied values are given the same rank number and the rank positions 2 and 5 areskipped:

Quarter Region SalesRank

2010 Q1 North 8

2010 Q1 South 6

2010 Q1 East 3

2010 Q1 West 1

2010 Q2 North 1

2010 Q2 South 4

2010 Q2 East 7

2010 Q2 West 4

Return Value

Returns a value of type LONG.

Page 509: Data Ingest Guide

Data Ingest Guide - Platfora Expression Language Dictionary

Page 509

Input Parameters

ROLLUP

Required. RANK must be used within a ROLLUPROLLUP expression in place of theaggregate_expression of the ROLLUP.

The TO clause of the ROLLUP expression specifies the dimension group(s) over which to calculate thewindow function. An empty TO calculates the window function over all rows in the query as one group.

The ORDER BY clause of the ROLLUP expression specifies a measure field or aggregate expression.

Examples

Rank the sum of all sales in descending order, so the highest sales is given the ranking of 1.

ROLLUP RANK() TO () ORDER BY ([Sales(Sum)] DESC) ROWS UNBOUNDEDPRECEDING

Rank the sum of all sales within a given quarter in descending order, so the highest sales in each quarteris given the ranking of 1.

ROLLUP RANK() TO (Quarter) ORDER BY ([Sales(Sum)] DESC) ROWS UNBOUNDEDPRECEDING

ROW_NUMBER

ROW_NUMBER is a windowing aggregate function that assigns a unique, sequential number to each rowin a group (partition) of rows, starting at 1 for the first row in each partition. ROW_NUMBER must be usedwithin a ROLLUP expression, which acts as a modifier for ROW_NUMBER. Use a column in the ROLLUPorder by clause to determine on which column to determine the row number.

Syntax

ROLLUP ROW_NUMBER(integer) TO ([partitioning_column]) [ORDER BY (measure_expression [ASC | DESC]) ROWS|RANGE window_boundary | BETWEEN window_boundary AND window_boundary ]

where window_boundary can be one of:

UNBOUNDED PRECEDINGvalue PRECEDINGvalue FOLLOWINGUNBOUNDED FOLLOWING

Page 510: Data Ingest Guide

Data Ingest Guide - Platfora Expression Language Dictionary

Page 510

Description

For example, suppose we had a dataset with the following rows and columns:

Quarter Region Sales

2010 Q1 North 100

2010 Q1 South 200

2010 Q1 East 300

2010 Q1 West 400

2010 Q2 North 400

2010 Q2 South 250

2010 Q2 East 150

2010 Q2 West 250

Suppose you want to assign a unique ID to the sales of each region by quarter in descending order. Inthis example, a measure field is defined called Sum_Sales with the expression SUM(Sales). You couldthen define a measure called SalesNumber using the following expression:

ROLLUP ROW_NUMBER() TO (Quarter) ORDER BY (Sum_Sales DESC) ROWS UNBOUNDED PRECEDING

When you include the Quarter, Region, and SalesNumber columns in the viz, you get the following datapoints:

Quarter Region SalesNumber

2010 Q1 North 4

2010 Q1 South 3

2010 Q1 East 2

2010 Q1 West 1

2010 Q2 North 1

2010 Q2 South 2

2010 Q2 East 4

2010 Q2 West 3

Page 511: Data Ingest Guide

Data Ingest Guide - Platfora Expression Language Dictionary

Page 511

Return Value

Returns a value of type LONG.

Input Parameters

None

Examples

Assign a unique ID to the sales of each region by quarter in descending order, so the highest sales isgiven the number of 1.

ROLLUP ROW_NUMBER() TO (Quarter) ORDER BY (Sum_Sales DESC) ROWSUNBOUNDED PRECEDING

User Defined Functions (UDFs)User defined functions (UDFs) allow you to define your own per-row processing logic, and then exposethat functionality to users in the Platfora application expression builder.

User defined functions can only be used to implement new row functions, notaggregate functions. If a computed field that uses a UDF is included in a lens, theUDF will be executed once for each row during the lens build process. This is goodto keep in mind when writing UDF Java programs, so you do not write programsthat negatively impact lens build resources or execution times.

Writing a Platfora UDF Java Program

User defined functions (UDFs) are written in the Java programming language and implement thePlatfora-provided Java interface, com.platfora.udf.UserDefinedFunction.

Verify that any JAR file that the UDF will use is compatible with the existing libraries Platfora uses.You can find those libraries in $PLATFORA_HOME/lib.

To define a user defined function for Platfora, you must have the Java Development Kit (JDK) version 6or 7 installed on the machine where you plan to do your development.

You will also need the com.platfora.udf.UserDefinedFunction interface Java code fromyour Platfora master server installation. If you go to the $PLATFORA_HOME/tools/udf directory ofyour Platfora master server installation, you will find two files:

• platfora-udf.jar – This is the compiled code for thecom.platfora.udf.UserDefinedFunction interface. You must link to this jar file (place itin the CLASSPATH) when you compile your UDF Java program.

• /com/platfora/udf/UserDefinedFunction.java – This is the source code for theJava interface that your UDF classes need to implement. The source code is provided as reference

Page 512: Data Ingest Guide

Data Ingest Guide - Platfora Expression Language Dictionary

Page 512

documentation of the Platfora UserDefinedFunction interface. You can refer to this file whenwriting your UDF Java programs.

1. Copy the file $PLATFORA_HOME/tools/udf/platfora-udf.jar to a directory on themachine where you plan to develop and compile your UDF program.

2. Write a Java program that implements com.platfora.udf.UserDefinedFunction interface.

For example, here is a sample Java program that defines a REPEAT_STRING user defined function.This simple function repeats an input string a specified number of times.

import java.util.List;

/** * Sample user-defined function implementation that demonstrates * how to create a REPEAT_STRING function. */

public class RepeatString implements com.platfora.udf.UserDefinedFunction {

/** * Returns the name of the user-defined function. * The first character in the name must be a letter, * and subsequent characters must be either letters, * digits, or underscores. You cannot name your function * the same name as an existing Platfora * built-in function. Names are case-insensitive. */

@Override public String getFunctionName() { return "REPEAT_STRING"; }

/** * Returns one of the following values, reflecting the * return type of the user-defined function: * DATETIME, DOUBLE, FIXED, INTEGER, LONG, or STRING. */

@Override public String getReturnType() { return "STRING"; }

/** * Returns an array of Strings, one for each of the * input arguments to the user-defined function, * specifying the required data type for each argument. * The Strings should be of the following values: * DATETIME, DOUBLE, FIXED, INTEGER, LONG, STRING. */

Page 513: Data Ingest Guide

Data Ingest Guide - Platfora Expression Language Dictionary

Page 513

@Override public String[] getArgumentTypes() { return new String[] { "STRING", "INTEGER" }; }

/** * Returns a human-readable description of what the function * does, to be displayed to Platfora users in the * Expression Builder. May return null. */ @Override public String getDescription() { return "The REPEAT_STRING function returns an input string repeated " + " a specified number of times."; }

/** * Returns a human-readable description explaining the * value that the function returns, to be displayed to * Platfora users in the Expression Builder. May return null. */ @Override public String getReturnValueDescription() { return "Returns one value per row of type STRING"; }

/** * Returns a human-readable example of the function syntax, * to be displayed to Platfora users in the Expression * Builder. May return null. */ @Override public String getExampleUsage() { return "CONCAT(\"It's a \", REPEAT_STRING(\"Mad \",4), \" World\")"; }

/** * The compute method performs the actual work of evaluating * the user-defined function. The method should operate on the * argument values provided to calculate the function return value * and return a Java object of the appropriate type to represent * the return value. The following mapping describes the Java * object type that is used to represent each Platfora data type: * DATETIME -> java.lang.Long * DOUBLE -> java.lang.Double * FIXED -> java.lang.Long * INTEGER -> java.lang.Integer * LONG -> java.lang.Long * STRING -> java.lang.String *

Page 514: Data Ingest Guide

Data Ingest Guide - Platfora Expression Language Dictionary

Page 514

* Note on DATETIME type: datetime values in Platfora are represented as * Longs in Unix Epoch Time. (The number of milliseconds since the epoch.) * * Note on FIXED type: fixed-precision numbers in Platfora are represented * as Longs that have been scaled by a factor of 10,000. For example, the * fixed-precision value 2.5000 would be represented as the Java value * returned by the expression new Long(25000L). * * In the event that the user-defined function * encounters invalid inputs, or the function return value is not * defined given the inputs provided, the compute method should return * null rather than throwing an exception. The compute method should * avoid throwing any exceptions. * * @param arguments The values of the function inputs. * * The entries in this list will match the specification * provided by getArgumentTypes method in type, number, and order: * for example, if getArgumentTypes returned an array of * length 3 with the values STRING, DOUBLE, STRING, then * the arguments parameter will hold be a list of 3 Java * objects: a java.lang.String, a java.lang.Double, and a * java.lang.String. Any of the values within the * arguments List may be null. */ @Override public String compute(List arguments) { // cast the inputs to the correct types final String toRepeat = (String) arguments.get(0); final Integer numberOfRepeats = (Integer) arguments.get(1);

// check for invalid inputs if (toRepeat == null || numberOfRepeats == null || numberOfRepeats < 0) return null;

// repeat the input string the specified number of times final StringBuilder builder = new StringBuilder(); for (int i = 0; i < numberOfRepeats; i++) { builder.append(toRepeat); } return builder.toString(); }}

Page 515: Data Ingest Guide

Data Ingest Guide - Platfora Expression Language Dictionary

Page 515

3. Compile your .java UDF program file into a .class file (make sure to link to the platfora-udf.jar file or place it in your Java CLASSPATH).

For example, to compile the RepeatString.java program using Java 1.7:

javac -source 1.7 -target 1.7 -cp platfora-udf.jar RepeatString.java

4. Create a Java archive file (.jar) containing your .class file.

For example:

jar cf repeat-string-udf.jar RepeatString.class

After you have written and compiled your UDF Java program, you must then install and enable it on thePlatfora master server. See Adding a UDF to the Platfora Expression Builder.

Adding a UDF to the Platfora Expression Builder

After you have written and compiled a user defined function (UDF) Java class, you must install yourclass on the Platfora master server and enable it so that it can be seen and used in the Platfora expressionbuilder.

This task is performed on the Platfora master server.

Before you begin, you must have written and compiled a Java class for your user defined function. SeeWriting a Platfora UDF Java Program.

1. Create a directory named extlib in the Platfora data directory on the Platfora master server.

For example:

$ mkdir $PLATFORA_DATA_DIR/extlib

2. Copy the Java archive (.jar) file containing your UDF class to the $PLATFORA_DATA_DIR/extlib directory on the Platfora master server.

For example:

$ cp repeat-string-udf.jar $PLATFORA_DATA_DIR/extlib/

3. Set the Platfora server configuration property, platfora.udf.class.names, so it containsthe name of your UDF Java class. If you have more than one class, separate the class names with acomma.

For example, to set this property using the platfora-config command-line utility:

$ $PLATFORA_HOME/bin/platfora-config set --key platfora.udf.class.names --value RepeatString

4. Restart the Platfora server:

$ platfora-services restart

Page 516: Data Ingest Guide

Data Ingest Guide - Platfora Expression Language Dictionary

Page 516

The user defined function will then be available for defining computed field expressions in the AddField dialog of the Platfora application.

Due to the way some web browsers cache Javascript files, the newly addedfunction may not appear in the Functions list for up to 24 hours. However, thefunction is immediately available for use and recognized by the Expression auto-complete feature.

Regular Expression ReferenceRegular expressions vary in complexity using a combination of basic constructs to describe a stringmatching pattern. This reference describes the most common regular expression matching patterns, butis not a comprehensive list.

Regular expressions, also referred to as regex or regexp, are a standardized collection of specialcharacters and constructs used for matching strings of text. They provide a flexible and precise languagefor matching particular characters, words, or patterns of characters.

Platfora regular expressions are based on the pattern matching syntax of the Java programminglanguage. For more in depth information on writing valid regular expressions, refer to the Java regularexpression pattern documentation.

Platfora makes use of regular expressions in the following contexts:

• In computed field expressions that use the REGEX or REGEX_REPLACE functions.

Page 517: Data Ingest Guide

Data Ingest Guide - Platfora Expression Language Dictionary

Page 517

• In PARTITION expression statements for event series processing computed fields.

• In the Regex file parser in data ingest.

• In the data source location path descriptor in data ingest.

• In lens filter expressions.

Regex Literal and Special Characters

The most basic form of regular expression pattern matching is the match of a literal character or string.Regular expressions also have a number of special characters that affect the way a pattern is matched.This section describes the regular expression syntax for referring to literal characters, special characters,non-printable characters (such as a tab or a newline), and special character escaping.

Literal Characters

The most basic form of pattern matching is the match of literal characters. For example, if the regularexpression is foo and the input string is foo, the match will succeed because the strings are identical.

Special Characters

Certain characters are reserved for special use in regular expressions. These special characters are oftencalled metacharacters. If you want to use special characters as literal characters, they must be escaped.

Character Name Character Reserved For

opening bracket [ start of a character class

closing bracket ] end of a character class

hyphen - character ranges within a character class

backslash \ general escape character

caret ^ beginning of string, negating of a character class

dollar sign $ end of string

period . matching any single character

pipe | alternation (OR) operator

question mark ? optional quantifier, quantifier minimizer

asterisk * zero or more quantifier

plus sign + once or more quantifier

opening parenthesis ( start of a subexpression group

closing parenthesis ) end of a subexpression group

Page 518: Data Ingest Guide

Data Ingest Guide - Platfora Expression Language Dictionary

Page 518

Character Name Character Reserved For

opening brace { start of min/max quantifier

closing brace } end of min/max quantifier

Escaping Special Characters

There are two ways to force a special character to be treated as an ordinary character:

• Precede the special character with a \ (backslash character). For example, to specify an asterisk as aliteral character instead of a quantifier, use \*.

• Enclose the special character(s) within \Q (starting quote) and \E (ending quote). Everythingbetween \Q and \E is then treated as literal characters.

• To escape literal double-quotes in a REGEX() expression, double the double-quotes (""). Forexample, to extract the inches portion from a height field where example values are 6'2", 5'11":

REGEX(height, "\'(\d)+""$")

Non-Printing Characters

You can use special character sequence constructs to specify non-printable characters in a regularexpression. Some of the most commonly used constructs are:

Construct Matches

\n newline character

\r carriage return character

\t tab character

\f form feed character

Regex Character Classes

A character class allows you to specify a set of characters, enclosed in square brackets, that can producea single character match. There are also a number of special predefined character classes (backslashcharacter sequences that are shorthand for the most common character sets).

Character Class Constructs

A character class matches only to a single character. For example, gr[ae]y will match to gray orgrey, but not to graay or graey. The order of the characters inside the brackets does not matter.

You can use a hyphen inside a character class to specify a range of characters. For example, [a-z] matches a single lower-case letter between a and z. You can also use more than one range, or a

Page 519: Data Ingest Guide

Data Ingest Guide - Platfora Expression Language Dictionary

Page 519

combination of ranges and single characters. For example, [0-9X] matches a numeric digit or the letterX. Again, the order of the characters and the ranges does not matter.

A caret following an opening bracket specifies characters to exclude from a match. For example,[^abc] will match any character except a, b, or c.

Construct Type Description

[abc] simple matchesaorborc

[^abc] negation matches any character exceptaorborc

[a-zA-Z] range matchesathroughz, orAthroughZ(inclusive)

[a-d[m-p]] union matchesathroughd, ormthroughp

Page 520: Data Ingest Guide

Data Ingest Guide - Platfora Expression Language Dictionary

Page 520

Construct Type Description

[a-z&&[def]] intersection matchesd,e, orf

[a-z&&[^xq]] subtraction matchesathroughz, except forxandq

Predefined Character Classes

Predefined character classes offer convenient shorthands for commonly used regular expressions.

Construct Description Example

. matches any single character (except newline) .atmatches "cat", "hat", and also"bat" in thephrase "batch files"

\d matches any digit character (equivalent to[0-9])

\dmatches "3" in "C3PO" and "2" in"file_2.txt"

\D matches any non-digit character (equivalent to[^0-9])

\Dmatches "S" in "900S" and "Q" in "Q45"

\s matches any single white-space character(equivalent to[ \t\n\x0B\f\r])

\sbookmatches "book" in "blue book" butnothing in "notebook"

\S matches any single non-white-space character \Sbookmatches "book" in "notebook" butnothing in "blue book"

Page 521: Data Ingest Guide

Data Ingest Guide - Platfora Expression Language Dictionary

Page 521

Construct Description Example

\w matches any alphanumeric character, includingunderscore (equivalent to[A-Za-z0-9_])

r\w*matches "rm" and "root"

\W matches any non-alphanumeric character(equivalent to[^A-Za-z0-9_])

\Wmatches "&" in "stmd &" , "%" in"100%", and "$" in "$HOME"

POSIX Character Classes (US-ASCII)

POSIX has a set of character classes that denote certain common ranges. They are similar to bracket andpredefined character classes, except they take into account the locale (the local language/coding system).

\p{Lower} a lower-case alphabetic character,[a-z]

\p{Upper} an upper-case alphabetic character,[A-Z]

\p{ASCII} an ASCII character,[\x00-\x7F]

\p{Alpha} an alphabetic character,[a-zA-z]

\p{Digit} a decimal digit,[0-9]

\p{Alnum} an alphanumeric character,[a-zA-z0-9]

\p{Punct} a punctuation character, one of!"#$%&'()*+,-./:;<=>?@[\]^_`{|}~

\p{Graph} a visible character,[\p{Alnum}\p{Punct}]

\p{Print} a printable character,[\p{Graph}\x20]

\p{Blank} a space or tab,[ t]

Page 522: Data Ingest Guide

Data Ingest Guide - Platfora Expression Language Dictionary

Page 522

\p{Cntrl} a control character,[\x00-\x1F\x7F]

\p{XDigit} a hexidecimal digit,[0-9a-fA-F]

\p{Space} a whitespace character,[ \t\n\x0B\f\r]

Regex Line and Word Boundaries

Boundary matching constructs are used to specify where in a string to apply a matching pattern. Forexample, you can search for a particular pattern within a word boundary, or search for a pattern at thebeginning or end of a line.

Construct Description Example

^ matches from the beginning of a line (multi-line matches are currently not supported)

^172will match the "172" in IP address"172.18.1.11" but not in "192.172.2.33"

$ matches from the end of a line (multi-linematches are currently not supported)

d$will match the "d" in "maid" but not in"made"

\b matches within a word boundary \bis\bmatches the word "is" in "this is myisland", but not the "is" part of "this" or"island".\bismatches both "is" and the "is" in "island",but not in "this".

\B matches within a non-word boundary \Bbmatches "b" in "sbin" but not in "bash"

Regex Quantifiers

Quantifiers specify how often the preceding regular expression construct should match. There are threeclasses of quantifiers: greedy, reluctant, and possessive. The difference between greedy, reluctant, andpossessive quantifiers involves what part of the string to try for the initial match, and how to retry if theinitial attempt does not produce a match.

Page 523: Data Ingest Guide

Data Ingest Guide - Platfora Expression Language Dictionary

Page 523

Quantifier Constructs

By default, quantifiers are greedy. A greedy quantifier will first try for a match with the entire inputstring. If that produces a match, then the match is considered a success, and the engine can move on tothe next construct in the regular expression. If the first try does not produce a match, the engine backs-off one character at a time until a match is found. So a greedy quantifier checks for possible matches inorder from the longest possible input string to the shortest possible input string, recursively trying fromright to left.

Adding a ? (question mark) to a greedy quantifier makes it reluctant. A reluctant quantifier will first tryfor a match from the beginning of the input string, starting with the shortest possible piece of the stringthat matches the regex construct. If that produces a match, then the match is considered a success, andthe engine can move on to the next construct in the regular expression. If the first try does not producea match, the engine adds one character at a time until a match is found. So a reluctant quantifier checksfor possible matches in order from the shortest possible input string to the longest possible input string,recursively trying from left to right.

Adding a + (plus sign) to a greedy quantifier makes it possessive. A possessive quantifier is like a greedyquantifier on the first attempt (it tries for a match with the entire input string). The difference is thatunlike a greedy quantifier, a possessive quantifier does not retry a shorter string if a match is not found.If the initial match fails, the possessive quantifier reports a failed match. It does not make any moreattempts.

Greedy

Construct

Reluctant

Construct

Possessive

Construct

Description Example

? ?? ?+ matches the previouscharacter or construct onceor not at all

st?onmatches "son" in "johnson" and "ston"in "johnston" but nothing in "clinton" or"version"

* *? *+ matches the previouscharacter or construct zeroor more times

if*matches "if", "iff" in "diff", or "i" in "print"

+ +? ++ matches the previouscharacter or construct oneor more times

if+matches "if", "iff" in "diff", but nothing in"print"

{n} {n}? {n}+ matches the previouscharacter or constructexactlyntimes

o{2}matches "oo" in "lookup" and the first two o'sin "fooooo" but nothing in "mount"

Page 524: Data Ingest Guide

Data Ingest Guide - Platfora Expression Language Dictionary

Page 524

Greedy

Construct

Reluctant

Construct

Possessive

Construct

Description Example

{n,} {n,}? {n,}+ matches the previouscharacter or construct atleastntimes

o{2,}matches "oo" in "lookup" all five o's in"fooooo" but nothing in "mount"

{n,m} {n,m}? {n,m}+matches the previouscharacter or construct atleastntimes, but no more thanmtimes

F{2,4}matches "FF" in "#FF0000" and the last fourF's in "#FFFFFF"

Regex Capturing Groups

Groups are specified by a pair of parenthesis around a subpattern in the regular expression. By placingpart of a regular expression inside parentheses, you group that part of the regular expression together.This allows you to apply regex operators and quantifiers to the entire group at once. Besides groupingpart of a regular expression together, parenthesis also create a capturing group. Capturing groups areused to determine which matching values to save or return from your regular expression.

Group Numbering

A regular expression can have more than one group and the groups can be nested. The groups arenumbered 1-n from left to right, starting with the first opening parenthesis. There is always an implicitgroup 0, which contains the entire match. For example, the pattern:

(a(b*))+(c)

contains three groups:

group 1: (a(b*))group 2: (b*) group 3: (c)

Capturing Groups

By default, a group captures the text that produces a match. Besides grouping part of a regularexpression together, parenthesis also create a capturing group or a backreference. The portion of thestring matched by the grouped subexpression is captured in memory for later retrieval or use.

Capturing Groups and the Regex Line Parser

When you choose the Regex line parser during the Parse step of the data ingest process, Platfora usescapturing groups to determine what parts of the regular expression to return as columns. The Regex

Page 525: Data Ingest Guide

Data Ingest Guide - Platfora Expression Language Dictionary

Page 525

line parser applies the user-supplied regular expression against each line in the source file, and returnseach capturing group in the regular expression as a column value.

For example, suppose you had user records in a file, and the lines were formatted like this:

Name: John Smith Address: 123 Main St. Age: 25 Comment: ActiveName: Sally R. Jones Address: 2 E. El Camino Real Age: 32Name: Rod Rogers Address: 55 Elm Street Age: 47 Comment: Suspended

You could use the following regular expression to extract the Full Name, Last Name, Address, Age, andComment column values:

Name: (.*\s(\p{Alpha}+)) Address:\s+(.*) Age:\s+([0-9]+)(?:\s+Comment:\s+(.*))?

Capturing Groups and the REGEX Function

The REGEX function can be used to extract a portion of a string value. For the REGEX function, only thevalue of the first capturing group is returned. For example, if you wanted to match all possible emailaddress strings with a pattern of [email protected], but only return the provider portion of theemail address from the email field:

REGEX(email,"^[a-zA-Z0-9._%+-]+@([a-zA-Z0-9._-]+)\.[a-zA-Z]{2,4}$")

Capturing Groups and the REGEX_REPLACE Function

The REGEX_REPLACE function is used to match a string value, and replace matched strings withanother value. The REGEX_REPLACE function takes three arguments: an input string, a matching regex,and a replacement regex. Capturing groups can be used to capture backreferences (see Backreferences),but do not control what portions of the match are returned (the entire match is always returned).

Backreferences

Backreferences allow you to capture and reuse a subexpression match inside the same regularexpression. You can reuse a capturing group as a backreference by referring to its group numberpreceded by a backslash (for example, \1 refers to capturing group 1, \2 refers to capturing group 2,and so on).

For example, if you wanted to match a pair of HTML tags and their enclosed text, you could capture theopening tag into a backreference, and then reuse it to match the corresponding closing tag:

(<([A-Z][A-Z0-9]*)\b[^>]*>.*?</\2>)

This regular expression contains two capturing groups, the outermost capturing group (which capturesthe entire string), and one which captures the string matched by [A-Z][A-Z0-9]* into backreferencenumber two. This backreference can then be reused with \2 (backslash two) to match the correspondingclosing HTML tag.

When referring to capturing groups in the previous regular expression, the backreference syntax isslightly different. The backreference group number is preceded by a dollar sign instead of a backslash(for example, $1 refers to capturing group 1 of the previous expression). An example of this would be

Page 526: Data Ingest Guide

Data Ingest Guide - Platfora Expression Language Dictionary

Page 526

the REGEX_REPLACE function, which takes two regular expressions: one for the matching string, andone for the replacement string.

The following example matches the values in a phone_number field where phone number values areformatted as xxx.xxx.xxxx, and replaces them with phone number values formatted as (xxx) xxx-xxxx. Notice the backreferences in the replacement expression; they refer to the capturing groups of theprevious matching expression:

REGEX_REPLACE(phone_number,"([0-9]{3})\.([[0-9]]{3})\.([[0-9]]{4})","\($1\) $2-$3")

Non-Capturing Groups

In some cases, you may want to use parenthesis to group subpatterns, but not capture text. A non-capturing group starts with (?: (a question mark and colon following the opening parenthesis). Forexample, h(?:a|i|o)t matches hat or hit or hot, but does not capture the a, i, or o from thesubexpression.

Page 527: Data Ingest Guide

Page 527

Appendix

BLens Query Language ReferencePlatfora's lens query language is a SQL-like language for programmatically querying the prepared data in a lens.This reference describes the query language syntax and usage.

Topics:• SELECT Statement

SELECT StatementQueries an aggregate lens. A SELECT statement is input to a progammatic lens query.

Syntax

[ DEFINE alias-name AS expression [ DEFINE ... ] ]SELECT measure-field [ AS alias-name ] | measure-expression AS alias-name [ , {dimension-field [ AS alias-name ] | row-expression AS alias-name } [ , ...] ]FROM lens-name[ WHERE filter-expression [ AND filter-expression ] ][ GROUP BY dimension-field [ [, group-ordering ] ][ HAVING measure-filter-expression ]

Description

Use SELECT to query an aggregate lens. You cannot query an event series lens. The SELECT mustinclude at least one measure field (column) or expression. Once you've supplied a measure value, yourSELECT can contain additional measures or dimensions.

If you include non-measure columns in the SELECT, you must include those columns in a GROUP BYclause. Use the DEFINE clause to add one or more computed fields to the lens.

Platfora always queries the current version of the lens-name. Keep in mind lens definitions canchange. If you write a query against a column that is later dropped from the lens, a previously workingquery can fail as a result and return an error message.

Page 528: Data Ingest Guide

Data Ingest Guide - Lens Query Language Reference

Page 528

Querying via REST

A lens query is meant to support external applications that want to access Platfora lens data. For thisreason, you query a lens by making API calls to the query REST resource:

https://hostname:port/api/v1/query

The resource supports passing the statement as a GET or POST with application/x-www-form-urlencoded URL/form parameters. The caller must authenticate as a user with the Analyst(Limited) role or higher to execute a query. To query a specific lens, the caller must have DataAccess on all datasets the lens references.

A query returns comma separated values (CSV) by default. You have the option of receiving the resultsin a JSON body. For detailed information about using this REST APIs, see the Platfora API Reference.

Writing a SELECT Expression

The SELECT expression can contain multiple columns and or expressions. You must specify at least onemeasure column or measure expression. Once you meet this requirement, you can include additionaldimension columns or row expressions. Recall that a measure is a aggregate numeric value where as adimension is a numeric, text, or time-based.

A measure expression supports addition, subtraction, multiplication, and division. Inputs values can becolumn references (fields) or expressions that contain any of these supported functions:

• aggregate ( that is, AVG(), COUNT(), SUM() and so forth

• ROLLUP()

• EXP()

• POW()

• SQRT()

Measure expressions can also include literal (integers or string) values. When constructing a measureexpression, make sure you understand the expression syntax rules and limitations of aggregate functions.See the Expression and Query Language Reference for information on the aggregate function limitationsand expression syntax.

If the SELECT statement includes a dimension value, you must include the column in your GROUP BYclause. A dimension or row expression supports addition, subtraction, multiplication, and division ofrow values. Your SELECT can reference columns or supply row expressions that include followingfunctions:

• data type conversion

• date and time

• general processing

• math

• string

• URL

Page 529: Data Ingest Guide

Data Ingest Guide - Lens Query Language Reference

Page 529

An expression can include literal (integers or string) values or other expressions. Make sure youunderstand the expression syntax rules. See the Expression and Query Language Reference forinformation on the expression syntax.

When specifying an expression, supply an alias (AS clause) if you want to referto the expression elsewhere in other clauses. You cannot use an * (asterisk) toretrieve all of the rows in a lens.

Specifying Lens and Column Names

When you specify the lens-name you use the name as it appears in the Data Catalog user interface.Enclose the name in [ ] (brackets) if it contains spaces or special characters. For example, you wouldrefer to the Web Log-2014 lens as:

[Web Log-2014]

When specifying a column name, you should follow the expression language rules for field (column)references. This means that for columns belonging to a reference dataset, you must qualify the nameusing dot-notation as follows:

{ [ reference-dataset . [...] ] column-name | alias-name }

For example, use device.manufacturer to refer the manufacturer column in the devicedataset. If you define an alias, use the alias to refer to the column in other parts of your query.

DEFINE Clause

Defines a computed field to include in a SELECT statement.

Syntax

DEFINE alias-name AS { expression }

Description

Use a DEFINE clause to include new computed fields that aren't in the original lens. Using the DEFINEclause is optional. Platfora applies the DEFINE statement before the main SELECT clause. Newcomputed fields can only use fields already in the lens.

The expression you write must be a valid expression for a vizboard computed field. This means yourcomputed field is subject to the following restrictions:

• You can only define a computed field that operates on fields that exist in the lens.

• A vizboard computed field can break if it operates on fields that are later removed from the lens or afocus or referenced dataset.

• You cannot use aggregate functions to add new measures from dimension data in the lens.

• You can compute new measures from existing measures already in the lens. For example, ifan AVG(sales) aggregate exists in the data, you can define a SUM(sales) field becauseSUM(sales) is necessary to compute AVG(sales).

If you specify multiple DEFINE clauses, separate each new DEFINE with a space.

Page 530: Data Ingest Guide

Data Ingest Guide - Lens Query Language Reference

Page 530

A computed field can depend on any fields pre-existing in the lens or other fields created in the query'sscope. For example, a computed field you DEFINE can depend on fields also created through otherDEFINE statements.

WHERE Clause

Filters a lens query by one or more predicate expression.

Syntax

WHERE predicate-expression [ AND predicate-expression ]

A predicate-expression can be a comparison:

column-name { = | < | > | <= | >= | != } literal

Or the predicate-expression can be a list-expression such as this:

column-name [ NOT ] { IN list | LIKE pattern | BETWEEN literal AND literal }

Description

Use WHERE clause to filter a lens query by one or more predicate expressions. Use the AND keywordto join multiple expressions. A WHERE clause can include expressions that make use of the comparisonoperators or list expressions. For detailed information about expressions syntax, see the PlatforaExpression and Query Language Reference.

You cannot useIS NULL or IS NOT NULL comparisons in the WHERE clause. You also cannotuse relative date filters (LAST integer DAYS) You can use the NOT keyword to negate any listexpressions.

The following example illustrate several different permutations of expressions structures you can use:

SELECT count() FROM [View Summary] WHERE prior_views NOT IN (3,5,7,11,13,17) AND TO_LONG(prior_views) NOT IN (4294967296) AND avebitrate_double NOT IN (3101.0, 2598.0, 804.0) AND video.genre NOT IN ("Silent", "Exercise") AND video.genre NOT LIKE ("*a*") AND date.Date NOT IN (2011-08-04, 2011-06-04, 2011-07-04) AND prior_views > 23 AND avebitrate_double < 3101.0 AND TO_FIXED(avebitrate_double) != 3101.0 AND TO_LONG(prior_views) != 4294967296 AND video.genre <= "Silent" AND date.Date > 2011-08-04 AND date.Date NOT BETWEEN 2012-01-01 AND 2013-01-01 AND video.genre BETWEEN "Exercise" and "Silent" AND prior_views BETWEEN 0 and 100 AND avebitrate_double NOT BETWEEN 1234.5678 AND 2345.6789

Page 531: Data Ingest Guide

Data Ingest Guide - Lens Query Language Reference

Page 531

When comparing literal dates, make sure you use the format of yyyy-MM-dd without any enclosingquotation marks or other punctuation.

GROUP BY Clause

Orders and optionally limit the results of a SELECT statement.

Syntax

GROUP BY group-ordering [ , group-ordering ]

The group-ordering clause has the following syntax:

column-name [ SORT [ BY measure-name ] [ { ASC | DESC } ] [ LIMIT integer [ WITHOTHERS ] ]

Description

Use a GROUP BY clause to order and optionally limit results of a SELECT. If the SELECT statementincludes a dimension column, you must supply GROUP BY clause that includes the dimension column.Otherwise, the GROUP BY clause is optional.

A GROUP BY can include more than one column. To do this, delimit each column with a , (comma) asillustrated here:

GROUP BY col_A, col_B, col_c

You can GROUP BY a new computed field that is not defined in the lens. To do this, you add the fieldusing the DEFINE clause and then use the field in the GROUP BY clause. Alternatively, you can definethe computed field in the SELECT list, associate an alias with the field, and use the alias in the GROUPBY clause.

A SORT specification is optional. If you do not specify SORT, the query returns in an unspecified order.To sort the columns by their values ("natural sorting order"), simply specify ASC (ascending) or DESC(descending). ASC is the default SORT order when sorting by natural values.

To SORT a particular column by another measure or measure expression use the SORT BY phrase. Youcan specify a measure-name in the SORT BY clause that need not be in the SELECT list. You can alsoorder the sort in either ASC or DESC order. Unlike natural value sorts, SORT BY default to the DESC(descending) sorting order.

GROUP BY col_A SORT BY meas_1 ASC, col_B SORT DESC, col_c SORT BY measure_expression ASC

Using GROUP BY with multiple SORT BY combinations allows you to group values with respect toone another. Consider three potential grouping columns, say Fee, Fi, and Foe. Sorting on column Feesorts the records on the Fee value. Another SORT BY clause on column Fi, sorts Fi values within theexisting Fee sort.

Page 532: Data Ingest Guide

Data Ingest Guide - Lens Query Language Reference

Page 532

Use the LIMIT keyword to reduce the number of groups returned. For example, if you are sortingairports by the number of departing flights in DESC order (most flights to least flights), you couldLIMIT the SORT to the 10 busiest airports.

GROUP BY airports SORT BY total_departures DESC LIMIT 10

The LIMIT restricts the results to top 10 busiest departure airports. The LIMIT clause excludes otherairports. You can use the WITH OTHERS keyword to combine all the other airports not in the top 10under a single Others group.

HAVING Clause

Filters a SELECT statement by a measure expression.

Syntax

HAVING measure-predicate-expression [ AND measure-predicate-expression ]

A measure-predicate-expression has the following form:

{ measure-column | measure-expression } { = | < | > | <= | >= | != } literal

Description

The HAVING clause filters the result of the GROUP BY clause by a measure or measure expression. TheHAVING conditions apply to the GROUP BY clause.SELECT device.manufacturer, [duration (Avg)] FROM movie_view2G_PSM GROUP BY device.manufacturer HAVING [duration (Max)] = 10800

In the example above, you see a reference to two quick measure fields. Both the duration AVG() andMAX() quick measures are already defined on the lens.

Example of Lens Queries

This section provides some tips and examples for querying a lens.

Discovering Lens Fields

When querying a lens, you must use the sql REST API endpoint. Before constructing your query, it is agood idea to list the lens fields with a REST call to the lens resource. One suggested method is to makethe following calls:

• List the lens by calling GET on the http://hostname:port/api/v1/lenses resource.

• Locate the lens id value in the lens list.

• Get the lens by calling GET to the http://hostname:port/api/v1/lenses/id resource.

• Review the lens fields.

Page 533: Data Ingest Guide

Data Ingest Guide - Lens Query Language Reference

Page 533

This is one way to discover existing aggregate expressions and quick measures in the lens. For example,listing lens fields give you examples such as the following:

... "fields": { "Active Clusters (Total)": { "name": "Active Clusters (Total)", "expression": "DISTINCT([Temp Field for Count Active Clusters])", "lensExpression": false, "platforaManaged": false, "role": "MEASURE", "type": "LONG" }, "Are there new Active Clusters since Yesterday?": { "name": "Are there new Active Clusters since Yesterday?", "expression": "[Active Clusters (Total)] - ROLLUP [Active Clusters (Total)] TO ([Log DateTime Date].Date) ORDER BY ([Log DateTime Date].Date) ROWS BETWEEN 1 PRECEDING AND 1 PRECEDING", "lensExpression": false, "platforaManaged": false, "role": "MEASURE", "type": "LONG" }, "Avg Page Views per Session": { "name": "Avg Page Views per Session", "expression": "[Total Records]/(DISTINCT(sessionId))", "lensExpression": false, "platforaManaged": false, "role": "MEASURE", "type": "DOUBLE" }, ...

Using the JSON description of a lens, you can quickly see measures used in your lens versus navigatingthe lens in Platfora's UI.

Complex DEFINE Clauses

This example illustrates the use of multiple DEFINE clauses. Notice the descriptive name for theROLLUP() computed field.

DEFINE Manu_Genre AS CONCAT ([device].[manufacturer], [video].[genre]) DEFINE [ROLLUP num_views TO Manu] as ROLLUP COUNT() TO (device.manufacturer) DEFINE [ROLLUP num_views TO Manu_Genre] as ROLLUP COUNT() TO ([Manu_Genre]) SELECT device.manufacturer, Manu_Genre, [Num Views], [ROLLUP num_views TO Manu], [ROLLUP num_views TO Manu_Genre] FROM moview_view2G_PSM

Page 534: Data Ingest Guide

Data Ingest Guide - Lens Query Language Reference

Page 534

WHERE Manu_Genre LIKE (\"*Action/Comedy\", \"*Anime\", \"*Drama/Silent\") GROUP BY device.manufacturer SORT ASC, Manu_Genre SORT ASC HAVING [ROLLUP num_views TO Manu] > 30000 AND [ROLLUP num_views TO Manu_Genre] > 1000

Build a WHERE Clause

The following example shows a WHERE clause using mixed predicates and row comparison. It also usesthe NOT keyword to negate list expressions

SELECT count() FROM [(test) View Summary] WHERE prior_views NOT IN (3,5,7,11,13,17) AND TO_LONG(prior_views) NOT IN (4294967296) AND avebitrate_double NOT IN (3101.0, 2598.0, 804.0) AND video.genre NOT IN ("Silent", "Exercise") AND video.genre NOT LIKE ("*a*") AND date.Date NOT IN (2011-08-04, 2011-06-04, 2011-07-04) AND prior_views > 23 AND avebitrate_double < 3101.0 AND TO_FIXED(avebitrate_double) != 3101.0 AND TO_LONG(prior_views) != 4294967296 AND video.genre <= "Silent" and date.Date > 2011-08-04 AND date.Date NOT BETWEEN 2012-01-01 AND 2013-01-01 AND video.genre BETWEEN "Exercise" AND "Silent" AND prior_views BETWEEN 0 AND 100 AND avebitrate_double NOT BETWEEN 1234.5678 AND 2345.6789

You cannot use IS NULL or IS NOT NULL comparisons. You also cannot use relative date filters(LAST integer DAYS).

Complex Measure Expression

The following example illustrates a measure expression that includes both a ROLLUP and use ofaggregate functions.

SELECT device.manufacturer, CONCAT([device].[manufacturer], [video].[genre]) AS Manu_Genre, [Num Views], ROLLUP COUNT() TO (device.manufacturer) as [ROLLUP num_views TO Manu], ROLLUP COUNT() TO ([Manu_Genre]) AS [ROLLUP num_views TO Manu_Genre] FROM movie_view2G_PSM WHERE Manu_Genre LIKE (\"*Action/Comedy\", \"*Anime\", \"*Drama/Silent\") GROUP BY device.manufacturer SORT ASC, Manu_Genre SORT ASC

Page 535: Data Ingest Guide

Data Ingest Guide - Lens Query Language Reference

Page 535

HAVING [ROLLUP num_views TO Manu] > 30000 AND [ROLLUP num_views TO Manu_Genre] > 1000

Complex Row Expressions

This row expression uses multiple row terms and factors:

SELECT duration + [days after release] + user.age + user.location.estimatedpopulation AS [Row-Expression multi-factors], [Num Views] FROM movie_view2G_PSM GROUP BY [Row-Expression multi-factors] SORT ASC

You'll notice that the Row-Expression multi-factors alias for the SELECT complex expressionis reused in the GROUP BY clause.