spectrum location intelligence - pitney bowes · 2019-04-29 · location intelligence for big data...

Spectrum™ Location Intelligence for Big Data Version 3.1

Spectrum™ Location Intelligence for Big Data User Guide

Table of Contents

1 - Welcome

What is Spectrum™ Location Intelligence for Big Data? 4

Spectrum™ Location Intelligence for Big DataArchitecture 5

System Requirements and Dependencies 5

2 - Spatial

Installing the SDK 8 Hive User-Defined Spatial Functions 9 MapReduce Jobs 40 Spark Jobs 44 Hexagons 47

3 - Samples

Risk Assessment 50 Fire Protection Assessment 72 Geohash Aggregation 79 Consuming Results 82

4 - Appendix

PGD Builder 85 Download Permissions 86

1 - Welcome

In this section

What is Spectrum™ Location Intelligence for Big Data? 4 Spectrum™ Location Intelligence for Big DataArchitecture 5 System Requirements and Dependencies 5

Welcome

What is Spectrum™ Location Intelligence for Big Data?

The Pitney Bowes Spectrum™ Location Intelligence for Big Data is a toolkit for processing enterprise data for large scale spatial analysis. Billions of records can be processed in parallel, using MapReduce, Hive, and Apache Spark's cluster processing framework, yielding results faster than ever. Unlike traditional processing techniques that used to take weeks to process the data, now the data processing can be done in a few hours using this product.

Spectrum™ Location Intelligence for Big Data 3.1 Spectrum™ Location Intelligence for Big Data User Guide 4

Welcome

Spectrum™ Location Intelligence for Big DataArchitecture

What is Spectrum™ Location Intelligence for Big Data?

The Spectrum™ Location Intelligence for Big Data transforms and packages Location Intelligence components into an SDK for Big Data platforms like Hadoop for MapReduce, Spark, and Hive.

SDK provides:

• Integration APIs for Location Intelligence • Input datasets and metadata

API Types:

• Pre-built MapReduce, Spark and HiveUDF wrappers for Location Intelligence operations • Core Location Intelligence APIs with sample MapReduce/Hive/Spark programs (security enabled via Kerberos and Apache Sentry for Hive)

System Requirements and Dependencies

Spectrum™ Location Intelligence for Big Data is collection of jar files that can be deployed to your Hadoop system.

This product is verified on the following Hadoop distributions.

• Cloudera 5.12 and 6.0 • Hortonworks 2.6 • EMR 5.10 • MapR 6.0 and above, with MapR Expansion Pack (MEP) 5.0.0

To use these jar files, you must be familiar with configuring Hadoop in Hortonworks, Cloudera, EMR, or MapR and developing applications for distributed processing. For more information, refer to Hortonworks, Cloudera, EMR, or MapR documentation.

To use the product, the following must be installed on your system:

for Hive:

• Hive version 1.2.1 or above

for Hive Client

• Beeline, for example

for Spark and Zeppelin Notebook:


http://docs.hortonworks.com/index.htmlhttp://www.cloudera.com/documentation.htmlhttps://aws.amazon.com/documentation/emr/https://mapr.com/docs/

Welcome

• Java JDK version 1.8 or above • Hadoop version 2.6.0 or above • Spark version 1.6.0 or above (2.0 or above required for MapR and Cloudera 6.0) • Zeppelin Notebook is not supported in Cloudera


2 - Spatial This section describes the MapReduce jobs, Spark jobs and Hive user defined functions (UDFs) for geometry and coordinate operations and the ability to read TAB files.

MapReduce and Spark jobs use the Location Intelligence SDK (LI SDK) API in map and reduce operations to use the big data processing systems for spatial data analysis. The LI SDK provides geometry and coordinate operations, the ability to read TAB files, and in-memory r-tree creation and searching.

Hive UDFs also use the LI SDK API to provide SQL-like functions for spatial analysis in Hive.

In this section

Installing the SDK 8 Hive User-Defined Spatial Functions 9 MapReduce Jobs 40 Spark Jobs 44 Hexagons 47

Spatial

Installing the SDK

To use spatial functions for Spectrum™ Location Intelligence for Big Data, the Hadoop cluster must have reference data and libraries accessible from each master and data node at the file-system level.

For the purposes of this guide, we will:

• use a user called pbuser • install everything into /pb

Perform the following steps from a node in your cluster, such as the master node.

1. Create the install directory and give ownership to pbuser.

sudo mkdir /pbsudo chown pbuser:pbuser /pb

2. Add the Location Intelligence distribution zip to the node at a temporary location, for example:

/pb/temp/spectrum-bigdata-locationintelligence-version.zip

3. Extract the Location Intelligence distribution.

mkdir /pb/limkdir /pb/li/sdkunzip /pb/temp/spectrum-bigdata-locationintelligence-version.zip -d /pb/li/sdk


Spatial

Hive User-Defined Spatial Functions

Hive user-defined functions (UDFs) create MapReduce jobs in SQL-like syntax so there is no need to write code. Spectrum™ Location Intelligence for Big Data and Spectrum Geocoding for Big Data provide Hive user defined functions for Geometry operations and to work with grids in the spectrum-bigdata-spatial-li-hive-.jar.

Refer to the table below to quickly navigate to Hive UDFs described in this document:

Type Description Name

Constructor Functions on page 16 Construct an instance of WritableGeometry FromGeoJSON

from supported geometry representation

formats FromKML

FromWKB

FromWKT

ST_Point

Grid Index Functions on page 31 Grid processing functions GeoHashBoundary

GeoHashID

HexagonBoundary

HexagonID

SquareHashBoundary

SquareHashID

Measurement Functions on page 22 Geometry measurement functions Area

ClosestPoints

Distance

Length

Perimeter


Spatial

Type Description Name

Observer Functions on page 29 Geometry observer functions ST_X

ST_Y

ST_XMax

ST_XMin

ST_YMax

ST_YMin

Persistence Functions on page 18 Serialize an instance of WritableGeometry ToGeoJSON

to supported geometry representation

formats ToKML

ToWKB

ToWKT

Predicate Functions on page 20 Geometry predicate functions Disjoint

Intersects

Overlaps

Within

Processing Functions on page 27 Geometry processing functions Buffer

ConvexHull

Intersection

Transform

Union

Search Functions on page 36 Spatial search functions LocalPointInPolygon

LocalSearchNearest


Spatial

Setup

This topic assumes the product is installed to /pb/li/sdk as described in Installing the SDK on page 8. To set up user-defined spatial functions for Hive, perform the following steps:

1. Proceed according to your platform.

On this Do this platform

Cloudera Copy the Hive jar for Location Intelligence to the HiveServer node.

/pb/li/sdk/hive/lib/spectrum-bigdata-li-hive-version.jar

In Cloudera Manager, navigate to the Hive Configuration page. Search for the Hive Auxiliary JARs Directory setting. If the value is already set then move the Hive jar into the specified folder. If the value is not set then set it to the parent folder of the Hive jar.

/pb/li/sdk/hive/lib/

Hortonworks On the HiveServer2 node, create the Hive auxlib folder if one does not already exist.

sudo mkdir /usr/hdp/current/hive-server2/auxlib/

Copy the Hive jar for Location Intelligence to the auxlib folder on the HiveServer2 node:

sudo cp /pb/li/sdk/hive/lib/spectrum-bigdata-li-hive-version.jar/usr/hdp/current/hive-server2/auxlib/

MapR Copy the Hive jar for Location Intelligence to the HiveServer node.

/pb/li/sdk/hive/lib/spectrum-bigdata-li-hive-version.jar

2. Restart all Hive services. 3. Launch Beeline, or some other Hive client, for the remaining step.

beeline -u jdbc:hive2://localhost:10000/default -n pbuser


Spatial

4. Register spatial user-defined functions. Add the temporary keyword after create if you want a temporary function (this step would need to be redone for every new Hive session).

create temporary function FromWKT as 'com.pb.bigdata.spatial.hive.construct.FromWKT';create temporary function FromWKB as 'com.pb.bigdata.spatial.hive.construct.FromWKB';create temporary function FromKML as 'com.pb.bigdata.spatial.hive.construct.FromKML';create temporary function FromGeoJSON as 'com.pb.bigdata.spatial.hive.construct.FromGeoJSON';create temporary function ST_Point as 'com.pb.bigdata.spatial.hive.construct.ST_Point';

create temporary function ToWKT as 'com.pb.bigdata.spatial.hive.persistence.ToWKT';create temporary function ToWKB as 'com.pb.bigdata.spatial.hive.persistence.ToWKB';create temporary function ToKML as 'com.pb.bigdata.spatial.hive.persistence.ToKML';create temporary function ToGeoJSON as 'com.pb.bigdata.spatial.hive.persistence.ToGeoJSON';

create temporary function Disjoint as 'com.pb.bigdata.spatial.hive.predicate.Disjoint';create temporary function Overlaps as 'com.pb.bigdata.spatial.hive.predicate.Overlaps';create temporary function Within as 'com.pb.bigdata.spatial.hive.predicate.Within';create temporary function Intersects as 'com.pb.bigdata.spatial.hive.predicate.Intersects';

create temporary function Area as 'com.pb.bigdata.spatial.hive.measurement.Area';create temporary function ClosestPoints as 'com.pb.bigdata.spatial.hive.measurement.ClosestPoints';create temporary function Distance as 'com.pb.bigdata.spatial.hive.measurement.Distance';create temporary function Length as 'com.pb.bigdata.spatial.hive.measurement.Length';create temporary function Perimeter as 'com.pb.bigdata.spatial.hive.measurement.Perimeter';

create temporary function ConvexHull as 'com.pb.bigdata.spatial.hive.processing.ConvexHull';create temporary function Intersection as 'com.pb.bigdata.spatial.hive.processing.Intersection';create temporary function Buffer as 'com.pb.bigdata.spatial.hive.processing.Buffer';create temporary function Union as 'com.pb.bigdata.spatial.hive.processing.Union';create temporary function GeometryTransform as 'com.pb.bigdata.spatial.hive.processing.Transform';

create temporary function ST_X as 'com.pb.bigdata.spatial.hive.observer.ST_X';create temporary function ST_XMax as 'com.pb.bigdata.spatial.hive.observer.ST_XMax';create temporary function ST_XMin as 'com.pb.bigdata.spatial.hive.observer.ST_XMin';create temporary function ST_Y as 'com.pb.bigdata.spatial.hive.observer.ST_Y';create temporary function ST_YMax as 'com.pb.bigdata.spatial.hive.observer.ST_YMax';create temporary function ST_YMin as 'com.pb.bigdata.spatial.hive.observer.ST_YMin';

create temporary function GeoHashBoundary as 'com.pb.bigdata.spatial.hive.grid.GeoHashBoundary';create temporary function GeoHashID as 'com.pb.bigdata.spatial.hive.grid.GeoHashID';create temporary function HexagonBoundary as 'com.pb.bigdata.spatial.hive.grid.HexagonBoundary';create temporary function HexagonID as 'com.pb.bigdata.spatial.hive.grid.HexagonID';create temporary function SquareHashBoundary as 'com.pb.bigdata.spatial.hive.grid.SquareHashBoundary';create temporary function SquareHashID as 'com.pb.bigdata.spatial.hive.grid.SquareHashID';

create temporary function LocalSearchNearest as 'com.pb.bigdata.spatial.hive.search.LocalSearchNearest';create temporary function LocalPointInPolygon as 'com.pb.bigdata.spatial.hive.search.LocalPointInPolygon';

Note: If you want to view the complete stack trace for any encountered error, enable logging in DEBUG mode and then restart the job execution.

5. MapR Only Set hive.aux.jars.path in hive-site.xml and (for Hive v2.1 and earlier only) HIVE_AUX_JARS_PATH in hive-env.sh using full paths to the jar files (not to folders) for only the nodes that are running HiveServer2 or Hive metastore (that is, the master node or nodes).

• /opt/mapr/hive/hive-version/conf/hive-site.xml

Qualify the hive.aux.jars.path with the file://uri prefix and separate multiple paths with a comma.

hive.aux.jars.path


file://urihttp:hive-env.sh

Spatial

file:///pb/li/sdk/hive/lib/spectrum-bigdata-li-hive-version.jar,file:///pb/geocoding/sdk/hive/lib/spectrum-bigdata-geocoding-hive-version.jar

• /opt/mapr/hive/hive-version-2.1-or-earlier/conf/hive-env.sh

Export the environment variable and separate multiple paths with a colon (:).

exportHIVE_AUX_JARS_PATH=/pb/li/sdk/hive/lib/spectrum-bigdata-li-hive-version.jar:/pb/geocoding/sdk/hive/lib/spectrum-bigdata-geocoding-hive-version.jar

• The first time you run a job may take a while if the reference data has to be downloaded remotely from HDFS or S3. It may also time out when using a large number of datasets that are stored in remote locations such as HDFS or S3. If you are using Hive with the MapReduce engine, you can adjust the value of the mapreduce.task.timeout property.

• Some types of queries will cause Hive to evaluate UDFs in the HiveServer2 process space instead of on a data node. The Routing UDFs in particular use a significant amount of memory and can shut down the Hive server due to memory constraints. To process these queries, we recommend increasing the amount of memory available to the HiveServer2 process (for example, by setting HADOOP_HEAPSIZE in hive-env.sh).


http:hive-env.shhttp:opt/mapr/hive/hive-version-2.1-or-earlier/conf/hive-env.shfile:///pb/geocoding/sdk/hive/lib/spectrum-bigdata-geocoding-hive-version.jarfile:///pb/li/sdk/hive/lib/spectrum-bigdata-li-hive-version.jar

Spatial

WritableGeometry

This is an implementation of Hadoop's Writable interface for geometry.

Spatial Hive user defined functions (UDFs) use WritableGeometry to exchange data between two functions. Constructor Hive functions provide a mechanism to get an instance of WritableGeometry from standard geometry formats like WKT, WKB, GeoJSON and KML. For example:

To get an instance of WritableGeometry from WKT:

SELECT FromWKT(t.geometry,'epsg:4267') FROM hivetable t;

To get an instance of WritableGeometry from WKB string:

SELECT FromWKB(t.geometry,'epsg:4267') FROM hivetable t;

Persistence Hive UDFs convert an instance of WritableGeometry to standard formats like WKT, WKB, GeoJSON and KML. For example:

To serialize an instance of WritableGeometry to WKT:

SELECT ToWKT(t.geometry) FROM hivetable t;

The output of Constructor functions can be supplied as input to other Hive functions that perform some operations on it:

For example:

To calculate the length of a geometry:

SELECT Length(FromWKT(t.geometry, 'epsg:4267'), 'm', 'SPHERICAL') FROM hivetable t;

To get the distance between two geometries:

SELECT Distance(FromWKT(t.geometry,'epsg:4267'), FromWKT(t.geometry2,'epsg:4267'), 'm','SPHERICAL') FROM hivetable t;

For more information, see https://hadoop.apache.org/docs/r2.6.1/api/org/apache/hadoop/io/Writable.html.


https://hadoop.apache.org/docs/r2.6.1/api/org/apache/hadoop/io/Writable.html

Spatial

Geometry Functions

• Constructor Functions • Grid Functions • Measurement Functions • Observer Functions • Persistence Functions • Predicate Functions • Processing Functions


Spatial

Constructor Functions

The following Constructor functions are available:

• FromWKT • FromGeoJSON • FromWKB • FromKML • Point

FromGeoJSON

The FromGeoJSON function returns a WritableGeometry instance from a GeoJSON representation of a geometry.

Example:

SELECT FromGeoJSON('{ "type": "Point", "coordinates": [100.0, 0.0] }');SELECT FromGeoJSON(t.geometry) FROM hivetable t;

FromKML

The FromKML function returns a WritableGeometry instance from the text formatted in KML (Keyhole Markup Language).

Example:

SELECT FromKML(t.geometry) FROM hivetable t;

FromWKB

The FromWKB function returns a WritableGeometry instance from a Well-Known Binary (WKB) of a geometry. The geometry will be created using the specified coordinate system.

Example:

SELECT FromWKB (t.geometry, 'epsg:4267') FROM hivetable t;

FromWKT

The FromWKT function returns a WritableGeometry instance from a Well-Known Text (WKT) representation of a geometry. The geometry is created using the the specified coordinate system.

Examples:

SELECT FromWKT(t.geometry,'epsg:4267') FROM hivetable t;

SELECT FromWKT ('POINT (30 20)', 'epsg:4267');


Spatial

ST_Point

The ST_Point function constructs a point geometry from the provided X and Y, and an optional CRS. X and Y can be either of String or Numeric types. If the CRS is not provided or null or empty, then EPSG:4326 will be used as the default CRS. If any of the argument values are invalid, then null will be returned in the output.

To create a temporary Hive function:

create temporary function ST_Point as 'com.pb.bigdata.spatial.hive.construct.ST_Point'.

Examples:

SELECT ST_Point(-73.750333 , 42.736103, 'epsg:4326');

SELECT ST_Point('-73.750333' , '42.736103', 'epsg:4326');

SELECT ST_Point(-73.750333 , 42.736103);

SELECT ST_Point('-73.750333' , 42.736103);

SELECT ST_Point(p.x, p.y, p.crs) FROM points p;


Spatial

Persistence Functions

The following Persistence functions are available:

• ToWKT • ToGeoJSON • ToWKB • ToKML

ToGeoJSON

The ToGeoJSON function returns a text formatted in GeoJSON representation of geometry from the specified WritableGeometry instance.

Example:

SELECT ToGeoJSON(FromGeoJSON(t.geometry)) FROM hivetable t;SELECT ToGeoJSON(Buffer(FromGeoJSON(t.geometry), 5.0, 'km', 12, 'SPHERICAL' )) FROMhivetable t;

ToKML

The ToKML function returns a text formatted in KML in the OGC standard, KML2.2 namespace (http://schemas.opengis.net/kml/2.2.0/ogckml22.xsd) as parsed from the specified WritableGeometry instance.

Example:

SELECT ToKML(FromKML(t.geometry)) FROM hivetable t;SELECT ToKML(Buffer(FromGeoJSON(t.geometry), 5.0, 'km', 12, 'SPHERICAL' )) FROM hivetablet;

ToWKB

The ToWKB function returns a text formatted in a Well-Known Binary (WKB) representation of a geometry as parsed from the specified WritableGeometry instance.

Example:

SELECT ToWKB(FromWKB(t.geometry, 'epsg:4326')) FROM hivetable t;SELECT ToWKB(Buffer(FromGeoJSON(t.geometry), 5.0, 'km', 12, 'SPHERICAL' )) FROM hivetablet;

ToWKT

The ToWKT function returns a Well-Known Text (WKT) representation of a geometry from the specified WritableGeometry instance.


http://schemas.opengis.net/kml/2.2.0/ogckml22.xsd

Spatial

Example:

SELECT ToWKT(FromWKT(t.geometry, 'epsg:4326')) FROM hivetable t;SELECT ToWKT(Buffer(FromGeoJSON(t.geometry), 5.0, 'km', 12, 'SPHERICAL' )) FROM hivetablet;


Spatial

Predicate Functions

The following Predicate functions are available:

• Disjoint • Intersects • Overlaps • Within

Disjoint

The Disjoint function returns True if two geometry objects have no points in common, otherwise False is returned.

If either geometry1 or geometry2 are null, Null is returned.

Example:

SELECT Disjoint(FromWKT(t1.geometry, 'epsg:4326'), FromWKT(t2.geometry, 'epsg:4326'))FROM hivetable1 t1, hivetable2 t2;

Intersects

The Intersects function determines whether or not one geometry object intersects another geometry object. It returns True if there is any direct position in common between the two geometries, or else False is returned.


Example:

SELECT Intersects(FromWKT(t1.geometry, 'epsg:4326'), FromWKT(t2.geometry, 'epsg:4326'))FROM hivetable1 t1, hivetable2 t2;

SELECT R.highway FROM USA_RIVERS L, usa_highways R where L.name='Hudson River' andIntersects(FromWKT(L.geom), FromWKT(R.geom));

Overlaps

The Overlaps function determines whether or not one geometry object overlaps another geometry object. This function returns True if the geometry1 overlaps the geometry2, otherwise False is returned.


Example:

SELECT Overlaps(FromWKT(t1.geometry, 'epsg:4326'), FromWKT(t2.geometry, 'epsg:4326'))FROM hivetable1 t1, hivetable2 t2;


Spatial

Within

The Within function returns whether or not one geometry object is entirely within another geometry object. It returns Trueif the geometry2 entirely contains geometry1, otherwise False is returned.

If either the testGeometry or the containerGeometry are null, Null is returned.

Example:

SELECT Within(FromWKT(t1.geometry, 'epsg:4326'), FromWKT(t2.geometry, 'epsg:4326')) as Result FROM hivetable1 t1, hivetable2 t2;

SELECT L.zipcode as zipcode, SUM(L.insurance) as TotalInsuredAmount, AVG(R.riskdesc) as RiskScore

FROM book_of_business L, FIRE_RISK_BOUNDRIES RWHERE Within(FromWKT(L.location), FromWKT(R.geom)) GROUP BY L.zipcode;


Spatial

Measurement Functions

The following Measurement functions are available:

• Area • Length • Perimeter • Distance • ClosestPoints

Area

The Area function calculates and returns the area of given Geometry in the desired unit. The unit must be specified as a parameter while calling the function. The area of a polygon is computed as the area of its exterior ring minus the areas of its interior rings. Points and curves have zero area.

Example:

SELECT Area (FromWkt(t.geometry,'epsg:4267'), 'sq mi', 'SPHERICAL') FROM hivetable t;

SELECT Area (FromWkt(t.geometry,'epsg:4267'), 'sq mi') FROM hivetable t;

Area Units:

Valid values for unit are the following area units:

Value Description

sq mi square miles

sq km square kilometers

sq in square inches

sq ft square foot

sq yd square yards

sq mm square millimeters

sq cm square centimeters

sq m square meters

sq survey ft square US Survey feet


Spatial

Value Description

sq nmi square nautical miles

acre acres

ha hectares

ClosestPoints

The ClosestPoints function returns the closest points between the two geometries. The geometries that intersect are at distance zero from each other, and in this case a shared point is returned.

Example:

SELECT res[0], res[1] FROM (SELECT ClosestPoints (FromWkt(t.geometry1,'epsg:4267'),FromWkt(t.geometry2,'epsg:4267'), 'SPHERICAL') as res FROM hivetable t) temp;

Distance

The Distance function calculates and returns the distance between two geometries specified in parameters. This function returns the distance value in the unit specified. Distance is always non-negative. The geometries that intersect are at distance zero from each other.

Example:

SELECT Distance (FromWkt(t.geometry,'epsg:4267'), FromWkt(t.geometry2,'epsg:4267'), 'm','SPHERICAL') FROM hivetable t;

SELECT Distance (FromWkt(t.geometry,'epsg:4267'), FromWkt(t.geometry2,'epsg:4267'), 'm')FROM hivetable t;

Linear Units:

Valid values for unit are the following distance units:

Value Description

mi miles

km kilometers

in inches

ft feet


Spatial

Value Description

yd yards

mm millimeters

cm centimeters

m meters

survey ft US Survey feet

nmi nautical miles

Length

The Length function calculates and returns the geographic length of a line or polyline geometry object in the desired unit. The unit must be specified as a parameter while calling the function.

Example:

SELECT Length(FromWkt(t.geometry,'epsg:4267'), 'm', 'SPHERICAL') FROM hivetable t;

SELECT Length(FromWkt(t.geometry,'epsg:4267'), 'm') FROM hivetable t;

Linear Units:


Value Description

mi miles

km kilometers

in inches

ft feet

yd yards

mm millimeters


Spatial

Value Description

cm centimeters

m meters


nmi nautical miles

Perimeter

The Perimeter function calculates and returns the total perimeter of a given geometry in the desired unit. The unit must be specified as a parameter while calling the function. The Perimeter of a polygon is the sum of the lengths of its rings (both exterior and holes). The curves are considered as thin polygons.

Example:

SELECT Perimeter (FromWkt(t.geometry,'epsg:4267'), 'm', 'SPHERICAL') FROM hivetable t;

SELECT Perimeter (FromWkt(t.geometry,'epsg:4267'), 'm') FROM hivetable t;

Linear Units:


Value Description

mi miles

km kilometers

in inches

ft feet

yd yards

mm millimeters

cm centimeters


Spatial

Value Description

m meters


nmi nautical miles


Spatial

Processing Functions

The following Processing functions are available:

• Buffer • Intersection • Transform • ConvexHull • Union

Buffer

The Buffer function returns an instance of WritableGeometry having a MultiPolygon geometry inside it which represents a buffered distance around another geometry object.

Example:

SELECT Buffer(FromWKT(t.geometry,'epsg:4267'), 5.0, 'km', 12, 'SPHERICAL' ) FROM hivetablet;

SELECT Buffer(ST_POINT(5, 6, 'epsg:4267'), 5.0, 'km', 12, 'SPHERICAL' );

ConvexHull

The ConvexHull function computes the convex hull of a geometry. The convex hull is the smallest convex geometry that contains all the points in the input geometry.

Example:

SELECT ConvexHull (FromWKT(geometry, 'epsg:4326')) FROM hivetable;

SELECT ToWKT ( ConvexHull( FromWKT (table.geometry,'epsg:4267'))) as result FROM hivetable;

SELECT ConvexHull ( FromWKT ('MULTIPOLYGON (((40 40, 20 45, 45 30, 40 40)), ((20 35, 1030, 10 10, 30 5, 45 20, 20 35), (30 20, 20 15, 20 25, 30 20)))'),'epsg:4267') ;

Intersection

The Intersection function is a geometry (point, line, or curve) common in two geometry objects (such as lines, curves, planes, and surfaces). It returns the geometry consisting of direct positions that lie in both specified geometries.

Example:

SELECT Intersection (FromWKT(t1.geometry,'epsg:4326'), FromWKT("WKT_String",'epsg:4267'))FROM hivetable t;

SELECT Intersection (FromWKT(t1.geometry,'epsg:4267'), FromWKT(t2.geometry,'epsg:4267'))FROM hivetable1 t1, hivetable2 t2;


Spatial

Transform

The Transform function transforms a given geometry from one coordinate system to another.

Example:

SELECT GeometryTransform (FromWKT(t.geometry,'epsg:4326'), 'epsg:3857') FROM hivetablet;

SELECT GeometryTransform (ST_POINT(30, 20),'epsg:3857');

Union

The Union function returns a geometry object which represents the union of two input geometry objects.

Example:

SELECT Union (FromWKT(geometry1, 'epsg:4326'), FromWKT(geometry2, 'epsg:4326')) FROMhivetable;

SELECT ToWKT(Union(FromWKT(t1.geometry,'epsg:4267'),FromWKT(t2.geometry,'epsg:4267')))`FROMhivetable1 t1, hivetable2 t2;


Spatial

Observer Functions

Obtaining the X and Y ordinates of a geometry is important when dealing with XY tables. For example, the TRANSFORM UDF accepts and returns a geometry which means an XY table cannot be transformed FROM one coordinate system to another. The ST_X and ST_Y UDFs allow the transformation of an XY table FROM one coordinate system to another.

Another common need is the ability to filter records in an XY table by the bounds of a geometry. The ST_XMax, ST_XMin, ST_YMax, and ST_YMin UDFs provide a way to get the values of the MBR for a writeable geometry.

The following Observer index functions are available:

• ST_X • ST_XMax • ST_XMin • ST_Y • ST_YMax • ST_YMin

ST_X

The ST_X function returns the X ordinate of the geometry if the geometry is a point, or null if the geometry is not a point or is null. The result type is a double.

Example

SELECT ST_X(ST_Point(x, y, 'epsg:4326')) FROM src;

ST_XMax

The ST_XMax function returns the X maxima of a geometry, or NULL if the specified value is not a geometry. The output will be of type double.

Example

SELECT ST_XMax(FromWKT(…, 'epsg:4326')) FROM src;

ST_XMin

The ST_XMin function returns the X minima of a geometry, or NULL if the specified value is not a geometry. The output will be of type double.

Example

SELECT ST_XMin(FromWKT(…, 'epsg:4326')) FROM src;


Spatial

ST_Y

The ST_Y function returns the Y ordinate of the geometry if the geometry is a point, or null if the geometry is not a point or is null. The result type is a double.

Example

SELECT ST_Y(ST_Point(x, y, 'epsg:4326')) FROM src;

ST_YMax

The ST_YMax function returns the Y maxima of a geometry, or NULL if the specified value is not a geometry. The output will be of type double.

Example

SELECT ST_YMax(FromWKT(…, 'epsg:4326')) FROM src;

ST_YMin

The ST_YMin function returns the Y minima of a geometry, or NULL if the specified value is not a geometry. The output will be of type double.

Example

SELECT ST_YMin(FromWKT(…, 'epsg:4326')) FROM src;


Spatial

Grid Index Functions

A grid is a way of dividing the surface of the earth into contiguous cells with no gaps in between. This makes grids very useful for spatial indexing and aggregating.

Spectrum™ Location Intelligence for Big Data provides Hive user defined functions (UDF) for hashing that allow you to manage grid cells for a variety of use cases. Hashing is a way of encoding and decoding the grid cell using the cell boundary and a unique identifier.

We provide three types of UDFs for processing three grid cell shapes: rectangular (geohash), square (square hash) and hexagon (hexagon hash). Hashes are useful for analysis and interoperability with other systems.

Square hash is similar to GeoHash but has the advantage that when displayed in Popular Mercator, the cells appear as squares.

Hexagons are often used in telecommunication solutions as they approximate circles while covering the surface of the earth without gaps.

The following Grid index functions are available:

• GeoHashBoundary: • GeoHashID • SquareHashBoundary • SquareHashID • HexagonBoundary • HexagonID

GeoHashBoundary

The GeoHashBoundary function returns a WritableGeometry that defines the boundary of a cell in a grid if given a unique ID for the location. It also can return the unique ID for a given pair of coordinates and a precision. The shape of the cell is rectangular.

Syntax:GeoHashBoundary ()

Examples:

SELECT GeoHashBoundary (hashStringId) FROM hivetable;

SELECT GeoHashBoundary ("ebvnk");

Syntax:GeoHashBoundary (, , )

Where

X is the longitude value for the specified point.

Y is the latitude value for the specified point.


Spatial

precision is the length of the string key to be returned. The precision determines how large the grid cells are (longer strings means higher precision and smaller grid cells). It returns the string ID of the grid cell at the specified precision that contains the point.

Examples:

SELECT GeoHashBoundary(x, y, precision) FROM hivetable;

SELECT GeoHashBoundary ("-73.750333", "42.736103", 3);

SELECT GeoHashBoundary (-73.750333, 42.736103, 3);

GeoHashID

The GeoHashID function returns a unique well-known string ID for the grid cell. The ID then is sortable and searchable that corresponds to the specified X, Y, and precision.

Syntax:GeoHashID (, , )

Where



PRECISION is the length of the string key to be returned. The precision determines how large the grid cells are (longer strings means higher precision and smaller grid cells). It returns the string ID of the grid cell at the specified precision that contains the point.

Examples:

SELECT GeoHashID(x, y, precision) FROM hivetable;

SELECT GeoHashID ("-73.750333”, "42.736103", 3);

SELECT GeoHashID (-73.750333, 42.736103, 3);

CREATE TEMPORARY TABLE tmptbl ASSELECT *, (GeoHashID(x, y, 10)) AS hashIDFROM coordinates ORDER BY hashID;

INSERT INTO TABLE coordinates_with_hash SELECT *, (GeoHashID(x, y, 10)) as hashID FROM coordinates ORDER BY hashID;

SELECT c.hashID, ToWKT (GeoHashBoundary (c.hashID)), count (*) as quantity FROM (SELECT GeoHashID(x, y, 10) as hashID FROM coordinates) c GROUP BY c.hashID;


Spatial

HexagonBoundary

The HexagonBoundary function returns a WritableGeometry that defines the boundary of a cell in a grid if given a unique ID for the location. It also can return the unique ID for a given pair of coordinates and a precision. The shape of the cell is a hexagon.

Syntax:HexagonBoundary ()

Examples:

SELECT HexagonBoundary (hashStringId) FROM hivetable;

SELECT HexagonBoundary ("PF625028642");

Syntax:HexagonBoundary (, , )

Where




Examples:

SELECT HexagonBoundary(x, y, precision) FROM hivetable;

SELECT HexagonBoundary (-73.750333, 42.736103, 3);

HexagonID

The HexagonID function returns a unique well-known string ID for the grid cell. The ID then is sortable and searchable that corresponds to the specified X, Y, and precision.

Syntax:HexagonID (, , )

Where





Spatial

Examples:

SELECT HexagonID(x, y, precision) FROM hivetable;

SELECT HexagonID (-73.750333, 42.736103, 3);

CREATE TEMPORARY TABLE tmptbl ASSELECT *, (HexagonID(x, y, 10)) AS hashIDFROM coordinates ORDER BY hashID;

INSERT INTO TABLE coordinates_with_hash SELECT *, (HexagonID(x, y, 10)) as hashID FROM coordinates ORDER BY hashID;

SELECT c.hashID, ToWKT (HexagonBoundary (c.hashID)), count (*) as quantity FROM (SELECT HexagonID(x, y, 10) as hashID FROM coordinates) c GROUP BY c.hashID;

SquareHashBoundary

The SquareHashBoundary function returns a WritableGeometry that defines the boundary of a cell in a grid if given a unique ID for the location. It also can return the unique ID for a given pair of coordinates and a precision. Square hash cells appear square when displayed on a Popular Mercator map.

Syntax:SquareHashBoundary ()

Examples:

SELECT SquareHashBoundary (hashStringId) FROM hivetable;

SELECT SquareHashBoundary ("03332");

Syntax:SquareHashBoundary (, , )

Where





Spatial

Examples:

SELECT SquareHashBoundary(x, y, precision) FROM hivetable;

SELECT SquareHashBoundary ("-73.750333", "42.736103", 3);

SELECT SquareHashBoundary (-73.750333, 42.736103, 3);

SquareHashID

The SquareHashID function takes a longitude, latitude (in WGS 84) and a precision. The precision determines how large the grid cells are (higher precision means smaller grid cells). It returns the string ID of the grid cell at the specified precision that contains the point. Square hash cells appear square when displayed on a Popular Mercator map.

Syntax:SquareHashID (, , )

Where




Examples:

SELECT SquareHashID(x, y, precision) FROM hivetable;

SELECT SquareHashID ("-73.750333", "42.736103", 3);

SELECT SquareHashID (-73.750333, 42.736103, 3);

CREATE TEMPORARY TABLE tmptbl ASSELECT *, (SquareHashID(x, y, 10)) AS hashIDFROM coordinates ORDER BY hashID;

INSERT INTO TABLE coordinates_with_hash SELECT *, (SquareHashID(x, y, 10)) as hashID FROM coordinates ORDER BY hashID;

SELECT c.hashID, ToWKT (SquareHashBoundary (c.hashID)), count (*) as quantity FROM (SELECT SquareHashID(x, y, 10) as hashID FROM coordinates) c GROUP BY c.hashID;


Spatial

Search Functions

LocalPointInPolygon

The LocalPointInPolygon UDTF function returns the polygon (contained in the specified TAB or shapefile) in which an input point resides. All the geometries where the search point lies within are returned by this UDTF; that is, if the search point lies on a Line, Polyline, Point, Polygon, or MultiPolygon geometry type, the respective geometries will be returned in the output.

Syntax: LocalPointInPolygon(, , map())

Where:

• inputPoint is a WritableGeometry representing a point • dataSourcePath is the path to the data source to be searched. The path can be either a relative path based on the remote resource or a local path to a file that must be available on the master node and every data node.

Note: If you are storing and distributing your data remotely using HDFS or S3, you must set the option for remoteDataSourceLocation and also specify the download location as described in the table below.

• options allow you to optionally set return criteria, in format:

ExampleDescriptionOption

'shpCharset', 'utf-8' the charset to use when reading a shapefile

shpCharset

'shpCrs', 'epsg:4326' the coordinate reference system to use when reading a shapefile

shpCrs

'remoteDataSourceLocation, 'hdfs:///data/mydata.zip'

the path to the directory or archive that contains the data source (required only

remoteDataSourceLocation

if you are storing and distributing data remotely on HDFS or S3)

'downloadLocation', '/pb/downloads'

the local file system location to which resources get downloaded (required

downloadLocation

Note: If you are also using Spectrum™ Geocoding for

only if you are storing and distributing data remotely on HDFS or S3) Big Data and have already

set the


Spatial

Option Description Example pb.download.locationHive variable, then you do not need to set this option here as well.

downloadGroup the operating system group 'downloadGroup',which should be applied to 'pbdownloads'downloaded data on a local file system; the default is the value from the Hive property, pb.download.group(required only if you are storing and distributing data remotely on HDFS or S3)

For more information, see Download Permissions on page 86.

Example (using HDFS)

SELECT pip_points.id, pipresult.capital, pipresult.stateFROM pip_pointsLATERAL VIEW LocalPointInPolygon(FromWKT(pip_points.geometry, pip_points.crs),

'/STATECAP.TAB',map('remoteDataSourceLocation', 'hdfs:///data/pip/capitals.zip','downloadLocation', '/pb/pip/download', 'downloadGroup', 'pbdownloads'))

pipresult

In the above example, id is a field from the pip_points table, which is the table being used to get the points we are searching from. The pipresult.capital and pipresult.state fields are from the STATECAP TAB file that we want in our query result.

Tip: To improve performance when searching TAB files, consider creating PGD (prepared geometry) index files. For more information, see PGD Builder on page 85.

LocalSearchNearest

The LocalSearchNearest UDTF function returns the nearest geometry or geometries contained in the specified TAB or shapefile to an input point.

Syntax: LocalSearchNearest(, , map())

Where:

• inputPoint is a WritableGeometry representing the point to search near


http:pip_points.id

Spatial

• dataSourcePath is the location of the input TAB or shapefile. The path can be either a relative path based on the remote resource or a local path to a file that must be available on the master node and every data node.

Note: If you are storing and distributing your data remotely using HDFS or S3, you must set the option for remoteDataSourceLocation and also specify the download location as described in the table below.

• options allow you to optionally return more than one value, return additional information, or set other return criteria, in format:

Option Description Example

'maxCandidates', '3' the maximum number of results to return (if not set, the default value is 1)

maxCandidates

'maxDistance', '25' the maximum distance to search for results (if not

maxDistance

set, the default value is no limit)

'distanceUnit', 'mi' the distance unit (if not set, the default value is m for meters)

See the Distance on page 23 function for

distanceUnit

examples of supported distance units.

'returnDistanceColumnName', 'Miles'

the name of the column to use for returning the distance

returnDistanceColumnName

'shpCharset', 'utf-8' the charset to use when reading a shapefile

shpCharset

'shpCrs', 'epsg:4326' the coordinate reference system to use when reading a shapefile

shpCrs

'remoteDataSourceLocation, 'hdfs:///data/mydata.zip'

the path to the directory or archive that contains the data source (required

remoteDataSourceLocation

only if you are storing and distributing data remotely on HDFS or S3)


Spatial

set the pb.download.locationHive variable, then you do not need to set this option here as well.

ExampleDescriptionOption

'downloadLocation', '/pb/downloads'location to which

the local file system

resources get downloaded

downloadLocation

Note: If you are also using Spectrum™ Geocoding for

(required only if you are storing and distributing

Big Data and have already data remotely on HDFS or S3)

downloadGroup the operating system 'downloadGroup',group which should be 'pbdownloads'applied to downloaded data on a local file system; the default is the value from the Hive property, pb.download.group(required only if you are storing and distributing data remotely on HDFS or S3)

For more information, see Download Permissions on page 86.

Example (using HDFS)

SELECT search_points.id, nearestresult.capital, nearestresult.stateFROM search_pointsLATERAL VIEW OUTER LocalSearchNearest(FromWKT(search_points.geometry, search_points.crs), '/STATECAP.TAB',map('maxCandidates', '3','remoteDataSourceLocation', 'hdfs:///data/search/capitals.zip','downloadLocation', '/pb/search/download', 'downloadGroup', 'pbdownloads')) nearestresult

In the above example, id is a field from the search_points table, which is the table being used to get the points we are searching from. The nearestresult.capital and nearestresult.state fields are from the STATECAP TAB file that we want in our query result. In this particular example, the maxCandidates option limits the results to 3 records for each search point.

Tip: To improve performance when searching TAB files, consider creating PGD (prepared geometry) index files. For more information, see PGD Builder on page 85.


http:search_points.id

Spatial

MapReduce Jobs

To process or produce large sets of data, the MapReduce jobs are provided with Spectrum Location Intelligence for Big Data.

• Polygon Filter • Hexagon Generator

Polygon Filter

The Polygon Filter is a MapReduce job that accepts a comma or tab-delimited file containing points (longitude/latitude) and attribute data, and matches them to a given boundary (polygon). This preprocessing operation determines whether or not your data resides inside (or outside) the polygon. The records that match the criteria are returned.

To filter data with the polygon filter:

1. Deploy the spectrum-bigdata-spatial-li-mapreduce-filter-version.jar on a Hadoop cluster on which input must be available.

/dir/on/server/spectrum-bigdata-spatial-li-mapreduce-filter-version.jar

2. Copy input data and boundary file to HDFS using the following commands:

hadoop fs -copyFromLocal /dir/on/server/myinput.txt /dir/on/hdfs/inputhadoop fs -copyFromLocal /dir/on/server/boundary.wkt /dir/on/hdfs/wkt

3. Start Hadoop job using the following command:

hadoop jar /dir/on/server/spectrum-bigdata-spatial-li-mapreduce-filter-version.jarcom.pb.hadoop.mapreduce.filter.PolygonFilterDriver-input /dir/on/hdfs/input -output /dir/on/hdfs/output-boundary /dir/on/hdfs/wkt/boundary.wkt-longitudeColumn 0 -latitudeColumn 1-delimiter "\t" -quote " -escape "\\" - overwrite -contains true

where:

• -input The HDFS path to the input directory • -output The HDFS path to the output directory • -boundary The HDFS path to the boundary file • -longitudeColumn The 0-based index of the longitude column • -latitudeColumn The 0-based index of the latitude column

Optional parameters:


Spatial

To control whether the data is evaluated against the inside or outside of the polygon, include the optional -contains parameter. If true, points within the boundary are included in the output. If false, points outside the boundary are included. If not specified, true is assumed.

Polygon Filter supports tab- and comma-delimited input files. The default is tab. Include as appropriate:

• For files with tabs: -delimiter "\t" • For files with commas: -delimiter ","

It supports configurable quote characters used in the input files. The default is double quotes. Include as appropriate:

• For files with a quote character as double quotes: -quote "”””" • For files with a quote character as a grave accent: -quote "`"

It also supports configurable escape characters used in input files. If you do not specify the escape character in the input, it is configured as no escape. Include as appropriate:

• For files with an escape character as backslash: -escape "\\"

A file is returned containing every line from the input file for which the point is either inside or outside the boundary, depending on the -contains parameter. This file can now be the input to a Hive query for aggregating the data by hexagon.


Spatial

Hexagon Generator

This MapReduce job generates the hexagons within a bounding box (for example, the bounding box of the continental USA). Hexagon output can be used for map display.

To create hexagons for a given bounding box:

1. Deploy the jar file and configuration to the Hadoop cluster.

/dir/on/server/spectrum-bigdata-li-mapreduce-hexgen-version.jar

2. Modify the configuration according to the hexagons to be generated. Change the bounding box coordinates and hexagon level to suit your needs. Refer to Hexagons to learn about hexagon levels.

MinLongitude-73.728200The bottom left longitude of the bounding box.

MinLatitude40.979800The bottom left latitude of the bounding box.

MaxLongitude-71.787480The upper right longitude of the bounding box.

MaxLatitude42.050496The upper right latitude of the bounding box.

HexLevel9The level to generate hexagons for. Must be between 1 and

11.

ContainerLevel2A hint for providing some parallel hexagon generation. Must be

less than the HexLevel property.

3. Start the Hadoop job using the following command:

Usage:

hadoop jar spectrum-bigdata-li-mapreduce-hexgen-version.jarcom.pb.bigdata.spatial.hex.mapreduce.HexGenDriver -conf /dir/on/server/config.xml-output /dir/on/hdfs/output


Spatial

The output of the Hexagon Generator is a list of Well Known Text (WKT) that represents the hexagons. Refer to Consuming Results for more information on how to use the output.

Sample Output


Spatial

Spark Jobs

To process large sets of data, the Spark jobs are provided with Spectrum Location Intelligence and Spectrum Geocoding for Big Data.

• Polygon Filter • Hexagon Generator

Polygon Filter

The Polygon Filter is a Spark job that accepts a comma or tab-delimited file containing points (longitude/latitude) and attribute data, and matches them to a given boundary (polygon). This preprocessing operation determines whether or not the data resides inside (or outside) the polygon. The records that match the criteria are returned. Next, the data can be processed to assign geohashes or hexagons to each location and aggregate the data. The Polygon Filter is also useful for testing with a small subset of your data. This topic assumes the product is installed to /pb/li/sdk as described in Installing the SDK on page 8.

To filter data with the polygon filter:

1. Copy the data to /pb/temp/data. 2. Copy the jar file to the Hadoop cluster.

copy li-distrib/spark/filter/lib/spectrum-bigdata-li-spark1-filter-version.jar to /pb/li/sdk

3. Create directories for the input data and boundary file, for example:

hdfs dfs -mkdir -p hdfs:///pb/li/data/inputhdfs dfs -mkdir -p hdfs:///pb/li/data/wkt

4. Copy the input data and boundary file to HDFS, for example:

hadoop fs -copyFromLocal /pb/temp/data/311data.txt hdfs:///pb/li/data/inputhadoop fs -copyFromLocal /pb/temp/data/manhattan.wkt hdfs:///pb/li/data/wkt

5. Start the Spark job using the following command, for example:

spark-submit--class com.pb.bigdata.spatial.filter.spark.app.PolygonFilterDriver--master yarn --deploy-mode cluster/pb/li/sdk/spectrum-bigdata-li-spark1-filter-version.jar-input hdfs:///pb/li/data/input/311data.txt-boundary hdfs:///pb/li/data/wkt/manhattan.wkt-output /user/pbuser/filter/output -delimiter "\t"-longitudeColumn 1 -latitudeColumn 2 -overwrite -contains true


Spatial

where:

• --input The path to the input directory • --boundary The path to the boundary file • --output The path to the output directory • --longitudeColumn The 0-based index of the longitude column • --latitudeColumn The 0-based index of the latitude column


To control whether the job overwrites the output directory, include the --overwrite parameter. Otherwise the job will fail if this directory already has content. This parameter does not have a value.

To control whether the data is evaluated against the inside or outside of the polygon, include the optional -contains parameter. If true, points within the boundary are included in the output. If false, points outside the boundary are included. If not specified, true is assumed.

Polygon Filter supports tab- and comma-delimited input files. The default is tab. Include as appropriate:


It supports configurable quote characters used in the input files. The default is double quotes. Include as appropriate:

• For files with a quote character as double quotes: -quote "”””" • For files with a quote character as a grave accent: -quote "`"

It also supports configurable escape characters used in input files. If you do not specify the escape character in the input, it is configured as no escape. Include as appropriate:


Files are returned containing every line from the input file for which the point is either inside or outside the boundary, depending on the -contains parameter. This file can now be the input to a Spark job, such as running the Spark job for the Geohash aggregation sample.


Spatial

Hexagon Generator

This Spark job generates the hexagons within a bounding box (for example, the bounding box of the continental USA). Hexagon output can be used for map display.

To create hexagons for a given bounding box:

1. Modify the configuration according to the hexagons to be generated. Change the bounding box coordinates and hexagon level to suit your needs. See Hexagons to learn about hexagon levels.

2. Deploy jar and configuration to the Hadoop cluster. 3. Start the Spark job using the following command:

spark-submit --class com.pb.bigdata.spatial.hex.spark.app.HexGenDriver

--master yarn --deploy-mode cluster --name /dir/on/server/spectrum-bigdata-li-spark1-hexgen-version.jar-output /dir/on/hdfs/output -conf -overwrite

The output of the HexGenerator is a list of WKT that represents the hexagons. See Consuming Results for how to use the output.

Sample Output


Spatial

Hexagons

A hexagon is an effective way to represent data related to circular wave propagation, such as cell tower strength or noise pollution. Closely approximating circles, hexagons include edge data better than if rectangles were used. Hexagons also fit together well across a space.

Spectrum™ Location Intelligence for Big Data provides API for assigning locations to hexagons and aggregating the data in the hexagons for further analysis. The com.pb.hadoop.core.hex package contains classes for working with hexagons and retrieving information about them. Refer to Geohash Aggregation on page 79 for details about using the API. Javadocs are located in the zip file in /core folder.

The API provides an interface that assigns a hexagon and ID to each location and that ID is used to aggregate the data associated with the hexagon.

One important hexagon parameter is the hexagon level. This, along with the longitude and latitude of a record, is used to get the hexagon or its ID for the location.


Spatial

The hexagon level refers to a hierarchy of hexagons that divide the earth's surface. Level 1 refers to the whole earth. Subsequent levels divide the previous level evenly into smaller units. The smaller the number, the higher the level and the larger the hexagon size. These hexagons form a fixed network with each hexagon having a specific unique identifier (ID).

Spectrum™ Location Intelligence for Big Data supports Levels 1 through 11, with level 9 as the default. Level 9 consists of hexagons with an edge distance of approximately 56 meters at the equator. It will generate each hexagon with a unique and consistent ID at a given level for the same longitude/latitude.


3 - Samples

In this section

Risk Assessment 50 Fire Protection Assessment 72 Geohash Aggregation 79 Consuming Results 82

Samples

Risk Assessment

A property and casualty insurer finds its competitive edge when they are able to effectively understand risk in a book of business. To characterize the risk associated with a property, the insurer needs to associate many different kinds of risk factors to each property in their book of business. To understand the overall risk of their portfolio, these risks are aggregated into various geographical regions to summarize total portfolio risk. These aggregated views help drive decisions related to effectively managing their exposure to risk and optimize their productivity when writing insurance policies.

The Risk Assessment sample application showcases the capabilities of the Pitney Bowes Spectrum™ Location Intelligence for Big Data with Apache Spark and MapReduce.


Samples

Fire and Coastal Risk Determination

The Risk Assessment sample covers two type of risks, Fire and Coastal. The score for these risks is calculated using Point in Polygon and Search Nearest geometry operations on datasets specific to each risk and then score is determined.

Fire Risk:

The Fire Risk is determined using the Fire Risk Pro dataset. The geocoded location is used to search for a record in the Fire Risk Pro dataset using a Point in Polygon search. The Fire Risk Pro table contains a field RISKDESC which contains the following values, each associated with a numeric interpretation of each risk values:

Numeric Value of Risk Description

10 Very High

7 High

5 Moderate

3 Low

2 Smoke Risk

0 No Record Found

Coastal Risk:

The coastal risk is based on the distance from the geocoded location to the shoreline found in the US Shoreline table. The geocoded location is used to find the nearest shoreline record which will contain a line type geometry. The minimum distance between the geocoded location and the shoreline geometry is used to calculate a risk score as follows:


10 100 ft or less

7 250 ft or less (but greater than 100 ft)

5 1000 ft or less (but greater than 250 ft)


Samples


3 1 mile or less (but greater than 1000 ft)

2 2 miles or less (but greater than 1 mile)

1 5 miles or less (but greater than 2 miles)

0 greater than 5 miles

Total Risk: – Total risk is the sum of the Fire Risk and Coastal Risk.

The data which is used to determine Fire and Coastal Risk scores is sample data from the Pitney Bowes Risk Data Suite. This data comprises a set of MapInfo TAB files which is being used to perform geometry operations. The way this data is used in the sample is for demonstrative purpose only. More details about this product can be found at https://www.pitneybowes.com/us/data/boundary-data.html


https://www.pitneybowes.com/us/data/boundary-data.html

Samples

Application Flow

This application processes an entire input (book of business) to assign a geocoded location for each property, assign a risk score to each and then aggregate the individual property risks based on geographical regions.

This application includes these processing stages:

• Geocode • Boundary Risk Determination • Shoreline Risk Determination • Join • Aggregate

Spatial operations require longitude and latitude columns for Spatial processing. If the input contains an address column, then the Geocode stage geocodes input records to get longitudeand latitude from the address columns. The Geocode stage is not required if the input already contains longitude and latitude columns.

Boundary Risk Determination uses the Point In Polygon spatial operation to verify whether or not the input record location is inside a risk boundary. If it is inside the risk boundary, then this process assigns the risk score of the boundary to the input record.

Shoreline Risk Determination performs the Nearest Search spatial operation to assign a risk score on the basis of distance between input record and shoreline boundary.

You can use Boundary Risk Determination and Shoreline Risk Determination in any order. These stages do not depend upon other stages of the application and do not require any order. These are spatial operations and input records must have longitude and latitude columns.

The Join stage performs join operations on two input records—left input and right input.

Outputs values of Boundary Risk Determination, Shoreline Risk Determination and Join stages can be used further as an input for all stages of the application.

For example, use Boundary Risk Determination with fire risk boundaries and Shoreline Risk Determination with flood risk boundaries to get fire and flood risk scores and then use the Join stage to join these two risk scores in to one record. You can use this output of the Join stage as an input of Boundary Risk Determination stage with criminal data.

The Aggregate stage aggregates records with provided group by column, aggregate column and risk score columns.

The Application Flow is shown in the figure below:


Samples

Sample data

Sample data to run risk the assessment application is available with the product distribution bundle in the data\riskAssessment\ directory, which has a TAB directory that contains risk boundaries and a BOB.txt file that contains input records.

Geocode

Spatial operations require longitude and latitude columns for Spatial processing. If the input contains an address column, then the Geocode stage geocodes input records to get longitudeand latitude from the address columns. The Geocode stage is not required if the input already contains longitude and latitude columns.

Execution Steps to Execute Spark Job

These steps assume you have installed Spectrum™ Geocoding for Big Data as outlined in the Spectrum™ Location Intelligence for Big Data Geocoding Install Guide on the Spectrum Spatial for Big Data documentation landing page.

1. Copy the input data to a Hadoop cluster. This data is available in data\riskAssessment directory of the product distribution bundle.

2. Copy the input data to HDFS using the following commands:

hadoop fs -copyFromLocal /PB/spectrum-bigdata-samples/data/riskAssessment/BOB.txt/dir/on/hdfs/input

3. Start the Spark job using the command appropriate for your version of Spark:

Spark 1.0

spark-submit--class com.pb.bigdata.geocoding.spark.app.GeocodeDriver--master yarn --deploy-mode cluster


http://support.pb.com/help/hadoop/landingpage/index.htmlhttp://support.pb.com/help/hadoop/landingpage/index.html

Samples

/pb/geocoding/sdk/spark1/driver/spectrum-bigdata-geocoding-spark1drivers-version-all.jar--input /dir/on/hdfs/input/BOB.txt--output /dir/on/hdfs/output--geocoding-output-fields x y--geocoding-config-location hdfs:///pb/geocoding/sdk/resources/config/

--geocoding-binaries-locationhdfs:///pb/geocoding/sdk/resources/nativeLibraries/bin/linux64/--download-location /pb/downloads--geocoding-preferences-filepathhdfs:///pb/geocoding/sdk/resources/config/geocodePreferences.xml--geocoding-input-fields streetName=0 areaName3=1 areaName1=2postCode1=3--geocoding-country USA --num-partitions=15

Spark 2.0

spark-submit--class com.pb.bigdata.geocoding.spark.app.GeocodeDriver--master yarn --deploy-mode cluster/pb/geocoding/sdk/spark2/driver/spectrum-bigdata-geocoding-spark2drivers-version-all.jar--input /user/pbuser/customers/addresses.csv--output /user/pbuser/customers_geocoded--geocoding-output-fields x y--geocoding-config-location hdfs:///pb/geocoding/sdk/resources/config/

--geocoding-binaries-locationhdfs:///pb/geocoding/sdk/resources/nativeLibraries/bin/linux64/--download-location /pb/downloads--geocoding-preferences-filepathhdfs:///pb/geocoding/sdk/resources/config/geocodePreferences.xml--geocoding-input-fields streetName=0 areaName3=1 areaName1=2postCode1=3--geocoding-country USA --num-partitions=15

The longitude and latitude columns are appended to the input record as output.

Boundary Risk Determination

After the book of business is geocoded, the application assesses the risk for each property by reading the file generated from the first step (containing the geocoded book of business) and then performing a risk determination for risk of wildfire damage based on location of property. For more information, refer to Risk Determination.

This step generates an output file containing the input book of business (addresses and insured value), the geocoded location, and a fire risk score, records that are not successfully processed are saved in the failure directory.

Execution Steps to Execute MapReduce Job


Samples

1. Deploy spectrum-bigdata-samples-riskassessment-mapreduce-version.jar and input data to the Hadoop cluster.

2. Upload boundary risk data (.TAB format) to HDFS.

hadoop fs -copyFromLocal/PB/spectrum-bigdata-samples/data/fireProtectionAssessment/FireRiskPro/FireRiskPro_FIPS_06073.*/dir/on/hdfs/referenceData

Note: The LI SDK API needs TAB data on the local file system to create a native table by using this data. MapReduce job downloads it to the local file system at run time.

3. Copy input data to the Hadoop cluster, this data is available in sampleData\riskAssessmentdirectory of the product distribution bundle.

4. Copy input data to HDFS using the following commands:



hadoop jar spectrum-bigdata-samples-riskassessment-mapreduce-version com.pb.bigdata.spatial.mapreduce.app.BoundaryMRDriver -input /dir/on/hdfs/input -output/dir/on/hdfs/output -config/PB/spectrum-bigdata-samples/riskAssessment/mapReduce/data/config.xml -overwrite

where:

• -input The HDFS path to the input directory • -output The HDFS path to the output directory • -conf Configuration file path to the server


The Boundary Driver also supports -overwrite to overwrite the old output folder on HDFS.

The Boundary Driver supports tab- and comma-delimited input files. The default is tab. Include as appropriate:


It supports configurable quote character used in the input files. The default is double quotes. Include as appropriate:

• For files with a quote character as double quotes: -quote "”””"

• For files with a quote character as grave accent: -quote "`"

It also supports configurable escape character used to escape quote character in the input files. If you do not specify the escape character in the input, it is configured as no escape. Include as appropriate:



Samples

Steps to Execute Spark Job

1. Deploy spectrum-bigdata-samples-riskassessment-spark-version.jar and input data to the Hadoop cluster.


hadoop fs -copyFromLocal/PB/spectrum-bigdata-samples/data/fireProtectionAssessment/FireRiskPro/FireRiskPro_FIPS_06073.*/dir/on/hdfs/referenceData

3. Copy the input data to hadoop cluster, this data is available in sampleData\riskAssessmentdirectory of the product distribution bundle.



5. Start Spark job using the following command:

export SPARK_MAJOR_VERSION=2

spark-submit --classcom.pb.bigdata.spatial.sample.riskassessment.spark.app.BoundaryDriver --master yarn--deploy-mode client/home/pbuser/riskAssessment/spectrum-bigdata-samples-riskassessment-spark-version.jar

-conf /home/pbuser/riskAssessment/config_RiskAssesment.xml-input /user/pbuser/boundaryDriver/input/BOB.txt -output/user/pbuser/boundaryDriver/output-overwrite -failures /user/pbuser/boundaryDriver/failures

where:

• -input The HDFS path to the input directory • -output The HDFS path to the output directory • -conf Configuration file path to the server • -master Master node's address


The Boundary Driver also supports -overwrite to overwrite the old output folder on HDFS, -failures pointing to the failures directory on HDFS belong to as optional parameters. Additionally, --name can be passed as optional parameter for application name.

The Boundary Driver supports tab- and comma-delimited input files. The default is tab. Include as appropriate:

• for files with tabs: -delimiter "\t" • for files with commas: -delimiter ","



Samples

• For files with a quote character as double quotes: -quote "”””" • For files with a quote character as grave accent: -quote "`"

It supports configurable escape character used to escape quote character in the input files. If you do not specify the escape character in the input, it is configured as no escape. Include as appropriate:


The boundary risk score is appended to the input record as output. Configuration

This xml represents configuration required for Spark and MapReduce jobs.

dataNodeTabPath/PB/spectrum-bigdata-geocoding/TAB/Location of TAB files to download from HDFS.

fireRiskBoundaryTabPath/PB/spectrum-bigdata-geocoding/TAB/FireRiskPro/FireRiskPro_FIPS_06073.TABLocation of fire risk boundary data in .TAB format

riskDescColNameRISKDESCName of risk description column in risk boundary data.

delimiter\tThe delimiter to use for parsing the input file.

quote"The quote character used in the input file.

escape\The escape character used for escaping quote in input file.

numOfColumns12Number of columns in a input record.

coordinateSystemEPSG:4326Coordinate system to create point geometry from longitude and latitude provided

in input record.

longitudeColumn8The 0-based index of the longitude column.

latitudeColumn7The 0-based index of the latitude column.


Samples

Shoreline Risk Determination

After the fire risk has been calculated, the application assesses risk associated with storm surge or coastal flooding based on the property’s distance to the shoreline. For more information, refer to Risk Determination.

This step outputs a file containing the input book of business (addresses and insured value), the geocoded location, and flood risk score, records that are not successfully processed are saved in the failure directory.




hadoop fs -copyFromLocal/PB/spectrum-bigdata-samples/data/riskAssessment/TAB/ShorelinePlus/Shoreline_Plus_FIPS_06073.*/dir/on/hdfs/referenceData

Note: LI SDK API needs TAB data on local file system to create a native table by using this data. MapReduce job downloads it to the local file system at run time.

3. Copy input data to hadoop cluster, this data is available in sampleData\riskAssessmentdirectory of the product distribution bundle.



5. Start a Hadoop job using the following command:

hadoop jar spectrum-bigdata-samples-riskassessment-mapreduce-version.jarcom.pb.bigdata.spatial.mapreduce.app.ShorelineMRDriver -input /dir/on/hdfs/input -output/dir/on/hdfs/output -config/PB/spectrum-bigdata-samples/riskAssessment/mapReduce/data/config.xml -overwrite

where:


Optional Parameters:

The Shoreline Driver also supports -overwrite to overwrite the old output folder on HDFS.

The Shoreline Driver supports tab- and comma-delimited input files. The default is tab. Include as appropriate:

• For files with tabs: -delimiter "\t"


Samples

• For files with commas: -delimiter ","








hadoop fs -copyFromLocal/PB/spectrum-bigdata-samples/data/riskAssessment/TAB/ShorelinePlus/Shoreline_Plus_FIPS_06073.*/dir/on/hdfs/referenceData

3. Copy the input data to hadoop cluster, this data is available in sampleData\riskAssessmentdirectory of the product distribution bundle.



5. Start the Spark job using the following command:

spark-submit --class com.pb.bigdata.spatial.spark.app.ShorelineDriver --master local[*]

--deploy-mode client/home/pbuser/riskAssessment/spectrum-bigdata-samples-riskassessment-spark-version.jar

-config /PB/spectrum-bigdata-samples/riskAssessment/spark/data/config.xml-input /dir/on/hdfs/input -output /dir/on/hdfs/output-overwrite -failures /dir/on/hdfs/failures

where:


Optional Parameters:


Samples

The Shoreline Driver also supports -overwrite to overwrite the old output folder on HDFS, -failures pointing to the failures directory on HDFS belong to as optional parameters. Additionally, --name can be passed as optional parameter for application name.

The Shoreline Driver supports tab- and comma-delimited input files. The default is tab. Include as appropriate:

• For files with tab: -delimiter "\t" • For files with comma: -delimiter ","





The shoreline risk score is appended to the input record as output. Configuration


dataNodeTabPath/PB/spectrum-bigdata-geocoding/TAB/Location of TAB files to download from HDFS.

shorelineBoundaryTabPath/PB/spectrum-bigdata-geocoding/TAB/ShorelinePlus/Shoreline_Plus_FIPS_06073.TAB

Location of shoreline risk boundary data in .TAB format

riskDescColNameRISKDESCName of risk description column in risk boundary data.



escape\The escape character used for escaping quote in input file.

numOfColumns12


Samples

Number of columns in a input record.

coordinateSystemEPSG:4326Coordinate system to create point geometry from longitude and latitude provided

in input record.

longitudeColumn8The 0-based indexes of the longitude column.

latitudeColumn7The 0-based indexes of the latitude column.

Join

A Join operation is performed based on BoundaryDriver stage and ShorelineDriver stage results. The join operation returns records present in left input and right input.




hadoop fs -copyFromLocal /PB/spectrum-bigdata-samples/leftInput.txt /dir/on/hdfs/inputhadoop fs -copyFromLocal /PB/spectrum-bigdata-samples/rightInput.wkt /dir/on/hdfs/wkt


hadoop jar spectrum-bigdata-samples-riskassessment-mapreduce-version.jarcom.pb.bigdata.spatial.mapreduce.app.JoinMRDriver -input /dir/on/hdfs/input -output/dir/on/hdfs/output -config /PB/spectrum-bigdata-samples/mapReduce/data/config.xml-overwrite

where:



The Join Driver also supports -overwrite to overwrite the old output folder on HDFS.

Join Driver supports tab- and comma-delimited input files. The default is tab. Include as appropriate:



Samples


• For files with a quote character as double quotes: -quote "”””"

• For files with a quote character as grave accent: -quote "`"






hadoop fs -copyFromLocal /PB/spectrum-bigdata-geocoding/leftInput.txt/dir/on/hdfs/leftInput

3. Start the Spark job using the following command:

spark-submit --class com.pb.bigdata.spatial.spark.app.JoinDriver --master local[*]--deploy-mode cluster/home/centos/risk-spark/spectrum-bigdata-samples-riskassessment-spark-version.jar-leftInput /home/centos/MR_Risk_Assessment/shoreline_risk -rightInput/home/centos/MR_Risk_Assessment/fire_risk-output /home/centos/MR_Risk_Assessment/join_output-overwrite -config /home/centos/risk-spark/config_spark.xml

-failures /home/centos/MR_Risk_Assessment/join_failures

where:



The Join Driver also supports -overwrite to overwrite the old output folder on HDFS, -failures pointing to the failures directory on HDFS belong to as optional parameters. Additionally --name can be passed as optional parameter for application name.

Join Driver supports tab- and comma-delimited input files. The default is tab. Include as appropriate:

• For files with tab: -delimiter "\t" • For files with comma: -delimiter ","



Samples




The risk scores from the record in the right input are appended to the record of left input. Configuration




esc

spectrum location intelligence - pitney bowes · 2019-04-29 · location intelligence for big data...

Documents