spectrum location intelligence - pitney bowes · 2019-04-29 · location intelligence for big data...

88
Spectrum Location Intelligence for Big Data Version 3.1 Spectrum Location Intelligence for Big Data User Guide

Upload: others

Post on 26-Jun-2020

21 views

Category:

Documents


0 download

TRANSCRIPT

  • Spectrum™ Location Intelligence for Big Data Version 3.1

    Spectrum™ Location Intelligence for Big Data User Guide

  • Table of Contents

    1 - Welcome

    What is Spectrum™ Location Intelligence for Big Data? 4

    Spectrum™ Location Intelligence for Big DataArchitecture 5

    System Requirements and Dependencies 5

    2 - Spatial

    Installing the SDK 8 Hive User-Defined Spatial Functions 9 MapReduce Jobs 40 Spark Jobs 44 Hexagons 47

    3 - Samples

    Risk Assessment 50 Fire Protection Assessment 72 Geohash Aggregation 79 Consuming Results 82

    4 - Appendix

    PGD Builder 85 Download Permissions 86

  • 1 - Welcome

    In this section

    What is Spectrum™ Location Intelligence for Big Data? 4 Spectrum™ Location Intelligence for Big DataArchitecture 5 System Requirements and Dependencies 5

  • Welcome

    What is Spectrum™ Location Intelligence for Big Data?

    The Pitney Bowes Spectrum™ Location Intelligence for Big Data is a toolkit for processing enterprise data for large scale spatial analysis. Billions of records can be processed in parallel, using MapReduce, Hive, and Apache Spark's cluster processing framework, yielding results faster than ever. Unlike traditional processing techniques that used to take weeks to process the data, now the data processing can be done in a few hours using this product.

    Spectrum™ Location Intelligence for Big Data 3.1 Spectrum™ Location Intelligence for Big Data User Guide 4

  • Welcome

    Spectrum™ Location Intelligence for Big DataArchitecture

    What is Spectrum™ Location Intelligence for Big Data?

    The Spectrum™ Location Intelligence for Big Data transforms and packages Location Intelligence components into an SDK for Big Data platforms like Hadoop for MapReduce, Spark, and Hive.

    SDK provides:

    • Integration APIs for Location Intelligence • Input datasets and metadata

    API Types:

    • Pre-built MapReduce, Spark and HiveUDF wrappers for Location Intelligence operations • Core Location Intelligence APIs with sample MapReduce/Hive/Spark programs (security enabled via Kerberos and Apache Sentry for Hive)

    System Requirements and Dependencies

    Spectrum™ Location Intelligence for Big Data is collection of jar files that can be deployed to your Hadoop system.

    This product is verified on the following Hadoop distributions.

    • Cloudera 5.12 and 6.0 • Hortonworks 2.6 • EMR 5.10 • MapR 6.0 and above, with MapR Expansion Pack (MEP) 5.0.0

    To use these jar files, you must be familiar with configuring Hadoop in Hortonworks, Cloudera, EMR, or MapR and developing applications for distributed processing. For more information, refer to Hortonworks, Cloudera, EMR, or MapR documentation.

    To use the product, the following must be installed on your system:

    for Hive:

    • Hive version 1.2.1 or above

    for Hive Client

    • Beeline, for example

    for Spark and Zeppelin Notebook:

    Spectrum™ Location Intelligence for Big Data 3.1 Spectrum™ Location Intelligence for Big Data User Guide 5

    http://docs.hortonworks.com/index.htmlhttp://www.cloudera.com/documentation.htmlhttps://aws.amazon.com/documentation/emr/https://mapr.com/docs/

  • Welcome

    • Java JDK version 1.8 or above • Hadoop version 2.6.0 or above • Spark version 1.6.0 or above (2.0 or above required for MapR and Cloudera 6.0) • Zeppelin Notebook is not supported in Cloudera

    Spectrum™ Location Intelligence for Big Data 3.1 Spectrum™ Location Intelligence for Big Data User Guide 6

  • 2 - Spatial This section describes the MapReduce jobs, Spark jobs and Hive user defined functions (UDFs) for geometry and coordinate operations and the ability to read TAB files.

    MapReduce and Spark jobs use the Location Intelligence SDK (LI SDK) API in map and reduce operations to use the big data processing systems for spatial data analysis. The LI SDK provides geometry and coordinate operations, the ability to read TAB files, and in-memory r-tree creation and searching.

    Hive UDFs also use the LI SDK API to provide SQL-like functions for spatial analysis in Hive.

    In this section

    Installing the SDK 8 Hive User-Defined Spatial Functions 9 MapReduce Jobs 40 Spark Jobs 44 Hexagons 47

  • Spatial

    Installing the SDK

    To use spatial functions for Spectrum™ Location Intelligence for Big Data, the Hadoop cluster must have reference data and libraries accessible from each master and data node at the file-system level.

    For the purposes of this guide, we will:

    • use a user called pbuser • install everything into /pb

    Perform the following steps from a node in your cluster, such as the master node.

    1. Create the install directory and give ownership to pbuser.

    sudo mkdir /pbsudo chown pbuser:pbuser /pb

    2. Add the Location Intelligence distribution zip to the node at a temporary location, for example:

    /pb/temp/spectrum-bigdata-locationintelligence-version.zip

    3. Extract the Location Intelligence distribution.

    mkdir /pb/limkdir /pb/li/sdkunzip /pb/temp/spectrum-bigdata-locationintelligence-version.zip -d /pb/li/sdk

    Spectrum™ Location Intelligence for Big Data 3.1 Spectrum™ Location Intelligence for Big Data User Guide 8

  • Spatial

    Hive User-Defined Spatial Functions

    Hive user-defined functions (UDFs) create MapReduce jobs in SQL-like syntax so there is no need to write code. Spectrum™ Location Intelligence for Big Data and Spectrum Geocoding for Big Data provide Hive user defined functions for Geometry operations and to work with grids in the spectrum-bigdata-spatial-li-hive-.jar.

    Refer to the table below to quickly navigate to Hive UDFs described in this document:

    Type Description Name

    Constructor Functions on page 16 Construct an instance of WritableGeometry FromGeoJSON

    from supported geometry representation

    formats FromKML

    FromWKB

    FromWKT

    ST_Point

    Grid Index Functions on page 31 Grid processing functions GeoHashBoundary

    GeoHashID

    HexagonBoundary

    HexagonID

    SquareHashBoundary

    SquareHashID

    Measurement Functions on page 22 Geometry measurement functions Area

    ClosestPoints

    Distance

    Length

    Perimeter

    Spectrum™ Location Intelligence for Big Data 3.1 Spectrum™ Location Intelligence for Big Data User Guide 9

  • Spatial

    Type Description Name

    Observer Functions on page 29 Geometry observer functions ST_X

    ST_Y

    ST_XMax

    ST_XMin

    ST_YMax

    ST_YMin

    Persistence Functions on page 18 Serialize an instance of WritableGeometry ToGeoJSON

    to supported geometry representation

    formats ToKML

    ToWKB

    ToWKT

    Predicate Functions on page 20 Geometry predicate functions Disjoint

    Intersects

    Overlaps

    Within

    Processing Functions on page 27 Geometry processing functions Buffer

    ConvexHull

    Intersection

    Transform

    Union

    Search Functions on page 36 Spatial search functions LocalPointInPolygon

    LocalSearchNearest

    Spectrum™ Location Intelligence for Big Data 3.1 Spectrum™ Location Intelligence for Big Data User Guide 10

  • Spatial

    Setup

    This topic assumes the product is installed to /pb/li/sdk as described in Installing the SDK on page 8. To set up user-defined spatial functions for Hive, perform the following steps:

    1. Proceed according to your platform.

    On this Do this platform

    Cloudera Copy the Hive jar for Location Intelligence to the HiveServer node.

    /pb/li/sdk/hive/lib/spectrum-bigdata-li-hive-version.jar

    In Cloudera Manager, navigate to the Hive Configuration page. Search for the Hive Auxiliary JARs Directory setting. If the value is already set then move the Hive jar into the specified folder. If the value is not set then set it to the parent folder of the Hive jar.

    /pb/li/sdk/hive/lib/

    Hortonworks On the HiveServer2 node, create the Hive auxlib folder if one does not already exist.

    sudo mkdir /usr/hdp/current/hive-server2/auxlib/

    Copy the Hive jar for Location Intelligence to the auxlib folder on the HiveServer2 node:

    sudo cp /pb/li/sdk/hive/lib/spectrum-bigdata-li-hive-version.jar/usr/hdp/current/hive-server2/auxlib/

    MapR Copy the Hive jar for Location Intelligence to the HiveServer node.

    /pb/li/sdk/hive/lib/spectrum-bigdata-li-hive-version.jar

    2. Restart all Hive services. 3. Launch Beeline, or some other Hive client, for the remaining step.

    beeline -u jdbc:hive2://localhost:10000/default -n pbuser

    Spectrum™ Location Intelligence for Big Data 3.1 Spectrum™ Location Intelligence for Big Data User Guide 11

  • Spatial

    4. Register spatial user-defined functions. Add the temporary keyword after create if you want a temporary function (this step would need to be redone for every new Hive session).

    create temporary function FromWKT as 'com.pb.bigdata.spatial.hive.construct.FromWKT';create temporary function FromWKB as 'com.pb.bigdata.spatial.hive.construct.FromWKB';create temporary function FromKML as 'com.pb.bigdata.spatial.hive.construct.FromKML';create temporary function FromGeoJSON as 'com.pb.bigdata.spatial.hive.construct.FromGeoJSON';create temporary function ST_Point as 'com.pb.bigdata.spatial.hive.construct.ST_Point';

    create temporary function ToWKT as 'com.pb.bigdata.spatial.hive.persistence.ToWKT';create temporary function ToWKB as 'com.pb.bigdata.spatial.hive.persistence.ToWKB';create temporary function ToKML as 'com.pb.bigdata.spatial.hive.persistence.ToKML';create temporary function ToGeoJSON as 'com.pb.bigdata.spatial.hive.persistence.ToGeoJSON';

    create temporary function Disjoint as 'com.pb.bigdata.spatial.hive.predicate.Disjoint';create temporary function Overlaps as 'com.pb.bigdata.spatial.hive.predicate.Overlaps';create temporary function Within as 'com.pb.bigdata.spatial.hive.predicate.Within';create temporary function Intersects as 'com.pb.bigdata.spatial.hive.predicate.Intersects';

    create temporary function Area as 'com.pb.bigdata.spatial.hive.measurement.Area';create temporary function ClosestPoints as 'com.pb.bigdata.spatial.hive.measurement.ClosestPoints';create temporary function Distance as 'com.pb.bigdata.spatial.hive.measurement.Distance';create temporary function Length as 'com.pb.bigdata.spatial.hive.measurement.Length';create temporary function Perimeter as 'com.pb.bigdata.spatial.hive.measurement.Perimeter';

    create temporary function ConvexHull as 'com.pb.bigdata.spatial.hive.processing.ConvexHull';create temporary function Intersection as 'com.pb.bigdata.spatial.hive.processing.Intersection';create temporary function Buffer as 'com.pb.bigdata.spatial.hive.processing.Buffer';create temporary function Union as 'com.pb.bigdata.spatial.hive.processing.Union';create temporary function GeometryTransform as 'com.pb.bigdata.spatial.hive.processing.Transform';

    create temporary function ST_X as 'com.pb.bigdata.spatial.hive.observer.ST_X';create temporary function ST_XMax as 'com.pb.bigdata.spatial.hive.observer.ST_XMax';create temporary function ST_XMin as 'com.pb.bigdata.spatial.hive.observer.ST_XMin';create temporary function ST_Y as 'com.pb.bigdata.spatial.hive.observer.ST_Y';create temporary function ST_YMax as 'com.pb.bigdata.spatial.hive.observer.ST_YMax';create temporary function ST_YMin as 'com.pb.bigdata.spatial.hive.observer.ST_YMin';

    create temporary function GeoHashBoundary as 'com.pb.bigdata.spatial.hive.grid.GeoHashBoundary';create temporary function GeoHashID as 'com.pb.bigdata.spatial.hive.grid.GeoHashID';create temporary function HexagonBoundary as 'com.pb.bigdata.spatial.hive.grid.HexagonBoundary';create temporary function HexagonID as 'com.pb.bigdata.spatial.hive.grid.HexagonID';create temporary function SquareHashBoundary as 'com.pb.bigdata.spatial.hive.grid.SquareHashBoundary';create temporary function SquareHashID as 'com.pb.bigdata.spatial.hive.grid.SquareHashID';

    create temporary function LocalSearchNearest as 'com.pb.bigdata.spatial.hive.search.LocalSearchNearest';create temporary function LocalPointInPolygon as 'com.pb.bigdata.spatial.hive.search.LocalPointInPolygon';

    Note: If you want to view the complete stack trace for any encountered error, enable logging in DEBUG mode and then restart the job execution.

    5. MapR Only Set hive.aux.jars.path in hive-site.xml and (for Hive v2.1 and earlier only) HIVE_AUX_JARS_PATH in hive-env.sh using full paths to the jar files (not to folders) for only the nodes that are running HiveServer2 or Hive metastore (that is, the master node or nodes).

    • /opt/mapr/hive/hive-version/conf/hive-site.xml

    Qualify the hive.aux.jars.path with the file://uri prefix and separate multiple paths with a comma.

    hive.aux.jars.path

    Spectrum™ Location Intelligence for Big Data 3.1 Spectrum™ Location Intelligence for Big Data User Guide 12

    file://urihttp:hive-env.sh

  • Spatial

    file:///pb/li/sdk/hive/lib/spectrum-bigdata-li-hive-version.jar,file:///pb/geocoding/sdk/hive/lib/spectrum-bigdata-geocoding-hive-version.jar

    • /opt/mapr/hive/hive-version-2.1-or-earlier/conf/hive-env.sh

    Export the environment variable and separate multiple paths with a colon (:).

    exportHIVE_AUX_JARS_PATH=/pb/li/sdk/hive/lib/spectrum-bigdata-li-hive-version.jar:/pb/geocoding/sdk/hive/lib/spectrum-bigdata-geocoding-hive-version.jar

    • The first time you run a job may take a while if the reference data has to be downloaded remotely from HDFS or S3. It may also time out when using a large number of datasets that are stored in remote locations such as HDFS or S3. If you are using Hive with the MapReduce engine, you can adjust the value of the mapreduce.task.timeout property.

    • Some types of queries will cause Hive to evaluate UDFs in the HiveServer2 process space instead of on a data node. The Routing UDFs in particular use a significant amount of memory and can shut down the Hive server due to memory constraints. To process these queries, we recommend increasing the amount of memory available to the HiveServer2 process (for example, by setting HADOOP_HEAPSIZE in hive-env.sh).

    Spectrum™ Location Intelligence for Big Data 3.1 Spectrum™ Location Intelligence for Big Data User Guide 13

    http:hive-env.shhttp:opt/mapr/hive/hive-version-2.1-or-earlier/conf/hive-env.shfile:///pb/geocoding/sdk/hive/lib/spectrum-bigdata-geocoding-hive-version.jarfile:///pb/li/sdk/hive/lib/spectrum-bigdata-li-hive-version.jar

  • Spatial

    WritableGeometry

    This is an implementation of Hadoop's Writable interface for geometry.

    Spatial Hive user defined functions (UDFs) use WritableGeometry to exchange data between two functions. Constructor Hive functions provide a mechanism to get an instance of WritableGeometry from standard geometry formats like WKT, WKB, GeoJSON and KML. For example:

    To get an instance of WritableGeometry from WKT:

    SELECT FromWKT(t.geometry,'epsg:4267') FROM hivetable t;

    To get an instance of WritableGeometry from WKB string:

    SELECT FromWKB(t.geometry,'epsg:4267') FROM hivetable t;

    Persistence Hive UDFs convert an instance of WritableGeometry to standard formats like WKT, WKB, GeoJSON and KML. For example:

    To serialize an instance of WritableGeometry to WKT:

    SELECT ToWKT(t.geometry) FROM hivetable t;

    The output of Constructor functions can be supplied as input to other Hive functions that perform some operations on it:

    For example:

    To calculate the length of a geometry:

    SELECT Length(FromWKT(t.geometry, 'epsg:4267'), 'm', 'SPHERICAL') FROM hivetable t;

    To get the distance between two geometries:

    SELECT Distance(FromWKT(t.geometry,'epsg:4267'), FromWKT(t.geometry2,'epsg:4267'), 'm','SPHERICAL') FROM hivetable t;

    For more information, see https://hadoop.apache.org/docs/r2.6.1/api/org/apache/hadoop/io/Writable.html.

    Spectrum™ Location Intelligence for Big Data 3.1 Spectrum™ Location Intelligence for Big Data User Guide 14

    https://hadoop.apache.org/docs/r2.6.1/api/org/apache/hadoop/io/Writable.html

  • Spatial

    Geometry Functions

    • Constructor Functions • Grid Functions • Measurement Functions • Observer Functions • Persistence Functions • Predicate Functions • Processing Functions

    Spectrum™ Location Intelligence for Big Data 3.1 Spectrum™ Location Intelligence for Big Data User Guide 15

  • Spatial

    Constructor Functions

    The following Constructor functions are available:

    • FromWKT • FromGeoJSON • FromWKB • FromKML • Point

    FromGeoJSON

    The FromGeoJSON function returns a WritableGeometry instance from a GeoJSON representation of a geometry.

    Example:

    SELECT FromGeoJSON('{ "type": "Point", "coordinates": [100.0, 0.0] }');SELECT FromGeoJSON(t.geometry) FROM hivetable t;

    FromKML

    The FromKML function returns a WritableGeometry instance from the text formatted in KML (Keyhole Markup Language).

    Example:

    SELECT FromKML(t.geometry) FROM hivetable t;

    FromWKB

    The FromWKB function returns a WritableGeometry instance from a Well-Known Binary (WKB) of a geometry. The geometry will be created using the specified coordinate system.

    Example:

    SELECT FromWKB (t.geometry, 'epsg:4267') FROM hivetable t;

    FromWKT

    The FromWKT function returns a WritableGeometry instance from a Well-Known Text (WKT) representation of a geometry. The geometry is created using the the specified coordinate system.

    Examples:

    SELECT FromWKT(t.geometry,'epsg:4267') FROM hivetable t;

    SELECT FromWKT ('POINT (30 20)', 'epsg:4267');

    Spectrum™ Location Intelligence for Big Data 3.1 Spectrum™ Location Intelligence for Big Data User Guide 16

  • Spatial

    ST_Point

    The ST_Point function constructs a point geometry from the provided X and Y, and an optional CRS. X and Y can be either of String or Numeric types. If the CRS is not provided or null or empty, then EPSG:4326 will be used as the default CRS. If any of the argument values are invalid, then null will be returned in the output.

    To create a temporary Hive function:

    create temporary function ST_Point as 'com.pb.bigdata.spatial.hive.construct.ST_Point'.

    Examples:

    SELECT ST_Point(-73.750333 , 42.736103, 'epsg:4326');

    SELECT ST_Point('-73.750333' , '42.736103', 'epsg:4326');

    SELECT ST_Point(-73.750333 , 42.736103);

    SELECT ST_Point('-73.750333' , 42.736103);

    SELECT ST_Point(p.x, p.y, p.crs) FROM points p;

    Spectrum™ Location Intelligence for Big Data 3.1 Spectrum™ Location Intelligence for Big Data User Guide 17

  • Spatial

    Persistence Functions

    The following Persistence functions are available:

    • ToWKT • ToGeoJSON • ToWKB • ToKML

    ToGeoJSON

    The ToGeoJSON function returns a text formatted in GeoJSON representation of geometry from the specified WritableGeometry instance.

    Example:

    SELECT ToGeoJSON(FromGeoJSON(t.geometry)) FROM hivetable t;SELECT ToGeoJSON(Buffer(FromGeoJSON(t.geometry), 5.0, 'km', 12, 'SPHERICAL' )) FROMhivetable t;

    ToKML

    The ToKML function returns a text formatted in KML in the OGC standard, KML2.2 namespace (http://schemas.opengis.net/kml/2.2.0/ogckml22.xsd) as parsed from the specified WritableGeometry instance.

    Example:

    SELECT ToKML(FromKML(t.geometry)) FROM hivetable t;SELECT ToKML(Buffer(FromGeoJSON(t.geometry), 5.0, 'km', 12, 'SPHERICAL' )) FROM hivetablet;

    ToWKB

    The ToWKB function returns a text formatted in a Well-Known Binary (WKB) representation of a geometry as parsed from the specified WritableGeometry instance.

    Example:

    SELECT ToWKB(FromWKB(t.geometry, 'epsg:4326')) FROM hivetable t;SELECT ToWKB(Buffer(FromGeoJSON(t.geometry), 5.0, 'km', 12, 'SPHERICAL' )) FROM hivetablet;

    ToWKT

    The ToWKT function returns a Well-Known Text (WKT) representation of a geometry from the specified WritableGeometry instance.

    Spectrum™ Location Intelligence for Big Data 3.1 Spectrum™ Location Intelligence for Big Data User Guide 18

    http://schemas.opengis.net/kml/2.2.0/ogckml22.xsd

  • Spatial

    Example:

    SELECT ToWKT(FromWKT(t.geometry, 'epsg:4326')) FROM hivetable t;SELECT ToWKT(Buffer(FromGeoJSON(t.geometry), 5.0, 'km', 12, 'SPHERICAL' )) FROM hivetablet;

    Spectrum™ Location Intelligence for Big Data 3.1 Spectrum™ Location Intelligence for Big Data User Guide 19

  • Spatial

    Predicate Functions

    The following Predicate functions are available:

    • Disjoint • Intersects • Overlaps • Within

    Disjoint

    The Disjoint function returns True if two geometry objects have no points in common, otherwise False is returned.

    If either geometry1 or geometry2 are null, Null is returned.

    Example:

    SELECT Disjoint(FromWKT(t1.geometry, 'epsg:4326'), FromWKT(t2.geometry, 'epsg:4326'))FROM hivetable1 t1, hivetable2 t2;

    Intersects

    The Intersects function determines whether or not one geometry object intersects another geometry object. It returns True if there is any direct position in common between the two geometries, or else False is returned.

    If either geometry1 or geometry2 are null, Null is returned.

    Example:

    SELECT Intersects(FromWKT(t1.geometry, 'epsg:4326'), FromWKT(t2.geometry, 'epsg:4326'))FROM hivetable1 t1, hivetable2 t2;

    SELECT R.highway FROM USA_RIVERS L, usa_highways R where L.name='Hudson River' andIntersects(FromWKT(L.geom), FromWKT(R.geom));

    Overlaps

    The Overlaps function determines whether or not one geometry object overlaps another geometry object. This function returns True if the geometry1 overlaps the geometry2, otherwise False is returned.

    If either geometry1 or geometry2 are null, Null is returned.

    Example:

    SELECT Overlaps(FromWKT(t1.geometry, 'epsg:4326'), FromWKT(t2.geometry, 'epsg:4326'))FROM hivetable1 t1, hivetable2 t2;

    Spectrum™ Location Intelligence for Big Data 3.1 Spectrum™ Location Intelligence for Big Data User Guide 20

  • Spatial

    Within

    The Within function returns whether or not one geometry object is entirely within another geometry object. It returns Trueif the geometry2 entirely contains geometry1, otherwise False is returned.

    If either the testGeometry or the containerGeometry are null, Null is returned.

    Example:

    SELECT Within(FromWKT(t1.geometry, 'epsg:4326'), FromWKT(t2.geometry, 'epsg:4326')) as Result FROM hivetable1 t1, hivetable2 t2;

    SELECT L.zipcode as zipcode, SUM(L.insurance) as TotalInsuredAmount, AVG(R.riskdesc) as RiskScore

    FROM book_of_business L, FIRE_RISK_BOUNDRIES RWHERE Within(FromWKT(L.location), FromWKT(R.geom)) GROUP BY L.zipcode;

    Spectrum™ Location Intelligence for Big Data 3.1 Spectrum™ Location Intelligence for Big Data User Guide 21

  • Spatial

    Measurement Functions

    The following Measurement functions are available:

    • Area • Length • Perimeter • Distance • ClosestPoints

    Area

    The Area function calculates and returns the area of given Geometry in the desired unit. The unit must be specified as a parameter while calling the function. The area of a polygon is computed as the area of its exterior ring minus the areas of its interior rings. Points and curves have zero area.

    Example:

    SELECT Area (FromWkt(t.geometry,'epsg:4267'), 'sq mi', 'SPHERICAL') FROM hivetable t;

    SELECT Area (FromWkt(t.geometry,'epsg:4267'), 'sq mi') FROM hivetable t;

    Area Units:

    Valid values for unit are the following area units:

    Value Description

    sq mi square miles

    sq km square kilometers

    sq in square inches

    sq ft square foot

    sq yd square yards

    sq mm square millimeters

    sq cm square centimeters

    sq m square meters

    sq survey ft square US Survey feet

    Spectrum™ Location Intelligence for Big Data 3.1 Spectrum™ Location Intelligence for Big Data User Guide 22

  • Spatial

    Value Description

    sq nmi square nautical miles

    acre acres

    ha hectares

    ClosestPoints

    The ClosestPoints function returns the closest points between the two geometries. The geometries that intersect are at distance zero from each other, and in this case a shared point is returned.

    Example:

    SELECT res[0], res[1] FROM (SELECT ClosestPoints (FromWkt(t.geometry1,'epsg:4267'),FromWkt(t.geometry2,'epsg:4267'), 'SPHERICAL') as res FROM hivetable t) temp;

    Distance

    The Distance function calculates and returns the distance between two geometries specified in parameters. This function returns the distance value in the unit specified. Distance is always non-negative. The geometries that intersect are at distance zero from each other.

    Example:

    SELECT Distance (FromWkt(t.geometry,'epsg:4267'), FromWkt(t.geometry2,'epsg:4267'), 'm','SPHERICAL') FROM hivetable t;

    SELECT Distance (FromWkt(t.geometry,'epsg:4267'), FromWkt(t.geometry2,'epsg:4267'), 'm')FROM hivetable t;

    Linear Units:

    Valid values for unit are the following distance units:

    Value Description

    mi miles

    km kilometers

    in inches

    ft feet

    Spectrum™ Location Intelligence for Big Data 3.1 Spectrum™ Location Intelligence for Big Data User Guide 23

  • Spatial

    Value Description

    yd yards

    mm millimeters

    cm centimeters

    m meters

    survey ft US Survey feet

    nmi nautical miles

    Length

    The Length function calculates and returns the geographic length of a line or polyline geometry object in the desired unit. The unit must be specified as a parameter while calling the function.

    Example:

    SELECT Length(FromWkt(t.geometry,'epsg:4267'), 'm', 'SPHERICAL') FROM hivetable t;

    SELECT Length(FromWkt(t.geometry,'epsg:4267'), 'm') FROM hivetable t;

    Linear Units:

    Valid values for unit are the following distance units:

    Value Description

    mi miles

    km kilometers

    in inches

    ft feet

    yd yards

    mm millimeters

    Spectrum™ Location Intelligence for Big Data 3.1 Spectrum™ Location Intelligence for Big Data User Guide 24

  • Spatial

    Value Description

    cm centimeters

    m meters

    survey ft US Survey feet

    nmi nautical miles

    Perimeter

    The Perimeter function calculates and returns the total perimeter of a given geometry in the desired unit. The unit must be specified as a parameter while calling the function. The Perimeter of a polygon is the sum of the lengths of its rings (both exterior and holes). The curves are considered as thin polygons.

    Example:

    SELECT Perimeter (FromWkt(t.geometry,'epsg:4267'), 'm', 'SPHERICAL') FROM hivetable t;

    SELECT Perimeter (FromWkt(t.geometry,'epsg:4267'), 'm') FROM hivetable t;

    Linear Units:

    Valid values for unit are the following distance units:

    Value Description

    mi miles

    km kilometers

    in inches

    ft feet

    yd yards

    mm millimeters

    cm centimeters

    Spectrum™ Location Intelligence for Big Data 3.1 Spectrum™ Location Intelligence for Big Data User Guide 25

  • Spatial

    Value Description

    m meters

    survey ft US Survey feet

    nmi nautical miles

    Spectrum™ Location Intelligence for Big Data 3.1 Spectrum™ Location Intelligence for Big Data User Guide 26

  • Spatial

    Processing Functions

    The following Processing functions are available:

    • Buffer • Intersection • Transform • ConvexHull • Union

    Buffer

    The Buffer function returns an instance of WritableGeometry having a MultiPolygon geometry inside it which represents a buffered distance around another geometry object.

    Example:

    SELECT Buffer(FromWKT(t.geometry,'epsg:4267'), 5.0, 'km', 12, 'SPHERICAL' ) FROM hivetablet;

    SELECT Buffer(ST_POINT(5, 6, 'epsg:4267'), 5.0, 'km', 12, 'SPHERICAL' );

    ConvexHull

    The ConvexHull function computes the convex hull of a geometry. The convex hull is the smallest convex geometry that contains all the points in the input geometry.

    Example:

    SELECT ConvexHull (FromWKT(geometry, 'epsg:4326')) FROM hivetable;

    SELECT ToWKT ( ConvexHull( FromWKT (table.geometry,'epsg:4267'))) as result FROM hivetable;

    SELECT ConvexHull ( FromWKT ('MULTIPOLYGON (((40 40, 20 45, 45 30, 40 40)), ((20 35, 1030, 10 10, 30 5, 45 20, 20 35), (30 20, 20 15, 20 25, 30 20)))'),'epsg:4267') ;

    Intersection

    The Intersection function is a geometry (point, line, or curve) common in two geometry objects (such as lines, curves, planes, and surfaces). It returns the geometry consisting of direct positions that lie in both specified geometries.

    Example:

    SELECT Intersection (FromWKT(t1.geometry,'epsg:4326'), FromWKT("WKT_String",'epsg:4267'))FROM hivetable t;

    SELECT Intersection (FromWKT(t1.geometry,'epsg:4267'), FromWKT(t2.geometry,'epsg:4267'))FROM hivetable1 t1, hivetable2 t2;

    Spectrum™ Location Intelligence for Big Data 3.1 Spectrum™ Location Intelligence for Big Data User Guide 27

  • Spatial

    Transform

    The Transform function transforms a given geometry from one coordinate system to another.

    Example:

    SELECT GeometryTransform (FromWKT(t.geometry,'epsg:4326'), 'epsg:3857') FROM hivetablet;

    SELECT GeometryTransform (ST_POINT(30, 20),'epsg:3857');

    Union

    The Union function returns a geometry object which represents the union of two input geometry objects.

    Example:

    SELECT Union (FromWKT(geometry1, 'epsg:4326'), FromWKT(geometry2, 'epsg:4326')) FROMhivetable;

    SELECT ToWKT(Union(FromWKT(t1.geometry,'epsg:4267'),FromWKT(t2.geometry,'epsg:4267')))`FROMhivetable1 t1, hivetable2 t2;

    Spectrum™ Location Intelligence for Big Data 3.1 Spectrum™ Location Intelligence for Big Data User Guide 28

  • Spatial

    Observer Functions

    Obtaining the X and Y ordinates of a geometry is important when dealing with XY tables. For example, the TRANSFORM UDF accepts and returns a geometry which means an XY table cannot be transformed FROM one coordinate system to another. The ST_X and ST_Y UDFs allow the transformation of an XY table FROM one coordinate system to another.

    Another common need is the ability to filter records in an XY table by the bounds of a geometry. The ST_XMax, ST_XMin, ST_YMax, and ST_YMin UDFs provide a way to get the values of the MBR for a writeable geometry.

    The following Observer index functions are available:

    • ST_X • ST_XMax • ST_XMin • ST_Y • ST_YMax • ST_YMin

    ST_X

    The ST_X function returns the X ordinate of the geometry if the geometry is a point, or null if the geometry is not a point or is null. The result type is a double.

    Example

    SELECT ST_X(ST_Point(x, y, 'epsg:4326')) FROM src;

    ST_XMax

    The ST_XMax function returns the X maxima of a geometry, or NULL if the specified value is not a geometry. The output will be of type double.

    Example

    SELECT ST_XMax(FromWKT(…, 'epsg:4326')) FROM src;

    ST_XMin

    The ST_XMin function returns the X minima of a geometry, or NULL if the specified value is not a geometry. The output will be of type double.

    Example

    SELECT ST_XMin(FromWKT(…, 'epsg:4326')) FROM src;

    Spectrum™ Location Intelligence for Big Data 3.1 Spectrum™ Location Intelligence for Big Data User Guide 29

  • Spatial

    ST_Y

    The ST_Y function returns the Y ordinate of the geometry if the geometry is a point, or null if the geometry is not a point or is null. The result type is a double.

    Example

    SELECT ST_Y(ST_Point(x, y, 'epsg:4326')) FROM src;

    ST_YMax

    The ST_YMax function returns the Y maxima of a geometry, or NULL if the specified value is not a geometry. The output will be of type double.

    Example

    SELECT ST_YMax(FromWKT(…, 'epsg:4326')) FROM src;

    ST_YMin

    The ST_YMin function returns the Y minima of a geometry, or NULL if the specified value is not a geometry. The output will be of type double.

    Example

    SELECT ST_YMin(FromWKT(…, 'epsg:4326')) FROM src;

    Spectrum™ Location Intelligence for Big Data 3.1 Spectrum™ Location Intelligence for Big Data User Guide 30

  • Spatial

    Grid Index Functions

    A grid is a way of dividing the surface of the earth into contiguous cells with no gaps in between. This makes grids very useful for spatial indexing and aggregating.

    Spectrum™ Location Intelligence for Big Data provides Hive user defined functions (UDF) for hashing that allow you to manage grid cells for a variety of use cases. Hashing is a way of encoding and decoding the grid cell using the cell boundary and a unique identifier.

    We provide three types of UDFs for processing three grid cell shapes: rectangular (geohash), square (square hash) and hexagon (hexagon hash). Hashes are useful for analysis and interoperability with other systems.

    Square hash is similar to GeoHash but has the advantage that when displayed in Popular Mercator, the cells appear as squares.

    Hexagons are often used in telecommunication solutions as they approximate circles while covering the surface of the earth without gaps.

    The following Grid index functions are available:

    • GeoHashBoundary: • GeoHashID • SquareHashBoundary • SquareHashID • HexagonBoundary • HexagonID

    GeoHashBoundary

    The GeoHashBoundary function returns a WritableGeometry that defines the boundary of a cell in a grid if given a unique ID for the location. It also can return the unique ID for a given pair of coordinates and a precision. The shape of the cell is rectangular.

    Syntax:GeoHashBoundary ()

    Examples:

    SELECT GeoHashBoundary (hashStringId) FROM hivetable;

    SELECT GeoHashBoundary ("ebvnk");

    Syntax:GeoHashBoundary (, , )

    Where

    X is the longitude value for the specified point.

    Y is the latitude value for the specified point.

    Spectrum™ Location Intelligence for Big Data 3.1 Spectrum™ Location Intelligence for Big Data User Guide 31

  • Spatial

    precision is the length of the string key to be returned. The precision determines how large the grid cells are (longer strings means higher precision and smaller grid cells). It returns the string ID of the grid cell at the specified precision that contains the point.

    Examples:

    SELECT GeoHashBoundary(x, y, precision) FROM hivetable;

    SELECT GeoHashBoundary ("-73.750333", "42.736103", 3);

    SELECT GeoHashBoundary (-73.750333, 42.736103, 3);

    GeoHashID

    The GeoHashID function returns a unique well-known string ID for the grid cell. The ID then is sortable and searchable that corresponds to the specified X, Y, and precision.

    Syntax:GeoHashID (, , )

    Where

    X is the longitude value for the specified point.

    Y is the latitude value for the specified point.

    PRECISION is the length of the string key to be returned. The precision determines how large the grid cells are (longer strings means higher precision and smaller grid cells). It returns the string ID of the grid cell at the specified precision that contains the point.

    Examples:

    SELECT GeoHashID(x, y, precision) FROM hivetable;

    SELECT GeoHashID ("-73.750333”, "42.736103", 3);

    SELECT GeoHashID (-73.750333, 42.736103, 3);

    CREATE TEMPORARY TABLE tmptbl ASSELECT *, (GeoHashID(x, y, 10)) AS hashIDFROM coordinates ORDER BY hashID;

    INSERT INTO TABLE coordinates_with_hash SELECT *, (GeoHashID(x, y, 10)) as hashID FROM coordinates ORDER BY hashID;

    SELECT c.hashID, ToWKT (GeoHashBoundary (c.hashID)), count (*) as quantity FROM (SELECT GeoHashID(x, y, 10) as hashID FROM coordinates) c GROUP BY c.hashID;

    Spectrum™ Location Intelligence for Big Data 3.1 Spectrum™ Location Intelligence for Big Data User Guide 32

  • Spatial

    HexagonBoundary

    The HexagonBoundary function returns a WritableGeometry that defines the boundary of a cell in a grid if given a unique ID for the location. It also can return the unique ID for a given pair of coordinates and a precision. The shape of the cell is a hexagon.

    Syntax:HexagonBoundary ()

    Examples:

    SELECT HexagonBoundary (hashStringId) FROM hivetable;

    SELECT HexagonBoundary ("PF625028642");

    Syntax:HexagonBoundary (, , )

    Where

    X is the longitude value for the specified point.

    Y is the latitude value for the specified point.

    precision is the length of the string key to be returned. The precision determines how large the grid cells are (longer strings means higher precision and smaller grid cells). It returns the string ID of the grid cell at the specified precision that contains the point.

    Examples:

    SELECT HexagonBoundary(x, y, precision) FROM hivetable;

    SELECT HexagonBoundary (-73.750333, 42.736103, 3);

    HexagonID

    The HexagonID function returns a unique well-known string ID for the grid cell. The ID then is sortable and searchable that corresponds to the specified X, Y, and precision.

    Syntax:HexagonID (, , )

    Where

    X is the longitude value for the specified point.

    Y is the latitude value for the specified point.

    PRECISION is the length of the string key to be returned. The precision determines how large the grid cells are (longer strings means higher precision and smaller grid cells). It returns the string ID of the grid cell at the specified precision that contains the point.

    Spectrum™ Location Intelligence for Big Data 3.1 Spectrum™ Location Intelligence for Big Data User Guide 33

  • Spatial

    Examples:

    SELECT HexagonID(x, y, precision) FROM hivetable;

    SELECT HexagonID (-73.750333, 42.736103, 3);

    CREATE TEMPORARY TABLE tmptbl ASSELECT *, (HexagonID(x, y, 10)) AS hashIDFROM coordinates ORDER BY hashID;

    INSERT INTO TABLE coordinates_with_hash SELECT *, (HexagonID(x, y, 10)) as hashID FROM coordinates ORDER BY hashID;

    SELECT c.hashID, ToWKT (HexagonBoundary (c.hashID)), count (*) as quantity FROM (SELECT HexagonID(x, y, 10) as hashID FROM coordinates) c GROUP BY c.hashID;

    SquareHashBoundary

    The SquareHashBoundary function returns a WritableGeometry that defines the boundary of a cell in a grid if given a unique ID for the location. It also can return the unique ID for a given pair of coordinates and a precision. Square hash cells appear square when displayed on a Popular Mercator map.

    Syntax:SquareHashBoundary ()

    Examples:

    SELECT SquareHashBoundary (hashStringId) FROM hivetable;

    SELECT SquareHashBoundary ("03332");

    Syntax:SquareHashBoundary (, , )

    Where

    X is the longitude value for the specified point.

    Y is the latitude value for the specified point.

    precision is the length of the string key to be returned. The precision determines how large the grid cells are (longer strings means higher precision and smaller grid cells). It returns the string ID of the grid cell at the specified precision that contains the point.

    Spectrum™ Location Intelligence for Big Data 3.1 Spectrum™ Location Intelligence for Big Data User Guide 34

  • Spatial

    Examples:

    SELECT SquareHashBoundary(x, y, precision) FROM hivetable;

    SELECT SquareHashBoundary ("-73.750333", "42.736103", 3);

    SELECT SquareHashBoundary (-73.750333, 42.736103, 3);

    SquareHashID

    The SquareHashID function takes a longitude, latitude (in WGS 84) and a precision. The precision determines how large the grid cells are (higher precision means smaller grid cells). It returns the string ID of the grid cell at the specified precision that contains the point. Square hash cells appear square when displayed on a Popular Mercator map.

    Syntax:SquareHashID (, , )

    Where

    X is the longitude value for the specified point.

    Y is the latitude value for the specified point.

    PRECISION is the length of the string key to be returned. The precision determines how large the grid cells are (longer strings means higher precision and smaller grid cells). It returns the string ID of the grid cell at the specified precision that contains the point.

    Examples:

    SELECT SquareHashID(x, y, precision) FROM hivetable;

    SELECT SquareHashID ("-73.750333", "42.736103", 3);

    SELECT SquareHashID (-73.750333, 42.736103, 3);

    CREATE TEMPORARY TABLE tmptbl ASSELECT *, (SquareHashID(x, y, 10)) AS hashIDFROM coordinates ORDER BY hashID;

    INSERT INTO TABLE coordinates_with_hash SELECT *, (SquareHashID(x, y, 10)) as hashID FROM coordinates ORDER BY hashID;

    SELECT c.hashID, ToWKT (SquareHashBoundary (c.hashID)), count (*) as quantity FROM (SELECT SquareHashID(x, y, 10) as hashID FROM coordinates) c GROUP BY c.hashID;

    Spectrum™ Location Intelligence for Big Data 3.1 Spectrum™ Location Intelligence for Big Data User Guide 35

  • Spatial

    Search Functions

    LocalPointInPolygon

    The LocalPointInPolygon UDTF function returns the polygon (contained in the specified TAB or shapefile) in which an input point resides. All the geometries where the search point lies within are returned by this UDTF; that is, if the search point lies on a Line, Polyline, Point, Polygon, or MultiPolygon geometry type, the respective geometries will be returned in the output.

    Syntax: LocalPointInPolygon(, , map())

    Where:

    • inputPoint is a WritableGeometry representing a point • dataSourcePath is the path to the data source to be searched. The path can be either a relative path based on the remote resource or a local path to a file that must be available on the master node and every data node.

    Note: If you are storing and distributing your data remotely using HDFS or S3, you must set the option for remoteDataSourceLocation and also specify the download location as described in the table below.

    • options allow you to optionally set return criteria, in format:

    ExampleDescriptionOption

    'shpCharset', 'utf-8' the charset to use when reading a shapefile

    shpCharset

    'shpCrs', 'epsg:4326' the coordinate reference system to use when reading a shapefile

    shpCrs

    'remoteDataSourceLocation, 'hdfs:///data/mydata.zip'

    the path to the directory or archive that contains the data source (required only

    remoteDataSourceLocation

    if you are storing and distributing data remotely on HDFS or S3)

    'downloadLocation', '/pb/downloads'

    the local file system location to which resources get downloaded (required

    downloadLocation

    Note: If you are also using Spectrum™ Geocoding for

    only if you are storing and distributing data remotely on HDFS or S3) Big Data and have already

    set the

    Spectrum™ Location Intelligence for Big Data 3.1 Spectrum™ Location Intelligence for Big Data User Guide 36

  • Spatial

    Option Description Example pb.download.locationHive variable, then you do not need to set this option here as well.

    downloadGroup the operating system group 'downloadGroup',which should be applied to 'pbdownloads'downloaded data on a local file system; the default is the value from the Hive property, pb.download.group(required only if you are storing and distributing data remotely on HDFS or S3)

    For more information, see Download Permissions on page 86.

    Example (using HDFS)

    SELECT pip_points.id, pipresult.capital, pipresult.stateFROM pip_pointsLATERAL VIEW LocalPointInPolygon(FromWKT(pip_points.geometry, pip_points.crs),

    '/STATECAP.TAB',map('remoteDataSourceLocation', 'hdfs:///data/pip/capitals.zip','downloadLocation', '/pb/pip/download', 'downloadGroup', 'pbdownloads'))

    pipresult

    In the above example, id is a field from the pip_points table, which is the table being used to get the points we are searching from. The pipresult.capital and pipresult.state fields are from the STATECAP TAB file that we want in our query result.

    Tip: To improve performance when searching TAB files, consider creating PGD (prepared geometry) index files. For more information, see PGD Builder on page 85.

    LocalSearchNearest

    The LocalSearchNearest UDTF function returns the nearest geometry or geometries contained in the specified TAB or shapefile to an input point.

    Syntax: LocalSearchNearest(, , map())

    Where:

    • inputPoint is a WritableGeometry representing the point to search near

    Spectrum™ Location Intelligence for Big Data 3.1 Spectrum™ Location Intelligence for Big Data User Guide 37

    http:pip_points.id

  • Spatial

    • dataSourcePath is the location of the input TAB or shapefile. The path can be either a relative path based on the remote resource or a local path to a file that must be available on the master node and every data node.

    Note: If you are storing and distributing your data remotely using HDFS or S3, you must set the option for remoteDataSourceLocation and also specify the download location as described in the table below.

    • options allow you to optionally return more than one value, return additional information, or set other return criteria, in format:

    Option Description Example

    'maxCandidates', '3' the maximum number of results to return (if not set, the default value is 1)

    maxCandidates

    'maxDistance', '25' the maximum distance to search for results (if not

    maxDistance

    set, the default value is no limit)

    'distanceUnit', 'mi' the distance unit (if not set, the default value is m for meters)

    See the Distance on page 23 function for

    distanceUnit

    examples of supported distance units.

    'returnDistanceColumnName', 'Miles'

    the name of the column to use for returning the distance

    returnDistanceColumnName

    'shpCharset', 'utf-8' the charset to use when reading a shapefile

    shpCharset

    'shpCrs', 'epsg:4326' the coordinate reference system to use when reading a shapefile

    shpCrs

    'remoteDataSourceLocation, 'hdfs:///data/mydata.zip'

    the path to the directory or archive that contains the data source (required

    remoteDataSourceLocation

    only if you are storing and distributing data remotely on HDFS or S3)

    Spectrum™ Location Intelligence for Big Data 3.1 Spectrum™ Location Intelligence for Big Data User Guide 38

  • Spatial

    set the pb.download.locationHive variable, then you do not need to set this option here as well.

    ExampleDescriptionOption

    'downloadLocation', '/pb/downloads'location to which

    the local file system

    resources get downloaded

    downloadLocation

    Note: If you are also using Spectrum™ Geocoding for

    (required only if you are storing and distributing

    Big Data and have already data remotely on HDFS or S3)

    downloadGroup the operating system 'downloadGroup',group which should be 'pbdownloads'applied to downloaded data on a local file system; the default is the value from the Hive property, pb.download.group(required only if you are storing and distributing data remotely on HDFS or S3)

    For more information, see Download Permissions on page 86.

    Example (using HDFS)

    SELECT search_points.id, nearestresult.capital, nearestresult.stateFROM search_pointsLATERAL VIEW OUTER LocalSearchNearest(FromWKT(search_points.geometry, search_points.crs), '/STATECAP.TAB',map('maxCandidates', '3','remoteDataSourceLocation', 'hdfs:///data/search/capitals.zip','downloadLocation', '/pb/search/download', 'downloadGroup', 'pbdownloads')) nearestresult

    In the above example, id is a field from the search_points table, which is the table being used to get the points we are searching from. The nearestresult.capital and nearestresult.state fields are from the STATECAP TAB file that we want in our query result. In this particular example, the maxCandidates option limits the results to 3 records for each search point.

    Tip: To improve performance when searching TAB files, consider creating PGD (prepared geometry) index files. For more information, see PGD Builder on page 85.

    Spectrum™ Location Intelligence for Big Data 3.1 Spectrum™ Location Intelligence for Big Data User Guide 39

    http:search_points.id

  • Spatial

    MapReduce Jobs

    To process or produce large sets of data, the MapReduce jobs are provided with Spectrum Location Intelligence for Big Data.

    • Polygon Filter • Hexagon Generator

    Polygon Filter

    The Polygon Filter is a MapReduce job that accepts a comma or tab-delimited file containing points (longitude/latitude) and attribute data, and matches them to a given boundary (polygon). This preprocessing operation determines whether or not your data resides inside (or outside) the polygon. The records that match the criteria are returned.

    To filter data with the polygon filter:

    1. Deploy the spectrum-bigdata-spatial-li-mapreduce-filter-version.jar on a Hadoop cluster on which input must be available.

    /dir/on/server/spectrum-bigdata-spatial-li-mapreduce-filter-version.jar

    2. Copy input data and boundary file to HDFS using the following commands:

    hadoop fs -copyFromLocal /dir/on/server/myinput.txt /dir/on/hdfs/inputhadoop fs -copyFromLocal /dir/on/server/boundary.wkt /dir/on/hdfs/wkt

    3. Start Hadoop job using the following command:

    hadoop jar /dir/on/server/spectrum-bigdata-spatial-li-mapreduce-filter-version.jarcom.pb.hadoop.mapreduce.filter.PolygonFilterDriver-input /dir/on/hdfs/input -output /dir/on/hdfs/output-boundary /dir/on/hdfs/wkt/boundary.wkt-longitudeColumn 0 -latitudeColumn 1-delimiter "\t" -quote " -escape "\\" - overwrite -contains true

    where:

    • -input The HDFS path to the input directory • -output The HDFS path to the output directory • -boundary The HDFS path to the boundary file • -longitudeColumn The 0-based index of the longitude column • -latitudeColumn The 0-based index of the latitude column

    Optional parameters:

    Spectrum™ Location Intelligence for Big Data 3.1 Spectrum™ Location Intelligence for Big Data User Guide 40

  • Spatial

    To control whether the data is evaluated against the inside or outside of the polygon, include the optional -contains parameter. If true, points within the boundary are included in the output. If false, points outside the boundary are included. If not specified, true is assumed.

    Polygon Filter supports tab- and comma-delimited input files. The default is tab. Include as appropriate:

    • For files with tabs: -delimiter "\t" • For files with commas: -delimiter ","

    It supports configurable quote characters used in the input files. The default is double quotes. Include as appropriate:

    • For files with a quote character as double quotes: -quote "”””" • For files with a quote character as a grave accent: -quote "`"

    It also supports configurable escape characters used in input files. If you do not specify the escape character in the input, it is configured as no escape. Include as appropriate:

    • For files with an escape character as backslash: -escape "\\"

    A file is returned containing every line from the input file for which the point is either inside or outside the boundary, depending on the -contains parameter. This file can now be the input to a Hive query for aggregating the data by hexagon.

    Spectrum™ Location Intelligence for Big Data 3.1 Spectrum™ Location Intelligence for Big Data User Guide 41

  • Spatial

    Hexagon Generator

    This MapReduce job generates the hexagons within a bounding box (for example, the bounding box of the continental USA). Hexagon output can be used for map display.

    To create hexagons for a given bounding box:

    1. Deploy the jar file and configuration to the Hadoop cluster.

    /dir/on/server/spectrum-bigdata-li-mapreduce-hexgen-version.jar

    2. Modify the configuration according to the hexagons to be generated. Change the bounding box coordinates and hexagon level to suit your needs. Refer to Hexagons to learn about hexagon levels.

    MinLongitude-73.728200The bottom left longitude of the bounding box.

    MinLatitude40.979800The bottom left latitude of the bounding box.

    MaxLongitude-71.787480The upper right longitude of the bounding box.

    MaxLatitude42.050496The upper right latitude of the bounding box.

    HexLevel9The level to generate hexagons for. Must be between 1 and

    11.

    ContainerLevel2A hint for providing some parallel hexagon generation. Must be

    less than the HexLevel property.

    3. Start the Hadoop job using the following command:

    Usage:

    hadoop jar spectrum-bigdata-li-mapreduce-hexgen-version.jarcom.pb.bigdata.spatial.hex.mapreduce.HexGenDriver -conf /dir/on/server/config.xml-output /dir/on/hdfs/output

    Spectrum™ Location Intelligence for Big Data 3.1 Spectrum™ Location Intelligence for Big Data User Guide 42

  • Spatial

    The output of the Hexagon Generator is a list of Well Known Text (WKT) that represents the hexagons. Refer to Consuming Results for more information on how to use the output.

    Sample Output

    Spectrum™ Location Intelligence for Big Data 3.1 Spectrum™ Location Intelligence for Big Data User Guide 43

  • Spatial

    Spark Jobs

    To process large sets of data, the Spark jobs are provided with Spectrum Location Intelligence and Spectrum Geocoding for Big Data.

    • Polygon Filter • Hexagon Generator

    Polygon Filter

    The Polygon Filter is a Spark job that accepts a comma or tab-delimited file containing points (longitude/latitude) and attribute data, and matches them to a given boundary (polygon). This preprocessing operation determines whether or not the data resides inside (or outside) the polygon. The records that match the criteria are returned. Next, the data can be processed to assign geohashes or hexagons to each location and aggregate the data. The Polygon Filter is also useful for testing with a small subset of your data. This topic assumes the product is installed to /pb/li/sdk as described in Installing the SDK on page 8.

    To filter data with the polygon filter:

    1. Copy the data to /pb/temp/data. 2. Copy the jar file to the Hadoop cluster.

    copy li-distrib/spark/filter/lib/spectrum-bigdata-li-spark1-filter-version.jar to /pb/li/sdk

    3. Create directories for the input data and boundary file, for example:

    hdfs dfs -mkdir -p hdfs:///pb/li/data/inputhdfs dfs -mkdir -p hdfs:///pb/li/data/wkt

    4. Copy the input data and boundary file to HDFS, for example:

    hadoop fs -copyFromLocal /pb/temp/data/311data.txt hdfs:///pb/li/data/inputhadoop fs -copyFromLocal /pb/temp/data/manhattan.wkt hdfs:///pb/li/data/wkt

    5. Start the Spark job using the following command, for example:

    spark-submit--class com.pb.bigdata.spatial.filter.spark.app.PolygonFilterDriver--master yarn --deploy-mode cluster/pb/li/sdk/spectrum-bigdata-li-spark1-filter-version.jar-input hdfs:///pb/li/data/input/311data.txt-boundary hdfs:///pb/li/data/wkt/manhattan.wkt-output /user/pbuser/filter/output -delimiter "\t"-longitudeColumn 1 -latitudeColumn 2 -overwrite -contains true

    Spectrum™ Location Intelligence for Big Data 3.1 Spectrum™ Location Intelligence for Big Data User Guide 44

  • Spatial

    where:

    • --input The path to the input directory • --boundary The path to the boundary file • --output The path to the output directory • --longitudeColumn The 0-based index of the longitude column • --latitudeColumn The 0-based index of the latitude column

    Optional parameters:

    To control whether the job overwrites the output directory, include the --overwrite parameter. Otherwise the job will fail if this directory already has content. This parameter does not have a value.

    To control whether the data is evaluated against the inside or outside of the polygon, include the optional -contains parameter. If true, points within the boundary are included in the output. If false, points outside the boundary are included. If not specified, true is assumed.

    Polygon Filter supports tab- and comma-delimited input files. The default is tab. Include as appropriate:

    • For files with tabs: -delimiter "\t" • For files with commas: -delimiter ","

    It supports configurable quote characters used in the input files. The default is double quotes. Include as appropriate:

    • For files with a quote character as double quotes: -quote "”””" • For files with a quote character as a grave accent: -quote "`"

    It also supports configurable escape characters used in input files. If you do not specify the escape character in the input, it is configured as no escape. Include as appropriate:

    • For files with an escape character as backslash: -escape "\\"

    Files are returned containing every line from the input file for which the point is either inside or outside the boundary, depending on the -contains parameter. This file can now be the input to a Spark job, such as running the Spark job for the Geohash aggregation sample.

    Spectrum™ Location Intelligence for Big Data 3.1 Spectrum™ Location Intelligence for Big Data User Guide 45

  • Spatial

    Hexagon Generator

    This Spark job generates the hexagons within a bounding box (for example, the bounding box of the continental USA). Hexagon output can be used for map display.

    To create hexagons for a given bounding box:

    1. Modify the configuration according to the hexagons to be generated. Change the bounding box coordinates and hexagon level to suit your needs. See Hexagons to learn about hexagon levels.

    2. Deploy jar and configuration to the Hadoop cluster. 3. Start the Spark job using the following command:

    spark-submit --class com.pb.bigdata.spatial.hex.spark.app.HexGenDriver

    --master yarn --deploy-mode cluster --name /dir/on/server/spectrum-bigdata-li-spark1-hexgen-version.jar-output /dir/on/hdfs/output -conf -overwrite

    The output of the HexGenerator is a list of WKT that represents the hexagons. See Consuming Results for how to use the output.

    Sample Output

    Spectrum™ Location Intelligence for Big Data 3.1 Spectrum™ Location Intelligence for Big Data User Guide 46

  • Spatial

    Hexagons

    A hexagon is an effective way to represent data related to circular wave propagation, such as cell tower strength or noise pollution. Closely approximating circles, hexagons include edge data better than if rectangles were used. Hexagons also fit together well across a space.

    Spectrum™ Location Intelligence for Big Data provides API for assigning locations to hexagons and aggregating the data in the hexagons for further analysis. The com.pb.hadoop.core.hex package contains classes for working with hexagons and retrieving information about them. Refer to Geohash Aggregation on page 79 for details about using the API. Javadocs are located in the zip file in /core folder.

    The API provides an interface that assigns a hexagon and ID to each location and that ID is used to aggregate the data associated with the hexagon.

    One important hexagon parameter is the hexagon level. This, along with the longitude and latitude of a record, is used to get the hexagon or its ID for the location.

    Spectrum™ Location Intelligence for Big Data 3.1 Spectrum™ Location Intelligence for Big Data User Guide 47

  • Spatial

    The hexagon level refers to a hierarchy of hexagons that divide the earth's surface. Level 1 refers to the whole earth. Subsequent levels divide the previous level evenly into smaller units. The smaller the number, the higher the level and the larger the hexagon size. These hexagons form a fixed network with each hexagon having a specific unique identifier (ID).

    Spectrum™ Location Intelligence for Big Data supports Levels 1 through 11, with level 9 as the default. Level 9 consists of hexagons with an edge distance of approximately 56 meters at the equator. It will generate each hexagon with a unique and consistent ID at a given level for the same longitude/latitude.

    Spectrum™ Location Intelligence for Big Data 3.1 Spectrum™ Location Intelligence for Big Data User Guide 48

  • 3 - Samples

    In this section

    Risk Assessment 50 Fire Protection Assessment 72 Geohash Aggregation 79 Consuming Results 82

  • Samples

    Risk Assessment

    A property and casualty insurer finds its competitive edge when they are able to effectively understand risk in a book of business. To characterize the risk associated with a property, the insurer needs to associate many different kinds of risk factors to each property in their book of business. To understand the overall risk of their portfolio, these risks are aggregated into various geographical regions to summarize total portfolio risk. These aggregated views help drive decisions related to effectively managing their exposure to risk and optimize their productivity when writing insurance policies.

    The Risk Assessment sample application showcases the capabilities of the Pitney Bowes Spectrum™ Location Intelligence for Big Data with Apache Spark and MapReduce.

    Spectrum™ Location Intelligence for Big Data 3.1 Spectrum™ Location Intelligence for Big Data User Guide 50

  • Samples

    Fire and Coastal Risk Determination

    The Risk Assessment sample covers two type of risks, Fire and Coastal. The score for these risks is calculated using Point in Polygon and Search Nearest geometry operations on datasets specific to each risk and then score is determined.

    Fire Risk:

    The Fire Risk is determined using the Fire Risk Pro dataset. The geocoded location is used to search for a record in the Fire Risk Pro dataset using a Point in Polygon search. The Fire Risk Pro table contains a field RISKDESC which contains the following values, each associated with a numeric interpretation of each risk values:

    Numeric Value of Risk Description

    10 Very High

    7 High

    5 Moderate

    3 Low

    2 Smoke Risk

    0 No Record Found

    Coastal Risk:

    The coastal risk is based on the distance from the geocoded location to the shoreline found in the US Shoreline table. The geocoded location is used to find the nearest shoreline record which will contain a line type geometry. The minimum distance between the geocoded location and the shoreline geometry is used to calculate a risk score as follows:

    Numeric Value of Risk Description

    10 100 ft or less

    7 250 ft or less (but greater than 100 ft)

    5 1000 ft or less (but greater than 250 ft)

    Spectrum™ Location Intelligence for Big Data 3.1 Spectrum™ Location Intelligence for Big Data User Guide 51

  • Samples

    Numeric Value of Risk Description

    3 1 mile or less (but greater than 1000 ft)

    2 2 miles or less (but greater than 1 mile)

    1 5 miles or less (but greater than 2 miles)

    0 greater than 5 miles

    Total Risk: – Total risk is the sum of the Fire Risk and Coastal Risk.

    The data which is used to determine Fire and Coastal Risk scores is sample data from the Pitney Bowes Risk Data Suite. This data comprises a set of MapInfo TAB files which is being used to perform geometry operations. The way this data is used in the sample is for demonstrative purpose only. More details about this product can be found at https://www.pitneybowes.com/us/data/boundary-data.html

    Spectrum™ Location Intelligence for Big Data 3.1 Spectrum™ Location Intelligence for Big Data User Guide 52

    https://www.pitneybowes.com/us/data/boundary-data.html

  • Samples

    Application Flow

    This application processes an entire input (book of business) to assign a geocoded location for each property, assign a risk score to each and then aggregate the individual property risks based on geographical regions.

    This application includes these processing stages:

    • Geocode • Boundary Risk Determination • Shoreline Risk Determination • Join • Aggregate

    Spatial operations require longitude and latitude columns for Spatial processing. If the input contains an address column, then the Geocode stage geocodes input records to get longitudeand latitude from the address columns. The Geocode stage is not required if the input already contains longitude and latitude columns.

    Boundary Risk Determination uses the Point In Polygon spatial operation to verify whether or not the input record location is inside a risk boundary. If it is inside the risk boundary, then this process assigns the risk score of the boundary to the input record.

    Shoreline Risk Determination performs the Nearest Search spatial operation to assign a risk score on the basis of distance between input record and shoreline boundary.

    You can use Boundary Risk Determination and Shoreline Risk Determination in any order. These stages do not depend upon other stages of the application and do not require any order. These are spatial operations and input records must have longitude and latitude columns.

    The Join stage performs join operations on two input records—left input and right input.

    Outputs values of Boundary Risk Determination, Shoreline Risk Determination and Join stages can be used further as an input for all stages of the application.

    For example, use Boundary Risk Determination with fire risk boundaries and Shoreline Risk Determination with flood risk boundaries to get fire and flood risk scores and then use the Join stage to join these two risk scores in to one record. You can use this output of the Join stage as an input of Boundary Risk Determination stage with criminal data.

    The Aggregate stage aggregates records with provided group by column, aggregate column and risk score columns.

    The Application Flow is shown in the figure below:

    Spectrum™ Location Intelligence for Big Data 3.1 Spectrum™ Location Intelligence for Big Data User Guide 53

  • Samples

    Sample data

    Sample data to run risk the assessment application is available with the product distribution bundle in the data\riskAssessment\ directory, which has a TAB directory that contains risk boundaries and a BOB.txt file that contains input records.

    Geocode

    Spatial operations require longitude and latitude columns for Spatial processing. If the input contains an address column, then the Geocode stage geocodes input records to get longitudeand latitude from the address columns. The Geocode stage is not required if the input already contains longitude and latitude columns.

    Execution Steps to Execute Spark Job

    These steps assume you have installed Spectrum™ Geocoding for Big Data as outlined in the Spectrum™ Location Intelligence for Big Data Geocoding Install Guide on the Spectrum Spatial for Big Data documentation landing page.

    1. Copy the input data to a Hadoop cluster. This data is available in data\riskAssessment directory of the product distribution bundle.

    2. Copy the input data to HDFS using the following commands:

    hadoop fs -copyFromLocal /PB/spectrum-bigdata-samples/data/riskAssessment/BOB.txt/dir/on/hdfs/input

    3. Start the Spark job using the command appropriate for your version of Spark:

    Spark 1.0

    spark-submit--class com.pb.bigdata.geocoding.spark.app.GeocodeDriver--master yarn --deploy-mode cluster

    Spectrum™ Location Intelligence for Big Data 3.1 Spectrum™ Location Intelligence for Big Data User Guide 54

    http://support.pb.com/help/hadoop/landingpage/index.htmlhttp://support.pb.com/help/hadoop/landingpage/index.html

  • Samples

    /pb/geocoding/sdk/spark1/driver/spectrum-bigdata-geocoding-spark1drivers-version-all.jar--input /dir/on/hdfs/input/BOB.txt--output /dir/on/hdfs/output--geocoding-output-fields x y--geocoding-config-location hdfs:///pb/geocoding/sdk/resources/config/

    --geocoding-binaries-locationhdfs:///pb/geocoding/sdk/resources/nativeLibraries/bin/linux64/--download-location /pb/downloads--geocoding-preferences-filepathhdfs:///pb/geocoding/sdk/resources/config/geocodePreferences.xml--geocoding-input-fields streetName=0 areaName3=1 areaName1=2postCode1=3--geocoding-country USA --num-partitions=15

    Spark 2.0

    spark-submit--class com.pb.bigdata.geocoding.spark.app.GeocodeDriver--master yarn --deploy-mode cluster/pb/geocoding/sdk/spark2/driver/spectrum-bigdata-geocoding-spark2drivers-version-all.jar--input /user/pbuser/customers/addresses.csv--output /user/pbuser/customers_geocoded--geocoding-output-fields x y--geocoding-config-location hdfs:///pb/geocoding/sdk/resources/config/

    --geocoding-binaries-locationhdfs:///pb/geocoding/sdk/resources/nativeLibraries/bin/linux64/--download-location /pb/downloads--geocoding-preferences-filepathhdfs:///pb/geocoding/sdk/resources/config/geocodePreferences.xml--geocoding-input-fields streetName=0 areaName3=1 areaName1=2postCode1=3--geocoding-country USA --num-partitions=15

    The longitude and latitude columns are appended to the input record as output.

    Boundary Risk Determination

    After the book of business is geocoded, the application assesses the risk for each property by reading the file generated from the first step (containing the geocoded book of business) and then performing a risk determination for risk of wildfire damage based on location of property. For more information, refer to Risk Determination.

    This step generates an output file containing the input book of business (addresses and insured value), the geocoded location, and a fire risk score, records that are not successfully processed are saved in the failure directory.

    Execution Steps to Execute MapReduce Job

    Spectrum™ Location Intelligence for Big Data 3.1 Spectrum™ Location Intelligence for Big Data User Guide 55

  • Samples

    1. Deploy spectrum-bigdata-samples-riskassessment-mapreduce-version.jar and input data to the Hadoop cluster.

    2. Upload boundary risk data (.TAB format) to HDFS.

    hadoop fs -copyFromLocal/PB/spectrum-bigdata-samples/data/fireProtectionAssessment/FireRiskPro/FireRiskPro_FIPS_06073.*/dir/on/hdfs/referenceData

    Note: The LI SDK API needs TAB data on the local file system to create a native table by using this data. MapReduce job downloads it to the local file system at run time.

    3. Copy input data to the Hadoop cluster, this data is available in sampleData\riskAssessmentdirectory of the product distribution bundle.

    4. Copy input data to HDFS using the following commands:

    hadoop fs -copyFromLocal /PB/spectrum-bigdata-samples/data/riskAssessment/BOB.txt/dir/on/hdfs/input

    5. Start Hadoop job using the following command:

    hadoop jar spectrum-bigdata-samples-riskassessment-mapreduce-version com.pb.bigdata.spatial.mapreduce.app.BoundaryMRDriver -input /dir/on/hdfs/input -output/dir/on/hdfs/output -config/PB/spectrum-bigdata-samples/riskAssessment/mapReduce/data/config.xml -overwrite

    where:

    • -input The HDFS path to the input directory • -output The HDFS path to the output directory • -conf Configuration file path to the server

    Optional parameters:

    The Boundary Driver also supports -overwrite to overwrite the old output folder on HDFS.

    The Boundary Driver supports tab- and comma-delimited input files. The default is tab. Include as appropriate:

    • For files with tabs: -delimiter "\t" • For files with commas: -delimiter ","

    It supports configurable quote character used in the input files. The default is double quotes. Include as appropriate:

    • For files with a quote character as double quotes: -quote "”””"

    • For files with a quote character as grave accent: -quote "`"

    It also supports configurable escape character used to escape quote character in the input files. If you do not specify the escape character in the input, it is configured as no escape. Include as appropriate:

    • For files with an escape character as backslash: -escape "\\"

    Spectrum™ Location Intelligence for Big Data 3.1 Spectrum™ Location Intelligence for Big Data User Guide 56

  • Samples

    Steps to Execute Spark Job

    1. Deploy spectrum-bigdata-samples-riskassessment-spark-version.jar and input data to the Hadoop cluster.

    2. Upload boundary risk data (.TAB format) to HDFS.

    hadoop fs -copyFromLocal/PB/spectrum-bigdata-samples/data/fireProtectionAssessment/FireRiskPro/FireRiskPro_FIPS_06073.*/dir/on/hdfs/referenceData

    3. Copy the input data to hadoop cluster, this data is available in sampleData\riskAssessmentdirectory of the product distribution bundle.

    4. Copy the input data to HDFS using the following commands:

    hadoop fs -copyFromLocal /PB/spectrum-bigdata-samples/data/riskAssessment/BOB.txt/dir/on/hdfs/input

    5. Start Spark job using the following command:

    export SPARK_MAJOR_VERSION=2

    spark-submit --classcom.pb.bigdata.spatial.sample.riskassessment.spark.app.BoundaryDriver --master yarn--deploy-mode client/home/pbuser/riskAssessment/spectrum-bigdata-samples-riskassessment-spark-version.jar

    -conf /home/pbuser/riskAssessment/config_RiskAssesment.xml-input /user/pbuser/boundaryDriver/input/BOB.txt -output/user/pbuser/boundaryDriver/output-overwrite -failures /user/pbuser/boundaryDriver/failures

    where:

    • -input The HDFS path to the input directory • -output The HDFS path to the output directory • -conf Configuration file path to the server • -master Master node's address

    Optional parameters:

    The Boundary Driver also supports -overwrite to overwrite the old output folder on HDFS, -failures pointing to the failures directory on HDFS belong to as optional parameters. Additionally, --name can be passed as optional parameter for application name.

    The Boundary Driver supports tab- and comma-delimited input files. The default is tab. Include as appropriate:

    • for files with tabs: -delimiter "\t" • for files with commas: -delimiter ","

    It supports configurable quote character used in the input files. The default is double quotes. Include as appropriate:

    Spectrum™ Location Intelligence for Big Data 3.1 Spectrum™ Location Intelligence for Big Data User Guide 57

  • Samples

    • For files with a quote character as double quotes: -quote "”””" • For files with a quote character as grave accent: -quote "`"

    It supports configurable escape character used to escape quote character in the input files. If you do not specify the escape character in the input, it is configured as no escape. Include as appropriate:

    • For files with an escape character as backslash: -escape "\\"

    The boundary risk score is appended to the input record as output. Configuration

    This xml represents configuration required for Spark and MapReduce jobs.

    dataNodeTabPath/PB/spectrum-bigdata-geocoding/TAB/Location of TAB files to download from HDFS.

    fireRiskBoundaryTabPath/PB/spectrum-bigdata-geocoding/TAB/FireRiskPro/FireRiskPro_FIPS_06073.TABLocation of fire risk boundary data in .TAB format

    riskDescColNameRISKDESCName of risk description column in risk boundary data.

    delimiter\tThe delimiter to use for parsing the input file.

    quote"The quote character used in the input file.

    escape\The escape character used for escaping quote in input file.

    numOfColumns12Number of columns in a input record.

    coordinateSystemEPSG:4326Coordinate system to create point geometry from longitude and latitude provided

    in input record.

    longitudeColumn8The 0-based index of the longitude column.

    latitudeColumn7The 0-based index of the latitude column.

    Spectrum™ Location Intelligence for Big Data 3.1 Spectrum™ Location Intelligence for Big Data User Guide 58

  • Samples

    Shoreline Risk Determination

    After the fire risk has been calculated, the application assesses risk associated with storm surge or coastal flooding based on the property’s distance to the shoreline. For more information, refer to Risk Determination.

    This step outputs a file containing the input book of business (addresses and insured value), the geocoded location, and flood risk score, records that are not successfully processed are saved in the failure directory.

    Execution Steps to Execute MapReduce Job

    1. Deploy spectrum-bigdata-samples-riskassessment-mapreduce-version.jar and input data to the Hadoop cluster.

    2. Upload boundary risk data (.TAB format) to HDFS.

    hadoop fs -copyFromLocal/PB/spectrum-bigdata-samples/data/riskAssessment/TAB/ShorelinePlus/Shoreline_Plus_FIPS_06073.*/dir/on/hdfs/referenceData

    Note: LI SDK API needs TAB data on local file system to create a native table by using this data. MapReduce job downloads it to the local file system at run time.

    3. Copy input data to hadoop cluster, this data is available in sampleData\riskAssessmentdirectory of the product distribution bundle.

    4. Copy input data to HDFS using the following commands:

    hadoop fs -copyFromLocal /PB/spectrum-bigdata-samples/data/riskAssessment/BOB.txt/dir/on/hdfs/input

    5. Start a Hadoop job using the following command:

    hadoop jar spectrum-bigdata-samples-riskassessment-mapreduce-version.jarcom.pb.bigdata.spatial.mapreduce.app.ShorelineMRDriver -input /dir/on/hdfs/input -output/dir/on/hdfs/output -config/PB/spectrum-bigdata-samples/riskAssessment/mapReduce/data/config.xml -overwrite

    where:

    • -input The HDFS path to the input directory • -output The HDFS path to the output directory • -conf Configuration file path to the server

    Optional Parameters:

    The Shoreline Driver also supports -overwrite to overwrite the old output folder on HDFS.

    The Shoreline Driver supports tab- and comma-delimited input files. The default is tab. Include as appropriate:

    • For files with tabs: -delimiter "\t"

    Spectrum™ Location Intelligence for Big Data 3.1 Spectrum™ Location Intelligence for Big Data User Guide 59

  • Samples

    • For files with commas: -delimiter ","

    It supports configurable quote character used in the input files. The default is double quotes. Include as appropriate:

    • For files with a quote character as double quotes: -quote "”””" • For files with a quote character as grave accent: -quote "`"

    It supports configurable escape character used to escape quote character in the input files. If you do not specify the escape character in the input, it is configured as no escape. Include as appropriate:

    • For files with an escape character as backslash: -escape "\\"

    Steps to Execute Spark Job

    1. Deploy spectrum-bigdata-samples-riskassessment-spark-version.jar and input data to the Hadoop cluster.

    2. Upload boundary risk data (.TAB format) to HDFS.

    hadoop fs -copyFromLocal/PB/spectrum-bigdata-samples/data/riskAssessment/TAB/ShorelinePlus/Shoreline_Plus_FIPS_06073.*/dir/on/hdfs/referenceData

    3. Copy the input data to hadoop cluster, this data is available in sampleData\riskAssessmentdirectory of the product distribution bundle.

    4. Copy the input data to HDFS using the following commands:

    hadoop fs -copyFromLocal /PB/spectrum-bigdata-samples/data/riskAssessment/BOB.txt/dir/on/hdfs/input

    5. Start the Spark job using the following command:

    spark-submit --class com.pb.bigdata.spatial.spark.app.ShorelineDriver --master local[*]

    --deploy-mode client/home/pbuser/riskAssessment/spectrum-bigdata-samples-riskassessment-spark-version.jar

    -config /PB/spectrum-bigdata-samples/riskAssessment/spark/data/config.xml-input /dir/on/hdfs/input -output /dir/on/hdfs/output-overwrite -failures /dir/on/hdfs/failures

    where:

    • -input The HDFS path to the input directory • -output The HDFS path to the output directory • -conf Configuration file path to the server • -master Master node's address

    Optional Parameters:

    Spectrum™ Location Intelligence for Big Data 3.1 Spectrum™ Location Intelligence for Big Data User Guide 60

  • Samples

    The Shoreline Driver also supports -overwrite to overwrite the old output folder on HDFS, -failures pointing to the failures directory on HDFS belong to as optional parameters. Additionally, --name can be passed as optional parameter for application name.

    The Shoreline Driver supports tab- and comma-delimited input files. The default is tab. Include as appropriate:

    • For files with tab: -delimiter "\t" • For files with comma: -delimiter ","

    It supports configurable quote character used in the input files. The default is double quotes. Include as appropriate:

    • For files with a quote character as double quotes: -quote "”””" • For files with a quote character as grave accent: -quote "`"

    It supports configurable escape character used to escape quote character in the input files. If you do not specify the escape character in the input, it is configured as no escape. Include as appropriate:

    • For files with an escape character as backslash: -escape "\\"

    The shoreline risk score is appended to the input record as output. Configuration

    This xml represents configuration required for Spark and MapReduce jobs.

    dataNodeTabPath/PB/spectrum-bigdata-geocoding/TAB/Location of TAB files to download from HDFS.

    shorelineBoundaryTabPath/PB/spectrum-bigdata-geocoding/TAB/ShorelinePlus/Shoreline_Plus_FIPS_06073.TAB

    Location of shoreline risk boundary data in .TAB format

    riskDescColNameRISKDESCName of risk description column in risk boundary data.

    delimiter\tThe delimiter to use for parsing the input file.

    quote"The quote character used in the input file.

    escape\The escape character used for escaping quote in input file.

    numOfColumns12

    Spectrum™ Location Intelligence for Big Data 3.1 Spectrum™ Location Intelligence for Big Data User Guide 61

  • Samples

    Number of columns in a input record.

    coordinateSystemEPSG:4326Coordinate system to create point geometry from longitude and latitude provided

    in input record.

    longitudeColumn8The 0-based indexes of the longitude column.

    latitudeColumn7The 0-based indexes of the latitude column.

    Join

    A Join operation is performed based on BoundaryDriver stage and ShorelineDriver stage results. The join operation returns records present in left input and right input.

    Execution Steps to Execute MapReduce Job

    1. Deploy spectrum-bigdata-samples-riskassessment-mapreduce-version.jar and input data to the Hadoop cluster.

    2. Copy input data to HDFS using the following commands:

    hadoop fs -copyFromLocal /PB/spectrum-bigdata-samples/leftInput.txt /dir/on/hdfs/inputhadoop fs -copyFromLocal /PB/spectrum-bigdata-samples/rightInput.wkt /dir/on/hdfs/wkt

    3. Start Hadoop job using the following command:

    hadoop jar spectrum-bigdata-samples-riskassessment-mapreduce-version.jarcom.pb.bigdata.spatial.mapreduce.app.JoinMRDriver -input /dir/on/hdfs/input -output/dir/on/hdfs/output -config /PB/spectrum-bigdata-samples/mapReduce/data/config.xml-overwrite

    where:

    • -input The HDFS path to the input directory • -output The HDFS path to the output directory • -conf Configuration file path to the server

    Optional parameters:

    The Join Driver also supports -overwrite to overwrite the old output folder on HDFS.

    Join Driver supports tab- and comma-delimited input files. The default is tab. Include as appropriate:

    • For files with tabs: -delimiter "\t" • For files with commas: -delimiter ","

    Spectrum™ Location Intelligence for Big Data 3.1 Spectrum™ Location Intelligence for Big Data User Guide 62

  • Samples

    It supports configurable quote character used in the input files. The default is double quotes. Include as appropriate:

    • For files with a quote character as double quotes: -quote "”””"

    • For files with a quote character as grave accent: -quote "`"

    It supports configurable escape character used to escape quote character in the input files. If you do not specify the escape character in the input, it is configured as no escape. Include as appropriate:

    • For files with an escape character as backslash: -escape "\\"

    Steps to Execute Spark Job

    1. Deploy spectrum-bigdata-samples-riskassessment-spark-version.jar and input data to the Hadoop cluster.

    2. Copy the input data to HDFS using the following commands:

    hadoop fs -copyFromLocal /PB/spectrum-bigdata-geocoding/leftInput.txt/dir/on/hdfs/leftInput

    3. Start the Spark job using the following command:

    spark-submit --class com.pb.bigdata.spatial.spark.app.JoinDriver --master local[*]--deploy-mode cluster/home/centos/risk-spark/spectrum-bigdata-samples-riskassessment-spark-version.jar-leftInput /home/centos/MR_Risk_Assessment/shoreline_risk -rightInput/home/centos/MR_Risk_Assessment/fire_risk-output /home/centos/MR_Risk_Assessment/join_output-overwrite -config /home/centos/risk-spark/config_spark.xml

    -failures /home/centos/MR_Risk_Assessment/join_failures

    where:

    • -input The HDFS path to the input directory • -output The HDFS path to the output directory • -conf Configuration file path to the server • -master Master node's address

    Optional parameters:

    The Join Driver also supports -overwrite to overwrite the old output folder on HDFS, -failures pointing to the failures directory on HDFS belong to as optional parameters. Additionally --name can be passed as optional parameter for application name.

    Join Driver supports tab- and comma-delimited input files. The default is tab. Include as appropriate:

    • For files with tab: -delimiter "\t" • For files with comma: -delimiter ","

    It supports configurable quote character used in the input files. The default is double quotes. Include as appropriate:

    Spectrum™ Location Intelligence for Big Data 3.1 Spectrum™ Location Intelligence for Big Data User Guide 63

  • Samples

    • For files with a quote character as double quotes: -quote "”””" • For files with a quote character as grave accent: -quote "`"

    It supports configurable escape character used to escape quote character in the input files. If you do not specify the escape character in the input, it is configured as no escape. Include as appropriate:

    • For files with an escape character as backslash: -escape "\\"

    The risk scores from the record in the right input are appended to the record of left input. Configuration

    This xml represents configuration required for Spark and MapReduce jobs.

    delimiter\tThe delimiter to use for parsing the input file.

    quote"The quote character used in the input file.

    esc