postgresql workshop: working with network data

40
PostgreSQL Workshop: Working with Network Data PGCon 2014 Workshop James Hanson 21-May-14

Upload: galia

Post on 25-Feb-2016

38 views

Category:

Documents


1 download

DESCRIPTION

PostgreSQL Workshop: Working with Network Data. PGCon 2014 Workshop James Hanson 21-May-14. About the presenter …. James Hanson [email protected] , http://www.linkedin.com/in/jamesphanson http://jamesphanson.wikispaces.com/ Reformed Oracle RDBMS developer / administrator - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: PostgreSQL Workshop: Working with Network Data

PostgreSQL Workshop:Working with Network Data

PGCon 2014 WorkshopJames Hanson

21-May-14

Page 2: PostgreSQL Workshop: Working with Network Data

2

About the presenter …

• James [email protected],http://www.linkedin.com/in/jamesphansonhttp://jamesphanson.wikispaces.com/

• Reformed Oracle RDBMS developer / administrator

• I am interested in network data and I see PostgreSQL as powerful, underutilized tool for storing and analyzing network data

21-May-14 PGCon14 Working with Network Data Types

Page 3: PostgreSQL Workshop: Working with Network Data

3

About the presenter …

• James Hanson Parsons, Inc. [email protected]

21-May-14 PGCon14 Working with Network Data Types

Page 4: PostgreSQL Workshop: Working with Network Data

4

What are we doing in this workshop?

• Using PostgreSQL to explore the question “What can I find out about the requester_IP addresses in my Apache log?”…plus some side trips …– Parsing & loading an Apache access log and

associating the requesting IPs with GeoLocated, known malicious and other IP lists.

NOTE: All workshop data is free and publicly available.

21-May-14 PGCon14 Working with Network Data Types

Page 5: PostgreSQL Workshop: Working with Network Data

5

What will we get from the workshop?• Hands-on experience using PG to load, relate and

analyze moderate-sized network data sets• Simple & useful techniques for loading network data• Simple and useful techniques for exploring network

data• Experience exploring some public data sets• Get jazzed to use the techniques and code samples

on your network data• Meet other folks with complementary interests and

expertise21-May-14 PGCon14

Working with Network Data Types

Page 6: PostgreSQL Workshop: Working with Network Data

6

What’s not in this workshop?• GUI displays or application integration.

We focus on the data• PG optimization, tuning or extensions other than

PostGIS• RDBMS normalizing, surrogate keys or orthodoxy*• Automated workflow• Handling of exceptions, errors and boundary

conditions* We’ll revisit this later

21-May-14 PGCon14 Working with Network Data Types

Page 7: PostgreSQL Workshop: Working with Network Data

7

Workshop setup and logistics• Using an EnterpriseDB PostgreSQL 9.3 + PostGIS

with no additional extensions• Hosted on a Centos 6.5 x64 virtual machine with

3GB RAM and 2* CPU cores• Using three workshop downloads …– The data and DDL queries are zipped and available on

GoogleDocs– The PowerPoint is on Slide Share– The SQL and Python files are on GitHub– Everything is linked off PGCon14 site

21-May-14 PGCon14 Working with Network Data Types

Page 8: PostgreSQL Workshop: Working with Network Data

8

Workshop setup errata• PGAdmin Query Tool

Show NULL values as <NULL>• Centos yum packages:

policycoreutils-gui setroubleshoot lzop sambados2unix system-config-lvm.noarch

• PostgreSQL non-default parameters Workshop value Enterprise DB default

max_connections 10 100 effective_cache_size 262,144 16,384maintenance_work_mem 524,288 16,384max_stack_depth 2,048 100shared_buffers 98,304 1,024temp_buffers 131,072 1,024work_mem 1,048,576 1,024fsync off onsynchronous_commit off onencoding UTF-8

21-May-14 PGCon14 Working with Network Data Types

Page 9: PostgreSQL Workshop: Working with Network Data

9

PostgreSQL network data typesName Storage Size Description

cidr 7 or 19 bytes IPv4 and IPv6 networks

inet 7 or 19 bytes IPv4 and IPv6 hosts and networks

macaddr 6 bytes MAC addresses

21-May-14 PGCon14 Working with Network Data Types

An inetrange type is a natural extension that complements subnets

CREATE TYPE inetrange AS RANGE ( SUBTYPE = inet);

Page 10: PostgreSQL Workshop: Working with Network Data

10

PostgreSQL network operators

21-May-14 PGCon14 Working with Network Data Types

Operator Description Example

<= is less than or equal inet '192.168.1.5' <= inet '192.168.1.5'

= equals inet '192.168.1.5' = inet '192.168.1.5'

<> is not equal inet '192.168.1.5' <> inet '192.168.1.4'

<< is contained within inet '192.168.1.5' << inet '192.168.1/24'

<<= is contained within or equals inet '192.168.1/24' <<= inet '192.168.1/24'

>> contains inet '192.168.1/24' >> inet '192.168.1.5'

Page 11: PostgreSQL Workshop: Working with Network Data

11

PostgreSQL ranges and arrays• Ranges– 6 built in types. Easy to create custom types– Can by inclusive, [, or exclusive, (

We’ll use inclusive lower and upper bounds, […]– Use the “contains element”, @>, and “element is

contained by”, <@, operators• Arrays … just the tip or the iceberg– Use for parsing raw data and alternative storage

models– Search arrays with the ANY operator

21-May-14 PGCon14 Working with Network Data Types

Page 12: PostgreSQL Workshop: Working with Network Data

12

Q: Why are we doing this with PostgreSQL?

I was told that “the cloud” was the only place to work with network data!

21-May-14 PGCon14 Working with Network Data Types

Page 13: PostgreSQL Workshop: Working with Network Data

13

A1: Because RDBMSs have some advantages

• Cloud technologies have a difficult time with semantically similar but syntactically different data such as:– Subnets:

192.168.100.24/25 =192.168.100.24 255.255.255.128

– Date / time:21-May-14 = 05/21/14 = 1400638460

• RDBMS are (still) much faster with GUI responses on small / moderate data sets

21-May-14 PGCon14 Working with Network Data Types

Page 14: PostgreSQL Workshop: Working with Network Data

14

A2: Because PostgreSQL has some useful and unique capabilities

• PostgreSQL network data types and operators enable:– Much faster Layer3-type operations … because you

can think like a router– Data validation (NOTE: custom solutions are nearly

impossible for IPv6)• PG arrays can simplify SQL syntax and ERDs.• PG regexp_split_to_array simplifies ETL

… but COPY still lacks important functionality

21-May-14 PGCon14 Working with Network Data Types

Page 15: PostgreSQL Workshop: Working with Network Data

15

Our workshop approach

• Our approach: What useful and tough problems can I solve with PostrgreSQL’s unique capabilities and low cost?

• ! Our approach: Given that you are using a relational database management system, these are the reasons why PostgreSQL is the best brand to choose.

21-May-14 PGCon14 Working with Network Data Types

Page 16: PostgreSQL Workshop: Working with Network Data

16

Workshop logistics• Preamble / instruction• Brief overview of our datasets• Transform and load the data together

-- Intermezzo --• Query the data together• Questions, follow-up & next steps.NOTE: Code snippets are at the end of the PPT.

All files are linked off of PGCon14

21-May-14 PGCon14 Working with Network Data Types

Page 17: PostgreSQL Workshop: Working with Network Data

17

Our buffet of data …All free and publicly available …• OurAirports Airport GeoLoc

http://ourairports.com• CSC InfoChimps US IPGeo

http://www.infochimps.com/datasets • Dragon Research Group Malicious IPs

http://www.dragonresearchgroup.org/ • National Vulnerabilities DatabaseIAVAs

http://nvd.nist.gov/home.cfm • GeoLite IPv6 Geo

http://geolite.maxmind.com/21-May-14 PGCon14

Working with Network Data Types

Page 18: PostgreSQL Workshop: Working with Network Data

18

… Our buffet of data• MyIP Malicious IPs

http://myip.ms• Cowgar Apache log

http://notebook.cowgar.com• The Internet Traffic Archive tracert

ftp://ita.ee.lbl.gov/• Cooperative Association for Internet Data Analysis

(CAIDA) DNShttp://www.caida.org

• NIL ARP table http://wiki.nil.com/

21-May-14 PGCon14 Working with Network Data Types

Page 19: PostgreSQL Workshop: Working with Network Data

19

Dive into Airport data …

../020_Airports/• 4 csv files:– airports, runways, countries and regions

AirportDDL.sql

21-May-14 PGCon14 Working with Network Data Types

Page 20: PostgreSQL Workshop: Working with Network Data

20

INTERMEZZO FOR 10 MINUTES

21-May-14 PGCon14 Working with Network Data Types

Page 21: PostgreSQL Workshop: Working with Network Data

21

Let’s query our data …

../Queries/• Where else have the Apache_log

requesting_ip addresses appeared?

USIPGeo_Queries.sql

21-May-14 PGCon14 Working with Network Data Types

Page 22: PostgreSQL Workshop: Working with Network Data

22

REVISIT NORMALIZATION, QUESTIONS AND WRAP UP

21-May-14 PGCon14 Working with Network Data Types

Page 23: PostgreSQL Workshop: Working with Network Data

23

*Normalizing and surrogate keysNOTE: This is a revisited topic from the workshop pre-amble. • Normalization is very useful for handling ACID

transactions and transaction-entered data (vs. bulk-loaded).

• Surrogate keys are loved by Hibernate, may help in some situations and are part or many ORM methodologies.

But that does not generally apply to immutable network data intended for analysis … like we have.

21-May-14 PGCon14 Working with Network Data Types

Page 24: PostgreSQL Workshop: Working with Network Data

24

*Normalizing and surrogate keys(2)

• And normalization has a steep cost in …– Increased complexity to load and query data– Hard coded assumptions in a complex ERD– Reduced performance **– Intellectual distance from the data

• The reasons to normalize and use surrogate keys don’t apply here but the costs do, so I don’t do it.(feel free to explore this topic off-line with me)

** This is not true in all cases. I refer to index-based joins and searches

21-May-14 PGCon14 Working with Network Data Types

Page 25: PostgreSQL Workshop: Working with Network Data

25

Q: Why did you use arrays and custom data types?

Doesn’t that violate RDBMS principles plus lock you in to custom PostgreSQL syntax?

A: Yes. Here’s why I suggest you consider that:Dr’s. Boyce and Codd did not include arrays in their

RDBMS theoretical work and arrays are not in the ANSI-SQL standard … but that work is focused on ACID transactions, which do not apply here.

PostgreSQL network types are unique … so you are already committed to custom code. Why not take full advantage of PGs capabilities.

21-May-14 PGCon14 Working with Network Data Types

Page 26: PostgreSQL Workshop: Working with Network Data

26

Wrap up and follow-on …

• IP and IP/subnet data types enable you to think of network data as a Layer-3 device does– This will always be faster and simpler than

treating IPs as strings or numbers.• Ex. “Which of these IPs is contained in that

large set if IP subnets and ranges?”– 1-line WHERE clause in PostgreSQL– … still working on it in other environments

21-May-14 PGCon14 Working with Network Data Types

Page 27: PostgreSQL Workshop: Working with Network Data

27

Wrap up and follow-on …

• An inetrange type is a natural extension that complements subnetsCREATE TYPE inetrange AS RANGE ( SUBTYPE = inet);

• If you are using PG network data types with read-only data for analysis …… consider using custom data types and arrays– You have already committed to custom-code so you

may as well reap the (substantial) benefits.

21-May-14 PGCon14 Working with Network Data Types

Page 28: PostgreSQL Workshop: Working with Network Data

28

XML and XPath notes

• The idealized ETL process for XML (and JSON) is ETL … don’t Extract or Transform, just Load

• PostgreSQL has good, but not great, XML support• Whatever you do, don’t write custom code to

shred XML to RDBMS then build XML again for delivering to applications. Please.

• Very good XPath resources are:http://zvon.org/comp/r/tut-XPath_1.html#Pages~List_of_XPaths http://stackoverflow.com/questions/17799790/using-xpath-to-extract-data-from-an-xml-column-in-postgres for details

21-May-14 PGCon14 Working with Network Data Types

Page 29: PostgreSQL Workshop: Working with Network Data

29

PostgreSQL: Working with Network Data

• James Hanson [email protected],http://www.linkedin.com/in/jamesphansonhttp://jamesphanson.wikispaces.com/

• Please provide feedback.This presentation is new and I’d like to make it better.

(… and please let me know if you see / have any cool data sets too!)

21-May-14 PGCon14 Working with Network Data Types

Page 30: PostgreSQL Workshop: Working with Network Data

30

QUESTIONS AND COMMENTS …James Hanson / [email protected]

21-May-14 PGCon14 Working with Network Data Types

Page 31: PostgreSQL Workshop: Working with Network Data

31

Code snippets and techniques from AirportsDDL.sql

-- CountriesCREATE TABLE countries ( id BIGINT PRIMARY KEY, code CHAR(10) NOT NULL UNIQUE, name VARCHAR(100), continent CHAR(4), wikipedia_link VARCHAR(500), keywords VARCHAR(300));

COMMENT ON TABLE countries IS 'Data from http://ourairports.com/data/';

COPY countriesFROM '/Pgcon14/Scripts/020_Airports/countries.csv' WITH DELIMITER ',' CSV HEADER;

21-May-14 PGCon14 Working with Network Data Types

Page 32: PostgreSQL Workshop: Working with Network Data

32

Code snippets and techniques from AirportsDDL.sql

-- Update the GEOMETRY column from the long / latUPDATE airports SET location = ST_GeomFromText('POINT('||longitude_deg||' '|| latitude_deg||')',4326);

21-May-14 PGCon14 Working with Network Data Types

Page 33: PostgreSQL Workshop: Working with Network Data

33

Code snippets and techniques from USIPGeoDDL.sql

-- COPY into temporary table for processing-- Create temporary tableCREATE TABLE us_ipgeo_temp ( id SERIAL, col1 TEXT);

-- Load temporary tableCOPY us_ipgeo_temp( col1) FROM '/Pgcon14/Scripts/030_USIPGeo/infochimps_dataset_13358_download_16680/ip_blocks_us_geo.tsv‘WITH DELIMITER '|' -- use a character that is not in the datasetCSV;

21-May-14 PGCon14 Working with Network Data Types

Page 34: PostgreSQL Workshop: Working with Network Data

34

Code snippets and techniques from USIPGeoDDL.sql-- Use a CTE (Common Table Expression) syntax and-- regexp_split_to_array to look at the data in our temporary table.WITH us_ipgeo_array AS (SELECT regexp_split_to_array(col1, '\t') AS col1FROM us_ipgeo_temp-- LIMIT 10)SELECT col1[1] AS col1_1, col1[2] AS col1_2, col1[3] AS col1_3, col1[4] AS col1_4, col1[5] AS col1_5, col1[6] AS col1_6, col1[7] AS col1_7, col1[8] AS col1_8, col1[9] AS col1_9, col1[10] AS col1_10, col1[11] AS col1_11, col1[12] AS col1_12, col1[13] AS col1_13, col1[14] AS col1_14FROM us_ipgeo_array;

21-May-14 PGCon14 Working with Network Data Types

Page 35: PostgreSQL Workshop: Working with Network Data

35

Code snippets and techniques from USIPGeoDDL.sql-- Manipulate and CAST the fields as neededWITH us_ipgeo_array AS (SELECT regexp_split_to_array(col1, '\t') AS col1FROM us_ipgeo_tempLIMIT 10000 -- check more records too)SELECT -- col1[1] AS col1_1, -- col1[2] AS col1_2, -- col1[3] AS col1_3, -- col1[4] AS col1_4, col1[5]::inet AS ip, REPLACE(col1[6], '"', '')::CHAR(2) AS country, REPLACE(col1[7], '"', '')::CHAR(2) AS state, REPLACE(col1[8], '"', '')::VARCHAR(80) AS city, REPLACE(col1[9], '"', '')::CHAR(5) AS zip_code, col1[10]::real AS latitude, col1[11]::real AS longitude, ST_GeomFromText('POINT('||col1[11]::real||' '|| col1[10]::real||')',4326) AS location, NULLIF(col1[12], '')::int AS metro_code, -- A few values were empty string, IS NOT NULL and does not CAST to INT col1[13]::text AS area_code -- col1[14] AS col1_14FROM us_ipgeo_arrayORDER BY metro_code NULLS LAST;

21-May-14 PGCon14 Working with Network Data Types

Page 36: PostgreSQL Workshop: Working with Network Data

36

Code snippets and techniques from USIPGeoDDL.sql-- Create and populate us_ipgeo from us_ipgeo_temp-- Try it once with a LIMIT statementSELECT * INTO us_ipgeoFROM (WITH us_ipgeo_array AS (SELECT regexp_split_to_array(col1, '\t') AS col1FROM us_ipgeo_temp-- LIMIT 10000 -- check more records too)SELECT -- col1[1] AS col1_1, -- col1[2] AS col1_2, -- col1[3] AS col1_3, -- col1[4] AS col1_4, col1[5]::inet AS ip, REPLACE(col1[6], '"', '')::CHAR(2) AS country, REPLACE(col1[7], '"', '')::CHAR(2) AS state, REPLACE(col1[8], '"', '')::VARCHAR(80) AS city, REPLACE(col1[9], '"', '')::CHAR(5) AS zip_code, col1[10]::real AS latitude, col1[11]::real AS longitude, ST_GeomFromText('POINT('||col1[11]::real||' '|| col1[10]::real||')',4326) AS location, NULLIF(col1[12], '')::int AS metro_code, -- A few values were empty string, IS NOT NULL and does not CAST to INT col1[13]::text AS area_code -- col1[14] AS col1_14FROM us_ipgeo_array) AS tmp_table;

21-May-14 PGCon14 Working with Network Data Types

Page 37: PostgreSQL Workshop: Working with Network Data

37

Code snippets and techniques from ApacheAccessLogs_DDL.sql

-- COPY into temporary table for processing-- Create temporary tableCREATE TABLE apache_log_temp ( id SERIAL, col1 TEXT);

-- Set the client_encoding for the same reason as beforeSET client_encoding TO 'latin1';

-- Load temporary tableCOPY apache_log_temp( col1) FROM '/Pgcon14/Scripts/080_ApacheLogs/access.log' WITH DELIMITER '^' -- use a character that is not in the dataset -- CSV ;

21-May-14 PGCon14 Working with Network Data Types

Page 38: PostgreSQL Workshop: Working with Network Data

38

Code snippets and techniques from ApacheAccessLogs_DDL.sql-- Try it once with a LIMIT statementSELECT *INTO apache_logFROM (WITH apache_log_array AS (SELECT regexp_split_to_array(col1, ' ') AS col1FROM apache_log_temp-- LIMIT 1000)SELECT col1[1]::inet AS requesting_ip, to_timestamp(RIGHT(col1[4], -1) || LEFT(col1[5], -6), 'DD/Mon/YYYY:HH24:MI:SS') AT TIME ZONE INTERVAL '-04:00' AS time, RIGHT(col1[6], -1)::VARCHAR(10) AS action, col1[7]::VARCHAR(100) AS resource, LEFT(col1[8], -1)::VARCHAR(10) AS protocol, NULLIF(col1[9], '-')::int AS status_to_client, NULLIF(col1[10], '-')::int AS object_size_bytes, REPLACE(NULLIF(col1[11], '"-"'), '"', '')::VARCHAR(100) AS referrer, REPLACE(COALESCE(col1[12], '') || COALESCE(col1[13], '') || COALESCE(col1[14], '') || COALESCE(col1[15], '') || COALESCE(col1[16], '') || COALESCE(col1[17], '') || COALESCE(col1[18], '') || COALESCE(col1[19], '') || COALESCE(col1[20], ''), '"', '')::VARCHAR(200) AS user_agentFROM apache_log_array) AS tmp_table;

21-May-14 PGCon14 Working with Network Data Types

Page 39: PostgreSQL Workshop: Working with Network Data

39

Code snippets and techniques from tracert_DDL.sql

-- Lets play a bit with an array structure for the tracert table-- DROP TYPE tracert_trip;CREATE TYPE tracert_trip AS ( hop INTEGER, from_ip INET, to_ip INET);

SELECT source_file, dest_name, dest_ip, array_agg((hop, from_ip, to_ip)::tracert_trip ORDER BY hop) tracert_tripFROM tracertGROUP BY source_file, dest_name, dest_ipLIMIT 10000;

21-May-14 PGCon14 Working with Network Data Types

Page 40: PostgreSQL Workshop: Working with Network Data

40

Code snippets and techniques from 03_MyIP_500TopHosts_Array_DDL.sql

-- Want to use a range for the IPs-- Create custom typeCREATE TYPE inetrange AS RANGE ( SUBTYPE = inet);

-- Check out our data-- SELECT inclusive, [], range for our inetrangeSELECT website, ww_site_rank, hosting_country, hosting_company, owner_country, parent_hosting_company, parent_country, inetrange(owner_ip_from, owner_ip_to, '[]') AS owner_range, -- [] means upper and lower inclusive parent_ip_rangeFROM tophostsORDER BY owner_range;

21-May-14 PGCon14 Working with Network Data Types