gi2016 ppt shi (big data analytics on the internet)

BIG DATA ANALYTICS ON

THE INTERNET

Dr. Shaozhong SHI

[email protected]

Drawing data from geographically

dispersed data stores over the Internet

A showcase of internationally remote access to and

use of open data and application over the Internet is

presented.

It shows how automation in big data analytics can be

achieved on the Internet.

It shows the importance of standardisation and

accessibility of data.

It illustrates, with a live example, how Open Source

tools can be utilised for advancing big data analytics.

Drawing data from geographically

dispersed data stores over the Internet

It shows the design of a new application with use of

Open Source tools such as Pandas, Numpy and

Metplotlib.

It explains how full automation in sourcing and

processing data and generating analytical output can

be achieved.

It shows the importance of the standardisation of data

and the role of geographical identifiers in automated

data processing.

Some key solutions for working

across multiple Pandas dataframes

(tables) This PowerPoint show covers some keys which

are important to data linkage, data integration,

working across multiple Pandas dataframes

(tables), and automation in processing.

These are key solutions for automated exact

processing of records.

The showcase implementation is provided in a

IPython notebook. See at the link below: http://dev.mapofagriculture.com:9999/ipython/notebooks/sshaozhong/

2016-05-16_Automatic_Aggregation_Disaggregation_Showcase.ipynb

Original Online Data from USGS

The original data used is a large well structured

Excel sheet at the following USGS website:

http://water.usgs.gov/pubs/sir/2006/5012/excel/Nutri

ent_Inputs_1982-2001jan06.xls

It is used as the input to the program. It is geo-

indexed with Federal Information Processing

Standards (FIPS) codes.

The data is read in the newly developed program

and stored as a Pandas dataframe table.

A subset of data was extracted for creation of a

Pandas dataframe table to serve as the input table.

http://water.usgs.gov/pubs/sir/2006/5012/excel/Nutrient_Inputs_1982-2001jan06.xls




Original data:

Nitrogen Input from Fertilizer Use (kilograms)

in each year between 1987 and 2001

A subset of a large spread sheet

Characterisation of the new algorithm for spatial

statistical aggregation and disaggregation

The primary questions that this work set out to answer is

whether automated means can be designed and developed

for use in data integration and integrated processing of

agricultural census dataset,

and whether automated aggregation by states and dis-

aggregation of values at state level into values at county

level.

To this end, an exploratory design, development and testing

were carried out. An integrated set of algorithms were

researched, designed, implemented and tested on the Map

of Agriculture platform.

The integrated algorithms are collectively called Data

Linkage for Data Integration and Automated Aggregation

and Dis-aggregation.



The Automated Aggregation and Dis-aggregation is a

prototype program that was developed in order to

enable rapid development of data integration and

integrated processing with Open Source Python tools

and libraries.

The automated Aggregation and Dis-aggregation use

Python and Pandas, Numpy libraries.

It has efficient, exact data integration, data inflow and

outflow in Pandas dataframe tables, integrated

processing characteristics.



The two sets of algorithmic solutions implemented are

automatic online sourcing of structured data

and Automated Aggregation and Dis-aggregation itself.

The first is to access, read in and take a set of data.

The second is to carry out an integrated processing for

aggregating county level statistics into state level

statistics and dis-aggregating state level statistics into

county level statistics by rule.

Automated aggregation: addition and summing used.

A loop for summing up farm and non-farm statistics at

county level for each year from 1987 to 2001.

Aggregated state level statistics are produced by using

the State FIPS codes as the key.

Working of the processing

Aggregation:

Input

Working of the processing

Aggregation:

Output of

Adding farm

And nofarm

Statistics

Recursively

Carried out

For all years

Working of processing

Aggregation:

Output of

Application of

Groupby with

The use of

StateFIPS



The output of automatic aggregation is a Pandas dataframe

table which is indexed with the State FIPS codes.

Dis-aggregation:

The showcase uses a rule assuming that county level

statistics contributing to state level statistics proportionally

as determined by the area within the state.

The totals of land areas of the states are collected from the

output of the aggregated output table through vLookup. It

is stored as a Python dictionary as a geo-referenced

dataset.

These are mapped exactly into right positions in a new

column in the intermediary table for producing dis-

aggregated statistics.

Output of dis-aggregating statistics on Nitrogen Input

Characterisation of the new

algorithm Then, calculation of ratio between each county and its state

takes place.

A loop is used to calculate dis-aggregated statistics for all

counties for each of years from 1987 to 2001.

This results in a Pandas dataframe table as a dis-

aggregated table.



New approach of dis-aggregating tabular statistics into

smaller geographical units (no intersection of geometric

objects is required):

Calculation of ratio between each county and its state

takes place. A loop is used to calculate dis-aggregated

Nitrogen input statistics for all counties for each of years

between 1987 and 2001. The total of a state times the

ratio yields a dis-aggregated sum for the county. This

logic of dis-aggregation has been used in areal

interpolations as a technique for spatial disaggregation

(Flowerdew and Green, 1992&1994; Goodchild, Anselin

and Deichmann, 1993). This results in a Pandas

dataframe table as a dis-aggregated table.



Hitherto, areal interpolation and Dasymetric mapping

(Flowerdew and Green, 1992&1994; Goodchild,

Anselin

and Deichmann, 1993) are the only known approach

and methods for spatially dis-aggregating statistics in

relevance to the current work, particularly regarding

the processing of tabular statistics in vector GIS

datasets. The current work uses the logic of areal

interpolation, as far as the datasets involved can

currently allow. The difference between the current

implementation of calculations and areal interpolation

is that the current implementation does not involve

intersection of area features/polygons.

Characterisation of the new algorithm for

spatial statistical aggregation and

disaggregation There is a degree of uncertainty related to the

estimates. Improvement in estimation requires further

research in the future. Nevertheless, it is a step

forward in enabling estimation given the situation

where no data are collected at county level. It offers a

means to provide a quantitative indication. It is

particularly useful to the processing of tabular

statistics or when patterns need to be visualised at

large scales.



disaggregation

The algorithmic solutions are characterised by their

capabilities to track the geo-referenced data entries

throughout cycles of processing, and exact geo-

referenced data retrieval and mapping, namely data

inflow and outflow from Pandas dataframe tables.

The dis-aggregation algorithm/procedure can be used

for directly processing of tabular statistics without

involving intersection of polygons, particularly in

situations when neatly nested geospatial boundaries

files of US states and counties are used.



disaggregation The new algorithm can carry out automatic online

sourcing of datasets and integrated processing with

Open Source Python libraries. The new algorithm

can be further extended for linking geodata from

various sources, and for creation of indexed tabular

datasets with geographical identifiers.

It can carry out automatic aggregation and dis-

aggregation of agricultural census datasets for all

states and counties in the USA.



disaggregation The Federation Information Processing Standards (FIPS)

codes were used as geographical identifiers for geo-

referenced data entries. It plays a critical role in retrieving

data from databases and mapping data into right

positions. It plays an efficient role in enabling vLookup

solutions for retrieving data and mapping to exact

positions in tables as desired.

Geographical identifiers serve as the key and are critically

important in linking data between tables and creating geo-

indexed tabular datasets. Geographical identifiers track

attribute data entries in reference to geospatial objects.

This vLookup solution can be modified and used for other

geodata projects.

Output

The output of the program includes an

aggregated statistical table by states and a

dis-aggregated table by counties.

Dis-aggregating wheat statistics into

all counties

Data columns of StateFIPS, State

Abbreviation, County name, country FIPS

and ratio are taken from the table of dis-

aggregated nitrogen input to form a new

Pandas DataFrame table.

Data on wheat is extracted from the

QuickStats are used. These data are state

level statistics. The data are dis-aggregated

into all counties.

Dis-aggregating wheat statistics into

all counties

Output of dis-aggregating wheat statistics

Issues encountered

Data type issues were encountered and resolved.

Clear understanding of data types and methods for

changing and handling is required.

After application of groupby command in Pandas

dataframe, the original indexing is found meaningless.

The use of FIPS codes ensures that data indexing and

linkage in records are maintained throughout

processing cycles. Mapping geo-referenced data into

exact positions in columns is very important.

Update Geo-databases and Create

digital models in Geographical Information

Systems to visualise spatial variation

A standard Geographical Information System has digital

map associated with a tabular database of records.

Areal interpolation and Dasymetric mapping techniques

have gained its popularity in using tabular records and

combine these with area boundary files for creating

map models.

The approach presented in this talk is based on the use

of a neatly nested area boundary files in the

administrative hierarchy of areas of the USA.

No intersection of digital boundaries is needed.

Analytical example: Change over time

Analytical example: Rate of Change

References https://www.nass.usda.gov/Quick_Stats/

https://www.python.org/downloads/

https://www.scipy.org/scipylib/download.html

http://matplotlib.org/downloads.html

https://pypi.python.org/pypi/pylab

Contact

4 Haythrop Close, Downhead Park, Milton Keynes,

Buckinghamshire, United Kingdom, MK15 9DD

Mobile: +44-7909844462

EMail: [email protected]

gi2016 ppt shi (big data analytics on the internet)

Technology