gi2016 ppt shi (big data analytics on the internet)
Post on 08-Feb-2017
45 Views
Preview:
TRANSCRIPT
BIG DATA ANALYTICS ON
THE INTERNET
Dr. Shaozhong SHI
drshishaozhong@gmail.com
Drawing data from geographically
dispersed data stores over the Internet
A showcase of internationally remote access to and
use of open data and application over the Internet is
presented.
It shows how automation in big data analytics can be
achieved on the Internet.
It shows the importance of standardisation and
accessibility of data.
It illustrates, with a live example, how Open Source
tools can be utilised for advancing big data analytics.
Drawing data from geographically
dispersed data stores over the Internet
It shows the design of a new application with use of
Open Source tools such as Pandas, Numpy and
Metplotlib.
It explains how full automation in sourcing and
processing data and generating analytical output can
be achieved.
It shows the importance of the standardisation of data
and the role of geographical identifiers in automated
data processing.
Some key solutions for working
across multiple Pandas dataframes
(tables) This PowerPoint show covers some keys which
are important to data linkage, data integration,
working across multiple Pandas dataframes
(tables), and automation in processing.
These are key solutions for automated exact
processing of records.
The showcase implementation is provided in a
IPython notebook. See at the link below: http://dev.mapofagriculture.com:9999/ipython/notebooks/sshaozhong/
2016-05-16_Automatic_Aggregation_Disaggregation_Showcase.ipynb
Original Online Data from USGS
The original data used is a large well structured
Excel sheet at the following USGS website:
http://water.usgs.gov/pubs/sir/2006/5012/excel/Nutri
ent_Inputs_1982-2001jan06.xls
It is used as the input to the program. It is geo-
indexed with Federal Information Processing
Standards (FIPS) codes.
The data is read in the newly developed program
and stored as a Pandas dataframe table.
A subset of data was extracted for creation of a
Pandas dataframe table to serve as the input table.
Original data:
Nitrogen Input from Fertilizer Use (kilograms)
in each year between 1987 and 2001
A subset of a large spread sheet
Characterisation of the new algorithm for spatial
statistical aggregation and disaggregation
The primary questions that this work set out to answer is
whether automated means can be designed and developed
for use in data integration and integrated processing of
agricultural census dataset,
and whether automated aggregation by states and dis-
aggregation of values at state level into values at county
level.
To this end, an exploratory design, development and testing
were carried out. An integrated set of algorithms were
researched, designed, implemented and tested on the Map
of Agriculture platform.
The integrated algorithms are collectively called Data
Linkage for Data Integration and Automated Aggregation
and Dis-aggregation.
Characterisation of the new algorithm for spatial
statistical aggregation and disaggregation
The Automated Aggregation and Dis-aggregation is a
prototype program that was developed in order to
enable rapid development of data integration and
integrated processing with Open Source Python tools
and libraries.
The automated Aggregation and Dis-aggregation use
Python and Pandas, Numpy libraries.
It has efficient, exact data integration, data inflow and
outflow in Pandas dataframe tables, integrated
processing characteristics.
Characterisation of the new algorithm for spatial
statistical aggregation and disaggregation
The two sets of algorithmic solutions implemented are
automatic online sourcing of structured data
and Automated Aggregation and Dis-aggregation itself.
The first is to access, read in and take a set of data.
The second is to carry out an integrated processing for
aggregating county level statistics into state level
statistics and dis-aggregating state level statistics into
county level statistics by rule.
Automated aggregation: addition and summing used.
A loop for summing up farm and non-farm statistics at
county level for each year from 1987 to 2001.
Aggregated state level statistics are produced by using
the State FIPS codes as the key.
Working of the processing
Aggregation:
Input
Working of the processing
Aggregation:
Output of
Adding farm
And nofarm
Statistics
Recursively
Carried out
For all years
Working of processing
Aggregation:
Output of
Application of
Groupby with
The use of
StateFIPS
Characterisation of the new algorithm for spatial
statistical aggregation and disaggregation
The output of automatic aggregation is a Pandas dataframe
table which is indexed with the State FIPS codes.
Dis-aggregation:
The showcase uses a rule assuming that county level
statistics contributing to state level statistics proportionally
as determined by the area within the state.
The totals of land areas of the states are collected from the
output of the aggregated output table through vLookup. It
is stored as a Python dictionary as a geo-referenced
dataset.
These are mapped exactly into right positions in a new
column in the intermediary table for producing dis-
aggregated statistics.
Output of dis-aggregating statistics on Nitrogen Input
Characterisation of the new
algorithm Then, calculation of ratio between each county and its state
takes place.
A loop is used to calculate dis-aggregated statistics for all
counties for each of years from 1987 to 2001.
This results in a Pandas dataframe table as a dis-
aggregated table.
Characterisation of the new algorithm for spatial
statistical aggregation and disaggregation
New approach of dis-aggregating tabular statistics into
smaller geographical units (no intersection of geometric
objects is required):
Calculation of ratio between each county and its state
takes place. A loop is used to calculate dis-aggregated
Nitrogen input statistics for all counties for each of years
between 1987 and 2001. The total of a state times the
ratio yields a dis-aggregated sum for the county. This
logic of dis-aggregation has been used in areal
interpolations as a technique for spatial disaggregation
(Flowerdew and Green, 1992&1994; Goodchild, Anselin
and Deichmann, 1993). This results in a Pandas
dataframe table as a dis-aggregated table.
Characterisation of the new algorithm for spatial
statistical aggregation and disaggregation
Hitherto, areal interpolation and Dasymetric mapping
(Flowerdew and Green, 1992&1994; Goodchild,
Anselin
and Deichmann, 1993) are the only known approach
and methods for spatially dis-aggregating statistics in
relevance to the current work, particularly regarding
the processing of tabular statistics in vector GIS
datasets. The current work uses the logic of areal
interpolation, as far as the datasets involved can
currently allow. The difference between the current
implementation of calculations and areal interpolation
is that the current implementation does not involve
intersection of area features/polygons.
Characterisation of the new algorithm for
spatial statistical aggregation and
disaggregation There is a degree of uncertainty related to the
estimates. Improvement in estimation requires further
research in the future. Nevertheless, it is a step
forward in enabling estimation given the situation
where no data are collected at county level. It offers a
means to provide a quantitative indication. It is
particularly useful to the processing of tabular
statistics or when patterns need to be visualised at
large scales.
Characterisation of the new algorithm for
spatial statistical aggregation and
disaggregation
The algorithmic solutions are characterised by their
capabilities to track the geo-referenced data entries
throughout cycles of processing, and exact geo-
referenced data retrieval and mapping, namely data
inflow and outflow from Pandas dataframe tables.
The dis-aggregation algorithm/procedure can be used
for directly processing of tabular statistics without
involving intersection of polygons, particularly in
situations when neatly nested geospatial boundaries
files of US states and counties are used.
Characterisation of the new algorithm for
spatial statistical aggregation and
disaggregation The new algorithm can carry out automatic online
sourcing of datasets and integrated processing with
Open Source Python libraries. The new algorithm
can be further extended for linking geodata from
various sources, and for creation of indexed tabular
datasets with geographical identifiers.
It can carry out automatic aggregation and dis-
aggregation of agricultural census datasets for all
states and counties in the USA.
Characterisation of the new algorithm for
spatial statistical aggregation and
disaggregation The Federation Information Processing Standards (FIPS)
codes were used as geographical identifiers for geo-
referenced data entries. It plays a critical role in retrieving
data from databases and mapping data into right
positions. It plays an efficient role in enabling vLookup
solutions for retrieving data and mapping to exact
positions in tables as desired.
Geographical identifiers serve as the key and are critically
important in linking data between tables and creating geo-
indexed tabular datasets. Geographical identifiers track
attribute data entries in reference to geospatial objects.
This vLookup solution can be modified and used for other
geodata projects.
Output
The output of the program includes an
aggregated statistical table by states and a
dis-aggregated table by counties.
Dis-aggregating wheat statistics into
all counties
Data columns of StateFIPS, State
Abbreviation, County name, country FIPS
and ratio are taken from the table of dis-
aggregated nitrogen input to form a new
Pandas DataFrame table.
Data on wheat is extracted from the
QuickStats are used. These data are state
level statistics. The data are dis-aggregated
into all counties.
Dis-aggregating wheat statistics into
all counties
Output of dis-aggregating wheat statistics
Issues encountered
Data type issues were encountered and resolved.
Clear understanding of data types and methods for
changing and handling is required.
After application of groupby command in Pandas
dataframe, the original indexing is found meaningless.
The use of FIPS codes ensures that data indexing and
linkage in records are maintained throughout
processing cycles. Mapping geo-referenced data into
exact positions in columns is very important.
Update Geo-databases and Create
digital models in Geographical Information
Systems to visualise spatial variation
A standard Geographical Information System has digital
map associated with a tabular database of records.
Areal interpolation and Dasymetric mapping techniques
have gained its popularity in using tabular records and
combine these with area boundary files for creating
map models.
The approach presented in this talk is based on the use
of a neatly nested area boundary files in the
administrative hierarchy of areas of the USA.
No intersection of digital boundaries is needed.
Analytical example: Change over time
Analytical example: Rate of Change
References https://www.nass.usda.gov/Quick_Stats/
https://www.python.org/downloads/
https://www.scipy.org/scipylib/download.html
http://matplotlib.org/downloads.html
https://pypi.python.org/pypi/pylab
Contact
4 Haythrop Close, Downhead Park, Milton Keynes,
Buckinghamshire, United Kingdom, MK15 9DD
Mobile: +44-7909844462
EMail: drshishaozhong@gmail.com
top related