a framework for comparison of spatiotemporal and time ... · particularly those with...

1
A Framework for Comparison of Spatiotemporal and Time Series Datasets NREL is a national laboratory of the U.S. Department of Energy, Office of Energy Efficiency & Renewable Energy, operated by the Alliance for Sustainable Energy, LLC. David Biagioni, Brian Bush, Ryan Elmore, Dan Getman, Danny Inman This presentation does not contain any proprietary, confidential, or otherwise restricted information Evaluation of Data Formats for Statistical Analysis of Large Matrices on HPC Objective To determine the best storage and analysis data format for statistical analysis of large matrices on high performance computing resources Approach Part of our preliminary experimentation involved running benchmarks to see how standard analyses would scale on Peregrine. The following results showing the average time to complete principal component analyses on data sets of varying size (7.6 MB, 76MB, 763 MB, and 7.45 GB), differing numbers of cores (96, 144, 192, 240, and 288) and cores per node (16 vs 24). The left panel shows the timings for all combinations and the right panel shows the speedups gained in using 144, 192, 240, and 288 cores relative to using 96 cores. Comparing Datasets using Non-Parametric Methods Objective Develop a non-parametric approach for making inter- and intra-dataset comparisons of large resource datasets. Approach Time series were assessed for normality. Box-Cox was used to transform the data to quasi-normal. Temporal dependency was removed from the Box-Cox transformed data using autoregression. Residuals were compared using Canonical Redundancy Analysis (RDA). Non-parametric significance tests were performed using bootstrapping. Illustrative Example National Solar Resource Database (NSRDB). Are stations within a region the same over a given time period? Results Are stations within a region the same over a given time period? Results of the RDA analysis suggest that three of the four stations have strong similarities in their underlying data structure. This approach may be applied across a range of data sets, including comparing multiple parameters across disparate datasets. Figure 1. Direct solar irradiance for four NSRDB stations in the Western US. Data presented in this figure have been minimally processed to remove NA values. The time period presented is 300 hours. Figure 2. Quantile-quantile (Q-Q) plot for the Twenty Nine Palms, CA station. Comparing the sample quantiles (open circles) to the theoretical normal quantiles (red line) the data do not follow a normal distribution. Figure 3. Residual plots of four NSRDB stations in the Western US. An autoregressive model was used to remove the effects of autocorrelation on the data. From this plot, it is salient that a strong underlying structure exists in the data. Figure 4. Biplot of the redundancy analysis (RDA). The green triangles are represent the fitted site scores of the dissimilarity measures and the blue lines represent the station data. The two axes are the first (x) and second (y) principle coordinates. The cosine of the angle between any two stations (blue lines) represents the correlation. Exploration of Simultaneous Variability using Principal Component Analysis Objective To develop a method of deriving where, when, and at what aggregation level we can detect and express correlation between the variability of two parameters within a large spatiotemporal dataset. Approach Using NSRDB Stations data, calculate variability for wind speed and solar irradiance at daily, weekly, monthly, and annual aggregation levels Calculate a difference in the simultaneous variability at each aggregation level Perform PCA on the difference value at each aggregation level, including monthly PCA calculated using the daily values Calculate the variability of the difference values across each aggregation level Classify results of the PCA based on the variability and maximum values of the difference calculations Results Figure 1. Mapping the difference between the variability of DIR and wind speed for several individual days and at weekly, monthly, and annual aggregation levels. Figure 2. Mapping the contribution of each station to the first 10 Eigenvectors Figure 4. Contribution of all stations to the first 10 eigenvectors grouped by high and low values representing the variability, range, and maximum difference between wind speed and DIR. Figure 5. Quintiles for variability, range, and maximum difference between wind speed and DIR for stations grouped by low and high contribution to the first 10 Eigenvectors Figure 3.Variability of the difference between the variability of wind speed and DIR for two stations. The station represented in red is a high contributor to the first 10 eigenvectors and the station represented in blue is a low contributor. Values represent the month of July across all years at each aggregation level Stations with low contributions to the first 10 eigenvectors demonstrate lower variability, range, and maximum values of the difference between variability of wind speed and solar irradiance (DIR). Stations with higher contributions demonstrate higher values in these parameters. Goal: To develop statistical methods for the cross-comparison and relative-quality evaluation of datasets focused on the interpretation of analytical results and the validation of modeled data through comparison to known or source datasets. Next Steps: This research is being undertaken in a series of phases that will ensure its availability and applicability to researchers at NREL. Currently, we are completing the initial assessment of potential statistical and computational methods and moving into the second phase of research in which those methods are applied. This will involve several in-depth analyses focused on I. Comparing multiple spatiotemporal datasets with overlapping coverage in space and time II. Comparing time series between different spatial locations within the same spatiotemporal dataset III. Comparing spatial data between multiple time slices within the same spatiotemporal dataset IV.Comparing multiple time windows within one or more temporal datasets V. Comparing irregularly spaced point datasets to gridded date representing similar parameters. Impact: Results from this project have the potential to routinely add value to a wide range of projects across most, if not all, NREL centers. Developing a methodology to routinely apply techniques for the cross-comparison and relative-quality evaluation of large spatiotemporal datasets constitutes a significant and novel contribution to energy data science. Increased visibility in the data science community will foster the development of high-value collaborations with academic institutions, enlarge the energy-data territory that NREL The addition of such a capability to NREL can be harnessed in marketing sophisticated, complex analysis projects to multiple sponsors, putting NREL another step ahead of competitors. Goals, Plans, Impacts Computational Sciences Center - Strategic Energy Analysis Center Motivation Non-parametric statistical methods combine power with robustness. Accessible and safe for non-statisticians Apply broadly to many types of EE/RE data Don’t require strong assumptions about data Avoid false positives Perform nearly as well as parametric methods Well suited for rapid calculation in distributed computing environments Approach Our research focuses on multidimensional applications of non-parametric methods, particularly those with spatio-temporal extensions. Datasets comparison often fits into a common data-processing pattern. Such data-processing patterns can be implemented in HPC environments as a map-reduce operation. Example (see diagram to the right) Study Question: How do NSRDB station’s irradiance “measurements” (dataset #1) compare with those measurements from the same hour of the same day five years previously (dataset #2)? Results of Sign Test: There is a statistically significant bias detectable between these datasets in many geographical regions. Diagnosis: There is a several W/m2 bias of the dataset #1relative to dataset #2. Group data into bins by LST hour Pair observations Perform sign test on pairs Diagnose problems Visualize Hour 7 9 11 13 15 17 19 0 100 200 300 400 500 600 0 100 200 300 400 500 600 USAF 0% 20% 40% 60% 80% 100% % of Total Number of Records 726573 "Unbiased" 726579 "Biased" Counts 0 50 100 150 200 250 300 Average of Diffuse Irradiance [W/m^2] -60 -40 -20 0 20 40 60 80 100 Dierence vs Average Comparison of Diuse Irradiance < = > USAF 726573 "Unbiased" 726579 "Biased" USAF 726573 "Unbiased" 726579 "Biased" 1 2 3 4 5 6 7 8 Average Bias USAF 0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100% % of Total Number of Records 726579 "Biased" Comparison of Diuse Irradiance with 5 Years Previous < = > Bias mostly absent (dataset at 10:00 LST compared with five years previously) significant difference insignificant difference examination of USAF 726573 and 726579 Map-Reduce Approach to Comparison of Spatiotemporal Data Raw Data Source N User-Defined Features of Interest N Statistical Summary for Testing N Statistical Test Anomalous Features Flagged N Visualization Statistical Inversion Anomalous Raw Data Flagged 2 Assessment by Subject-Matter Expert Raw Data Source 1 Raw Data Source 2 User-Defined Features of Interest 2 Statistical Summary for Testing 2 User-Defined Features of Interest 1 Statistical Summary for Testing 1 Anomalous Features Flagged 2 Anomalous Features Flagged 1 Anomalous Raw Data Flagged 1 Used-Dened Applications Anomalous Raw Data Flagged N . . . Investigation by Subject- Matter Expert Analysts are typically faced with deciding which of N datasets is most appropriate for their application. Analysts rarely use datasets in their raw form, but typically aggregate, transform, or summarize them into a set of features for their application. One finds that rigorous statistical comparison of raw datasets usually indicates that they are statistically different, which is not particularly informative to analysts. Only by applying statistical tests to the user-defined features of interest, however, can one determine if the datasets differ for the analyst’s application. Statistical tests for comparing datasets typically rely on transforming, binning, or summarizing the data before applying the test. The result of the test is the identification of the anomalous features, which can then be visualized and used in assessing the consequences of using each dataset. Statistical inversion techniques can allow one to trace back the anomalies identified in the features of interest back to the characteristics of the raw datasets. Subject-matter experts can then focus on determining the fundamental cause of the anomalies and assess the severity of their impact on applications. . . . . . . . . . . . . NREL/PR-6A20-62648

Upload: others

Post on 05-Jul-2020

4 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: A Framework for Comparison of Spatiotemporal and Time ... · particularly those with spatio-temporal extensions. Datasets comparison often fits into a common data-processing pattern

A Framework for Comparison of Spatiotemporal and Time Series Datasets

NREL is a national laboratory of the U.S. Department of Energy, Office of Energy Efficiency & Renewable Energy, operated by the Alliance for Sustainable Energy, LLC.

David Biagioni, Brian Bush, Ryan Elmore, Dan Getman, Danny Inman

This presentation does not contain any proprietary, confidential, or otherwise restricted information

Evaluation of Data Formats for Statistical Analysis of Large Matrices on HPCObjectiveTo determine the best storage and analysis data format for statistical analysis of large matrices on high performance computing resources

ApproachPart of our preliminary experimentation involved running benchmarks to see how standard analyses would scale on Peregrine. The following results showing the average time to complete principal component analyses on data sets of varying size (7.6 MB, 76MB, 763 MB, and 7.45 GB), differing numbers of cores (96, 144, 192, 240, and 288) and cores per node (16 vs 24). The left panel shows the timings for all combinations and the right panel shows the speedups gained in using 144, 192, 240, and 288 cores relative to using 96 cores.

Comparing Datasets using Non-Parametric MethodsObjective• Develop a non-parametric approach for

making inter- and intra-datasetcomparisons of large resource datasets.

Approach• Time series were assessed for normality.• Box-Cox was used to transform the data

to quasi-normal.• Temporal dependency was removed from

the Box-Cox transformed data usingautoregression.

• Residuals were compared using CanonicalRedundancy Analysis (RDA).

• Non-parametric significance tests wereperformed using bootstrapping.

Illustrative Example • National Solar Resource Database

(NSRDB).• Are stations within a region the same over

a given time period?Results • Are stations within a region the same over

a given time period?• Results of the RDA analysis suggest that

three of the four stations have strongsimilarities in their underlying datastructure.

• This approach may be applied across arange of data sets, including comparingmultiple parameters across disparatedatasets.

Figure 1. Direct solar irradiance for four NSRDB stations in the Western US. Data presented in this figure have been minimally processed to remove NA

values. The time period presented is 300 hours.

Figure 2. Quantile-quantile (Q-Q) plot for the Twenty Nine Palms, CA station. Comparing the sample quantiles (open circles) to the theoretical normal quantiles (red line) the data do not follow a normal distribution.

Figure 3. Residual plots of four NSRDB stations in the Western US. An autoregressive model was used to remove the effects of autocorrelation on the data. From this plot, it is salient that a strong underlying structure exists

in the data.

Figure 4. Biplot of the redundancy analysis (RDA). The green triangles are represent the fitted site scores of the dissimilarity measures and the blue lines

represent the station data. The two axes are the first (x) and second (y) principle coordinates. The cosine of the angle between any two stations (blue

lines) represents the correlation.

Exploration of Simultaneous Variability using Principal Component AnalysisObjectiveTo develop a method of deriving where, when, and at what aggregation level we can detect and express correlation between the variability of two parameters within a large spatiotemporal dataset. Approach• Using NSRDB Stations data, calculate variability for wind speed and solar irradiance at daily, weekly, monthly, and

annual aggregation levels• Calculate a difference in the simultaneous variability at each aggregation level• Perform PCA on the difference value at each aggregation level, including monthly PCA calculated using the daily

values• Calculate the variability of the difference values across each aggregation level• Classify results of the PCA based on the variability and maximum values of the difference calculations

Results

Figure 1. Mapping the difference between the variability of DIR and wind speed for several individual days and at weekly, monthly, and annual aggregation levels.

Figure 2. Mapping the contribution of each station to the first 10 EigenvectorsFigure 4. Contribution of all stations to the first 10 eigenvectors grouped by high and low values representing the variability, range, and maximum difference between wind speed and DIR.

Figure 5. Quintiles for variability, range, and maximum difference between wind speed and DIR for stations grouped by low and high contribution to the first 10 Eigenvectors

Figure 3. Variability of the difference between the variability of wind speed and DIR for two stations.

The station represented in red is a high contributor to the first 10 eigenvectors and the station represented in blue is a low contributor.

Values represent the month of July across all years at each aggregation level

Stations with low contributions to the first 10 eigenvectors demonstrate lower variability, range, and maximum values of the difference between variability of wind speed and solar irradiance (DIR). Stations with higher contributions demonstrate higher values in these parameters.

Goal: To develop statistical methods for the cross-comparison and relative-quality evaluation of datasets focused on the interpretation of analytical results and the validation of modeled data through comparison to known or source datasets.

Next Steps: This research is being undertaken in a series of phases that will ensure its availability and applicability to researchers at NREL. Currently, we are completing the initial assessment of potential statistical and computational methods and moving into the second phase of research in which those methods are applied.

This will involve several in-depth analyses focused on I. Comparing multiple spatiotemporal datasets with overlapping

coverage in space and timeII. Comparing time series between different spatial locations within the

same spatiotemporal datasetIII. Comparing spatial data between multiple time slices within the same

spatiotemporal datasetIV. Comparing multiple time windows within one or more temporal

datasetsV. Comparing irregularly spaced point datasets to gridded date

representing similar parameters.

Impact: Results from this project have the potential to routinely add value to a wide range of projects across most, if not all, NREL centers.

• Developing a methodology to routinely apply techniques for thecross-comparison and relative-quality evaluation of largespatiotemporal datasets constitutes a significant and novelcontribution to energy data science.

• Increased visibility in the data science community will foster thedevelopment of high-value collaborations with academic institutions,enlarge the energy-data territory that NREL

• The addition of such a capability to NREL can be harnessed inmarketing sophisticated, complex analysis projects to multiplesponsors, putting NREL another step ahead of competitors.

Goals, Plans, Impacts

Computational Sciences Center - Strategic Energy Analysis Center

MotivationNon-parametric statistical methods combine power with robustness.

Accessible and safe for non-statisticiansApply broadly to many types of EE/RE dataDon’t require strong assumptions about dataAvoid false positivesPerform nearly as well as parametric methodsWell suited for rapid calculation in distributed computing environments

ApproachOur research focuses on multidimensional applications of non-parametric methods, particularly those with spatio-temporal extensions.Datasets comparison often fits into a common data-processing pattern.Such data-processing patterns can be implemented in HPC environments as a map-reduce operation.

Example (see diagram to the right)Study Question: How do NSRDB station’s irradiance “measurements” (dataset #1) compare with those measurements from the same hour of the same day five years previously (dataset #2)?Results of Sign Test: There is a statistically significant bias detectable between these datasets in many geographical regions.Diagnosis: There is a several W/m2 bias of the dataset #1relative to dataset #2.

Group data into bins by LST hour

Pair observations

Perform sign test on pairs

Diagnose problems

VisualizeHour

7 9 11 13 15 17 19

0

100

200

300

400

500

600

0

100

200

300

400

500

600

USAF

0% 20% 40% 60% 80% 100%% of Total Number of Records

726573 "Unbiased"726579 "Biased"

Counts

0 50 100 150 200 250 300Average of Diffuse Irradiance [W/m^2]

-60

-40-20

0

2040

60

80100

Di!erence vs Average

Comparison of Di!useIrradiance

< = >

USAF726573 "Unbiased"726579 "Biased"

USAF

726573"Unbiased"

726579"Biased"

12345678

Average Bias

USAF

0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100%% of Total Number of Records

726579"Biased"

Comparison of Di!use Irradiance with 5 Years Previous< = >

Bias mostly absent

(dataset at 10:00 LST compared with five years previously)

significant difference insignificant difference

examination of U

SAF 726573 and 726579

Map-Reduce Approach to

Comparison of Spatiotemporal

Data

Raw Data Source N

User-Defined Features of Interest N

Statistical Summary for

Testing N

Statistical Test

Anomalous Features Flagged N

Visualization

Statistical Inversion

Anomalous Raw Data Flagged 2

Assessment by Subject-Matter Expert

Raw Data Source 1

Raw Data Source 2

User-Defined Features of Interest 2

Statistical Summary for

Testing 2

User-Defined Features of Interest 1

Statistical Summary for

Testing 1

Anomalous Features Flagged 2

Anomalous Features Flagged 1

Anomalous Raw Data Flagged 1

Used-De!ned Applications

Anomalous Raw Data Flagged N

. . .

Investigation by Subject-Matter Expert

Analysts are typically faced with deciding which of N datasets is most appropriate for their application.Analysts rarely use datasets in their raw form, but typically aggregate, transform, or summarize them into a set of features for their application.One finds that rigorous statistical comparison of raw datasets usually indicates that they are statistically different, which is not particularly informative to analysts. Only by applying statistical tests to the user-defined features of interest, however, can one determine if the datasets differ for the analyst’s application.Statistical tests for comparing datasets typically rely on transforming, binning, or summarizing the data before applying the test. The result of the test is the identification of the anomalous features, which can then be visualized and used in assessing the consequences of using each dataset.Statistical inversion techniques can allow one to trace back the anomalies identified in the features of interest back to the characteristics of the raw datasets. Subject-matter experts can then focus on determining the fundamental cause of the anomalies and assess the severity of their impact on applications.

. . .

. . .

. . .

. . .

NREL/PR-6A20-62648