aggregate data and statistics statistics canada data liberation initiative wendy watkins carleton...

52
Aggregate Data and Statistics Statistics Canada Data Liberation Initiative Wendy Watkins Carleton University Chuck Humphrey University of Alberta CAPDU/DLI Training May 29 th , 2002

Upload: arianna-roach

Post on 27-Mar-2015

216 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Aggregate Data and Statistics Statistics Canada Data Liberation Initiative Wendy Watkins Carleton University Chuck Humphrey University of Alberta CAPDU/DLI

Aggregate Data and Statistics

Statistics Canada Data Liberation Initiative

Wendy Watkins Carleton University

Chuck Humphrey University of Alberta

CAPDU/DLI Training May 29th, 2002

Page 2: Aggregate Data and Statistics Statistics Canada Data Liberation Initiative Wendy Watkins Carleton University Chuck Humphrey University of Alberta CAPDU/DLI

Outline

What are aggregate data? Why aggregate? How to aggregate? Computing exercise

Page 3: Aggregate Data and Statistics Statistics Canada Data Liberation Initiative Wendy Watkins Carleton University Chuck Humphrey University of Alberta CAPDU/DLI

Let’s start with the relationship between statistics and data.

What are aggregate data?

Page 4: Aggregate Data and Statistics Statistics Canada Data Liberation Initiative Wendy Watkins Carleton University Chuck Humphrey University of Alberta CAPDU/DLI

Statistics and Data

Data• numeric files created

and organized for analysis

• requires processing• not ready for display

Statistics• numeric facts/figures • created from data, i.e,

already processed• presentation-ready

Page 5: Aggregate Data and Statistics Statistics Canada Data Liberation Initiative Wendy Watkins Carleton University Chuck Humphrey University of Alberta CAPDU/DLI

Statistics and Data

Page 6: Aggregate Data and Statistics Statistics Canada Data Liberation Initiative Wendy Watkins Carleton University Chuck Humphrey University of Alberta CAPDU/DLI
Page 7: Aggregate Data and Statistics Statistics Canada Data Liberation Initiative Wendy Watkins Carleton University Chuck Humphrey University of Alberta CAPDU/DLI

Statistics and Data

Page 8: Aggregate Data and Statistics Statistics Canada Data Liberation Initiative Wendy Watkins Carleton University Chuck Humphrey University of Alberta CAPDU/DLI

In short, statistics are created from data and represent summaries of the detail observed in the data.

Statistics and Data

Page 9: Aggregate Data and Statistics Statistics Canada Data Liberation Initiative Wendy Watkins Carleton University Chuck Humphrey University of Alberta CAPDU/DLI

Building on this previous example, let’s explore aggregation.

We see a table with the number of smokers summarized over categories for age, education, sex, geography, and different time points.

What is aggregation?

Page 10: Aggregate Data and Statistics Statistics Canada Data Liberation Initiative Wendy Watkins Carleton University Chuck Humphrey University of Alberta CAPDU/DLI

Categories of Region

Categories of Sex

Categories of Periods

A Statistic

Page 11: Aggregate Data and Statistics Statistics Canada Data Liberation Initiative Wendy Watkins Carleton University Chuck Humphrey University of Alberta CAPDU/DLI

Aggregation involves tabulating a summary statistic across all of the categories or levels of a set of variables.

What is aggregation?

Page 12: Aggregate Data and Statistics Statistics Canada Data Liberation Initiative Wendy Watkins Carleton University Chuck Humphrey University of Alberta CAPDU/DLI

The summary statistic in this example is the total number of smokers.

The summary statistic

Page 13: Aggregate Data and Statistics Statistics Canada Data Liberation Initiative Wendy Watkins Carleton University Chuck Humphrey University of Alberta CAPDU/DLI

The variables and their categories are:

Region (11): Canada and the ten provinces

Age (5) : Total, 15-19, 20-44, 45-64, 65+

Sex (3) : Total, Female, Male

Education (4) : Total, Some secondary or less, Secondary graduate or more, Not stated

Periods (5) : 1985, 1989, 1991, 1994-95, 1996-97

Variables and categories

Page 14: Aggregate Data and Statistics Statistics Canada Data Liberation Initiative Wendy Watkins Carleton University Chuck Humphrey University of Alberta CAPDU/DLI

The tabulation consists of determining the combinations of all categories across variables and then counting the number of smokers within each of these combinations.

11 x 5 x 3 x 4 x 5 = 3300 category

Variables and categories

combinations

Page 15: Aggregate Data and Statistics Statistics Canada Data Liberation Initiative Wendy Watkins Carleton University Chuck Humphrey University of Alberta CAPDU/DLI

Tabulating or aggregating

One might be wondering if there is a difference between tabulating and aggregating?

Usually, they are the same thing.

Page 16: Aggregate Data and Statistics Statistics Canada Data Liberation Initiative Wendy Watkins Carleton University Chuck Humphrey University of Alberta CAPDU/DLI

Tabulating = aggregating

In creating tables from data, the variables are arranged in various combinations along the columns and the rows.

Page 17: Aggregate Data and Statistics Statistics Canada Data Liberation Initiative Wendy Watkins Carleton University Chuck Humphrey University of Alberta CAPDU/DLI

Tabulating = aggregating

Placing multiple variables along the columns or rows is called nesting.

Tables may have variables nested on both the columns and rows.

Page 18: Aggregate Data and Statistics Statistics Canada Data Liberation Initiative Wendy Watkins Carleton University Chuck Humphrey University of Alberta CAPDU/DLI

Categories of Sex nested within Periods

Page 19: Aggregate Data and Statistics Statistics Canada Data Liberation Initiative Wendy Watkins Carleton University Chuck Humphrey University of Alberta CAPDU/DLI

Categories of Education nested within Sex

Categories of Sex nested within Region

Page 20: Aggregate Data and Statistics Statistics Canada Data Liberation Initiative Wendy Watkins Carleton University Chuck Humphrey University of Alberta CAPDU/DLI

Up to this point, we have noted that statistics are created from data

aggregations consist of tabulating statistics within the categories of select variables

variables may be nested within columns and rows to display these tabulations

A quick summary

Page 21: Aggregate Data and Statistics Statistics Canada Data Liberation Initiative Wendy Watkins Carleton University Chuck Humphrey University of Alberta CAPDU/DLI

What is the difference between a tabulation or aggregation and aggregate data?

The display of the aggregation, that is, the structure of the tabulated output.

What are aggregate data?

Page 22: Aggregate Data and Statistics Statistics Canada Data Liberation Initiative Wendy Watkins Carleton University Chuck Humphrey University of Alberta CAPDU/DLI

A statistical data structure is a fixed, two-dimensional matrix with the variables in the columns and cases in the rows.

What are aggregate data?

Case 1Case 2Case 3Case 4Case 5

Case 6Case 7

V1 V2 V3 V4 V5 V6 V7

Page 23: Aggregate Data and Statistics Statistics Canada Data Liberation Initiative Wendy Watkins Carleton University Chuck Humphrey University of Alberta CAPDU/DLI

Aggregate data require the same type of statistical data structure.

Consequently, aggregate data are a special type of tabulation where variables are nested along the rows but not along the columns.

What are aggregate data?

Page 24: Aggregate Data and Statistics Statistics Canada Data Liberation Initiative Wendy Watkins Carleton University Chuck Humphrey University of Alberta CAPDU/DLI

(11) (5) (3) (4)

Page 25: Aggregate Data and Statistics Statistics Canada Data Liberation Initiative Wendy Watkins Carleton University Chuck Humphrey University of Alberta CAPDU/DLI

To create an aggregate data structure for the example tabulation, the combination of categories representing geography (region), three social variables (age, sex, and education), and time (period) must all be nested along the rows, as shown in the previous slide.

Aggregate Data Structure

Page 26: Aggregate Data and Statistics Statistics Canada Data Liberation Initiative Wendy Watkins Carleton University Chuck Humphrey University of Alberta CAPDU/DLI

This time the table consists of the average length of stay in hospital by sex, age, diagnostic chapter, region, and time period.

Another example

Page 27: Aggregate Data and Statistics Statistics Canada Data Liberation Initiative Wendy Watkins Carleton University Chuck Humphrey University of Alberta CAPDU/DLI
Page 28: Aggregate Data and Statistics Statistics Canada Data Liberation Initiative Wendy Watkins Carleton University Chuck Humphrey University of Alberta CAPDU/DLI

diagnostic chapter : 19 levels

sex : 3 levels

age : 6 levels

region : 13 levels

period : 28 levels

Variables and categories

Page 29: Aggregate Data and Statistics Statistics Canada Data Liberation Initiative Wendy Watkins Carleton University Chuck Humphrey University of Alberta CAPDU/DLI

The number of category combinations is equal to:

13 x 28 x 3 x 6 x 19 = 124,488 category

Aggregate data structure

combinations

Page 30: Aggregate Data and Statistics Statistics Canada Data Liberation Initiative Wendy Watkins Carleton University Chuck Humphrey University of Alberta CAPDU/DLI
Page 31: Aggregate Data and Statistics Statistics Canada Data Liberation Initiative Wendy Watkins Carleton University Chuck Humphrey University of Alberta CAPDU/DLI

The aggregate structure is represented by the 124,488 cells created by the combination of all categories from these five variables.

The statistic is the average length of stay in the hospital in days.

Aggregate average length of hospital stays in days

Page 32: Aggregate Data and Statistics Statistics Canada Data Liberation Initiative Wendy Watkins Carleton University Chuck Humphrey University of Alberta CAPDU/DLI

Definition: Statistical summaries over categorical variables representing social phenomena, geography, and time that are organized in a specific data structure.

What are aggregate data?

Page 33: Aggregate Data and Statistics Statistics Canada Data Liberation Initiative Wendy Watkins Carleton University Chuck Humphrey University of Alberta CAPDU/DLI

When the data structure of the summaries is organized around time, these aggregate statistics are called a time series.

Time series aggregate data

Page 34: Aggregate Data and Statistics Statistics Canada Data Liberation Initiative Wendy Watkins Carleton University Chuck Humphrey University of Alberta CAPDU/DLI

Time Series aggregate data

structure

Page 35: Aggregate Data and Statistics Statistics Canada Data Liberation Initiative Wendy Watkins Carleton University Chuck Humphrey University of Alberta CAPDU/DLI

Annual Time Series

Page 36: Aggregate Data and Statistics Statistics Canada Data Liberation Initiative Wendy Watkins Carleton University Chuck Humphrey University of Alberta CAPDU/DLI

When the data structure of the summaries is organized around geography, we recognize these aggregations as geo-spatial or geo-referenced statistics.

Geo-spatial aggregate data

Page 37: Aggregate Data and Statistics Statistics Canada Data Liberation Initiative Wendy Watkins Carleton University Chuck Humphrey University of Alberta CAPDU/DLI

Geo-spatial aggregate data

structure

Page 38: Aggregate Data and Statistics Statistics Canada Data Liberation Initiative Wendy Watkins Carleton University Chuck Humphrey University of Alberta CAPDU/DLI

Census Sub-divisions

Census Divisions

Province

Page 39: Aggregate Data and Statistics Statistics Canada Data Liberation Initiative Wendy Watkins Carleton University Chuck Humphrey University of Alberta CAPDU/DLI

Why aggregate?

Statistics Canada creates aggregate statistics from its major surveys, including the Census, as a way of publishing selected findings.

The release of aggregate statistics is a partial safeguard against the possible disclosure of respondents.

Page 40: Aggregate Data and Statistics Statistics Canada Data Liberation Initiative Wendy Watkins Carleton University Chuck Humphrey University of Alberta CAPDU/DLI

Why aggregate?

Furthermore, the geographic distribution of statistics in Canada is important. As a result, aggregate statistics are released by Statistics Canada for different levels of geography – from the nation to small areas.

Page 41: Aggregate Data and Statistics Statistics Canada Data Liberation Initiative Wendy Watkins Carleton University Chuck Humphrey University of Alberta CAPDU/DLI

Why aggregate?

Statistics organized into time series is another way in which Statistics Canada publishes a large amount of statistical information. These time series reflect summaries of data that are repeatedly collected over time and permit studies about trends and change.

Page 42: Aggregate Data and Statistics Statistics Canada Data Liberation Initiative Wendy Watkins Carleton University Chuck Humphrey University of Alberta CAPDU/DLI

Why aggregate?

To publish findings

To safeguard against disclosure

To provide geographic distributions of statistics

To present statistics over time

Page 43: Aggregate Data and Statistics Statistics Canada Data Liberation Initiative Wendy Watkins Carleton University Chuck Humphrey University of Alberta CAPDU/DLI

Why aggregate?

Other reasons to aggregate

To modify geo-referenced statistics for GIS applications

– for example, finding postal codes within their corresponding EA and then aggregating data from the postal code level up to the EA level

Page 44: Aggregate Data and Statistics Statistics Canada Data Liberation Initiative Wendy Watkins Carleton University Chuck Humphrey University of Alberta CAPDU/DLI

Why aggregate?

Other reasons to aggregate To change the unit of analysis

– for the purposes of a specific research question

– to create a common, higher-level unit of analysis that can be used in merging files

Page 45: Aggregate Data and Statistics Statistics Canada Data Liberation Initiative Wendy Watkins Carleton University Chuck Humphrey University of Alberta CAPDU/DLI

How does one aggregate?

Identify the grouping structure that represents all of the variables and their categories over which the aggregation is to be conducted.

– This group structure defines a new unit of analysis.

Page 46: Aggregate Data and Statistics Statistics Canada Data Liberation Initiative Wendy Watkins Carleton University Chuck Humphrey University of Alberta CAPDU/DLI

How does one aggregate?

Establish the sort order for the grouping variables, i.e., decide which variable increments the fastest, the next fastest, until you reach the variable that changes the slowest.

Select the summary statistics, such as sums, averages, minimums, maximums, etc.

Page 47: Aggregate Data and Statistics Statistics Canada Data Liberation Initiative Wendy Watkins Carleton University Chuck Humphrey University of Alberta CAPDU/DLI

How does one aggregate?

The actual aggregation is performed using statistical software such as SAS or SPSS.

SAS offers a couple of different procedures and the Data step that can be used to aggregate data, including Proc Summary, Proc Tabulate, and Proc Means.

Page 48: Aggregate Data and Statistics Statistics Canada Data Liberation Initiative Wendy Watkins Carleton University Chuck Humphrey University of Alberta CAPDU/DLI
Page 49: Aggregate Data and Statistics Statistics Canada Data Liberation Initiative Wendy Watkins Carleton University Chuck Humphrey University of Alberta CAPDU/DLI

How does one aggregate?

SPSS has the Aggregate command.

Page 50: Aggregate Data and Statistics Statistics Canada Data Liberation Initiative Wendy Watkins Carleton University Chuck Humphrey University of Alberta CAPDU/DLI

Only nesting of the row variables

Multiple levels of geography and

time

Page 51: Aggregate Data and Statistics Statistics Canada Data Liberation Initiative Wendy Watkins Carleton University Chuck Humphrey University of Alberta CAPDU/DLI

Tabulating = aggregating

Furthermore, geography and time may not play a prominent role in the data and consequently, tables from these data will not include variables for geography and time.

Page 52: Aggregate Data and Statistics Statistics Canada Data Liberation Initiative Wendy Watkins Carleton University Chuck Humphrey University of Alberta CAPDU/DLI

Sex and age nested in the

column variables

Geography and time are each a single category