8323 stats - lesson 1 - 02 introduction general 2008

22
1 STATISTICS FOR ECONOMICS AND BUSINESS The course I loved to hate… (S.B.)

Upload: untellectualism

Post on 01-Nov-2014

1.475 views

Category:

Business


0 download

DESCRIPTION

8323 Stats - Lesson 1 - 04 Multivariate Vectors And Samples 2008

TRANSCRIPT

Page 1: 8323 Stats - Lesson 1 - 02 Introduction General 2008

1

STATISTICS FOR

ECONOMICS AND BUSINESS

The course I loved to hate… (S.B.)

Page 2: 8323 Stats - Lesson 1 - 02 Introduction General 2008

2

STATISTICS FOR ECONOMICS AND BUSINESS

The goals

1. The key aim is providing you with basic skills in multivariate data analysis. In particular, we focus on techniques useful to analyze and synthesize data sets with many variables and/or many observations.

2. Great attention is devoted to applications. You will learn to identify a proper multivariate technique for a given problem, to analyze the data with the statistical software SAS, to interpret results and to formulate the conclusions of the statistical analysis. We will (try to) refer to datasets relevant to your studies.

We briefly present the course. A document with a more detailed description of rules and criteria has been already uploaded in Learning space

Page 3: 8323 Stats - Lesson 1 - 02 Introduction General 2008

3

STATISTICS FOR ECONOMICS AND BUSINESS

The tools

1. Frontal Lessons (Theory) Power point slides on-line before each lesson

2. Lab classes (Applications)familiarize with the statistical software SAS, interpret results. Extended solutions on-line after each lessonWord documents with a detailed descriptions of SAS

programs

4. Tutor: Chiara Castellano CLEMIT grad student (2 times a week)

5. Discussion List (LS or specific. please avoid personal email)

5. SAS installed on your laptop (see the Library for details)

6. Textbooks (see the Library for details; my slides should be sufficient but if you are not present at the lessons, reference to textbook is recommended)

Page 4: 8323 Stats - Lesson 1 - 02 Introduction General 2008

4

STATISTICS FOR ECONOMICS AND BUSINESS

Changes and Enhancements

1. Introduction of graded assignments/group worksSome students experienced problems due to the postponement of study. Many students asked for (graded) incentives to day by day study. An assessment methods specific for attending students has been introduced and is strongly recommended.

2 Some variations in the organization of the lessons

Page 5: 8323 Stats - Lesson 1 - 02 Introduction General 2008

5

STATISTICS FOR ECONOMICS AND BUSINESS Assessment Methods For attending students the course grade is based on:

The analysis of a real data set (Pc-lab session – 4 hours). Here the focus is on the proper use of statistical techniques and adequacy of economic conclusions drawn on the basis of the obtained results. Documents with SAS procedures can be used during the exam (no other material is allowed).

A written exam concerning the methodological issues discussed during the course (content of the theoretical slides).

The two exams will be graded separately (max grades = 21 and 6 respectively)

2 Assignments– group work

Lessons (at least 2) dedicated to discussion of the 2 assignments. All groups members present at discussion. In these lessons one person picked at random for each group will illustrate (part of) the obtained results (material may be consulted). If the group-person answer reasonably, the assignment of the group will be graded (0-2 for each assignment). Otherwise, 0. for all group members.

Not attending students (did not hand in both assignments): extended practical and theoretical exams (max grades=23 and 8 respectively)

Page 6: 8323 Stats - Lesson 1 - 02 Introduction General 2008

6

STATISTICS FOR ECONOMICS AND BUSINESS

Prerequisites

1. Univariate Descriptive Statistics. Synthesis Measures (mean, median, quartiles, percentiles, variance, standard deviation). Graphical tools (histogram, box-plot). Extreme values

2. Bivariate Descriptive Statistics. Contingency table, joint, marginal and conditional distributions, measures of association. Conditional means and variances. Scatterplots, covariance, correlation coefficient

3. Inference: random sample, estimators (point and interval) of the mean and of the variance. Hypothesis testing: notion of p-value.

Page 7: 8323 Stats - Lesson 1 - 02 Introduction General 2008

7

Multivariate Data Analysis

Techniques to analyze/synthesize data sets with many variables and/or

many observations.

MOTIVATION

Page 8: 8323 Stats - Lesson 1 - 02 Introduction General 2008

8

Multivariate Data Analysis – Motivation Example1. Innovation and Research in Europe (Source: Eurostat)

Geo Country code

Country Country name

Region european region

Educ_Exp Spending on Human Resources (total public expen. on education) - % of GDP

GERD Gross domestic expenditure on R&D (GERD) - As a % of GDP

GERD_industry GERD - industry - % of GERD financed by industry

GERD_govern GERD - government - % of GERD financed by government

GERD_abroad GERD - abroad - % of GERD financed by abroad

Internet_Acc Level of Internet access - % of households who have Internet access at home

ST_grad Science and technology - Tertiary graduates in S&T x 1000 persons aged 20-29

ST_grad_f Female tertiary graduates in S&T per 1000 of females aged 20-29

ST_grad_m Male tertiary graduates in S&T per 1000 of males aged 20-29

EPO No patent applications to the European Patent Office per million inhabitants

USTPO No patents granted by the US Patent and Trademark Office per million inhabitants

IT_Expenditure Expenditure on Information Technology as a % of GDP

Telec_Expenditure Expenditure on Telecommunications as a % of GDP

Y_Educ_Lev Youth education attainment level - total - % of the population 20-24 who completed at least upper secondary education

Y_Educ_Lev_f % of fem. 20-24 having completed at least upper 2° educ.

Y_Educ__Lev_m % of males 20-24 having completed at least upper 2° educ.

E_gov_avail E-government on-line availability - Online availability of 20 basic public services

HT_Exports Exports of high technology products as a share of total exports

Page 9: 8323 Stats - Lesson 1 - 02 Introduction General 2008

9

Multivariate Data Analysis – Motivation Example1 (continued). Innovation and Research in Europe. For the sake of simplicity, we limit attention to few observations and to few variables, transformed so that variables have all the same unit of measurement (we will show later how we obtain this result)

country region Educ_Exp

GERD GERD_industry

GERD_govern

GERD_abroad

Internet_Acc

ST_grad

EPO E_gov_avail

HT_Exports

Romania Eastern -1.51 -1.24 -0.52 0.66 0.04 -1.67 -1.11 -1.12 -1.53 -0.83

Czech Republic Eastern -0.60 -0.48 -0.11 0.72 -1.07 -1.05 -1.08 -1.03 -1.18 -0.20

Lithuania Northern -0.09 -0.98 -1.42 1.95 -0.25 -1.38 0.51 -1.11 -0.49 -1.25

Ireland Northern -0.72 -0.57 1.11 -1.03 -0.36 -0.04 1.60 -0.43 0.21 2.20

Norway Northern 1.91 -0.10 -0.18 0.35 -0.16 0.92 -0.76 0.07 0.63 -0.93

United Kingdom Northern 0.08 0.12 -0.70 -0.72 2.20 0.72 1.56 -0.03 0.84 1.57

Finland Northern 1.00 1.52 1.46 -1.04 -1.01 0.49 1.03 1.61 1.39 0.74

Sweden Northern 1.79 2.42 1.52 -1.45 -0.85 1.54 0.27 1.47 1.88 0.01

Greece Southern -1.10 -1.01 -1.77 1.01 1.94 -1.14 -0.71 -1.05 -1.04 -0.72

Italy Southern -0.45 -0.58 -0.56 1.03 -0.60 -0.33 -0.82 -0.39 0.42 -0.62

Spain Southern -0.81 -0.75 -0.56 0.36 -0.05 -0.33 0.01 -0.86 0.56 -0.83

Netherlands Western -0.18 0.09 -0.16 -0.04 0.56 1.25 -0.97 1.05 -1.04 0.53

Belgium Western 0.63 0.36 0.83 -1.38 0.77 0.44 -0.25 0.13 -0.84 -0.62

Germany Western -0.47 0.72 1.02 -0.47 -1.01 0.92 -0.69 1.53 0.00 0.11

France Western 0.51 0.47 0.04 0.07 -0.14 -0.33 1.41 0.15 0.21 0.84

How can we study the relationships among all the variables to understand which are the main tendencies of data, i.e. if there are groups of variables acting in the same or in the opposite direction?

Page 10: 8323 Stats - Lesson 1 - 02 Introduction General 2008

10

Multivariate Data Analysis – Motivation

2) Obtain a line plot for VARIABLES

Example1 (continued). Innovation and Research in Europe (subset)How can we study the relationships among all the variables?

A line is associated to each variable. We can observe groups of vars with similar tendencies with respect to some variables, for example the orange-red ones, or the green ones or the blue ones. These three groups of vars show different tendencies

ro gr lt cz it es nl no fr be ie uk de fi se

HT_Exports GERD GERD_industryInternet_Acc EPO Educ_ExpE_gov_avail ST_grad GERD_govern

Page 11: 8323 Stats - Lesson 1 - 02 Introduction General 2008

11

Multivariate Data Analysis – Motivation Example1 (continued). Innovation and Research in Europe. (subset)How can we combine the information provided by all the vars to compare innovation/ research performance for each country?

Should we consider the means for the previously observed groups OF VARIABLES?Are they sufficient to explain ALL the vars?Should we consider the 3 means, one for each group and compare obs on the basis of them?

Which is the most important index/mean? Should the 3 indices have the same weight when comparing variables? What if we want a single index? Is it possible, how much information we loose?

ro gr lt cz it es nl no fr be ie uk de fi se

-2.5

-2

-1.5

-1

-0.5

0

0.5

1

1.5

2

2.5

ro gr lt cz it es nl no fr be ie uk de fi se

mean group3

mean group2

mean group1

Group 1: GERD, GERD_industry, Internet_Acc, EPO, Educ_Exp, E_gov_avail Group 2: ST_grad, HT_ExportsGroup 3: GERD_govern

Page 12: 8323 Stats - Lesson 1 - 02 Introduction General 2008

12

Multivariate Data Analysis – Motivation

Things become complicated when we consider more vars/obs. FINDING GROUPS OF VARIABLES WITH SIMILAR PATTERN IS DIFFICULT

Example1 (continued). Innovation and Research in Europe. How can we study the relationships among all the variables?

-3

-2

-1

0

1

2

3

4

Educ_Exp GERD GERD_industry GERD_govern GERD_abroad

Internet_Acc ST_grad ST_grad_f ST_grad_m EPOUSTPO Y_Educ_Lev Y_Educ_Lev_f Y_Educ__Lev_m HT_Exports

Page 13: 8323 Stats - Lesson 1 - 02 Introduction General 2008

13

Multivariate Data Analysis – Motivation - Vars

High number of (numerical) variables:

•Analyzing the relationships among variables

•Synthesizing the variables

Principal Component Analysis Factor Analysis

Page 14: 8323 Stats - Lesson 1 - 02 Introduction General 2008

14

Multivariate Data Analysis – Motivation Example1 (continued). Innovation and Research in Europe. (subset)

How can we describe the main tendencies of European countries with respect to innovation? Are there countries with similar characteristics? Which are the main pattern/profiles in this data set?Obtain a line plot FOR OBSERVATIONS

A line is associated to each observation. We can observe groups of obs with similar tendencies (for example the orange-red ones). Tendencies are similar only with respect to some vars. Which vars should be mostly considered?

ro gr cz lt it es uk ie

fr no nl de be fi se

Who is “close” to who? How can we describe in a simple way similarity or dissimilarity between countries?

Page 15: 8323 Stats - Lesson 1 - 02 Introduction General 2008

15

Multivariate Data Analysis – Motivation

Sometimes the grouping is obtained on the basis of a priori knowledge. In this case, for example, we can group by referring to the region

Example1 (continued). Innovation and Research in Europe (subset)How can we individuate groups of cases (countries) with similar characteristics?

Grouping obs according to the region is not a good idea: countries in the same region show different patterns.

GERD

GERD_ind

ustry

Inte

rnet

_Acc

EPO

Educ_

Exp

E_gov

_ava

il

ST_gr

ad

HT_Exp

orts

GERD_gov

ern

ro cz gr it es fr nl de

be fi se lt uk ie no

Page 16: 8323 Stats - Lesson 1 - 02 Introduction General 2008

16

Multivariate Data Analysis – Motivation Example1 (continued). Innovation and Research in Europe. How can we describe the main tendencies of European countries wrt innovation?

Things become complicated when we consider more vars/obs. FINDING GROUPS OF OBSERVATIONS WITH SIMILAR PATTERNS IS DIFFICULT

-3

-2

-1

0

1

2

3

4

bg cz hu pl ro sk dk ee fi is ie lvlt no se uk hr gr it mt pt si es atbe fr de lu nl ch cy tr

Page 17: 8323 Stats - Lesson 1 - 02 Introduction General 2008

17

Multivariate Data Analysis – Motivation

•Describing cases

•Analysis of similarity/dissimilarity between cases

•Individuation of the main tendencies (groups of cases) in a data base

High number of observations (either numerical or categorical)

Finding groups

Cluster Analysis

Visualizing differences

Factor Analysis/Multidimensional Scaling

Page 18: 8323 Stats - Lesson 1 - 02 Introduction General 2008

18

Multivariate Data Analysis – Motivation

Example 2. Information about projects financed by EU in 1995-1996

Record Project id

Information about the Responsible (organization which is coordinating the project)

Country Nationality of the responsible

Type Type of organisation (Industry, Education, Research, Commercial) of the Responsible

REV Revenues of the Responsible

EMP Employees of the Responsible

Activity_partner Evaluation of the activity of the Responsible as a partner in other projects before 1995 (8 point scale; 1=very poor, 8=excellent)

Proj_resp_1995 Number of projects coordinated by the Responsible before 1995

Proj_resp_end_1995 Number of projects coordinated by the Responsible ended before 1995

Information about the Project

Duration Duration of the project

Size Number of organisations involved in the project

Topic Topic of the project

Page 19: 8323 Stats - Lesson 1 - 02 Introduction General 2008

19

Multivariate Data Analysis

RECORD COUNTRY TYPE REV EMPACTIVITY_ PARTNER

PROJ_RESP_ 1995

PROJ_RESP_END_1995 DURATION SIZE TOPIC

23376 France Research 6353 49 2 1 0 24 6 MATERIALS TECHNOLOGY

23386 Belgium Education 39707 310 6 10 2 24 15 MATERIALS TECHNOLOGY

23590 Italy Industry 9969376 34217 6 6 1 18 5 ENERGY

23596 UK Research 99404 572 1 3 0 24 5 NATURAL_RESOURCES

23601 Netherlands Research 18400 163 2 1 0 36 7 NATURAL_RESOURCES

23611 Germany Education 15930801 78701 2 1 0 24 3 ENERGY

23682 Germany Education 4547875 10343 6 3 1 24 4 ENERGY

23770 France Research 974 12 2 2 0 36 6 NATURAL_RESOURCES

23806 France Non Commercial 168066 594 7 10 0 24 10 SAFETY

23985 UK Education 15297220 53164 7 7 0 24 4 SAFETY

23988 Italy Industry 23859 199 1 1 0 33 7 SAFETY

24171 Italy Industry 5706 259 2 2 0 24 6 TELECOMMUNICATIONS

24174 UK Industry 18947 363 2 6 4 18 3 TELECOMMUNICATIONS

24175 Netherlands Industry 208312 1394 1 1 1 18 4 TELECOMMUNICATIONS

27410 UK Industry 248824 2633 1 3 0 30 5 STANDARDS

Example 2 (continued). Projects financed by EU in 1995-1996 (partial input)

Is there an association between the country, the type of organization and the topic? Are there organizations/countries specialized in particular topics?

If there is association, what is it due to? Who is attracted by what?

Page 20: 8323 Stats - Lesson 1 - 02 Introduction General 2008

20

Multivariate Data Analysis – Motivation

•Describing of the association between categorical variables, i.e., understanding the main attraction/repulsion forces between categories

•Individuation of profiles of categories (i.e., typical combinations of categories

Categorical Variables (two or more) with many values

Correspondence Analysis Simple and Multiple

Page 21: 8323 Stats - Lesson 1 - 02 Introduction General 2008

21

Multivariate Data AnalysisWhen dealing with many vars and/or obs it may be difficult to

•Describe, analyze synthesize obs taking into account all the vars, individuating “typical” cases or tendencies in OBS

•Study the relationships among vars and/or synthesize them jointly

Grouping of vars and/or obs according to some “natural” or somehow “intuitive” rules (e.g., the mean for the variables, the region or the richness for countries a.s.o.)

These approaches: Are subjective

Reproduce what we already know about data and do not help in further knowledge about them

Sometimes can not be applied (no natural grouping available) / difficulty in individuating similar patterns

Page 22: 8323 Stats - Lesson 1 - 02 Introduction General 2008

22

Multivariate Data Analysis

The aim of Multivariate Statistical Techniques is to

Extract information contained in a given data set, by simplifying and summarizing observations and/or variables by using

DATA DRIVEN TOOLS

The tool – i.e., the compression/simplification/synthesis of data – used to make information available depends upon the aim of the analysis and on the nature of the variables taken into account