8323 stats - lesson 1 - 02 introduction general 2008
DESCRIPTION
8323 Stats - Lesson 1 - 04 Multivariate Vectors And Samples 2008TRANSCRIPT
1
STATISTICS FOR
ECONOMICS AND BUSINESS
The course I loved to hate… (S.B.)
2
STATISTICS FOR ECONOMICS AND BUSINESS
The goals
1. The key aim is providing you with basic skills in multivariate data analysis. In particular, we focus on techniques useful to analyze and synthesize data sets with many variables and/or many observations.
2. Great attention is devoted to applications. You will learn to identify a proper multivariate technique for a given problem, to analyze the data with the statistical software SAS, to interpret results and to formulate the conclusions of the statistical analysis. We will (try to) refer to datasets relevant to your studies.
We briefly present the course. A document with a more detailed description of rules and criteria has been already uploaded in Learning space
3
STATISTICS FOR ECONOMICS AND BUSINESS
The tools
1. Frontal Lessons (Theory) Power point slides on-line before each lesson
2. Lab classes (Applications)familiarize with the statistical software SAS, interpret results. Extended solutions on-line after each lessonWord documents with a detailed descriptions of SAS
programs
4. Tutor: Chiara Castellano CLEMIT grad student (2 times a week)
5. Discussion List (LS or specific. please avoid personal email)
5. SAS installed on your laptop (see the Library for details)
6. Textbooks (see the Library for details; my slides should be sufficient but if you are not present at the lessons, reference to textbook is recommended)
4
STATISTICS FOR ECONOMICS AND BUSINESS
Changes and Enhancements
1. Introduction of graded assignments/group worksSome students experienced problems due to the postponement of study. Many students asked for (graded) incentives to day by day study. An assessment methods specific for attending students has been introduced and is strongly recommended.
2 Some variations in the organization of the lessons
5
STATISTICS FOR ECONOMICS AND BUSINESS Assessment Methods For attending students the course grade is based on:
The analysis of a real data set (Pc-lab session – 4 hours). Here the focus is on the proper use of statistical techniques and adequacy of economic conclusions drawn on the basis of the obtained results. Documents with SAS procedures can be used during the exam (no other material is allowed).
A written exam concerning the methodological issues discussed during the course (content of the theoretical slides).
The two exams will be graded separately (max grades = 21 and 6 respectively)
2 Assignments– group work
Lessons (at least 2) dedicated to discussion of the 2 assignments. All groups members present at discussion. In these lessons one person picked at random for each group will illustrate (part of) the obtained results (material may be consulted). If the group-person answer reasonably, the assignment of the group will be graded (0-2 for each assignment). Otherwise, 0. for all group members.
Not attending students (did not hand in both assignments): extended practical and theoretical exams (max grades=23 and 8 respectively)
6
STATISTICS FOR ECONOMICS AND BUSINESS
Prerequisites
1. Univariate Descriptive Statistics. Synthesis Measures (mean, median, quartiles, percentiles, variance, standard deviation). Graphical tools (histogram, box-plot). Extreme values
2. Bivariate Descriptive Statistics. Contingency table, joint, marginal and conditional distributions, measures of association. Conditional means and variances. Scatterplots, covariance, correlation coefficient
3. Inference: random sample, estimators (point and interval) of the mean and of the variance. Hypothesis testing: notion of p-value.
7
Multivariate Data Analysis
Techniques to analyze/synthesize data sets with many variables and/or
many observations.
MOTIVATION
8
Multivariate Data Analysis – Motivation Example1. Innovation and Research in Europe (Source: Eurostat)
Geo Country code
Country Country name
Region european region
Educ_Exp Spending on Human Resources (total public expen. on education) - % of GDP
GERD Gross domestic expenditure on R&D (GERD) - As a % of GDP
GERD_industry GERD - industry - % of GERD financed by industry
GERD_govern GERD - government - % of GERD financed by government
GERD_abroad GERD - abroad - % of GERD financed by abroad
Internet_Acc Level of Internet access - % of households who have Internet access at home
ST_grad Science and technology - Tertiary graduates in S&T x 1000 persons aged 20-29
ST_grad_f Female tertiary graduates in S&T per 1000 of females aged 20-29
ST_grad_m Male tertiary graduates in S&T per 1000 of males aged 20-29
EPO No patent applications to the European Patent Office per million inhabitants
USTPO No patents granted by the US Patent and Trademark Office per million inhabitants
IT_Expenditure Expenditure on Information Technology as a % of GDP
Telec_Expenditure Expenditure on Telecommunications as a % of GDP
Y_Educ_Lev Youth education attainment level - total - % of the population 20-24 who completed at least upper secondary education
Y_Educ_Lev_f % of fem. 20-24 having completed at least upper 2° educ.
Y_Educ__Lev_m % of males 20-24 having completed at least upper 2° educ.
E_gov_avail E-government on-line availability - Online availability of 20 basic public services
HT_Exports Exports of high technology products as a share of total exports
9
Multivariate Data Analysis – Motivation Example1 (continued). Innovation and Research in Europe. For the sake of simplicity, we limit attention to few observations and to few variables, transformed so that variables have all the same unit of measurement (we will show later how we obtain this result)
country region Educ_Exp
GERD GERD_industry
GERD_govern
GERD_abroad
Internet_Acc
ST_grad
EPO E_gov_avail
HT_Exports
Romania Eastern -1.51 -1.24 -0.52 0.66 0.04 -1.67 -1.11 -1.12 -1.53 -0.83
Czech Republic Eastern -0.60 -0.48 -0.11 0.72 -1.07 -1.05 -1.08 -1.03 -1.18 -0.20
Lithuania Northern -0.09 -0.98 -1.42 1.95 -0.25 -1.38 0.51 -1.11 -0.49 -1.25
Ireland Northern -0.72 -0.57 1.11 -1.03 -0.36 -0.04 1.60 -0.43 0.21 2.20
Norway Northern 1.91 -0.10 -0.18 0.35 -0.16 0.92 -0.76 0.07 0.63 -0.93
United Kingdom Northern 0.08 0.12 -0.70 -0.72 2.20 0.72 1.56 -0.03 0.84 1.57
Finland Northern 1.00 1.52 1.46 -1.04 -1.01 0.49 1.03 1.61 1.39 0.74
Sweden Northern 1.79 2.42 1.52 -1.45 -0.85 1.54 0.27 1.47 1.88 0.01
Greece Southern -1.10 -1.01 -1.77 1.01 1.94 -1.14 -0.71 -1.05 -1.04 -0.72
Italy Southern -0.45 -0.58 -0.56 1.03 -0.60 -0.33 -0.82 -0.39 0.42 -0.62
Spain Southern -0.81 -0.75 -0.56 0.36 -0.05 -0.33 0.01 -0.86 0.56 -0.83
Netherlands Western -0.18 0.09 -0.16 -0.04 0.56 1.25 -0.97 1.05 -1.04 0.53
Belgium Western 0.63 0.36 0.83 -1.38 0.77 0.44 -0.25 0.13 -0.84 -0.62
Germany Western -0.47 0.72 1.02 -0.47 -1.01 0.92 -0.69 1.53 0.00 0.11
France Western 0.51 0.47 0.04 0.07 -0.14 -0.33 1.41 0.15 0.21 0.84
How can we study the relationships among all the variables to understand which are the main tendencies of data, i.e. if there are groups of variables acting in the same or in the opposite direction?
10
Multivariate Data Analysis – Motivation
2) Obtain a line plot for VARIABLES
Example1 (continued). Innovation and Research in Europe (subset)How can we study the relationships among all the variables?
A line is associated to each variable. We can observe groups of vars with similar tendencies with respect to some variables, for example the orange-red ones, or the green ones or the blue ones. These three groups of vars show different tendencies
ro gr lt cz it es nl no fr be ie uk de fi se
HT_Exports GERD GERD_industryInternet_Acc EPO Educ_ExpE_gov_avail ST_grad GERD_govern
11
Multivariate Data Analysis – Motivation Example1 (continued). Innovation and Research in Europe. (subset)How can we combine the information provided by all the vars to compare innovation/ research performance for each country?
Should we consider the means for the previously observed groups OF VARIABLES?Are they sufficient to explain ALL the vars?Should we consider the 3 means, one for each group and compare obs on the basis of them?
Which is the most important index/mean? Should the 3 indices have the same weight when comparing variables? What if we want a single index? Is it possible, how much information we loose?
ro gr lt cz it es nl no fr be ie uk de fi se
-2.5
-2
-1.5
-1
-0.5
0
0.5
1
1.5
2
2.5
ro gr lt cz it es nl no fr be ie uk de fi se
mean group3
mean group2
mean group1
Group 1: GERD, GERD_industry, Internet_Acc, EPO, Educ_Exp, E_gov_avail Group 2: ST_grad, HT_ExportsGroup 3: GERD_govern
12
Multivariate Data Analysis – Motivation
Things become complicated when we consider more vars/obs. FINDING GROUPS OF VARIABLES WITH SIMILAR PATTERN IS DIFFICULT
Example1 (continued). Innovation and Research in Europe. How can we study the relationships among all the variables?
-3
-2
-1
0
1
2
3
4
Educ_Exp GERD GERD_industry GERD_govern GERD_abroad
Internet_Acc ST_grad ST_grad_f ST_grad_m EPOUSTPO Y_Educ_Lev Y_Educ_Lev_f Y_Educ__Lev_m HT_Exports
13
Multivariate Data Analysis – Motivation - Vars
High number of (numerical) variables:
•Analyzing the relationships among variables
•Synthesizing the variables
Principal Component Analysis Factor Analysis
14
Multivariate Data Analysis – Motivation Example1 (continued). Innovation and Research in Europe. (subset)
How can we describe the main tendencies of European countries with respect to innovation? Are there countries with similar characteristics? Which are the main pattern/profiles in this data set?Obtain a line plot FOR OBSERVATIONS
A line is associated to each observation. We can observe groups of obs with similar tendencies (for example the orange-red ones). Tendencies are similar only with respect to some vars. Which vars should be mostly considered?
ro gr cz lt it es uk ie
fr no nl de be fi se
Who is “close” to who? How can we describe in a simple way similarity or dissimilarity between countries?
15
Multivariate Data Analysis – Motivation
Sometimes the grouping is obtained on the basis of a priori knowledge. In this case, for example, we can group by referring to the region
Example1 (continued). Innovation and Research in Europe (subset)How can we individuate groups of cases (countries) with similar characteristics?
Grouping obs according to the region is not a good idea: countries in the same region show different patterns.
GERD
GERD_ind
ustry
Inte
rnet
_Acc
EPO
Educ_
Exp
E_gov
_ava
il
ST_gr
ad
HT_Exp
orts
GERD_gov
ern
ro cz gr it es fr nl de
be fi se lt uk ie no
16
Multivariate Data Analysis – Motivation Example1 (continued). Innovation and Research in Europe. How can we describe the main tendencies of European countries wrt innovation?
Things become complicated when we consider more vars/obs. FINDING GROUPS OF OBSERVATIONS WITH SIMILAR PATTERNS IS DIFFICULT
-3
-2
-1
0
1
2
3
4
bg cz hu pl ro sk dk ee fi is ie lvlt no se uk hr gr it mt pt si es atbe fr de lu nl ch cy tr
17
Multivariate Data Analysis – Motivation
•Describing cases
•Analysis of similarity/dissimilarity between cases
•Individuation of the main tendencies (groups of cases) in a data base
High number of observations (either numerical or categorical)
Finding groups
Cluster Analysis
Visualizing differences
Factor Analysis/Multidimensional Scaling
18
Multivariate Data Analysis – Motivation
Example 2. Information about projects financed by EU in 1995-1996
Record Project id
Information about the Responsible (organization which is coordinating the project)
Country Nationality of the responsible
Type Type of organisation (Industry, Education, Research, Commercial) of the Responsible
REV Revenues of the Responsible
EMP Employees of the Responsible
Activity_partner Evaluation of the activity of the Responsible as a partner in other projects before 1995 (8 point scale; 1=very poor, 8=excellent)
Proj_resp_1995 Number of projects coordinated by the Responsible before 1995
Proj_resp_end_1995 Number of projects coordinated by the Responsible ended before 1995
Information about the Project
Duration Duration of the project
Size Number of organisations involved in the project
Topic Topic of the project
19
Multivariate Data Analysis
RECORD COUNTRY TYPE REV EMPACTIVITY_ PARTNER
PROJ_RESP_ 1995
PROJ_RESP_END_1995 DURATION SIZE TOPIC
23376 France Research 6353 49 2 1 0 24 6 MATERIALS TECHNOLOGY
23386 Belgium Education 39707 310 6 10 2 24 15 MATERIALS TECHNOLOGY
23590 Italy Industry 9969376 34217 6 6 1 18 5 ENERGY
23596 UK Research 99404 572 1 3 0 24 5 NATURAL_RESOURCES
23601 Netherlands Research 18400 163 2 1 0 36 7 NATURAL_RESOURCES
23611 Germany Education 15930801 78701 2 1 0 24 3 ENERGY
23682 Germany Education 4547875 10343 6 3 1 24 4 ENERGY
23770 France Research 974 12 2 2 0 36 6 NATURAL_RESOURCES
23806 France Non Commercial 168066 594 7 10 0 24 10 SAFETY
23985 UK Education 15297220 53164 7 7 0 24 4 SAFETY
23988 Italy Industry 23859 199 1 1 0 33 7 SAFETY
24171 Italy Industry 5706 259 2 2 0 24 6 TELECOMMUNICATIONS
24174 UK Industry 18947 363 2 6 4 18 3 TELECOMMUNICATIONS
24175 Netherlands Industry 208312 1394 1 1 1 18 4 TELECOMMUNICATIONS
27410 UK Industry 248824 2633 1 3 0 30 5 STANDARDS
Example 2 (continued). Projects financed by EU in 1995-1996 (partial input)
Is there an association between the country, the type of organization and the topic? Are there organizations/countries specialized in particular topics?
If there is association, what is it due to? Who is attracted by what?
20
Multivariate Data Analysis – Motivation
•Describing of the association between categorical variables, i.e., understanding the main attraction/repulsion forces between categories
•Individuation of profiles of categories (i.e., typical combinations of categories
Categorical Variables (two or more) with many values
Correspondence Analysis Simple and Multiple
21
Multivariate Data AnalysisWhen dealing with many vars and/or obs it may be difficult to
•Describe, analyze synthesize obs taking into account all the vars, individuating “typical” cases or tendencies in OBS
•Study the relationships among vars and/or synthesize them jointly
Grouping of vars and/or obs according to some “natural” or somehow “intuitive” rules (e.g., the mean for the variables, the region or the richness for countries a.s.o.)
These approaches: Are subjective
Reproduce what we already know about data and do not help in further knowledge about them
Sometimes can not be applied (no natural grouping available) / difficulty in individuating similar patterns
22
Multivariate Data Analysis
The aim of Multivariate Statistical Techniques is to
Extract information contained in a given data set, by simplifying and summarizing observations and/or variables by using
DATA DRIVEN TOOLS
The tool – i.e., the compression/simplification/synthesis of data – used to make information available depends upon the aim of the analysis and on the nature of the variables taken into account