engaging everyone with open data science - courses...engaging everyone with open data science iase...

37
Engaging Everyone with Open Data Science IASE 2019 Satellite Conference ”Decision Making Based on Data” 13-16 Aug 2019 in Kuala Lumpur, Malaysia Kimmo Vehkalahti Centre for Social Data Science University of Helsinki, Finland

Upload: others

Post on 25-May-2020

8 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Engaging Everyone with Open Data Science - Courses...Engaging Everyone with Open Data Science IASE 2019 Satellite Conference ”Decision Making Based on Data” 13-16 Aug 2019 in Kuala

Engaging Everyone withOpen Data Science

IASE 2019 Satellite Conference ”Decision Making Based on Data” 13-16 Aug 2019 in Kuala Lumpur, Malaysia

Kimmo VehkalahtiCentre for Social Data ScienceUniversity of Helsinki, Finland

Page 2: Engaging Everyone with Open Data Science - Courses...Engaging Everyone with Open Data Science IASE 2019 Satellite Conference ”Decision Making Based on Data” 13-16 Aug 2019 in Kuala

Outline1. Introduction2. Open Data Science3. Results4. Conclusion5. References

Kimmo Vehkalahti, University of Helsinki, Finland: Engaging Everyone with Open Data ScienceIASE 2019 Satellite Conference ”Decision Making Based on Data” 13-16 Aug 2019 in Kuala Lumpur, Malaysia

Statistics (Google search)

Data Science (Google search) Open Data Science (Google search)

Page 3: Engaging Everyone with Open Data Science - Courses...Engaging Everyone with Open Data Science IASE 2019 Satellite Conference ”Decision Making Based on Data” 13-16 Aug 2019 in Kuala
Page 4: Engaging Everyone with Open Data Science - Courses...Engaging Everyone with Open Data Science IASE 2019 Satellite Conference ”Decision Making Based on Data” 13-16 Aug 2019 in Kuala
Page 5: Engaging Everyone with Open Data Science - Courses...Engaging Everyone with Open Data Science IASE 2019 Satellite Conference ”Decision Making Based on Data” 13-16 Aug 2019 in Kuala
Page 6: Engaging Everyone with Open Data Science - Courses...Engaging Everyone with Open Data Science IASE 2019 Satellite Conference ”Decision Making Based on Data” 13-16 Aug 2019 in Kuala

1. Introduction• The ongoing “Data revolution” sets more requirements for the

students and researchers on all fields of science.• One could say (without exaggerating too much) that

We should all be data scientists.• The term “data science” is a good synonym to statistics.• “Statistics” vs “Data science” is also a question of brand/image.

Kimmo Vehkalahti, University of Helsinki, Finland: Engaging Everyone with Open Data ScienceIASE 2019 Satellite Conference ”Decision Making Based on Data” 13-16 Aug 2019 in Kuala Lumpur, Malaysia

Page 7: Engaging Everyone with Open Data Science - Courses...Engaging Everyone with Open Data Science IASE 2019 Satellite Conference ”Decision Making Based on Data” 13-16 Aug 2019 in Kuala

1. IntroductionAccording to Wikipedia:• “Statistics is a branch of mathematics working with data collection,

organization, analysis, interpretation and presentation.“while• “Data science is a multi-disciplinary field that uses scientific methods,

processes, algorithms and systems to extract knowledge and insightsfrom structured and unstructured data.”

Which one of these definitions sounds more interesting? Perhapssome combination of these would better describe our field?

Kimmo Vehkalahti, University of Helsinki, Finland: Engaging Everyone with Open Data ScienceIASE 2019 Satellite Conference ”Decision Making Based on Data” 13-16 Aug 2019 in Kuala Lumpur, Malaysia

Page 8: Engaging Everyone with Open Data Science - Courses...Engaging Everyone with Open Data Science IASE 2019 Satellite Conference ”Decision Making Based on Data” 13-16 Aug 2019 in Kuala

1. Introduction

Kimmo Vehkalahti, University of Helsinki, Finland: Engaging Everyone with Open Data ScienceIASE 2019 Satellite Conference ”Decision Making Based on Data” 13-16 Aug 2019 in Kuala Lumpur, Malaysia

Page 9: Engaging Everyone with Open Data Science - Courses...Engaging Everyone with Open Data Science IASE 2019 Satellite Conference ”Decision Making Based on Data” 13-16 Aug 2019 in Kuala

1. Introduction

Kimmo Vehkalahti, University of Helsinki, Finland: Engaging Everyone with Open Data ScienceIASE 2019 Satellite Conference ”Decision Making Based on Data” 13-16 Aug 2019 in Kuala Lumpur, Malaysia

Page 10: Engaging Everyone with Open Data Science - Courses...Engaging Everyone with Open Data Science IASE 2019 Satellite Conference ”Decision Making Based on Data” 13-16 Aug 2019 in Kuala

1. Introduction• Teaching of statistics should focus more on practical data science.• Special emphasis needed on data wrangling:

• Preparing the data for the analyses• Looking at the data via simple but clever visualizations

• Important overall learning goals:• Principles and practices of open science and reproducible research• Statistical and algorithmic thinking, sharing of code and data

• State-of-the-art tools like RStudio and R Markdown freely available!• Thus: many reasons why I like to use the term Open Data Science.

Kimmo Vehkalahti, University of Helsinki, Finland: Engaging Everyone with Open Data ScienceIASE 2019 Satellite Conference ”Decision Making Based on Data” 13-16 Aug 2019 in Kuala Lumpur, Malaysia

Page 11: Engaging Everyone with Open Data Science - Courses...Engaging Everyone with Open Data Science IASE 2019 Satellite Conference ”Decision Making Based on Data” 13-16 Aug 2019 in Kuala

2. Open Data Science• New course established to respond to the serious need around:• Introduction to Open Data Science

• Primary target: Doctoral students of social sciences and humanities• Suitable for ”anyone” (master’s / bachelor’s / exchange / post docs)• So far, 100+ participants every time (organized 3 times 2017-2018)

Kimmo Vehkalahti, University of Helsinki, Finland: Engaging Everyone with Open Data ScienceIASE 2019 Satellite Conference ”Decision Making Based on Data” 13-16 Aug 2019 in Kuala Lumpur, Malaysia

Page 12: Engaging Everyone with Open Data Science - Courses...Engaging Everyone with Open Data Science IASE 2019 Satellite Conference ”Decision Making Based on Data” 13-16 Aug 2019 in Kuala

2. Open Data ScienceGeneral learning objectives of the course were stated as follows:• ”After completing this course you will understand the principles and

advantages of using open research tools with open data andunderstand the possibilities of reproducible research.”

and• “You will know how to use RStudio, R Markdown, and GitHub for

these tasks, and know how to learn more of these open softwaretools. You will also know how to apply certain statistical methods ofdata science, that is, data-driven statistics.”

Kimmo Vehkalahti, University of Helsinki, Finland: Engaging Everyone with Open Data ScienceIASE 2019 Satellite Conference ”Decision Making Based on Data” 13-16 Aug 2019 in Kuala Lumpur, Malaysia

Page 13: Engaging Everyone with Open Data Science - Courses...Engaging Everyone with Open Data Science IASE 2019 Satellite Conference ”Decision Making Based on Data” 13-16 Aug 2019 in Kuala

2. Open Data ScienceSeven weeks of study online, w/ one optional computer class / week:1. Start me up!2. Regression and

model validation3. Logistic regression4. Clustering and classification5. Dimensionality reduction techniques6. Analysis of longitudinal data

Kimmo Vehkalahti, University of Helsinki, Finland: Engaging Everyone with Open Data ScienceIASE 2019 Satellite Conference ”Decision Making Based on Data” 13-16 Aug 2019 in Kuala Lumpur, Malaysia

Far from a traditional, systematicstatistics course! A mixture of- Statistical modeling (LM, GLM etc.)- Data analysis (PCA, MCA etc.)

Page 14: Engaging Everyone with Open Data Science - Courses...Engaging Everyone with Open Data Science IASE 2019 Satellite Conference ”Decision Making Based on Data” 13-16 Aug 2019 in Kuala

2. Open Data Science• Dedicated free DataCamp course

”Helsinki Open Data Science”supports learning the R skills

• Easy, interactive way to explore andlearn the weekly R tricks to be used

• R code can then be copied to Rstudio• www.datacamp.com

Kimmo Vehkalahti, University of Helsinki, Finland: Engaging Everyone with Open Data ScienceIASE 2019 Satellite Conference ”Decision Making Based on Data” 13-16 Aug 2019 in Kuala Lumpur, Malaysia

Page 15: Engaging Everyone with Open Data Science - Courses...Engaging Everyone with Open Data Science IASE 2019 Satellite Conference ”Decision Making Based on Data” 13-16 Aug 2019 in Kuala

2. Open Data Science

Kimmo Vehkalahti, University of Helsinki, Finland: Engaging Everyone with Open Data ScienceIASE 2019 Satellite Conference ”Decision Making Based on Data” 13-16 Aug 2019 in Kuala Lumpur, Malaysia

Some views of the DataCamp platform:

Page 16: Engaging Everyone with Open Data Science - Courses...Engaging Everyone with Open Data Science IASE 2019 Satellite Conference ”Decision Making Based on Data” 13-16 Aug 2019 in Kuala

2. Open Data Science

Kimmo Vehkalahti, University of Helsinki, Finland: Engaging Everyone with Open Data ScienceIASE 2019 Satellite Conference ”Decision Making Based on Data” 13-16 Aug 2019 in Kuala Lumpur, Malaysia

Page 17: Engaging Everyone with Open Data Science - Courses...Engaging Everyone with Open Data Science IASE 2019 Satellite Conference ”Decision Making Based on Data” 13-16 Aug 2019 in Kuala

2. Open Data Science• Weekly assignments consist of data wrangling and analysis exercises.• They are practiced on DataCamp and then completed with RStudio.• All the students’ weekly reports are saved and shared on GitHub,

using ready-made templates downloaded on the first week.

• R Markdown is used and the reports are knitted into HTML files.

Kimmo Vehkalahti, University of Helsinki, Finland: Engaging Everyone with Open Data ScienceIASE 2019 Satellite Conference ”Decision Making Based on Data” 13-16 Aug 2019 in Kuala Lumpur, Malaysia

Page 18: Engaging Everyone with Open Data Science - Courses...Engaging Everyone with Open Data Science IASE 2019 Satellite Conference ”Decision Making Based on Data” 13-16 Aug 2019 in Kuala

2. Open Data Science

Kimmo Vehkalahti, University of Helsinki, Finland: Engaging Everyone with Open Data ScienceIASE 2019 Satellite Conference ”Decision Making Based on Data” 13-16 Aug 2019 in Kuala Lumpur, Malaysia

Page 19: Engaging Everyone with Open Data Science - Courses...Engaging Everyone with Open Data Science IASE 2019 Satellite Conference ”Decision Making Based on Data” 13-16 Aug 2019 in Kuala

2. Open Data Science

Kimmo Vehkalahti, University of Helsinki, Finland: Engaging Everyone with Open Data ScienceIASE 2019 Satellite Conference ”Decision Making Based on Data” 13-16 Aug 2019 in Kuala Lumpur, Malaysia

Page 20: Engaging Everyone with Open Data Science - Courses...Engaging Everyone with Open Data Science IASE 2019 Satellite Conference ”Decision Making Based on Data” 13-16 Aug 2019 in Kuala

2. Open Data Science

Kimmo Vehkalahti, University of Helsinki, Finland: Engaging Everyone with Open Data ScienceIASE 2019 Satellite Conference ”Decision Making Based on Data” 13-16 Aug 2019 in Kuala Lumpur, Malaysia

(Petteri is one of my previous & preciousTeaching Assistants and very importanttechnical developers of the IODS course.)

Page 21: Engaging Everyone with Open Data Science - Courses...Engaging Everyone with Open Data Science IASE 2019 Satellite Conference ”Decision Making Based on Data” 13-16 Aug 2019 in Kuala

2. Open Data Science• Weekly peer-reviews of 3 reports for 6 weeks.• Grading with a scale from 0 (fail) to 5 (excellent).• Teachers check the integrity of the peer-reviews.• Course grade is completely based on peer-reviews.

Kimmo Vehkalahti, University of Helsinki, Finland: Engaging Everyone with Open Data ScienceIASE 2019 Satellite Conference ”Decision Making Based on Data” 13-16 Aug 2019 in Kuala Lumpur, Malaysia

Page 22: Engaging Everyone with Open Data Science - Courses...Engaging Everyone with Open Data Science IASE 2019 Satellite Conference ”Decision Making Based on Data” 13-16 Aug 2019 in Kuala

3. Results• The course has been a HUGE success story, so we are quite happy!• Some excerpts from anonymous student feedback: (BOLD ADDED)

”I really enjoyed this course, to be honest this is the best course that Ihad in Helsinki. Combining both DataCamp and Rstudio exercise wasamazing idea, it helped me alot. Even though I have been using R sincecouple of years but during this course I learned more sophisticatedways of programming.”

Kimmo Vehkalahti, University of Helsinki, Finland: Engaging Everyone with Open Data ScienceIASE 2019 Satellite Conference ”Decision Making Based on Data” 13-16 Aug 2019 in Kuala Lumpur, Malaysia

Page 23: Engaging Everyone with Open Data Science - Courses...Engaging Everyone with Open Data Science IASE 2019 Satellite Conference ”Decision Making Based on Data” 13-16 Aug 2019 in Kuala

3. Results• More excerpts from anonymous student feedback: (BOLD ADDED)

”The course was really interesting and hands-on approach worked well.Datacamp exercises were well organised. Need for this kind of appliedstatistical (data science) courses where you’re needed to clean yourdataset and then use correct statistical methods is in high demand.You can get a feel that you’re learning something actually useful for reallife. Learning Github has been really huge benefit.”

Kimmo Vehkalahti, University of Helsinki, Finland: Engaging Everyone with Open Data ScienceIASE 2019 Satellite Conference ”Decision Making Based on Data” 13-16 Aug 2019 in Kuala Lumpur, Malaysia

Page 24: Engaging Everyone with Open Data Science - Courses...Engaging Everyone with Open Data Science IASE 2019 Satellite Conference ”Decision Making Based on Data” 13-16 Aug 2019 in Kuala

3. Results• More excerpts from anonymous student feedback: (BOLD ADDED)

”First of all I want to thank you all about this course which has been thefunniest and most interesting ever. This was my first touch to R, GitHuband Slack. I never thought that I would get this excited aboutsomething, but I did. I noticed that the R environment is an endlessworld and its not as difficult as I thought at first. I will definitelycontinue to learn codes and statistics.”

Kimmo Vehkalahti, University of Helsinki, Finland: Engaging Everyone with Open Data ScienceIASE 2019 Satellite Conference ”Decision Making Based on Data” 13-16 Aug 2019 in Kuala Lumpur, Malaysia

Page 25: Engaging Everyone with Open Data Science - Courses...Engaging Everyone with Open Data Science IASE 2019 Satellite Conference ”Decision Making Based on Data” 13-16 Aug 2019 in Kuala

4. Conclusion• There is a huge need for more (and more) data scientists.• Teaching of statistics should focus more on data science, with a

special emphasis on data wrangling.• The statistics curriculum should be updated and the term “data

science” used as a synonym to statistics.• Our new course gives an excellent example of how to engage students

to learn skills of reproducible, open data science.

Kimmo Vehkalahti, University of Helsinki, Finland: Engaging Everyone with Open Data ScienceIASE 2019 Satellite Conference ”Decision Making Based on Data” 13-16 Aug 2019 in Kuala Lumpur, Malaysia

Page 26: Engaging Everyone with Open Data Science - Courses...Engaging Everyone with Open Data Science IASE 2019 Satellite Conference ”Decision Making Based on Data” 13-16 Aug 2019 in Kuala

5. References (cited in the paper)• Greenacre, M. & Blasius, J., eds. (2006). Multiple Correspondence Analysis and Related

Methods. Boca Raton, Florida: Chapman and Hall/CRC.• Kitchin, R. (2014). The Data Revolution: Big Data, Open Data, Data Infrastructures &

Their Consequences. London: SAGE.• R Core Team (2019). R: A language and environment for statistical computing. Vienna,

Austria: R Foundation for Statistical Computing. https://www.R-project.org/• Vehkalahti, K. & Everitt, B. S. (2019). Multivariate Analysis for the Behavioral Sciences,

Second edition. Boca Raton, Florida: Chapman and Hall/CRC.https://github.com/KimmoVehkalahti/MABS

• Xie, Y., Allaire, J.J. & Grolemund, G. (2018). R Markdown: The Definitive Guide. BocaRaton, FL: Chapman and Hall/CRC.

Kimmo Vehkalahti, University of Helsinki, Finland: Engaging Everyone with Open Data ScienceIASE 2019 Satellite Conference ”Decision Making Based on Data” 13-16 Aug 2019 in Kuala Lumpur, Malaysia

Page 27: Engaging Everyone with Open Data Science - Courses...Engaging Everyone with Open Data Science IASE 2019 Satellite Conference ”Decision Making Based on Data” 13-16 Aug 2019 in Kuala

Engaging Everyone withOpen Data Science

Thank you for your attention!

Kimmo Vehkalahti, University of Helsinki, Finland: Engaging Everyone with Open Data ScienceIASE 2019 Satellite Conference ”Decision Making Based on Data” 13-16 Aug 2019 in Kuala Lumpur, Malaysia

Page 28: Engaging Everyone with Open Data Science - Courses...Engaging Everyone with Open Data Science IASE 2019 Satellite Conference ”Decision Making Based on Data” 13-16 Aug 2019 in Kuala

Multivariate Analysis for Data Scientists

Kimmo Vehkalahti

University of Helsinki, Finland

IWMS-2019: The 27th International Workshop on Matrices and Statistics

6-9 June 2019 | Shanghai, China

… … … Thanks!J Just a few slides from my invited talk in the 27th IWMS (Shanghai, China) in June 2019, entitled:

Page 29: Engaging Everyone with Open Data Science - Courses...Engaging Everyone with Open Data Science IASE 2019 Satellite Conference ”Decision Making Based on Data” 13-16 Aug 2019 in Kuala

Multivariate Analysisfor the

Behavioral SciencesSecond Edition

Kimmo Vehkalahti & Brian S. Everitt:

Chapman and Hall/CRC Press, 2019

Page 30: Engaging Everyone with Open Data Science - Courses...Engaging Everyone with Open Data Science IASE 2019 Satellite Conference ”Decision Making Based on Data” 13-16 Aug 2019 in Kuala

Table of Contents

1. Data, Measurement, and Models2. Looking at Data3. Simple Linear and Locally Weighted Regression4. Multiple Linear Regression5. Generalized Linear Models6. Applying Logistic Regression7. Survival Analysis8. Analysis of Longitudinal Data I: Graphical Displays and Summary Measure Approach9. Analysis of Longitudinal Data II: Linear Mixed Effects Models for Normal Response Variables10.Analysis of Longitudinal Data III: Non-Normal Responses11.Missing Values12.Multivariate Data and Multivariate Analysis13.Principal Components Analysis14.Multidimensional Scaling and Correspondence Analysis15.Exploratory Factor Analysis16.Confirmatory Factor Analysis and Structural Equation Models17.Cluster Analysis18.Grouped Multivariate Data

Page 31: Engaging Everyone with Open Data Science - Courses...Engaging Everyone with Open Data Science IASE 2019 Satellite Conference ”Decision Making Based on Data” 13-16 Aug 2019 in Kuala

Let us close with an example of MDS from Chapter 14:

• A view of the book (pp. 278-279):Data (dissimilarity matrix), analysis,figure, and interpretations

• A view of the same example from materialfreely available online on GitHub:Analysis and figure with R Markdown

Page 32: Engaging Everyone with Open Data Science - Courses...Engaging Everyone with Open Data Science IASE 2019 Satellite Conference ”Decision Making Based on Data” 13-16 Aug 2019 in Kuala
Page 33: Engaging Everyone with Open Data Science - Courses...Engaging Everyone with Open Data Science IASE 2019 Satellite Conference ”Decision Making Based on Data” 13-16 Aug 2019 in Kuala
Page 34: Engaging Everyone with Open Data Science - Courses...Engaging Everyone with Open Data Science IASE 2019 Satellite Conference ”Decision Making Based on Data” 13-16 Aug 2019 in Kuala

Let us close with an example of MDS from Chapter 14:

• A view of the book (pp. 278-279):Data (dissimilarity matrix), analysis,figure, and interpretations

• A view of the same example from materialfreely available online on GitHub:Analysis and figure with R Markdown

Page 35: Engaging Everyone with Open Data Science - Courses...Engaging Everyone with Open Data Science IASE 2019 Satellite Conference ”Decision Making Based on Data” 13-16 Aug 2019 in Kuala
Page 36: Engaging Everyone with Open Data Science - Courses...Engaging Everyone with Open Data Science IASE 2019 Satellite Conference ”Decision Making Based on Data” 13-16 Aug 2019 in Kuala
Page 37: Engaging Everyone with Open Data Science - Courses...Engaging Everyone with Open Data Science IASE 2019 Satellite Conference ”Decision Making Based on Data” 13-16 Aug 2019 in Kuala

Thanks!谢谢