big data workshopbettina.berendt/... · • group 1: you are a fitbit-wristband / smart-home...
TRANSCRIPT
Big Data
Workshop
Bettina Berendt Department of Computer Science KU Leuven, Belgium http://people.cs.kuleuven.be/~bettina.berendt/ St. John's International School April 23rd, 2018, Waterloo, Belgium
‹#›
2
2
Who am I?
3
Goals and non-goals
• Goals
▫ Talk about Big Data as a critical data scientist
▫ On a background of what science is & what
“critical“ means in this context
▫ Involve you in being critical and constructive
• Non-goals (selection)
▫ Go into depth about privacy and data protection
– although these topics are unavoidable in the Big
Data context
3
Big Data is ...
(from Alexandra Roche and Josefine Droste’s
presentation)
‹#›
5
Big Data is …
• “the growth in the volume of structured and
unstructured data, the speed at which it is
created and collected, and the scope of how
many data points are collected”
• Potential for personalizing learning
• Inherits bias
• Surveillance
• Ethical dilemmas
• Transparency (pro and con), privacy
(Alexandra Roche & Josefine Droste)
5
Science and being critical are ... ‹#›
7
What is science? (1)
• A systematic enterprise that builds and organizes knowledge in the form of testable explanations and predictions about the universe.
• the word "science" became increasingly associated with what is today known as the scientific method, a structured way to study the natural world.
• Contemporary science is typically subdivided into the natural sciences which study nature in the broadest sense, the social sciences which study people and societies, and the formal sciences like mathematics which study abstract concepts. […] Disciplines which use science like engineering and medicine may also be considered to be applied sciences.
• Science is related to research, and is normally organized by a university, a college, or a research institute.
(Wikipedia: “Science”) 7
8
(1st part of pic)
8
9
What is science? (2)
“Wissenschaft ist, wenn man genauer nachfragt.”
˜Science happens when you ask again, and ask
more precisely.
(author unknown to me)
9
10
(1st part of pic)
10
Big Data is ...
… something we usually encounter via
statements
‹#›
12
Typical Big Data statements (fictitious, but true to style)
① The average Belgian pupil now spends 3 hours
a day chatting.
② Pupils who spend more than 3 hours a day
chatting “like” Converse sneakers and Dunkin
Donuts.
③ People who “like” Converse and Dunkin
Donuts are less intelligent.
12
13 Typical BD statements (4):
From Psychometrics Centre 2013
to Cambridge Analytica 2016 13
14
14
15
Typical Big Data statements (5) (from the CEM Brochure)
• Maximise learning potential • The CEM IBE computer-adaptive assessment provides an
excellent research-informed baseline to help you predict future performance (in IB Diploma examinations for each subject)
• The CEM IBE computer-adaptive assessment measures students on three key cognitive areas which research shows are linked to later academic outcomes: maths, vocabulary, non-verbal
• Once you have students’ final IB Diploma results, you can return this data to us
• The full CEM IBE product includes additional … questionnaires aiming to understand your students’ motivations, interests and aspirations. (questions about views on cultural background, way of life, social status, …)
15
16
So how …
• … can we understand such statements
scientifically?
• … can we criticise them scientifically?
16
Big Data is ...
… data ‹#›
18
“Data speak for themselves.“
• “With enough data, the numbers speak for
themselves.” Anderson, C. (2008).
• “Quantitative data [...] are independent of
interpretation; [...] they often demand an
interpretation that transcends the quantitative
realm.“ Moretti, F. (2007), p.30
18
19
Data?
• datum = given
• “data refer to those elements that are taken
[abstracted from phenomena]: extracted
through observations, computations,
experiments, and record keeping”, “selected
from nature by the scientist in accordance with
his [sic] purpose” (Kitchin, 2014)
Capta! 19
20
Impact of measure-
ment methods
20
21
Who or what “speaks“?
Who or what “decides“?
21
22
Summary:
Data cannot speak for themselves • All data are not given (by nature), but taken
(by a researcher or other data collector) ▫ With conscious or unconscious purposes/agendas
▫ In some context
• Data and analyses of them require interpretation
• Big Data are samples too
• All data have quality issues; in Big Data, we often do not know these
• Combining datasets can introduce biases and errors
22
23
Parking lot science
23
24
Some more examples of data biases
and parking lot science • Facebook likes, real-world likes
• Facebook self-presentation: only the good things ...
• Restrictions on search in Twitter
Research focus on current and recent events?!
• “Trending topics“ algorithm in Twitter based on burstiness
Suppression of persistent topics?!
24
Big Data is ...
… statistics
(on steroids)
‹#›
26
What should you ask a statistic?
26
27
What should you ask this statement?
The average Belgian pupil now spends 3 hours a day
chatting.
27
28
How to talk back to a statistic (1)
(building on Huff’s final chapter)
1. Who says so?
2. How do they know?
▫ How were data collected and analysed?
▫ In which contexts?
3. Did somebody change the subject?
▫ What are the actual data?
4. Does it make sense?
28
29
So …?
1. Who says so?
2. How do they know?
▫ How were data collected and analysed?
3. Did somebody change the subject?
▫ What are the actual data?
4. Does it make sense?
29
The average Belgian pupil now spends 3 hours a day chatting.
30
Huff’s questions in more detail 1. Who says so?
▫ What could be their conscious or unconscious biases? ▫ Do they use unqualified words (“average”: mean, median, …?) ▫ Do they use OK names? (“The survey results from scientists from the
University of … show …”)
2. How do they know? ▫ Sample size, selection bias? ▫ Correlation size, significance? ▫ Baseline values? ▫ Did external factors change? E.g. frequency of reporting?
3. Did somebody change the subject? / What are the actual data? ▫ Observation or self-report? ▫ Change over time or across data sets in how basic measures are defined ▫ Correlation or causation?
4. Does it make sense? ▫ Be wary of “exact-sounding numbers” (40.13 Euros to eat per week,
average family with 3.5 children) ▫ extrapolation
30
31
Empiricism and apophenia
31
32
Empiricism and apophenia: correlation, causation, and instrumentality
32
33
Correlation vs. causation
• The current scientific consensus is that the only
way to properly demonstrate causation is to do
an experiment.
• Many Big Data sets – especially those
concerning people – are not experimental data,
because they have been collected as
observations in the field, in all the diverse
contexts in which people operate.
• This means they can only show correlation.
33
34
How to talk back to a statistic (2)
1. Who says so?
2. How do they know?
▫ How were data collected and analysed?
3. Did somebody change the subject?
▫ What are the actual data?
▫ Correlation or causation?
4. Does it make sense?
34
35
“Correlation replaces causation“?!
(1) Good enough for business logic
35
36
Correlation replaces causation?!
(2) But deficient for explanation (can we really explain
German history like this?)
36
37
Correlation replaces causation?!
(3) What about predictions that affect someone‘s self-
image?
37
38
Questions you should ask any inferential
statistic (e.g., prediction models)
38
• How good is the model?
• There are many relevant measures of
“goodness”.
• In the following, only a small selection.
39
What is the measure,
and is it statistically significant?
39
[figure caption, from paper]
• Prediction accuracy of
regression for numeric
attributes and traits
expressed by the Pearson
correlation coefficient
between predicted and
actual attribute values;
• all correlations are
significant at the P < 0.001
level.
• The transparent bars
indicate the questionnaire’s
baseline accuracy,
expressed in terms of test–
retest reliability.
40
But what does the correlation value
itself say?
40 (Wikipedia: “Correlation”)
41
But what does the correlation value
itself say?
41 (Wikipedia: “Correlation”)
42
How is a classification model built?
42
43
How is a classification model built?
43
44
How good is the model? (= How is a classification model evaluated?) confusion matrix
44
45
How good?
45
Overall accuracy = (4+900)/1010 = 89.5% Precision for “criminals” = 4/104 = 3.8% Recall for “criminals” = 4/10 = 40% Accuracy of model “always innocent” = 1000/1010 = 99%
46
How to talk back to a statistic (3)
1. Who says so?
2. How do they know?
▫ How were data collected and analysed?
▫ How good is the model?
3. Did somebody change the subject?
▫ What are the actual data? Correlation or
causation?
4. Does it make sense?
46
47
Recap (from the CEM Brochure)
• Maximise learning potential • The CEM IBE computer-adaptive assessment provides an
excellent research-informed baseline to help you predict future performance (in IB Diploma examinations for each subject)
• The CEM IBE computer-adaptive assessment measures students on three key cognitive areas which research shows are linked to later academic outcomes: maths, vocabulary, non-verbal
• Once you have students’ final IB Diploma results, you can return this data to us
• The full CEM IBE product includes additional … questionnaires aiming to understand your students’ motivations, interests and aspirations. (questions about views on cultural background, way of life, social status, …)
47
48
How to talk back to a statistic (4)
1. Who says so?
2. How do they know?
▫ How were data collected and analysed?
▫ How good is the model?
3. Did somebody change the subject?
▫ What are the actual data? Correlation or
causation?
4. Does it make sense?
5. What is actually being claimed?
48
49
Accumulation of errors 49
… and if they see this ad, they will vote for Trump
Statistical model 1
Statistical model 1
Big Data is ...
… business models ‹#›
51
Recap (from the CEM Brochure)
• Maximise learning potential • The CEM IBE computer-adaptive assessment provides an
excellent research-informed baseline to help you predict future performance (in IB Diploma examinations for each subject)
• The CEM IBE computer-adaptive assessment measures students on three key cognitive areas which research shows are linked to later academic outcomes: maths, vocabulary, non-verbal
• Once you have students’ final IB Diploma results, you can return this data to us
• The full CEM IBE product includes additional … questionnaires aiming to understand your students’ motivations, interests and aspirations. (questions about views on cultural background, way of life, social status, …)
51
52
How to talk back to a statistic (5)
1. Who says so? ▫ What (else) are they interested in?
2. How do they know? ▫ How were data collected and analysed?
▫ How good is the model?
3. Did somebody change the subject? ▫ What are the actual data? Correlation or
causation?
4. Does it make sense?
5. What is actually being claimed?
52
53
NB: Can I see my data?
What if it’s wrong?
• You have data access rights (and other rights)
under European data protection legislation.
• But that’s another workshop …
53
Big Data is ...
… an understanding of the past used to justify what some decision maker wants to do in the future.
(Geoffrey Rockwell,
personal communication, cited from memory)
‹#›
55
Which brings us to …
• … the 2nd meaning of “critical” in science
• “Critical theory” (Habermas, Adorno, …) ▫ (social) science as a practical philosophy aiming
at societal change with the goal of increasing the autonomy / self-determination of people
▫ (A view of “critical” not as widely shared as the first one)
Here:
• Is data the only answer?
• What is the question?
55
Let’s be practical philosophers
and scientists
… and we’ll use a different example now ‹#›
57
Belgium:
top?
57
http://ec.europa.eu/eurostat/tgm/refreshTableAction.do?tab=table&plugin=1&pcode=ten00063&language=en
58
Belgium: flop?
58
59
One reason:
Belgians don’t
excel at sorting
waste
59
60
Group work!
• Group 1: You are a fitbit-wristband / smart-home company and want to develop predictive analytics for identifying who will have problems separating their trash properly, in order to give them helpful alerts. You may use any data you want. Prepare a pitch for your business model.
• Group 2: You are a company that wants to use Big Data, but avoid processing personal data. Develop an idea for how to best use these data. Prepare a pitch for your business model.
• Group 3: You are a civil society organisation that wants to
improve the trash situation without recourse to Big Data. Prepare a pitch for your idea.
60
61
Note 1: Definition of “recycling rate”
Recycling rates for packaging waste (in %)
'Recycling rate' for the purposes of Article 6(1) of
Directive 94/62/EC means the total quantity of
recycled packaging waste, divided by the total
quantity of generated packaging waste.
http://ec.europa.eu/eurostat/web/products-
datasets/product?code=ten00063
61
62
Note 2: Recycling science
62
63
Some more ideas
63
64
Shops
64
65
Re-
use
65
66
Activists
66
67
“Science
activists”
67
68
Group work!
• Group 1: You are a fitbit-wristband / smart-home company and want to develop predictive analytics for identifying who will have problems separating their trash properly, in order to give them helpful alerts. You may use any data you want. Prepare a pitch for your business model.
• Group 2: You are a company that wants to use Big Data, but avoid processing personal data. Develop an idea for how to best use these data. Prepare a pitch for your business model.
• Group 3: You are a civil society organisation that wants to
improve the trash situation without recourse to Big Data. Prepare a pitch for your idea.
68
Thank you!
Questions? Email me!
http://people.cs.kuleuven.be/~bettina.berendt/
‹#›
70
References
70
• Anderson, C. (2008). The end of theory: The data deluge makes the scientific method obsolete. Wired 16.07. Available at http://edge.org/3rd_culture/anderson08/anderson08_index.html
• pp. 42ff: Degeling, M. & Berendt, B. (2017). What is wrong about Robocops as consultants? A technology-centric critique of predictive policing. AI & Society. May 2017 Online First.
• pp. 8, 10: Huber, O. (). Das psychologische Experiment: Eine Einführung.
• Huff, D. (1954). How to Lie with Statistics. New York: W.W. Norton & Company, Inc.
• Kitchin, R. (2014). The Data Revolution. Big Data, Open Data, Data Infrastructures & Their Consequences. London: Sage.
• p. 13, 37, 39: Kosinski, M., Stillwell, D., & Graepel, T. (2013). Private traits and attributes are predictable from digital records of human behavior. Proceedings of the National Academy of Sciences, 110 (15), 5802–5805.
• Moretti, F. (2005). Graphs, Maps, Trees. Abstract Models for Literary History. p.30 London: Verso (cited from the paperback published in 2007)
• pp. 13f, 49: www.theguardian.com/commentisfree/2018/mar/20/brenda-the-civil-disobedience-penguin-on-cambridge-analytica-the-real-was-getting-caught
• pp. 31f.: From http://www.tylervigen.com/spurious-correlations
• Further sources on the slides themselves.
• My apologies for having mislaid some photo/picture URLs, and thanks to those who provide(d) them online!
Not cited, but also potentially interesting:
• Berendt, B. (2015). Big Capta, Bad Science? On two recent books on “Big Data” and its revolutionary potential. http://people.cs.kuleuven.be/~bettina.berendt/Reviews/BigData.pdf
• boyd, d. & Crawford, K. (2012). Critical questions for Big Data. Information, Communication & Society, 15:5, 662-679, DOI: 10.1080/1369118X.2012.678878.