Download - BDACA1516s2 - Lecture1
Introducing. . . What is Big Data? Methods The schedule
Big Data and Automated Content AnalysisWeek 1 – Wednesday
»Introduction«
Damian Trilling
[email protected]@damian0604
www.damiantrilling.net
Afdeling CommunicatiewetenschapUniversiteit van Amsterdam
30 March 2016
Big Data and Automated Content Analysis Damian Trilling
Introducing. . . What is Big Data? Methods The schedule
Today1 Introducing. . .
. . . the people
2 What is Big Data?DefinitionsAre we doing Big Data research?
3 MethodsWhich techniques?Which tools?Examples from last year
4 The scheduleNext meetingsPreview
Big Data and Automated Content Analysis Damian Trilling
Introducing. . . What is Big Data? Methods The schedule
. . . the people
Introducing. . .. . . the people
Big Data and Automated Content Analysis Damian Trilling
Introducing. . . What is Big Data? Methods The schedule
. . . the people
Introducing. . .Damian
dr. Damian TrillingAssistant Professor Political Communication &Journalism
• studied Communication Science in Münsterand at the VU 2003–2009
• PhD candidate @ ASCoR 2009–2012
• interested in political communication andjournalism in a changing media environmentand in innovative (digital, large-scale,computational) research methods
@damian0604 [email protected] 8th floor www.damiantrilling.net
Big Data and Automated Content Analysis Damian Trilling
Introducing. . . What is Big Data? Methods The schedule
. . . the people
Introducing. . .Theo
dr. Theo AraujoAssistant Professor of Communication ScienceUniversiteit van Amsterdam
• interested in corporate and persuasivecommunication, and on how organisations,consumers and stakeholders use social mediaacross countries and cultures. also interestedin innovative (digital, large-scale,computational) research methods
@theoaraujo [email protected]
Big Data and Automated Content Analysis Damian Trilling
Introducing. . . What is Big Data? Methods The schedule
. . . the people
Introducing. . .You
Your name?Your background?Your reason to follow this course?
Big Data and Automated Content Analysis Damian Trilling
Introducing. . . What is Big Data? Methods The schedule
Definitions
What is Big Data?
Big Data and Automated Content Analysis Damian Trilling
Introducing. . . What is Big Data? Methods The schedule
Definitions
What is Big Data?
A simple technical definition could be:Everything that needs so much computational power and/orstorage that you cannot do it on a regular computer.
Big Data and Automated Content Analysis Damian Trilling
Introducing. . . What is Big Data? Methods The schedule
Definitions
What is Big Data?
A simple technical definition could be:Everything that needs so much computational power and/orstorage that you cannot do it on a regular computer.
Big Data and Automated Content Analysis Damian Trilling
Introducing. . . What is Big Data? Methods The schedule
Definitions
What is Big Data?
Vis, 2013
• “commercial” definition (Gartner): “’Big data’ is high-volume,-velocity and -variety information assets that demandcost-effective, innovative forms of information processing forenhanced insight and decision making”
• boyd & Crawford definition:1 Technology: maximizing computation power and algorithmic
accuracy to gather, analyze, link, and compare large data sets.2 Analysis: drawing on large data sets to identify patterns in
order to make economic, social, technical, and legal claims.3 Mythology: the widespread belief that large data sets offer a
higher form of intelligence and knowledge that can generateinsights that were previously impossible, with the aura of truth,objectivity, and accuracy.
Big Data and Automated Content Analysis Damian Trilling
Introducing. . . What is Big Data? Methods The schedule
Definitions
What is Big Data?Vis, 2013
• “commercial” definition (Gartner): “’Big data’ is high-volume,-velocity and -variety information assets that demandcost-effective, innovative forms of information processing forenhanced insight and decision making”
• boyd & Crawford definition:1 Technology: maximizing computation power and algorithmic
accuracy to gather, analyze, link, and compare large data sets.2 Analysis: drawing on large data sets to identify patterns in
order to make economic, social, technical, and legal claims.3 Mythology: the widespread belief that large data sets offer a
higher form of intelligence and knowledge that can generateinsights that were previously impossible, with the aura of truth,objectivity, and accuracy.
Big Data and Automated Content Analysis Damian Trilling
Introducing. . . What is Big Data? Methods The schedule
Definitions
What is Big Data?Vis, 2013
• “commercial” definition (Gartner): “’Big data’ is high-volume,-velocity and -variety information assets that demandcost-effective, innovative forms of information processing forenhanced insight and decision making”
• boyd & Crawford definition:1 Technology: maximizing computation power and algorithmic
accuracy to gather, analyze, link, and compare large data sets.2 Analysis: drawing on large data sets to identify patterns in
order to make economic, social, technical, and legal claims.3 Mythology: the widespread belief that large data sets offer a
higher form of intelligence and knowledge that can generateinsights that were previously impossible, with the aura of truth,objectivity, and accuracy.
Big Data and Automated Content Analysis Damian Trilling
Introducing. . . What is Big Data? Methods The schedule
Definitions
What is Big Data?Vis, 2013
• “commercial” definition (Gartner): “’Big data’ is high-volume,-velocity and -variety information assets that demandcost-effective, innovative forms of information processing forenhanced insight and decision making”
• boyd & Crawford definition:1 Technology: maximizing computation power and algorithmic
accuracy to gather, analyze, link, and compare large data sets.2 Analysis: drawing on large data sets to identify patterns in
order to make economic, social, technical, and legal claims.3 Mythology: the widespread belief that large data sets offer a
higher form of intelligence and knowledge that can generateinsights that were previously impossible, with the aura of truth,objectivity, and accuracy.
Big Data and Automated Content Analysis Damian Trilling
Introducing. . . What is Big Data? Methods The schedule
Definitions
Implications & criticism
boyd & Crawford, 2012
1 Big Data changes the definition of knowledge2 Claims to objectivity and accuracy are misleading3 Bigger data are not always better data4 Taken out of context, Big Data loses its meaning5 Just because it is accessible does not make it ethical6 Limited access to Big Data creates new digital divides
Big Data and Automated Content Analysis Damian Trilling
Introducing. . . What is Big Data? Methods The schedule
Definitions
Implications & criticism
boyd & Crawford, 2012
1 Big Data changes the definition of knowledge2 Claims to objectivity and accuracy are misleading3 Bigger data are not always better data4 Taken out of context, Big Data loses its meaning5 Just because it is accessible does not make it ethical6 Limited access to Big Data creates new digital divides
Big Data and Automated Content Analysis Damian Trilling
Introducing. . . What is Big Data? Methods The schedule
Definitions
APIs, researchers and tools make Big Data
Vis, 2013Inevitable influences of:
• APIs• filtering, search strings, . . .• changing services over time• organizations that provide the data
Big Data and Automated Content Analysis Damian Trilling
Introducing. . . What is Big Data? Methods The schedule
Definitions
APIs, researchers and tools make Big Data
Vis, 2013Inevitable influences of:
• APIs• filtering, search strings, . . .• changing services over time• organizations that provide the data
Big Data and Automated Content Analysis Damian Trilling
Introducing. . . What is Big Data? Methods The schedule
Definitions
Epistemologies and paradigm shifts
Kitchin, 2014
• (Reborn) empiricism: purely inductive, correlation is enough• Data-driven science: knowledge discovery guided by theory• Computational social science and digital humanities: employBig Data research within existing epistemologies
• DH: descriptive statistics, visualizations• CSS: prediction and simulation
Big Data and Automated Content Analysis Damian Trilling
Introducing. . . What is Big Data? Methods The schedule
Definitions
Epistemologies and paradigm shifts
Kitchin, 2014
• (Reborn) empiricism: purely inductive, correlation is enough
• Data-driven science: knowledge discovery guided by theory• Computational social science and digital humanities: employBig Data research within existing epistemologies
• DH: descriptive statistics, visualizations• CSS: prediction and simulation
Big Data and Automated Content Analysis Damian Trilling
Introducing. . . What is Big Data? Methods The schedule
Definitions
Epistemologies and paradigm shifts
Kitchin, 2014
• (Reborn) empiricism: purely inductive, correlation is enough• Data-driven science: knowledge discovery guided by theory
• Computational social science and digital humanities: employBig Data research within existing epistemologies
• DH: descriptive statistics, visualizations• CSS: prediction and simulation
Big Data and Automated Content Analysis Damian Trilling
Introducing. . . What is Big Data? Methods The schedule
Definitions
Epistemologies and paradigm shifts
Kitchin, 2014
• (Reborn) empiricism: purely inductive, correlation is enough• Data-driven science: knowledge discovery guided by theory• Computational social science and digital humanities: employBig Data research within existing epistemologies
• DH: descriptive statistics, visualizations• CSS: prediction and simulation
Big Data and Automated Content Analysis Damian Trilling
Introducing. . . What is Big Data? Methods The schedule
Are we doing Big Data research?
Are we doing Big Data research in this course?
Depends on the definition
• Not if we take a definition that only focuses on computingpower and the amount of data
• But: We are using the same techniques. And they scale well.• Oh, and about that high-performance computing in the cloud:We actually do have access to that, so if someone has a reallygreat idea. . .
Big Data and Automated Content Analysis Damian Trilling
Introducing. . . What is Big Data? Methods The schedule
Are we doing Big Data research?
Are we doing Big Data research in this course?
Depends on the definition
• Not if we take a definition that only focuses on computingpower and the amount of data
• But: We are using the same techniques. And they scale well.• Oh, and about that high-performance computing in the cloud:We actually do have access to that, so if someone has a reallygreat idea. . .
Big Data and Automated Content Analysis Damian Trilling
Introducing. . . What is Big Data? Methods The schedule
Are we doing Big Data research?
Are we doing Big Data research in this course?
Depends on the definition
• Not if we take a definition that only focuses on computingpower and the amount of data
• But: We are using the same techniques. And they scale well.
• Oh, and about that high-performance computing in the cloud:We actually do have access to that, so if someone has a reallygreat idea. . .
Big Data and Automated Content Analysis Damian Trilling
Introducing. . . What is Big Data? Methods The schedule
Are we doing Big Data research?
Are we doing Big Data research in this course?
Depends on the definition
• Not if we take a definition that only focuses on computingpower and the amount of data
• But: We are using the same techniques. And they scale well.• Oh, and about that high-performance computing in the cloud:We actually do have access to that, so if someone has a reallygreat idea. . .
Big Data and Automated Content Analysis Damian Trilling
Introducing. . . What is Big Data? Methods The schedule
Methods
Big Data and Automated Content Analysis Damian Trilling
Introducing. . . What is Big Data? Methods The schedule
Which techniques?
What we will learn the next weeks
1. How to collect dataAPIs, scrapers and crawlers, feeds, databases, . . .Storage in different file formats
2. How to analyze dataSentiment analysis, automated content analysis, regularexpressions, natural language processing, cluster analysis, machinelearning, network analysis
Big Data and Automated Content Analysis Damian Trilling
Introducing. . . What is Big Data? Methods The schedule
Which techniques?
What we will learn the next weeks
1. How to collect dataAPIs, scrapers and crawlers, feeds, databases, . . .Storage in different file formats
2. How to analyze dataSentiment analysis, automated content analysis, regularexpressions, natural language processing, cluster analysis, machinelearning, network analysis
Big Data and Automated Content Analysis Damian Trilling
Introducing. . . What is Big Data? Methods The schedule
Which techniques?
What we will learn the next weeks
1. How to collect dataAPIs, scrapers and crawlers, feeds, databases, . . .Storage in different file formats
2. How to analyze dataSentiment analysis, automated content analysis, regularexpressions, natural language processing, cluster analysis, machinelearning, network analysis
Big Data and Automated Content Analysis Damian Trilling
Introducing. . . What is Big Data? Methods The schedule
Which tools?
The methods
The tools we use for thisThe programming language Python (and the huge amount ofPython modules others already wrote).
Big Data and Automated Content Analysis Damian Trilling
Introducing. . . What is Big Data? Methods The schedule
Which tools?
A language, not a program:Python
Big Data and Automated Content Analysis Damian Trilling
Introducing. . . What is Big Data? Methods The schedule
Which tools?
Python
What?
• A language, not a specific program• Huge advantage: flexibility, portability• One of the languages for data analysis. (The other one is R.)
But Python is more flexible—the original version of Dropbox was written in Python. I’d say: R fornumbers, Python for text and messy stuff. But that’s just my personal view.
Which version?We use Python 3.http://www.google.com or http://www.stackexchange.com still offer a lotof Python2-code, but that can easily be adapted. Most notable difference: InPython 2, you write print "Hi", this has changed to print ("Hi")
Big Data and Automated Content Analysis Damian Trilling
Introducing. . . What is Big Data? Methods The schedule
Which tools?
Python
What?
• A language, not a specific program• Huge advantage: flexibility, portability• One of the languages for data analysis. (The other one is R.)
But Python is more flexible—the original version of Dropbox was written in Python. I’d say: R fornumbers, Python for text and messy stuff. But that’s just my personal view.
Which version?We use Python 3.http://www.google.com or http://www.stackexchange.com still offer a lotof Python2-code, but that can easily be adapted. Most notable difference: InPython 2, you write print "Hi", this has changed to print ("Hi")
Big Data and Automated Content Analysis Damian Trilling
Introducing. . . What is Big Data? Methods The schedule
Which tools?
Python
What?
• A language, not a specific program• Huge advantage: flexibility, portability• One of the languages for data analysis. (The other one is R.)
But Python is more flexible—the original version of Dropbox was written in Python. I’d say: R fornumbers, Python for text and messy stuff. But that’s just my personal view.
Which version?We use Python 3.http://www.google.com or http://www.stackexchange.com still offer a lotof Python2-code, but that can easily be adapted. Most notable difference: InPython 2, you write print "Hi", this has changed to print ("Hi")
Big Data and Automated Content Analysis Damian Trilling
Introducing. . . What is Big Data? Methods The schedule
Which tools?
If it’s not a program, how do you work with it?
Interactive mode
• Just type python3 on the command line, and you can startentering Python commands (You can leave again by entering quit())
• Great for quick try-outs, but you cannot even save your code
An editor of your choice
• Write your program in any text editor, save it as myprog.py• and run it from the command line with ./myprog.py or
python3 myprog.py
Big Data and Automated Content Analysis Damian Trilling
Introducing. . . What is Big Data? Methods The schedule
Which tools?
If it’s not a program, how do you work with it?
Interactive mode
• Just type python3 on the command line, and you can startentering Python commands (You can leave again by entering quit())
• Great for quick try-outs, but you cannot even save your code
An editor of your choice
• Write your program in any text editor, save it as myprog.py• and run it from the command line with ./myprog.py or
python3 myprog.py
Big Data and Automated Content Analysis Damian Trilling
Introducing. . . What is Big Data? Methods The schedule
Which tools?
If it’s not a program, how do you start it?
An IDE (Integrated Development Environment)
• Provides an interface• Both quick interactive try-outs and writing larger programs• We use spyder, which looks a bit like RStudio (and to someextent like Stata)
Big Data and Automated Content Analysis Damian Trilling
Introducing. . . What is Big Data? Methods The schedule
Which tools?
Some considerations regarding the use of software inscience
Assuming that science should be transparent and reproducible
, weshoulduse tools that are
• platform-independent• free (as in beer and as in speech, gratis and libre)• which implies: open source
This ensures it can our research (a) can be reproduced by anyone,and that there is (b) no black box that no one can look inside
Big Data and Automated Content Analysis Damian Trilling
Introducing. . . What is Big Data? Methods The schedule
Which tools?
Some considerations regarding the use of software inscience
Assuming that science should be transparent and reproducible, weshoulduse tools that are
• platform-independent• free (as in beer and as in speech, gratis and libre)• which implies: open source
This ensures it can our research (a) can be reproduced by anyone,and that there is (b) no black box that no one can look inside
Big Data and Automated Content Analysis Damian Trilling
Introducing. . . What is Big Data? Methods The schedule
Which tools?
Some considerations regarding the use of software inscience
Assuming that science should be transparent and reproducible, weshoulduse tools that are
• platform-independent• free (as in beer and as in speech, gratis and libre)• which implies: open source
This ensures it can our research (a) can be reproduced by anyone,and that there is (b) no black box that no one can look inside
Big Data and Automated Content Analysis Damian Trilling
Introducing. . . What is Big Data? Methods The schedule
Which tools?
Why program your own tool?
[...] these [commercial] tools are often unsuitablefor academic purposes because of their cost, alongwith the problematic ‘black box’ nature of many ofthese tools. Vis, 2013
[...] we should resist the temptation to let theopportunities and constraints of an application orplatform determine the research question [...] Mahrt &Scharkow, 2013, p. 30
Big Data and Automated Content Analysis Damian Trilling
Introducing. . . What is Big Data? Methods The schedule
Examples from last year
Some final projects from previous courses
Questions
• Are (popular) apps for children in the Apple App Store framedin terms of education or entertainment?
• How can house prices on funda.nl be predicted?• Where are those who edit Wikipedia-entries about companiesgeographically located, related to the company’s HQ?
• What are the sentiment and topical frames presented inGoogle US search results for the top political news searchterms as indicated by Google Trends?
• How is the tone in newspaper election coverage related to thetone on Twitter?
Big Data and Automated Content Analysis Damian Trilling
Introducing. . . What is Big Data? Methods The schedule
The schedule
Big Data and Automated Content Analysis Damian Trilling
Introducing. . . What is Big Data? Methods The schedule
The schedule
Each weekIn general: A lecture (Monday) and a lab session (Wednesday).Each week one method.
ExaminationsA mid-term take-home exam in week 5 and an individual researchproject on which you work during the whole course.
Self-studyPlay around! You really have to do it to learn programming. See itas your weekly assignment ;-)
Big Data and Automated Content Analysis Damian Trilling
Introducing. . . What is Big Data? Methods The schedule
The schedule
Each weekIn general: A lecture (Monday) and a lab session (Wednesday).Each week one method.
ExaminationsA mid-term take-home exam in week 5 and an individual researchproject on which you work during the whole course.
Self-studyPlay around! You really have to do it to learn programming. See itas your weekly assignment ;-)
Big Data and Automated Content Analysis Damian Trilling
Introducing. . . What is Big Data? Methods The schedule
Next meetings
Next meetings
Big Data and Automated Content Analysis Damian Trilling
Introducing. . . What is Big Data? Methods The schedule
Next meetings
Week 2: Getting started with Python
Monday, 4–4Introduction lecture (you are encouraged to try it out at the sametime!)
Wednesday, 6–4Data analysis exercise
Preparation: Read chapters 2 and 3!And play around!
Big Data and Automated Content Analysis Damian Trilling
Introducing. . . What is Big Data? Methods The schedule
Preview
Start playing!Exercises
1. Run a program that greets you.The code for this is
1 print("Hello world")
After that, do some calculations. You can do that in a similar way:1 a=22 print(a*3)
Just play around.
Additional ressourcesCodecademy course on Pythonhttps://www.codecademy.com/learn/python
Big Data and Automated Content Analysis Damian Trilling