soda 501: approaches and issues in big social data spring ... · \approaches and issues in big...

32
SoDA 501: Approaches and Issues in Big Social Data Spring 2018 Burt L. Monroe Office: Sparks B002 (The Databasement) or Pond 207 Course website: https://burtmonroe.github.io/SoDA501 Appointments: http://burtmonroe.youcanbook.me Contact: [email protected], 814-867-2726 or 814-865-9215 Description This seminar is part of the core seminar series for students in the Social Data Analytics dual-title PhD and doctoral minor. The primary objective of the seminar is interdisciplinary exposure to, engagement with, and integration of the tools, practices, language, and standards used in the col- lection and management of data in the component disciplines of the Social Data Analytics field. Each of you is well on your way toward a PhD – formal certification as an “expert” – in one of the component disciplines of Social Data Analytics and has in your coursework and research become well versed in one or more of the many computational, informational, statistical, visual analytic, or social scientific approaches to data, and the issues faced by those approaches. Here, we are interested in trying to integrate your multidisciplinary expertise, particularly in the context of data that are social (about, or arising from, human interaction) and big or intensive (of sufficient scale, variety, or complexity to strain the informational, computational, or cognitive limits of conventional approaches to data collection, management, manipulation, or analysis). The SoDA core seminars are organized around the metaphor of the social data stack. The social data stack consists of three fuzzily boundaried layers: the “data layer,” the “analytics layer,” and the “relevance layer” (Fig. 1). The data layer is comprised of the processes and technologies by which human interactions are translated into data about human interactions. These are the themes emphasized in SoDA 501, “Approaches and Issues in Big Social Data,” offered in the spring semester. Some SoDA / IGERT students will take more in depth seminars with focus on computational and informational aspects (primarily in Information Sciences & Technology, Geography, or engineering departments) and re- search design aspects (primarily in social science departments or Statistics) of the data layer. The analytics layer is comprised of the processes and technologies by which social data are translated into knowledge about society. These are the themes emphasized in SoDA 502, “Approaches and Issues in Social Data Analytics.” Some of you will take more in-depth seminars on machine / statistical learning, visual analytics, or other statistical or social scientific approaches to inference. The relevance layer is comprised of the processes and technologies by which knowledge about society is translated into value for science or society. Within the SoDA seminars, this is addressed through primarily through exposure to and participation in projects that require an interdisciplinary team science approach. 1

Upload: others

Post on 28-May-2020

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: SoDA 501: Approaches and Issues in Big Social Data Spring ... · \Approaches and Issues in Big Social Data," o ered in the spring semester. Some SoDA / IGERT students will take more

SoDA 501 Approaches and Issues in Big Social DataSpring 2018

Burt L MonroeOffice Sparks B002 (The Databasement) or Pond 207Course website httpsburtmonroegithubioSoDA501Appointments httpburtmonroeyoucanbookmeContact burtmonroepsuedu 814-867-2726 or 814-865-9215

Description

This seminar is part of the core seminar series for students in the Social Data Analytics dual-titlePhD and doctoral minor The primary objective of the seminar is interdisciplinary exposure toengagement with and integration of the tools practices language and standards used in the col-lection and management of data in the component disciplines of the Social Data Analytics field

Each of you is well on your way toward a PhD ndash formal certification as an ldquoexpertrdquo ndash in one of thecomponent disciplines of Social Data Analytics and has in your coursework and research becomewell versed in one or more of the many computational informational statistical visual analyticor social scientific approaches to data and the issues faced by those approaches Here we areinterested in trying to integrate your multidisciplinary expertise particularly in the context of datathat are social (about or arising from human interaction) and big or intensive (of sufficient scalevariety or complexity to strain the informational computational or cognitive limits of conventionalapproaches to data collection management manipulation or analysis)

The SoDA core seminars are organized around the metaphor of the social data stack The socialdata stack consists of three fuzzily boundaried layers the ldquodata layerrdquo the ldquoanalytics layerrdquo andthe ldquorelevance layerrdquo (Fig 1)

The data layer is comprised of the processes and technologies by which human interactions aretranslated into data about human interactions These are the themes emphasized in SoDA 501ldquoApproaches and Issues in Big Social Datardquo offered in the spring semester Some SoDA IGERTstudents will take more in depth seminars with focus on computational and informational aspects(primarily in Information Sciences amp Technology Geography or engineering departments) and re-search design aspects (primarily in social science departments or Statistics) of the data layer

The analytics layer is comprised of the processes and technologies by which social data are translatedinto knowledge about society These are the themes emphasized in SoDA 502 ldquoApproaches andIssues in Social Data Analyticsrdquo Some of you will take more in-depth seminars on machine statistical learning visual analytics or other statistical or social scientific approaches to inferenceThe relevance layer is comprised of the processes and technologies by which knowledge about societyis translated into value for science or society Within the SoDA seminars this is addressed throughprimarily through exposure to and participation in projects that require an interdisciplinary teamscience approach

1

Figure 1 SoDA and the social data stack

Assignments and Grades

The latter leads us to the main pedagogical components of SoDA 501

bull Engagement in Seminar - 40

Guest Speakers For half of the session most weeks we will host a guest speaker(typically a member of the Graduate Faculty in Social Data Analytics drawn from thefull range of participating disciplines) discussing an active research project or relatedtopic that touches on one or more areas of concern in the course For each speaker wewill have two or more of you acting as ldquodesignated respondentsrdquo with extra responsibilityfor having questions for discussion with the speaker

Readings and Seminar Discussion The readings discussion and what lecturing Iwill do will focus on interdisciplinary integration In part this involves identifying thoseconcepts that may be new to some of you in this setting ndash eg how ldquobigrdquo or ldquomachine

2

learningrdquo or ldquovisual analyticsrdquo approaches challenge conventional social science method-ology or how social scientific thinking challenges emerging practices and conventionalwisdom in data science ndash and tools associated with those concepts In part this involvesinterdisciplinary arbitrage and translation ndash identifying common concepts and structurethat may go by slightly different names in different disciplines and settings To thisend I want you each to send me ndash by Wednesday 700am each week by email ndash listsof terms concepts that you encountered in that weekrsquos reading in three categories (1)termsconcepts that were new but you think you now understand (2) termsconceptsthat seem to be used differently than in the context of your home discipline and (3)termsconcepts you still find confusing

Grading Criteria Full 30 points if you are present every week have made a good faitheffort to provide your lists of confusing terms and concepts on time have thoughtfullyread all of the assignments are prepared to talk about the weekrsquos readings and themesand consistently contribute in ways that are productive to the discussion (good ques-tions thoughtful responses etc) with all of that weighted more heavily when you are adesignated respondent If you donrsquot do any of that 0 points Sliding scale in between

bull Exercises - 20 It is explicitly not an objective of this course to ldquotrainrdquo you in all of thetools we will mention much less those that we could mention a task that would take yearsIn the interest of collectively ldquomoving the ball forwardrdquo for each of you however we will havea small number of assigned exercises Early in the semester these will be done in (assigned)interdisciplinary teams Later exercises will be individual

bull Semester Team Project - 40 You will in a team consisting of at least three disciplinescreate gather andor organizemanipulateprepare for analytics a ldquobigrdquo ldquosocialrdquo datasetThe data must be at least partly social (arise from human interactions) There must be somenontrivial computational or informational element to the project There need not be a finalanalysis of the data but there must be some basic calculation of descriptive statistics over thedata and some demonstration of the validity of the data for the (or an) intended scientificpurpose ndash eg representativeness (and of what) balance randomization measurementvalidity etc

March 1 - Deadline for approval of teams and (proposed) projects

March 15 - 25 Project Review

March 29 - 50 Project Review

April 12 - 75 Project Review

April 26 - Team Project Presentations

May 3 - White Paper and Data Replication Archive Due Submit a 4-5 pagepaper that documents what was done and why discusses problems you encounteredprovides an assessment of the validity of the data for an analytic purpose (or how itmight be validated) and discusses what further work might be done to make the datainto a useful resource for others andor to publish an analysis based on the data Shareyour code and data with me in maximally documented and reproducible form (ideally anotebook stored on github or similar)

3

Course Schedule 2018

January 11

bull Introductions

bull Syllabus (What this course is and isnrsquot How this course (hopefully) works)

bull Further reference

10RulesforData SoftwareCarpentry Lessons

Section ldquoGeneral Resources for Python and Rrdquo

Recommended Python via Anaconda (httpswwwanacondacom) R (httpswwwr-projectorg) amp RStudio (httpswwwrstudiocom) Account on ICS-ACI (httpsicspsuedu) Git (httpsgit-scmcom)

bull Exercise 1 Team Updates 118 Due 125 BitByBit Exercise 26 (a-g) Teams

TeamFrancisco Sara Claire Rosemary Fangcao

TeamFreelin Brittany Arif So Young Xiaoran

TeamKankane Shipi Steve Lulu Omer

January 18

bull Readings (send list of confusing terms concepts by 700am Jan 17 Wednesday)

BitByBit Ch 1 amp 2

CompSocSci Monroe-No Monroe-5Vs

One article from a different discipline listed in the ldquoMultidisciplinary Perspectivesrdquosection excluding Business-BigData

bull Further reference Section ldquoBig Data amp Social Data Analyticsrdquo

bull Exercise 1 Team Updates

January 25

bull Readings (send list of confusing terms concepts by 700 am Jan 24 Wednesday)

GoogleFlu GoogleBooks EmbeddingsBias MachineBias RacistBot BDSS-Census

ResearchMethodsKB ldquoMeasurementrdquo Quinn-Topics

bull Further reference

Section ldquoiexclCuidadordquo

Section ldquoMeasurement Reliability and Validityrdquo

bull Exercise 1 Due Teams Report

4

bull Determine ldquodiscussion leadrdquo dates

bull Exercise 2 Due 28 Wikipedia Google Trends exercise Teams

TeamKelling Claire Brittany Arif

TeamYalcin Omer Lulu Fangcao

TeamPang Rosemary Shipi Sara

TeamSun Xiaoran Steve So Young

February 1

bull Bing Pan (RPTM) ldquoBig Data and Forecasting in Tourismrdquo

bull Readings (send list of confusing terms concepts by 700 am Jan 31 Wednesday)

Bing Pan ldquoIdentifying the Next Non-Stop Flying Market with a Big Data Approachrdquohttpsdoiorg101016jtourman201712008 ldquoGoogle Trends and Tourist Ar-rivals Emerging Biases and Proposed Correctionsrdquo httpsdoiorg101016j

tourman201710014 (See also ldquoForecasting Destination Weekly Hotel Occupancywith Big Datardquo httpjournalssagepubcomdoiabs1011770047287516669050ldquoForecasting tourism demand with composite search indexrdquo httpsdoiorg10

1016jtourman201607005)

UnobtrusiveMeasures

Multivariate-R Chapter 1 (Donrsquot get bogged down when the math starts) LatentVari-ables Chapter 1 (Skim - donrsquot get bogged down in the math) NetflixPrize

ResearchMethodsKB ldquoSamplingrdquo NRCReport Chapter 8 ldquoSampling and MassiveDatardquo BitByBit Chapter 3 ldquoAsking Questionsrdquo

bull Further reference

Section ldquoIndirect Unobtrusive Nonreactive Measures Data Exhaustrdquo

Section ldquoMultiple Measures Latent Variable Measurementrdquo

Section ldquoSampling and Survey Designrdquo

Section ldquoOpen data APIs linked data rdquo

Section ldquoSpace and Timerdquo (readings on Time)

February 8

bull Clio Andris (GEOG) ldquoWhat AirBNB and Yelp can teach us about human behavior incitiesrdquo

bull Readings (send list of confusing terms concepts by 700 am Feb 7 Wednesday)

Clio Andris ldquoUsing Yelp to Find Romance in the City A Case of Restaurants in FourCitiesrdquo httpswwwdropboxcomsuink0zcklwcpo5gYelp_Restaurantspdfdl=

0 ldquoHidden Style in the City An Analysis of Geolocated Airbnb Rental Images in TenMajor Citiesrdquo httpswwwdropboxcomst7y7f6880m4ty7vAirBNB_Analysispdf

dl=0

5

InfoRetrieval Chs 1 2 6 (you may prefer the slides from their classes) My primaryhope here is that you understand the ldquovector space modelrdquo and ldquocosine similarityrdquo(from Chapter 6) My secondary hope is that you understand the basics of Booleaninformation retrieval (and notions like ldquoindexrdquo ldquoinverted indexrdquo and ldquopostingsrdquo) Mytertiary hope is that you are exposed to some basic concepts of text analytics NLP(including ldquotokenizationrdquo ldquonormalizationrdquo ldquostemmingrdquoldquolemmatizationrdquo ldquostop wordsrdquoldquotf-idfrdquo)

FightinWords (FW was used by Jurafsky in httpfirstmondayorgojsindexphp

fmarticleview49443863 and a best-selling book The Language of Food for similarapplications to Andris based on Yelp reviews and now appears in his textbook NLP)

bull Further reference

Section ldquoSpace and Timerdquo

Section ldquoWeb Scrapingrdquo

Section ldquoData Representations Data Mappingsrdquo

Section ldquoFeature selection feature extraction feature engineering rdquo

Section ldquoDatabases and data managementrdquo (esp SQL)

Section ldquoLanguage Text Speech Audiordquo

bull Exercise 2 Due Teams Report

February 15 (Postponed)

bull Conrad Tucker (IE) ldquoCybersecurity Policies and Their Impact on Dynamic Data DrivenApplication Systemsrdquo

bull Readings (send list of confusing terms concepts by 700 am Feb 14 Wednesday)

Conrad Tucker ldquoCybersecurity Policies and Their Impact on Dynamic Data DrivenApplication Systemsrdquo httpieeexploreieeeorgdocument8064151 See alsoldquoGenerative Adversarial Networks for Increasing the Veracity of Big Datardquo http

ieeexploreieeeorgdocument8258219

February 22

bull David Reitter (IST) ldquoComputational Psycholinguisticsrdquo

bull Readings (two weeks worth)

David Reitter ldquoAlignment in Web-Based Dialogue Studies in Big Data ComputationalPsycholinguisticsrdquo (Based on data from the Cancer Survivors Network and Reddit)httpwwwdavid-reittercompubreitter2017alignmentpdf

Similarity

PatternRecognition Ch2 ldquoRepresentationrdquo

6

MMDS Chapter 3 ldquoFinding Similar Itemsrdquo (to understand ldquominhashingrdquo and ldquolocalitysensitive hashingrdquo yoursquoll need to understand ldquohashingrdquo ndash Section 132 also discussedin InfoRetrieval Chapter 3)

bull Further reference

Section ldquoSimilarity Distance Association rdquo

Section ldquoDerived Data Representations rdquo

Section ldquoRecord Linkage Entity Resolution rdquo

Section ldquoClustering Hashing Compression rdquo

March 1

bull Reading

BitByBit Ch 5 (ldquoCreating Mass Collaborationrdquo)

Crowdsourcing ldquoIntroductionrdquo and Ch1 ldquoConcepts Theories and Cases of Crowd-sourcingrdquo

HumanComputation Chs 125

bull Further reference

Section ldquoCrowdsourcing Human Computation Citizen Science Web Experimentsrdquo

Section ldquoMaking Up Data (smoothing convolution kernels )rdquo

Section ldquoVision image videordquo

bull Semester Projects Must be Approved Before Spring Break

March 8 - Spring Break

March 15

bull Daniel DellaPosta (SOC) ldquoNetworks and the Mid-20th Century American Mafiardquo

bull Reading

Dan DellaPosta ldquoNetwork Closure in the Mid-20th Century American Mafiardquo (SocialNetworks 2017) httpswww-sciencedirect-comezaccesslibrariespsuedusciencearticlepiiS037887331630199X ldquoBetween Clique and Corporation Boundary-Spanningin Solidary Groupsrdquo (RampR in American Journal of Sociology in the Box folder)

Networks Chs 1 2

MMDS Ch 5 ldquoLink Analysisrdquo

MarkovVisually

bull Further references

Section ldquoNetworks and Graphsrdquo

7

Section ldquoSimulation resampling Markov Chains rdquo

DeepLearning (different kind of network)

bull 25 Project Review

March 22

bull Reading

DeepLearning Ch2 ldquoLinear Algebrardquo

Shalizi-ADA Chs 16-18 (ldquoPrincipal Components Analysisrdquo ldquoFactor Modelsrdquo ldquoNonlin-ear Dimension Reductionrdquo)

bull Further reference

Section ldquoDimensionality reduction decomposition rdquo

Section ldquoMultiple measures latent variable measurementrdquo

Section ldquoLinear algebra matrix computationsrdquo

March 29

bull Naomi Altman (STAT) ldquoGeneralizing PCArdquo

bull Reading

Altman ldquoGeneralizing PCArdquo 2015 slides httppersonalpsuedunsa1AltmanWebpagePCATorontopdf

MapReduceIntuition (7 minute video providing ldquodivide and conquerrdquo intuition to MapRe-duce)

TidySAC-Video Hadley Wickham on the tidyverse ldquosplit-apply-combinerdquo with dplyrand tidy data (1 hour video) (More detail but perhaps less intuition in book formdiscussion of ldquotidyingrdquo data Chapter 5 of TidyData-R)

MMDS Ch 2 (Read the Chapter 2 in the new ldquoBETArdquo version of the book whichalso touches on Spark and Tensorflow in section 24 ldquoExtensions to MapReducerdquo httpistanfordedu~ullmanmmdsnhtml)

bull Further reference

Section ldquoNonlinear dimension reduction manifold learningrdquo

Section ldquoDatabases and data managementrdquo (esp NoSQL)

Section ldquoTheoretically-structured approaches to data wranglingrdquo

Section ldquoParallelism MapReduce Split-Apply-Combinerdquo

Section ldquoFunctional programmingrdquo

Section ldquoCutting and bleeding edge rdquo

bull 50 Project Review

bull Exercise 3 (Individual) Due 45

8

April 5

bull Prasenjit Mitra (IST) ldquoClassification of Tweets from Disaster Scenariosrdquo

bull Reading

Prasenjit Mitra Rudra et al rdquoSummarizing Situational and Topical InformationDuring Crisesrdquo httpsarxivorgpdf161001561pdf Imran Mitra amp CastillordquoTwitter as a Lifeline Human-annotated Twitter Corpora for NLP of Crisis-relatedMessagesrdquo httpmimranmepapersimran_prasenjit_carlos_lrec2016pdf

NLP (Jurafsky and Martin) Chapters 15 and 16 ldquoVector Semanticsrdquo and ldquoSemanticswith Dense Vectorsrdquo

bull Further reference

Section ldquoLanguage text speech audiordquo

GloVe word2vec word2vecExplained t-SNE

Section ldquoMobile devices distributed sensors rdquo

Section ldquoCrowdsourcing human computation citizen science rdquo

Section ldquoScaling iteration streaming data online algorithmsrdquo

bull Exercise 3 Due

April 12

bull Reading

BitByBit Ch 4 ldquoRunning Experimentsrdquo review Ch 2 sections ldquoNatural experimentsin observable datardquo examples

CausalInference Chs 1-2

bull Further reference

Section ldquoExperimental and observational designs for causal inferencerdquo

Section ldquoCrowdsourcing web experimentsrdquo

Section ldquoHuman subjects rdquo

bull 75 Project Review

April 19

bull Reading

BitByBit Ch 6 ldquoEthicsrdquo

PSU-ORP ldquoCommon Rule and Other Changesrdquo httpswwwresearchpsuedu

irbcommonrulechanges

DataPrivacy

9

Reproducibility

Refresh on MachineBias

bull Further reference

Section ldquoEthics and Scientific Responsibility in Big Social Datardquo

Section ldquoiexclCuidadordquo

April 26 - LAST CLASS MEETING

bull Team Project Presentations

May 3 - Team Projects Due

10

Past Visiting Speakers in SoDA 501 and 502

SoDA 502 (Fall 2017)

bull Chris Zorn (PLSC)

bull Diane Felmlee (SOC)

bull Alan MacEachren (GEOG)

bull Eric Plutzer (PLSC)

bull James LeBreton (PSYCH)

bull Sesa Slavkovic (STAT)

bull Guido Cervone (GEOG)

bull Ashton Verdery (SOC)

bull Maggie Niu (STAT)

bull Reka Albert (PHYS)

SoDA 501 (Spring 2017)

bull Tim Brick (HDFS) ldquoTowards real-time monitoring and intervention using wearable technologyrdquo

bull Aylin Caliskan (Princeton) ldquoA Story of Discrimination and Unfairness Bias in Word Embeddingsrdquo

bull Jay Yonamine (Google - IGERT alum) ldquoData Science in Industryrdquo

bull Johnathan Rush (Illinois) ldquoGeospatial Data Science Workshoprdquo

bull Rick Gilmore (PSYCH) ldquoToward a more reproducible and robust science of human behaviorrdquo

bull Glenn Firebaugh (SOC) ldquoMeasuring Inequality and Segregation with US Census Datardquo

bull Charles Twardy (Sotera) ldquoData Science for Search and Rescuerdquo

bull Anna Smith (Ohio State) ldquoA Hierarchical Model for Network Data in a Latent Hyperbolic Spacerdquo

bull Rebecca Passonneau (CSE) ldquoOmnigraph Rich Feature Representation for Graph Kernel Learningrdquo

bull Alex Klippel (GEOG) ldquoVirtual Reality for Immersive Analyticsrdquo

bull Murali Haran (STAT) ldquoA Computationally Efficient Projection-based Approach for Spatial General-ized Mixed Modelsrdquo

SoDA 502 (Fall 2016)

bull Clio Andris (GEOG) ldquoIntegrating Social Network Data into GISystemsrdquo

bull Jia Li (STAT) ldquoClustering under the Wasserstein Metricrdquo

bull Rachel Smith (CAS) ldquoStigma Networks Perceptions of Sociogramsrdquo

bull Zita Oravecz (HDFS)

bull Bethany Bray (Methodology Center) ldquoLatent Class and Latent Transition Analysisrdquo

bull Dave Hunter (STAT) ldquoModel Based Clustering of Large Networksrdquo

bull Scott Bennett (PLSC) ldquoABM Model of Insurgencyrdquo

bull David Reitter (IST)

bull Suzanna Linn (PLSC) ldquoMethodological Issues in Automated Text Analysis Application to NewsCoverage of the US Economyrdquo

11

SoDA 501 (Spring 2016)

bull Bruce Desmarais (PLSC) ldquoLearning in the Sunshine Analysis of Local Government Email Corporardquo

bull Timothy Brick (HDFS) ldquoMapping and Manipulating Facial Expressionrdquo

bull Qunying Huang (USC) ldquoSocial Media An Emerging Data Source for Human Mobility Studiesrdquo

bull Lingzhou Zue (STAT) ldquoAn Introduction to High-Dimensional Graphical Modelsrdquo

bull Ashton Verdery (SOC) ldquoSampling from Network Datardquo

bull Sarah Battersby (Tableau) ldquoHelping People See and Understand Spatial Datardquo

bull Alexandra Slavkovic (STAT) ldquoStatistical Privacy with Network Datardquo

bull Lee Giles (IST) ldquoMachine Learning for Scholarly Big Datardquo

12

Readings and References (updated Spring 2018)

We will discuss a relatively small subset of the readings listed here and this will vary based on topics andreadings discussed by visiting speakers and you yourselves The remainder are provided here as curatedreferences for more in-depth investigation in both the theory and practice related to the topic (with thelatter heavily weighted toward resources in Python and R)

dagger Material that is at last check made available through the Penn State library Most journal links shouldwork if you are logged in to a Penn State machine or through the Penn State VPN Some article archivesand most books require additional authentication through webaccess If links are broken start directly froma search via httpswwwlibrariespsuedu Some books require installation of e-readers like AdobeDigital Editions Links to lyndacom can be accessed through httplyndapsuedu

Dagger Material that is at last check legally provided for free In some cases these are the preprint versionsof published material

sect Material for which a legal selection is or will be provided through the class Box folder

Big Data amp Social Data Analytics

Overviews

bull Dagger[CompSocSci] David Lazer Alex Pentland Lada Adamic Sinan Aral Albert-Laszlo BarabasiDevon Brewer Nicholas Christakis Noshir Contractor James Fowler Myron Gutmann Tony JebaraGary King Michael Macy Deb Roy and Marshall Van Alstyne 2009 ldquoComputational Social SciencerdquoScience 323(5915)721-3 + Supp Feb 6 httpsciencesciencemagorgcontent3235915

721full httpsgkingharvardedufilesgkingfilesLazPenAda09pdf

bull Dagger[NRCreport] National Research Council 2013 Frontiers in Massive Data Analysis NationalAcademies Press (Free w registration httpwwwnapeducatalogphprecord_id=18374)Ch1 ldquoIntroductionrdquo Ch2 ldquoMassive Data in Science Technology Commerce National DefenseTelecommunications and other Endeavorsrdquo

bull Dagger[MMDS] Jure Leskovec Anand Rajaraman and Jeff Ullman 2014 Mining of Massive DatasetsCambridge University Press httpwwwmmdsorg (BETA version of Third Edition httpi

stanfordedu~ullmanmmdsnhtml)

bull sect[BitByBit] Matthew J Salganik 2018 (Forthcoming) Bit by Bit Social Research in the DigitalAge Princeton University Press Ch 1 ldquoIntroductionrdquo Ch 2 ldquoObserving Behaviorrdquo

bull DSHandbook-Py Ch 1 ldquoIntroduction Becoming a Unicornrdquo Ch2 ldquoThe Data Science Road Maprdquo

Burt-schtick

bull Dagger[Monroe-5Vs] Burt L Monroe 2013 ldquoThe Five Vs of Big Data Political Science Introductionto the Special Issue on Big Data in Political Sciencerdquo Political Analysis 21(V5) 1ndash9 https

doiorg101017S1047198700014315 (Volume Velocity Variety Vinculation Validity)

bull dagger[Monroe-No] Burt L Monroe Jennifer Pan Margaret E Roberts Maya Sen and Betsy Sin-clair 2015 ldquoNo Formal Theory Causal Inference and Big Data Are Not Contradictory Trendsin Political Sciencerdquo PS Political Science amp Politics 48(1) 71ndash4 httpdxdoiorg101017

S1049096514001760

bull dagger[Quinn-Topics] Kevin M Quinn Burt L Monroe Michael Colaresi Michael H Crespin andDragomir R Radev 2010 ldquoHow to Analyze Political Attention with Minimal Assumptions andCostsrdquo American Journal of Political Science 54(1) 209ndash28 httponlinelibrarywileycom

13

doi101111j1540-5907200900427xfull (esp topic modeling as measurement approach tovalidation)

bull dagger[FightinWords] Burt L Monroe Michael Colaresi and Kevin M Quinn 2008 ldquoFightinrsquo WordsLexical Feature Selection and Evaluation for Identifying the Content of Political Conflictrdquo PoliticalAnalysis 16(4) 372-403 httpsdoiorg101093panmpn018 (esp the impact of samplingvariance and regularization through priors)

bull Dagger[BDSS-Census] Big Data Social Science PSU Team 2012 ldquoA Closer Look at the Kaggle CensusDatardquo httpsburtmonroegithubioBDSSKaggleCensus2012 (esp the relevance of the socialprocesses by which data come to exist as data)

Multidisciplinary Perspectives

bull dagger[Business-BigData] Andrew McAfee and Erik Brynjolfsson 2012 ldquoBig Data The ManagementRevolutionrdquo Harvard Business Review 90(10) 61ndash8 October httpshbrorg201210big-

data-the-management-revolution Thomas H Davenport and DJ Patel 2012 ldquoData ScientistThe Sexiest Job of the 21st Centuryrdquo Harvard Business Review 90(10)70-6 October https

hbrorg201210data-scientist-the-sexiest-job-of-the-21st-century

bull dagger[InfoSci-BigData] CL Philip Chen and Chun-Yang Zhang 2014 ldquoData-intensive ApplicationsChallenges Techniques and Technologies A Survey on Big Datardquo Information Sciences 275 314-47httpsdoiorg101016jins201401015

bull dagger[Informatics-BigData] Vasant G Honavar 2014 ldquoThe Promise and Potential of Big Data ACase for Discovery Informaticsrdquo Review of Policy Research 31(4) 326-330 httpsdoiorg10

1111ropr12080

bull Dagger[Stats-BigData] Beate Franke Jean-Francois Ribana Roscher Annie Lee Cathal Smyth ArminHatefi Fuqi Chen Einat Gil Alexander Schwing Alessandro Selvitella Michael M Hoffman RogerGrosse Dietrich Hendricks and Nancy Reid 2016 ldquoStatistical Inference Learning and Models in BigDatardquo International Statistical Review 84(3) 371-89 httponlinelibrarywileycomdoi101111insr12176full

bull dagger[Econ-BigData] Hal R Varian 2013 ldquoBig Data New Tricks for Econometricsrdquo Journal ofEconomic Perspectives 28(2) 3-28 httpsdoiorg101257jep2823

bull dagger[GeoViz-BigData] Alan MacEachren 2017 ldquoLeveraging Big (Geo) Data with (Geo) Visual An-alytics Place as the Next Frontierrdquo In Chenghu Zhou Fenzhen Su Francis Harvey and Jun Xueds Spatial Data Handling in the Big Data Era pp 139-155 Springer httpslink-springer-

comezaccesslibrariespsueduchapter101007978-981-10-4424-3_10

bull dagger[Soc-BigData] David Lazer and Jason Radford 2017 ldquoData ex Machina Introduction to BigDatardquo Annual Review of Sociology 43 19-39 httpsdoiorg101146annurev-soc-060116-

053457

bull sect[Politics-BigData] Keith T Poole L Jason Anasastopolous and James E Monagan III Forthcom-ing ldquoThe lsquoBig Datarsquo Revolution in Political Campaigning and Governancerdquo Oxford Bibliographies inPolitical Science

iexclCuidado Traps Biases Problems Pains Perils

bull dagger[GoogleFlu] David Lazer Ryan Kennedy Gary King and Alessandro Vespignani 2014 ldquoTheParable of Google Flu Traps in Big Data Analysisrdquo Science (343) 14 March httpgking

harvardedufilesgkingfiles0314policyforumffpdf

bull Dagger[MachineBias] ProPublica ldquoMachine Bias Investigating Algorithmic Injusticerdquo Series https

wwwpropublicaorgseriesmachine-bias See especially

14

Julia Angwin Jeff Larson Surya Mattu and Lauren Kirchner 2016 ldquoMachine Biasrdquo ProPub-lica May 23 httpswwwpropublicaorgarticlemachine-bias-risk-assessments-in-

criminal-sentencing

Julia Angwin Madeleine Varner and Ariana Tobin 2017 ldquoFacebook Enabled Adverstisers toReach Jew Hatersrdquo httpswwwpropublicaorgarticlefacebook-enabled-advertisers-

to-reach-jew-haters

bull Dagger[Polling2016] Doug Rivers (Nov 11 2016) ldquoFirst Thoughts on Polling Problems in the 2016 USElectionsrdquo httpstodayyougovcomnews20161111first-thoughts-polling-problems-

2016-us-elections

bull dagger[EventData] Wei Wang Ryan Kennedy David Lazer Naren Ramakrishnan 2016 ldquoGrowing Painsfor Global Monitoring of Societal Eventsrdquo Science 3536307 pp 1502ndash1503 httpsdoiorg101126scienceaaf6758

bull Dagger[GoogleBooks] Eitan Adam Pechenick Christopher M Danforth and Peter Sheridan Dodds 2015ldquoCharacterizing the Google Books Corpus Strong Limits to Inferences of Socio-Cultural and LinguisticEvolutionrdquo PLoS One httpsdoiorg101371journalpone0137041

bull Dagger[OkCupid] Michael Zimmer 2016 ldquoOkCupid Study Reveals the Perils of Big-Data Sciencerdquo Wiredhttpswwwwiredcom201605okcupid-study-reveals-perils-big-data-science May 14

bull dagger[CriticalQuestions] danah boyd and Kate Crawford 2011 ldquoCritical Questions for Big Data Provo-cations for a Cultural Technological and Scholarly Phenomenonrdquo Information Communication ampSociety 15(5) 662ndash79 httpdxdoiorg1010801369118X2012678878

bull Dagger[RacistBot] Daniel Victor 2016 ldquoMicrosoft created a Twitter bot to learn from users It quicklybecame a racist jerkrdquo New York Times httpswwwnytimescom20160325technology

microsoft-created-a-twitter-bot-to-learn-from-users-it-quickly-became-a-racist-jerk

html

bull MMDS 122-123 on Bonferroni

bull Also BDSS-Census

Research Design and Measurement

Overviews

bull BitByBit

bull Dagger[ResearchMethodsKB] William M Trochim 2006 The Research Methods Knowledge Basehttpwwwsocialresearchmethodsnetkb

bull [7Rules] Glenn Firebaugh 2008 Seven Rules for Social Research Princeton University Press

Measurement Reliability and Validity

bull ResearchMethodsKB ldquoMeasurementrdquo

bull 7Rules Ch 3 ldquoBuild Reality Checks into Your Researchrdquo

bull Reliability see also FightinWords

bull Validity see also Quinn10-Topics Monroe-5Vs

Indirect Unobtrusive Nonreactive Measures Data Exhaust

15

bull dagger[UnobtrusiveMeasures] Raymond M Lee 2015 ldquoUnobtrusive Measuresrdquo Oxford Bibliographieshttpsdoiorg101093OBO9780199846740-0048 (Canonical cite is Eugene J Webb 1966Unobtrusive Measures or Webb Donald T Campbell Richard D Schwartz Lee Sechrest 1999Unobtrusive Measures rev ed Sage)

bull BitByBit Examples in Chapter 2

Multiple Measures Latent Variable Measurement

bull 7Rules Ch 4 ldquoReplicate Where Possiblerdquo

bull dagger[LatentVariables] David J Bartholomew Martin Knott and Irini Moustaki 2011 Latent VariableModels and Factor Analysis A Unified Approach Wiley httpsiteebrarycomezaccess

librariespsuedulibpennstatedetailactiondocID=10483308) (esp Ch1 ldquoBasic ideas andexamplesrdquo)

bull dagger[Multivariate-R] Brian Everitt and Torsten Hothorn 2011 An Introduction to Applied MultivariateAnalysis with R Springer httplinkspringercomezaccesslibrariespsuedubook10

10072F978-1-4419-9650-3

bull Shalizi-ADA Ch17 ldquoFactor Modelsrdquo

bull Dagger[NetflixPrize] Edwin Chen 2011 ldquoWinning the Netflix Prize A Summaryrdquo

bull DaggerCRAN httpscranr-projectorgwebviewsMultivariatehtmlhttpscranr-projectorgwebviewsPsychometricshtmlhttpscranr-projectorgwebviewsClusterhtml

Sampling and Survey Design

bull dagger[Sampling] Steven K Thompson 2012 Sampling 3rd ed httpsk8es4mc2lsearchserialssolutionscomsid=sersolampSS_jc=TC_024492330amptitle=Wiley20Desktop20Editions203A20Sampling

bull ResearchMethodsKB ldquoSamplingrdquo

bull NRCreport Ch 8 ldquoSampling and Massive Datardquo

bull BitByBit Ch 3 ldquoAsking Questionsrdquo

bull Dagger[MSE] Daniel Manrique-Vallier Megan E Price and Anita Gohdes 2013 In Seybolt et al (eds)Counting Civilians ldquoMultiple Systems Estimation Techniques for Estimating Casualties in ArmedConflictsrdquo Preprint httpciteseerxistpsueduviewdocdownloaddoi=1011469939amp

rep=rep1amptype=pdf)

bull dagger[NetworkSampling] Ted Mouw and Ashton M Verdery 2012 ldquoNetwork Sampling with Memory AProposal for More Efficient Sampling from Social Networksrdquo Sociological Methodology 42(1)206ndash56httpsdxdoiorg1011772F0081175012461248

bull See also FightinWords (re hidden heteroskedasticity in sample variance)

bull Dagger[HashDontSample] Mudit Uppal 2016 ldquoProbabilistic data structures in the Big data world (+code)rdquo (re ldquoHash donrsquot samplerdquo) httpsmediumcommuppalprobabilistic-data-structures-

in-the-big-data-world-code-b9387cff0c55

bull MMDS Hash Functions (132) Sampling in streams (42)

bull DaggerCRAN httpscranr-projectorgwebviewsOfficialStatisticshtml (ldquoComplex SurveyDesignrdquo ldquoSmall Area Estimationrdquo)

Experimental and Observational Designs for Causal Inference

16

bull Dagger[CausalInference] Miguel A Hernan James M Robins Forthcoming (2017) Causal Infer-ence Chapman amp HallCRC Preprint httpswwwhsphharvardedumiguel-hernancausal-

inference-book Chs 1 ldquoA definition of causal effectrdquo Ch 2 ldquoRandomized experimentsrdquo Ch3 ldquoObservational studiesrdquo (See Ch 6 ldquoGraphical representation of causal effectsrdquo for integrationwith Judea Pearl approach)

bull BitByBit Chapter 4 ldquoRunning Experimentsrdquo Ch 2 Natural experiments in observable data exam-ples

bull ResearchMethodsKB ldquoDesignrdquo

bull Firebaugh-7Rules Ch 2 ldquoLook for Differences that Make a Difference and Report Themrdquo sectCh5 ldquoCompare Like with Likerdquo Ch 6 ldquoUse Panel Data to Study Individual Change and RepeatedCross-Section Data to Study Social Changerdquo

bull Shalizi-ADA Part IV ldquoCausal Inferencerdquo

bull Monroe-No

bull CRAN httpscranr-projectorgwebviewsExperimentalDesignhtml

Technologies for Primary and Secondary Data Collection

Mobile Devices Distributed Sensors Wearable Sensors Remote Sensing

bull dagger[RealityMining] Nathan Eagle and Alex (Sandy) Pentland 2006 ldquoReality Mining SensingComplex Social Systemsrdquo Personal and Ubiquitous Computing 10(4) 255ndash68 httpsdoiorg

101007s00779-005-0046-3

bull dagger[QuantifiedSelf ] Melanie Swan 2013 ldquoThe Quantified Self Fundamental Disruption in Big DataScience and Biological Discoveryrdquo Big Data 1(2) 85ndash99 httpsdoiorg101089big2012

0002

bull dagger[SensorData] Charu C Aggarwal (Ed) 2013 Managing and Mining Sensor Data Springer httplinkspringercomezaccesslibrariespsuedubook1010072F978-1-4614-6309-2 Ch1ldquoAn Introduction to Sensor Data Analyticsrdquo

bull dagger[NightLights] Thushyanthan Baskaran Brian Min Yogesh Uppal 2015 ldquoElection cycles andelectricity provision Evidence from a quasi-experiment with Indian special electionsrdquo Journal ofPublic Economics 12664-73 httpsdoiorg101016jjpubeco201503011

Crowdsourcing Human Computation Citizen Science Web Experiments

bull dagger[Crowdsourcing] Daren C Brabham 2013 Crowdsourcing MIT Press httpsiteebrary

comezaccesslibrariespsuedulibpennstatedetailactiondocID=10692208 ldquoIntroduc-tionrdquo Ch1 ldquoConcepts Theories and Cases of Crowdsourcingrdquo

bull BitByBit Ch 5 ldquoCreating Mass Collaborationrdquo

bull NRCreport Ch 9 ldquoHuman Interaction with Datardquo

bull dagger[MTurk] Krista Casler Lydia Bickel and Elizabeth Hackett 2013 ldquoSeparate but Equal AComparison of Participants and Data Gathered via Amazonrsquos MTurk Social Media and Face-to-FaceBehavioral Testingrdquo Computers in Human Behavior 29(6) 2156ndash60 httpdoiorg101016j

chb201305009

bull dagger[LabintheWild] Katharina Reinecke and Krzysztof Z Gajos 2015 ldquoLabintheWild ConductingLarge-Scale Online Experiments With Uncompensated Samplesrdquo In Proceedings of the 18th ACMConference on Computer Supported Cooperative Work amp Social Computing (CSCW rsquo15) 1364ndash1378httpdxdoiorg10114526751332675246

17

bull dagger[TweetmentEffects] Kevin Munger 2017 ldquoTweetment Effects on the Tweeted ExperimentallyReducing Racist Harassmentrdquo Political Behavior 39(3) 629ndash49 httpdoiorg101016jchb

201305009

bull dagger[HumanComputation] Edith Law and Luis van Ahn 2011 Human Computation Morgan amp Clay-pool httpwwwmorganclaypoolcomezaccesslibrariespsuedudoipdf102200S00371ED1V01Y201107AIM013

Open Data File Formats APIs Semantic Web Linked Data

bull DSHandbook-Py Ch 12 ldquoData Encodings and File Formatsrdquo

bull Dagger[OpenData] Open Data Institute ldquoWhat Is Open Datardquo httpstheodiorgwhat-is-open-

data

bull Dagger[OpenDataHandbook] Open Knowledge International The Open Data Handbook httpopendatahandbookorg Includes appendix ldquoFile Formatsrdquo httpopendatahandbookorgguideenappendices

file-formats

bull Dagger[APIs] Brian Cooksey 2016 An Introduction to APIs httpszapiercomlearnapis

bull Dagger[APIMarkets] RapidAPI mashape API marketplaces httpsdocsrapidapicom https

marketmashapecom ProgrammableWeb httpswwwprogrammablewebcom

bull dagger[LinkedData] Tom Heath and Christian Bizer 2011 Linked Data Evolving the Web into a GlobalData Space Morgan amp Claypool httpwwwmorganclaypoolcomezaccesslibrariespsu

edudoiabs102200S00334ED1V01Y201102WBE001 Ch 1 ldquoIntroductionrdquo Ch 2 ldquoPrinciples ofLinked Datardquo

bull dagger[SemanticWeb] Nikolaos Konstantinos and Dimitrios-Emmanuel Spanos 2015 Materializing theWeb of Linked Data Springer httplinkspringercomezaccesslibrariespsuedubook

1010072F978-3-319-16074-0 Ch 1 ldquoIntroduction Linked Data and the Semantic Webrdquo

bull daggerMorgan amp Claypool Synthesis Lectures on the Semantic Web Theory and Technology httpwww

morganclaypoolcomtocwbe111

Web Scraping

bull dagger[TheoryDrivenScraping] Richard N Landers Robert C Brusso Katelyn J Cavanaugh andAndrew B Colmus 2016 ldquoA Primer on Theory-Driven Web Scraping Automatic Extraction ofBig Data from the Internet for Use in Psychological Researchrdquo Psychological Methods 4 475ndash492httpdxdoiorgezaccesslibrariespsuedu101037met0000081

bull Dagger[Scraping-Py] Al Sweigart 2015 Automate the Boring Stuff with Python Practical Programmingfor Total Beginners Ch 11 Web-Scraping httpsautomatetheboringstuffcomchapter11

bull dagger[Scraping-R] Simon Munzert Christian Rubba Peter Meiszligner and Dominic Nyhuis 2014 Au-tomated Data Collection with R A Practical Guide to Web Scraping and Text Mining Wileyhttponlinelibrarywileycombook1010029781118834732

bull DaggerCRAN httpscranr-projectorgwebviewsWebTechnologieshtml

Ethics amp Scientific Responsibility in Big Social Data

Human Subjects Consent Privacy

bull BitByBit Ch 6 (ldquoEthicsrdquo)

bull Dagger[PSU-ORP] Penn State Office for Research Protections (under the VP for Research)

Human Subjects Research IRB httpswwwresearchpsueduirb

18

Revised Common Rule httpswwwresearchpsueduirbcommonrulechanges

Responsible Conduct of Research httpswwwresearchpsuedueducationrcr

Research Misconduct httpswwwresearchpsueduresearchmisconduct

SARI PSU (Scientific and Research Integrity Training) httpswwwresearchpsuedu

trainingsari

bull Dagger[MenloReport] David Dittrich and Erin Kenneally (Center for Applied Interneet Data Analysis)2012 ldquoThe Menlo Report Ethical Principles Guiding Information and Communication Technol-ogy Researchrdquo and companion report ldquoApplying Ethical Principles rdquo httpswwwcaidaorg

publicationspapers2012menlo_report_actual_formatted

bull Dagger[BigDataEthics-CBDES] Jacob Metcalf Emily F Keller and danah boyd 2016 ldquoPerspec-tives on Big Data Ethics and Societyrdquo Council for Big Data Ethics and Society httpbdes

datasocietynetcouncil-outputperspectives-on-big-data-ethics-and-society

bull Dagger[AoIRReport] Annette Markham and Elizabeth Buchanan (Association of Internet Researchers)2012 ldquoEthical Decision-Making and Internet Research Recommendations of the AoIR Ethics WorkingCommittee (Version 20)rdquo httpsaoirorgreportsethics2pdf and guidelines chart https

aoirorgwp-contentuploads201701aoir_ethics_graphic_2016pdf

bull Dagger[BigDataEthics-Wired] Sarah Zhang 2016 ldquoScientists are Just as Confused about the Ethics ofBig-Data Research as Yourdquo Wired httpswwwwiredcom201605scientists-just-confused-ethics-big-data-research

bull dagger[BigDataEthics-HerschelMori] Richard Herschel and Virginia M Mori 2017 ldquoEthics amp BigDatardquo Technology in Society 49 31-36 httpdoiorg101016jtechsoc201703003

The Science of Data Privacy

bull dagger[BigDataPrivacy] Terence Craig and Mary E Ludloff 2011 Privacy and Big Data OrsquoReilly MediahttppensueblibcompatronFullRecordaspxp=781814

bull Dagger[DataAnalysisPrivacy] John Abowd Lorenzo Alvisi Cynthia Dwork Sampath Kannan AshwinMachanavajjhala Jerome Reiter 2017 ldquoPrivacy-Preserving Data Analysis for the Federal Statisti-cal Agenciesrdquo A Computing Community Consortium white paper httpsarxivorgabs1701

00752

bull dagger[DataPrivacy] Stephen E Fienberg and Aleksandra B Slavkovic 2011 ldquoData Privacy and Con-fidentialityrdquo International Encyclopedia of Statistical Science 342ndash5 httpdoiorg978-3-642-

04898-2_202

bull Dagger[NetworksPrivacy] Vishesh Karwa and Aleksandra Slavkovic 2016 ldquoInference using noisy degreesDifferentially private β-model and synthetic graphsrdquo Annals of Statistics 44(1) 87-112 http

projecteuclidorgeuclidaos1449755958

bull dagger[DataPublishingPrivacy] Raymond Chi-Wing Wong and Ada Wai-Chee Fu 2010 Privacy-Preserving Data Publishing An Overview Morgan amp Claypool httpwwwmorganclaypoolcom

ezaccesslibrariespsuedudoipdfplus102200S00237ED1V01Y201003DTM002

bull DataMatching Ch 8 ldquoPrivacy Aspects of Data Matchingrdquo

Transparency Reproducibility and Team Science

bull Dagger[10RulesforData] Alyssa Goodman Alberto Pepe Alexander W Blocker Christine L BorgmanKyle Cranmer Merce Crosas Rosanne Di Stefano Yolanda Gil Paul Groth Margaret HedstromDavid W Hogg Vinay Kashyap Ashish Mahabal Aneta Siemiginowska and Aleksandra Slavkovic(2014) ldquoTen Simple Rules for the Care and Feeding of Scientific Datardquo PLoS Computational Biology10(4) e1003542 httpsdoiorg101371journalpcbi1003542 (Note esp curated resourcesfor reproducible research)

19

bull dagger[Transparency] E Miguel C Camerer K Casey J Cohen K M Esterling A Gerber R Glen-nerster D P Green M Humphreys G Imbens D Laitin T Madon L Nelson B A Nosek M Pe-tersen R Sedlmayr J P Simmons U Simonsohn M Van der Laan 2014 ldquoPromoting Transparencyin Social Science Researchrdquo Science 343(6166) 30ndash1 httpsdoiorg101126science1245317

bull dagger[Reproducibility] Marcus R Munafo Brian A Nosek Dorothy V M Bishop Katherine S ButtonChristopher D Chambers Nathalie Percie du Sert Uri Simonsohn Eric-Jan Wagenmakers JenniferJ Ware and John P A Ioannidis 2017 ldquoA Manifesto for Reproducible Sciencerdquo Nature HumanBehavior 0021(2017) httpsdoiorg101038s41562-016-0021

bull Dagger[SoftwareCarpentry] esp ldquoLessonsrdquo httpsoftware-carpentryorglessons

bull DSHandbook-Py Ch 9 ldquoTechnical Communication and Documentationrdquo Ch 15 ldquoSoftware Engi-neering Best Practicesrdquo

bull dagger[TeamScienceToolkit] Vogel AL Hall KL Fiore SM Klein JT Bennett LM Gadlin H Stokols DNebeling LC Wuchty S Patrick K Spotts EL Pohl C Riley WT Falk-Krzesinski HJ 2013 ldquoTheTeam Science Toolkit enhancing research collaboration through online knowledge sharingrdquo AmericanJournal of Preventive Medicine 45 787-9 http101016jamepre201309001

bull sect[Databrary] Kara Hall Robert Croyle and Amanda Vogel Forthcoming (2017) Advancing Socialand Behavioral Health Research through Cross-disciplinary Team Science Springer Includes RickO Gilmore and Karen E Adolph ldquoOpen Sharing of Research Video Breaking the Boundaries of theResearch Teamrdquo (See httpdatabraryorg)

bull CRAN httpscranr-projectorgwebviewsReproducibleResearchhtml

Social Bias Fair Algorithms

bull MachineBias

bull Dagger[EmbeddingsBias] Aylin Caliskan Joanna J Bryson and Arvind Narayanan 2017 ldquoSemanticsDerived Automatically from Language Corpora Contain Human Biasesrdquo Science httpsarxiv

orgabs160807187

bull Dagger[Debiasing] Tolga Bolukbasi Kai-Wei Chang James Zou Venkatesh Saligrama Adam Kalai 2016ldquoMan is to Computer Programmer as Woman is to Homemaker Debiasing Word Embeddingsrdquohttpsarxivorgabs160706520

bull Dagger[AvoidingBias] Moritz Hardt Eric Price Nathan Srebro 2016 ldquoEquality of Opportunity inSupervised Learningrdquo httpsarxivorgabs161002413

bull Dagger[InevitableBias] Jon Kleinberg Sendhil Mullainathan Manish Raghavan 2016 ldquoInherent Trade-Offs in the Fair Determination of Risk Scoresrdquo httpsarxivorgabs160905807

Databases and Data Management

bull NRCreport Ch 3 ldquoScaling the Infrastructure for Data Managementrdquo

bull dagger[SQL] Jan L Harrington 2010 SQL Clearly Explained Elsevier httpwwwsciencedirectcomezaccesslibrariespsuedusciencebook9780123756978

bull DSHandbook-Py Ch 14 ldquoDatabasesrdquo

bull dagger[noSQL] Guy Harrison 2015 Next Generation Databases NoSQL NewSQL and Big Data Apresshttpslink-springer-comezaccesslibrariespsuedubook101007978-1-4842-1329-2

bull dagger[Cloud] Divyakant Agrawal Sudipto Das and Amr El Abbadi 2012 Data Management inthe Cloud Challenges and Opportunities Morgan amp Claypool httpwwwmorganclaypoolcom

ezaccesslibrariespsuedudoipdfplus102200S00456ED1V01Y201211DTM032

20

bull daggerMorgan amp Claypool Synthesis Lectures on Data Management httpwwwmorganclaypoolcom

tocdtm11

Data Wrangling

Theoretically-structured approaches to data wrangling

bull Dagger[TidyData-R] Garrett Grolemund and Hadley Wickham 2017 R for Data Science (httpr4dshadconz) esp Ch 12 ldquoTidy Datardquo also Wickham 2014 ldquoTidy Datardquo Journalof Statistical Software 59(10) httpwwwjstatsoftorgv59i10paper Tools The tidyversehttptidyverseorg

bull Dagger[DataScience-Py] Jake VanderPlas 2016 Python Data Science Handbook (httpsgithubcomjakevdpPythonDSHandbook-Py) esp Ch 3 on pandas httppandaspydataorg

bull Dagger[DataCarpentry] Colin Gillespie and Robin Lovelace 2017 Efficient R Programming https

csgillespiegithubioefficientR esp Ch 6 on ldquoefficient data carpentryrdquo

bull sect[Wrangler] Joseph M Hellerstein Jeffrey Heer Tye Rattenbury and Sean Kandel 2017 DataWrangling Practical Techniques for Data Preparation Tool Trifacta Wrangler httpwwwtrifactacomproductswrangler

Data wrangling practice

bull DSHandbook-Py Ch 4 ldquoData Munging String Manipulation Regular Expressions and Data Clean-ingrdquo

bull dagger[DataSimplification] Jules J Berman 2016 Data Simplification Taming Information with OpenSource Tools Elsevier httpwwwsciencedirectcomezaccesslibrariespsueduscience

book9780128037812

bull dagger[Wrangling-Py] Jacqueline Kazil Katharine Jarmul 2016 Data Wrangling with Python OrsquoReillyhttpproquestcombosafaribooksonlinecomezaccesslibrariespsuedu9781491948804

bull dagger[Wrangling-R] Bradley C Boehmke 2016 Data Wrangling with R Springer httplink

springercomezaccesslibrariespsuedubook1010072F978-3-319-45599-0

Record Linkage Entity Resolution Deduplication

bull dagger[DataMatching] Peter Christen 2012 Data Matching Concepts and Techniques for Record Link-age Entity Resolution and Duplicate Detectionhttplinkspringercomezaccesslibrariespsuedubook1010072F978-3-642-31164-2

(esp Ch 2 ldquoThe Data Matching Processrdquo)

bull DataSimplification Chapter 5 ldquoIdentifying and Deidentifying Datardquo

bull Dagger[EntityResolution] Lise Getoor and Ashwin Machanavajjhala 2013 ldquoEntity Resolution for BigDatardquo Tutorial KDD httpwwwumiacsumdedu~getoorTutorialsER_KDD2013pdf

bull Dagger[SyrianCasualties] Peter Sadosky Anshumali Shrivastava Megan Price and Rebecca C Steorts2015 ldquoBlocking Methods Applied to Casualty Records from the Syrian Conflictrdquo httpsarxiv

orgabs151007714 (For more on blocking see DataMatching Ch 4 ldquoIndexingrdquo)

bull CRAN Task Views httpscranr-projectorgwebviewsOfficialStatisticshtml ldquoStatis-tical Matching and Record Linkagerdquo

ldquoMaking up datardquo Imputation Smoothers Kernels Priors Filters Teleportation Negative Sampling Con-volution Augmentation Adversarial Training

21

bull dagger[Imputation] Yi Deng Changgee Chang Moges Seyoum Ido and Qi Long 2016 ldquoMultiple Im-putation for General Missing Data Patterns in the Presence of High-dimensional Datardquo ScientificReports 6(21689) httpswwwnaturecomarticlessrep21689

bull Dagger[Overimputation] Matthew Blackwell James Honaker and Gary King 2017 ldquoA Unified Ap-proach to Measurement Error and Missing Data Overview and Applicationsrdquo Sociological Methodsamp Research 46(3) 303-341 httpgkingharvardedufilesgkingfilesmeasurepdf

bull Shalizi-ADA Sect 15 (Linear Smoothers) Ch 8 (Splines) Sect 144 (Kernel Density Estimates)DataScience-Python NB 0513 ldquoKernel Density Estimationrdquo

bull FightinWords (re Bayesian priors as additional data impact of priors on regularization)

bull DeepLearning Section 75 (Data Augmentation) 712 (Dropout) 713 (Adversarial Examples) Chap-ter 9 (Convolutional Networks)

bull dagger[Adversarial] Ian J Goodfellow Jonathon Shlens Christian Szegedy 2015 ldquoExplaining and Har-vesting Adversarial Examplesrdquo httpsarxivorgabs14126572

bull See also resampling and simulation methods

bull See also feature engineering preprocessing

bull CRAN Task Views httpscranr-projectorgwebviewsOfficialStatisticshtml ldquoImpu-tationrdquo

(Direct) Data Representations Data Mappings

bull NRCreport Ch 5 ldquoLarge-Scale Data Representationsrdquo

bull dagger[PatternRecognition] M Narasimha Murty and V Susheela Devi 2011 Pattern RecognitionAn Algorithmic Approach Springer httpslink-springer-comezaccesslibrariespsuedu

book1010072F978-0-85729-495-1 Section 21 ldquoData Structures for Pattern Representationrdquo

bull Dagger[InfoRetrieval] Christopher D Manning Prabhakar Raghavan and Hinrich Schutze 2009 Intro-duction to Information Retrieval Cambridge University Press httpnlpstanfordeduIR-bookCh 1 2 6 (also note slides used in their class)

bull dagger[Algorithms] Brian Steele John Chandler Swarna Reddy 2016 Algorithms for Data ScienceWiley httplinkspringercomezaccesslibrariespsuedubook1010072F978-3-319-

45797-0 Ch 2 ldquoData Mapping and Data Dictionariesrdquo

bull See also Social Data Structures

Similarity Distance Association Covariance The Kernel Trick

bull PatternRecognition Section 23 ldquoProximity Measuresrdquo

bull Dagger[Similarity] Brendan OrsquoConnor 2012 ldquoCosine similarity Pearson correlation and OLS coeffi-cientsrdquo httpsbrenoconcomblog201203cosine-similarity-pearson-correlation-and-

ols-coefficients

bull For relatively comprehensive lists see also

M-J Lesot M Rifqi and H Benhadda 2009 ldquoSimilarity measures for binary and numerical data a surveyrdquoInt J Knowledge Engineering and Soft Data Paradigms 1(1) 63- httpciteseerxistpsueduviewdoc

downloaddoi=10112126533amprep=rep1amptype=pdf

Seung-Seok Choi Sung-Hyuk Cha Charles C Tappert 2010 ldquoA Survey of Binary Similarity and DistanceMeasuresrdquo Journal of Systemics Cybernetics amp Informatics 8(1)43-8 httpwwwiiisciorgJournalCV$

scipdfsGS315JGpdf

22

Sung-Hyuk Cha 2007 ldquoComprehensive Survey on DistanceSimilarity Measures between Probability DensityFunctionsrdquo International Journal of Mathematical Models and Methods in Applied Sciences 4(1) 300-7 http

csispaceeductappertdpsd861-12session4-p2pdf

Anna Huang 2008 ldquoSimilarity Measures for Text Document Clusteringrdquo Proceedings of the 6th New ZealandComputer Science Research Student Conference 49-56 httpwwwnzcsrsc08canterburyacnzsiteproceedingsIndividual_Paperspg049_Similarity_Measures_for_Text_Document_Clusteringpdf

bull Michael B Jordan ldquoThe Kernel Trickrdquo (Lecture Notes) httpspeopleeecsberkeleyedu

~jordancourses281B-spring04lectureslec3pdf

bull Eric Kim 2017 ldquoEverything You Ever Wanted to Know about the Kernel Trick (But Were Afraidto Ask)rdquo httpwwweric-kimneteric-kim-netposts1kernel_trick_blog_ekim_12_20_

2017pdf

Derived Data Representations - Dimensionality Reduction Compression De-composition Embeddings

The groupings here and under the measurement multivariate statistics section are particularly arbitraryFor example ldquok-Means clusteringrdquo can be viewed as a technique for ldquodimensionality reductionrdquo ldquocompres-sionrdquo ldquofeature extractionrdquo ldquolatent variable measurementrdquo ldquounsupervised learningrdquo ldquocollaborative filteringrdquo

Clustering hashing quantization blocking compression

bull Dagger[Compression] Khalid Sayood 2012 Introduction to Data Compression 4th ed Springer httpwwwsciencedirectcomezaccesslibrariespsuedusciencebook9780124157965 (eg cod-ing blocking via vector quantization)

bull MMDS Ch3 ldquoFinding Similar Itemsrdquo (minhashing locality sensitive hashing)

bull Multivariate-R Ch 6 ldquoClusteringrdquo

bull DataScience-Python NB 0511 ldquok-Means Clusteringrdquo NG 0512 ldquoGaussian Mixture Modelsrdquo

bull Dagger[KMeansHashing] Kaiming He Fang Wen Jian Sun 2013 ldquoK-means Hashing An Affinity-Preserving Quantization Method for Learning Binary Compact Codesrdquo CVPR httpswwwcv-foundationorgopenaccesscontent_cvpr_2013papersHe_K-Means_Hashing_An_2013_CVPR_paper

pdf

bull dagger[Squashing] Madigan D Raghavan N Dumouchel W Nason M Posse C and RidgewayG (2002) ldquoLikelihood-based data squashing A modeling approach to instance constructionrdquo DataMining and Knowledge Discovery 6(2) 173-190 httpdxdoiorgezaccesslibrariespsu

edu101023A1014095614948

bull dagger[Core-sets] Piotr Indyk Sepideh Mahabadi Mohammad Mahdian Vahab S Mirrokni 2014 ldquoCom-posable core-sets for diversity and coverage maximizationrdquo PODS rsquo14 100-8 httpsdoiorg10114525945382594560

Feature selection feature extraction feature engineering weighting preprocessing

bull PatternRecognition Sections 26-7 ldquoFeature Selection Feature Extractionrdquo

bull DSHandbook-Py Ch 7 ldquoInterlude Feature Extraction Ideasrdquo

bull DataScience-Python NB 0504 ldquoFeature Engineeringrdquo

bull Features in text NLP and InfoRetrieval re tfidf and similar FightinWords

bull Features in images DataScience-Python NB 0514 ldquoImage Featuresrdquo

23

Dimensionality reduction decomposition factorization change of basis reparameterization matrix com-pletion latent variables source separation

bull MMDS Ch 9 ldquoRecommendation Systemsrdquo Ch11 ldquoDimensionality Reductionrdquo

bull DSHandbook-Py Ch 10 ldquoUnsupervised Learning Clustering and Dimensionality Reductionrdquo

bull Dagger[Shalizi-ADA] Cosma Rohilla Shalizi 2017 Advanced Data Analysis from an Elementary Pointof View Ch 16 (ldquoPrincipal Components Analysisrdquo) Ch 17 (ldquoFactor Modelsrdquo) httpwwwstat

cmuedu~cshaliziADAfaEPoV

See also Multivariate-R Ch 3 ldquoPrincipal Components Analysisrdquo Ch 4 ldquoMultidimensional ScalingrdquoCh 5 ldquoExploratory Factor Analysisrdquo Latent

bull DeepLearning Ch 2 ldquoLinear Algebrardquo Ch 13 ldquoLinear Factor Modelsrdquo

bull Dagger[GloVe] Jeffrey Pennington Richard Socher Christopher Manning 2014 ldquoGloVe Global vectorsfor word representationrdquo EMNLP httpsnlpstanfordeduprojectsglove

bull dagger[NMF] Daniel D Lee and H Sebastian Seung 1999 ldquoLearning the parts of objects by non-negativematrix factorizationrdquo Nature 401788-791 httpdxdoiorgezaccesslibrariespsuedu

10103844565

bull dagger[CUR] Michael W Mahoney and Petros Drineas 2009 ldquoCUR matrix decompositions for improveddata analysisrdquo PNAS httpwwwpnasorgcontent1063697full

bull Dagger[ICA] Aapo Hyvarinen and Erkki Oja 2000 ldquoIndependent Component Analysis Algorithmsand Applicationsrdquo Neural Networks 13(4-5) 411-30 httpswwwcshelsinkifiuahyvarin

papersNN00newpdf

bull [RandomProjection] Ella Bingham and Heikki Mannilla 2001 ldquoRandom projection in dimension-ality reduction applications to image and text datardquo KDD httpsdoiorg1011452F502512

502546

bull Compression eg ldquoTransform codingrdquo ldquoWaveletsrdquo

Nonlinear dimensionality reduction Manifold learning

bull DeepLearning Section 5113 ldquoManifold Learningrdquo

bull DataScience-Python NB 0510 ldquoManifold Learningrdquo (Locally linear embedding [LLE] Isomap)

bull Dagger[KernelPCA] Sebastian Raschka 2014 ldquoKernel tricks and nonlinear dimensionality reduction viaRBF kernel PCArdquo httpsebastianraschkacomArticles2014_kernel_pcahtml

bull Shalizi-ADA Ch 18 ldquoNonlinear Dimensionality Reductionrdquo (LLE)

bull Dagger[LaplacianEigenmaps] Mikhail Belkin and Partha Niyogi 2003 ldquoLaplacian eigenmaps for di-mensionality reduction and data representationrdquo Neural Computation 15(6) 13731396 http

webcseohio-stateedu~belkin8papersLEM_NC_03pdf

bull DeepLearning Ch14 (ldquoAutoencodersrdquo)

bull Dagger[word2vec] Tomas Mikolov Kai Chen Greg Corrado Jeffrey Dean 2013 ldquoEfficient Estimation ofWord Representations in Vector Spacerdquo httpsarxivorgabs13013781

bull Dagger[word2vecExplained] Yoav Goldberg and Omer Levy 2014 ldquoword2vec Explained DerivingMikolov et als Negative-Sampling Word-Embedding Methodrdquo httpsarxivorgabs14023722

bull Dagger[t-SNE] Laurens van der Maaten and Geoffrey Hinton 2008 ldquoVisualising Data using t-SNErdquo Jour-nal of Machine Learning Research 9 2579-2605 httpjmlrcsailmitedupapersvolume9

vandermaaten08avandermaaten08apdf

24

Computation and Scaling Up

Scientific computing computation at scale

bull NRCreport Ch 10 ldquoThe Seven Computational Giants of Massive Data Analysisrdquo

bull DSHandbook-Py Ch 21 ldquoPerformance and Computer Memoryrdquo Ch 22 ldquoComputer Memory andData Structuresrdquo

bull Dagger[ComputationalStatistics-Python] Cliburn Chan Computational Statistics in Python https

peopledukeedu~ccc14sta-663indexhtml

bull CRAN Task View High Performance Computinghttpscranr-projectorgwebviewsHighPerformanceComputinghtml

Numerical computing

bull [NumericalComputing] Ward Cheney and David Kincaid 2013 Numerical Mathematics andComputing 7th ed BrooksCole Cengage Learning

bull CRAN Task View Numerical Mathematicshttpscranr-projectorgwebviewsNumericalMathematicshtml

Optimization (eg MLE gradient descent stochastic gradient descent EM algorithm neural nets)

bull DSHandbook-Py Ch 23 ldquoMaximum Likelihood Estimation and Optimizationrdquo (gradient descent)

bull DeepLearning Ch 4 ldquoNumerical Computationrdquo Ch 6 ldquoDeep Feedforward Networksrdquo Ch 8ldquoOptimization for Training Deep Modelsrdquo

bull CRAN Task View Optimization and Mathematical Programminghttpscranr-projectorgwebviewsOptimizationhtml

Linear algebra matrix computations

bull [MatrixComputations] Gene H Golub and Charles F Van Loan 2013 Matrix Computations 4thed Johns Hopkins University Press

bull Dagger[NetflixMatrix] Yehuda Koren Robert Bell and Chris Volinsky 2009 ldquoMatrix FactorizationTechniques for Recommender Systemsrdquo Computer August 42-9 httpsdatajobscomdata-

science-repoRecommender-Systems-[Netflix]pdf

bull Dagger[BigDataPCA] Jianqing Fan Qiang Sun Wen-Xin Zhou Ziwei Zhu ldquoPrincipal Component Anal-ysis for Big Datardquo httpwwwprincetonedu~ziweizpcapdf

bull Dagger[RandomSVD] Andrew Tulloch 2009 ldquoFast Randomized SVDrdquo httpsresearchfbcom

fast-randomized-svd

bull Dagger[FactorSGD] Rainer Gemulla Peter J Haas Erik Nijkamp and Yannis Sismanis 2011 ldquoLarge-Scale Matrix Factorization with Distributed Stochastic Gradient Descentrdquo KDD httpwwwcs

utahedu~hariteachingbigdatagemulla11dsgdpdf

bull Dagger[Sparse] Max Grossman 2015 ldquo101 Ways to Store a Sparse Matrixrdquo httpsmediumcom

jmaxg3101-ways-to-store-a-sparse-matrix-c7f2bf15a229

bull Dagger[DontInvert] John D Cook 2010 ldquoDonrsquot Invert that Matrixrdquo httpswwwjohndcookcom

blog20100119dont-invert-that-matrix

Simulation-based inference resampling Monte Carlo methods MCMC Bayes approximate inference

25

bull Dagger[MarkovVisually] Victor Powell ldquoMarkov Chains Explained Visuallyrdquo httpsetosaioev

markov-chains

bull DSHandbook-Py Ch 25 ldquoStochastic Modelingrdquo (Markov chains MCMC HMM)

bull Bayes ComputationalStatistics-Py

bull DeepLearning Ch 17 ldquoMonte Carlo Methodsrdquo Ch 19 ldquoApproximate Inferencerdquo

bull Dagger[VariationalInference] Jason Eisner 2011 ldquoHigh-Level Explanation of Variational Inferencerdquohttpswwwcsjhuedu~jasontutorialsvariationalhtml

bull CRAN Task View Bayesian Inferencehttpscranr-projectorgwebviewsBayesianhtml

Parallelism MapReduce Split-Apply-Combine

bull Dagger[MapReduceIntuition] Jigsaw Academy 2014 ldquoBig Data Specialist MapReducerdquo https

wwwyoutubecomwatchv=TwcYQzFqg-8ampfeature=youtube

bull MMDS Ch 2 ldquoMap-Reduce and the New Software Stackrdquo (3rd Edition discusses Spark amp Tensor-Flow)

bull Algorithms Ch 3ldquoScalable Algorithms and Associative Statisticsrdquo Ch 4 ldquoHadoop and MapReducerdquo

bull NRCreport Ch 6 ldquoResources Trade-offs and Limitationsrdquo

bull TidyData-R (Split-apply-combine is the motivating principle behind the ldquotidyverserdquo approach) Seealso Part III ldquoProgramrdquo (pipes functions vectors iteration)

bull Dagger[TidySAC-Video] Hadley Wickham 2017 ldquoData Science in the Tidyverserdquo httpswww

rstudiocomresourcesvideosdata-science-in-the-tidyverse

bull See also FactorSGD

Functional Programming

bull ComputationalStatistics-Python ldquoFunctions are first class objectsrdquo through first exercises

bull DSHandbook-Py Ch 20 ldquoProgramming Language Conceptsrdquo

bull See also Haskell

Scaling iteration streaming data online algorithms (Spark)

bull NRCreport Ch 4 ldquoTemporal Data and Real-Time Algorithmsrdquo

bull MMDS Ch 4 ldquoMining Data Streamsrdquo

bull Dagger[BDAS] AMPLab BDAS The Berkeley Data Analytics Stack httpsamplabcsberkeleyedu

software

bull dagger[Spark] Mohammed Guller 2015 Big Data Analytics with Spark A Practitioners Guide toUsing Spark for Large-Scale Data Processing Machine Learning and Graph Analytics and High-Velocity Data Stream Processing Apress httpslink-springer-comezaccesslibrariespsu

edubook101007978-1-4842-0964-6

bull DSHandbook-Py Ch 13 ldquoBig Datardquo

bull DeepLearning Ch 10 ldquoSequence Modeling Recurrent and Recursive Netsrdquo

General Resources for Python and R

26

bull dagger[DSHandbook-Py] Field Cady 2017 The Data Science Handbook (Python-based) Wiley http

onlinelibrarywileycomezaccesslibrariespsuedubook1010029781119092919

bull Dagger[Tutorials-R] Ujjwal Karn 2017 ldquoA curated list of R tutorials for Data Science NLP and MachineLearningrdquo httpsgithubcomujjwalkarnDataScienceR

bull Dagger[Tutorials-Python] Ujjwal Karn 2017 ldquoA curated list of Python tutorials for Data Science NLPand Machine Learningrdquo httpsgithubcomujjwalkarnDataSciencePython

Cutting and Bleeding Edge of Data Science Languages

bull [Scala] daggerVishal Layka and David Pollak 2015 Beginning Scala httpslink-springer-

comezaccesslibrariespsuedubook1010072F978-1-4842-0232-6 daggerNicolas Patrick 2014Scala for Machine Learning httpsebookcentralproquestcomlibpensudetailaction

docID=1901910 DaggerScala site httpswwwscala-langorg (See also )

bull [Julia] daggerIvo Baobaert 2015 Getting Started with Julia httpsebookcentralproquestcom

libpensudetailactiondocID=1973847 DaggerDouglas Bates 2013 ldquoJulia for R Programmersrdquohttpwwwstatwiscedu~batesJuliaForRProgrammerspdf DaggerJulia site httpsjulialang

org

bull [Haskell] daggerRichard Bird 2014 Thinking Functionally with Haskell httpsdoi-orgezaccess

librariespsuedu101017CBO9781316092415 daggerHakim Cassimally 2017 Learning Haskell Pro-gramming httpswwwlyndacomHaskell-tutorialsLearning-Haskell-Programming604926-2html daggerJames Church 2017 Learning Haskell for Data Analysis httpswwwlyndacomDeveloper-tutorialsLearning-Haskell-Data-Analysis604234-2html Haskell site httpswwwhaskellorg

bull [Clojure] daggerMark McDonnell 2017 Quick Clojure Essential Functional Programming https

link-springer-comezaccesslibrariespsuedubook1010072F978-1-4842-2952-1 daggerAkhillWali 2014 Clojure for Machine Learning httpsebookcentralproquestcomlibpensu

detailactiondocID=1674848 daggerArthur Ulfeldt 2015 httpswwwlyndacomClojure-tutorialsUp-Running-Clojure413127-2html DaggerClojure site httpsclojureorg

bull [TensorFlow] DaggerAbadi et al (Google) 2016 ldquoTensorFlow A System for Large-Scale Ma-chine Learningrdquo httpsarxivorgabs160508695 DaggerUdacity ldquoDeep Learningrdquo httpswww

udacitycomcoursedeep-learning--ud730 DaggerTensorFlow site httpswwwtensorfloworg

bull [H2O] Darren Cool 2016 Practical Machine Learning with H2O Powerful Scalable Techniques forDeep Learning and AI OrsquoReilly Arno Candel and Viraj Parmar 2015 Deep Learning with H2ODaggerH2O site httpswwwh2oai

Social Data Structures

Space and Time

bull dagger[GIA] David OrsquoSullivan David J Unwin 2010 Geographic Information Analysis Second EditionJohn Wiley amp Sons httponlinelibrarywileycombook1010029780470549094

bull [Space-Time] Donna J Peuquet 2003 Representations of Space and Time Guilford

bull Roger S Bivand Edzer Pebesma Virgilio Gmez-Rubio 2013 Applied Spatial Data Analysis in RFree through library httplinkspringercomezaccesslibrariespsuedubook101007

2F978-1-4614-7618-4 See also Edzer Pebesma 2016 ldquoHandling and Analyzing Spatial Spa-tiotemporal and Movement Data in Rrdquo httpsedzergithubioUseR2016

bull Dagger[PySAL] Sergio J Rey and Dani Arribas-Bel 2016 ldquoGeographic Data Science with PySAL andthe pydata Stackrdquo httpdarribasorggds_scipy16

27

bull DataScience-Python NB 0413 ldquoGeographic Data with Basemaprdquo

bull Time ldquoLongitudinalrdquo intra-individual data as common in developmental psychology dagger[Longitudinal]Garrett Fitzmaurice Marie Davidian Geert Verbeke and Geert Molenberghs editors 2009 Longitu-dinal Data Analysis Chapman amp Hall CRC httppensueblibcompatronFullRecordaspxp=359998

bull Time ldquoTime Series Time Series Cross Section Panelrdquo as common in economics political scienceand sociology DataScience-Python Ch 3 re pandas DSHandbook-Py Ch 17 ldquoTime Series AnalysisrdquoEconometrics GelmanHill

bull Time ldquoSequential Data Streamsrdquo as in NLP machine learning Spark and other rdquostreaming datardquoreadings

bull Space-Time Data with continuity connectivity neighborhood structure See DeepLearning Chapter 9re convolution

bull CRAN Task Views Spatial httpscranr-projectorgwebviewsSpatialhtml Spatiotempo-ral httpscranr-projectorgwebviewsSpatioTemporalhtml Time Series httpscran

r-projectorgwebviewsTimeSerieshtml

Network Graphs

bull Dagger[Networks] David Easley and Jon Kleinberg 2010 Networks Crowds and Markets ReasoningAbout a Highly Connected World Cambridge University Press httpwwwcscornelleduhome

kleinbernetworks-book

bull MMDS Ch 5 ldquoLink Analysisrdquo Ch 10 ldquoMining Social-Network Graphsrdquo

bull dagger[Networks-R] Douglas Luke 2015 A Userrsquos Guide to Network Analysis in R Springer https

link-springer-comezaccesslibrariespsuedubook101007978-3-319-23883-8

bull dagger[Networks-Python] Mohammed Zuhair Al-Taie and Seifedine Kadry 2017 Python for Graph andNetwork Analysis Springer httpslink-springer-comezaccesslibrariespsuedubook

1010072F978-3-319-53004-8

Hierarchy Clustered Data Aggregation Mixed Models

bull dagger[GelmanHill] Andrew Gelman and Jennifer Hill 2006 Data Analysis Using Regression and Multi-levelHierarchical Models Cambridge University Press httppensueblibcompatronFullRecordaspxp=288457

bull dagger[MixedModels] Eugene Demidenko 2013 Mixed Models Theory and Applications with R 2nd edWiley httpsiteebrarycomezaccesslibrariespsuedulibpennstatedetailaction

docID=10748641

Social Data Channels

Language Text Speech Audio

bull Dagger[NLP] Daniel Jurafsky and James H Martin Forthcoming (2017) Speech and Language ProcessingAn Introduction to Natural Language Processing Speech Recognition and Computational Linguistics3rd ed Prentice-Hall Preprint httpswebstanfordedu~jurafskyslp3

bull InfoRetrieval

bull Dagger[CoreNLP] Stanford CoreNLP httpsstanfordnlpgithubioCoreNLP

28

bull dagger[TextAsData] Grimmer and Stewart 2013 ldquoText as Data The Promise and Pitfalls of AutomaticContent Analysis Methods for Political Textsrdquo Political Analysis httpsdoiorg101093pan

mps028

bull Quinn-Topics

bull FightinWords

bull dagger[NLTK-Python] Jacob Perkins 2010 Python Text Processing with NLTK 20 Cookbook PackthttpsiteebrarycomezaccesslibrariespsuedulibpennstatedetailactiondocID=10435387dagger[TextAnalytics-Python] Dipanjan Sarkar 2016 Text Analytics with Python Springer http

linkspringercomezaccesslibrariespsuedubook1010072F978-1-4842-2388-8

bull DSHandbook-Py Ch 16 ldquoNatural Language Processingrdquo

bull Matthew J Denny httpwwwmjdennycomText_Processing_In_Rhtml

bull CRAN Natural Language Processing httpscranr-projectorgwebviewsNaturalLanguageProcessinghtml

bull dagger[AudioData] Dean Knox and Christopher Lucas 2017 ldquoThe Speaker-Affect Model MeasuringEmotion in Political Speech with Audio Datardquo httpchristopherlucasorgfilesPDFssam

pdf httpswwwyoutubecomwatchv=Hs8A9dwkMzI

bull daggerMorgan amp Claypool Synthesis Lectures on Human Language Technologies httpwwwmorganclaypoolcomtochlt11

bull daggerMorgan amp Claypool Synthesis Lectures on Speech and Audio Processing httpwwwmorganclaypoolcomtocsap11

Vision Image Video

bull Dagger[ComputerVision] Richard Szelinski 2010 Computer Vision Algorithms and Applications SpringerhttpszeliskiorgBook

bull MixedModels Ch 11 ldquoStatistical Analysis of Shaperdquo Ch 12 ldquoStatistical Image Analysisrdquo

bull AudioVideoVolumetric See DeepLearning Chapter 9 re convolution

bull daggerMorgan amp Claypool Synthesis Lectures on Image Video and Multimedia Processing httpwww

morganclaypoolcomtocivm11

bull daggerMorgan amp Claypool Synthesis Lectures on Computer Vision httpwwwmorganclaypoolcom

toccov11

Approaches to Learning from Data (The Analytics Layer)

bull NRCreport Ch 7 ldquoBuilding Models from Massive Datardquo

bull MMDS (data-mining)

bull CausalInference

bull Dagger[VisualAnalytics] Daniel Keim Jorn Kohlhammer Geoffrey Ellis and Florian Mansmann editors2010 Mastering the Information Age Solving Problems with Visual Analytics The EurographicsAssociation Goslar Germany httpwwwvismastereubook See also daggerMorgan amp ClaypoolSynthesis Lectures on Visualization httpwwwmorganclaypoolcomtocvis21

bull Dagger[Econometrics] Bruce E Hansen 2014 Econometrics httpwwwsscwiscedu~bhansen

econometrics

29

bull Dagger[StatisticalLearning] Gareth James Daniela Witten Trevor Hastie and Robert Tibshirani 2013An Introduction to Statistical Learning with Applications in R Springer httpwww-bcfuscedu

~garethISLindexhtml

bull [MachineLearning] Christopher Bishop 2006 Pattern Recognition and Machine Learning Springer

bull dagger[Bayes] Peter D Hoff 2009 A First Course in Bayesian Statistical Methods Springer http

linkspringercombook1010072F978-0-387-92407-6

bull dagger[GraphicalModels] Soslashren Hoslashjsgaard David Edwards Steffen Lauritzen 2012 Graphical Mod-els with R Springer httpslink-springer-comezaccesslibrariespsuedubook101007

2F978-1-4614-2299-0

bull [InformationTheory] David Mackay 2003 Information Theory Inference and Learning AlgorithmsSpringer

bull Dagger[DeepLearning] Ian Goodfellow Yoshua Bengio and Aaron Courville 2016 Deep Learning MITPress httpwwwdeeplearningbookorg (see also Yann LeCun Yoshua Bengio and Geof-frey Hinton 2015 ldquoDeep Learningrdquo Nature 521(7553) 436ndash444 httpsdoiorg101038

nature14539)

See also Dagger[Keras] Keras The Python Deep Learning Library httpskerasio Francoios CholletDirectory of Keras tutorials httpsgithubcomfcholletkeras-resources

bull dagger[SignalProcessing] Jose Maria Giron-Sierra 2013 Digital Signal Processing with MATLAB Exam-ples (Volumes 1-3) Springer httpslink-springer-comezaccesslibrariespsuedubook

101007978-981-10-2534-1

30

Penn State Policy Statements

Academic Integrity

Academic integrity is the pursuit of scholarly activity in an open honest and responsible manner Academicintegrity is a basic guiding principle for all academic activity at The Pennsylvania State University and allmembers of the University community are expected to act in accordance with this principle Consistent withthis expectation the Universitys Code of Conduct states that all students should act with personal integrityrespect other students dignity rights and property and help create and maintain an environment in whichall can succeed through the fruits of their efforts

Academic integrity includes a commitment by all members of the University community not to engage in ortolerate acts of falsification misrepresentation or deception Such acts of dishonesty violate the fundamentalethical principles of the University community and compromise the worth of work completed by others

Disability Accomodation

Penn State welcomes students with disabilities into the Universitys educational programs Every Penn Statecampus has an office for students with disabilities Student Disability Resources (SDR) website provides con-tact information for every Penn State campus (httpequitypsuedusdrdisability-coordinator)For further information please visit the Student Disability Resources website (httpequitypsuedusdr)

In order to receive consideration for reasonable accommodations you must contact the appropriate disabilityservices office at the campus where you are officially enrolled participate in an intake interview and pro-vide documentation See documentation guidelines at (httpequitypsuedusdrguidelines) If thedocumentation supports your request for reasonable accommodations your campus disability services officewill provide you with an accommodation letter Please share this letter with your instructors and discussthe accommodations with them as early as possible You must follow this process for every semester thatyou request accommodations

Psychological Services

Many students at Penn State face personal challenges or have psychological needs that may interfere withtheir academic progress social development or emotional wellbeing The university offers a variety ofconfidential services to help you through difficult times including individual and group counseling crisisintervention consultations online chats and mental health screenings These services are provided by staffwho welcome all students and embrace a philosophy respectful of clients cultural and religious backgroundsand sensitive to differences in race ability gender identity and sexual orientation

bull Counseling and Psychological Services at University Park (CAPS) (httpstudentaffairspsueducounseling) 814-863-0395

bull Penn State Crisis Line (24 hours7 daysweek) 877-229-6400

bull Crisis Text Line (24 hours7 daysweek) Text LIONS to 741741

Educational Equity

Penn State takes great pride to foster a diverse and inclusive environment for students faculty and staffConsistent with University Policy AD29 students who believe they have experienced or observed a hatecrime an act of intolerance discrimination or harassment that occurs at Penn State are urged to report theseincidents as outlined on the Universitys Report Bias webpage (httpequitypsuedureportbias)

31

Background Knowledge

Students will have a variety of backgrounds Prior to beginning interdisciplinary coursework tofulfill Social Data Analytics degree requirements including SoDA 501 and 502 students are ex-pected to have advanced (graduate) training in at least one of the component areas of Social DataAnalytics and a familiarity with basic concepts in the others

With regard to specialization students are expected to have advanced (graduate) training in ONEof the following

bull quantitative social science methodology and a discipline of social science (as would be thecase for a second-year PhD student in Political Science Sociology Criminology Human De-velopment and Family Studies or Demography) OR

bull statistics (as would be the case for a second-year PhD student in Statistics) OR

bull information science or informatics (as would be the case for a second-year PhD student inInformation Science and Technology or a second-year PhD student in Geography specializingin GIScience) OR

bull computer science (as would be the case for a second-year PhD student in Computer Scienceand Engineering)

This requirement is met as a matter of meeting home program requirements for students in thedual-title PhD but may require additional coursework on the part of students in other programswishing to pursue the graduate minor

With regard to general preparation students are expected to have ALL of the following technicalknowledge

bull basic programming skills including basic facility with R ampor Python AND

bull basic knowledge of relational databases ampor geographic information systems AND

bull basic knowledge of probability applied statistics ampor social science research design AND

bull basic familiarity with a substantive or theoretical area of social science (eg 300-level course-work in political science sociology criminology human development psychology economicscommunication anthropology human geography social informatics or similar fields)

It is not unusual for students to have one or more gaps in this preparation at time of applicationto the SoDA program Students should work with Social Data Analytics advisers to develop aplan for timely remediation of any deficiencies which generally will not require formal courseworkfor students whose training and interests are otherwise appropriate for pursuit of the Social DataAnalytics degree Where possible this will be addressed at time of application to the Social DataAnalytics program

To this end some free training materials are linked in the reference section and there are alsoabundant free high-quality self-paced course-style training materials on these and related subjectsavailable through edX (httpwwwedxorg) Udacity (httpwwwudacitycom) Codecademy(httpcodecademycom) and Lynda (httpwwwlyndapsu)

32

Page 2: SoDA 501: Approaches and Issues in Big Social Data Spring ... · \Approaches and Issues in Big Social Data," o ered in the spring semester. Some SoDA / IGERT students will take more

Figure 1 SoDA and the social data stack

Assignments and Grades

The latter leads us to the main pedagogical components of SoDA 501

bull Engagement in Seminar - 40

Guest Speakers For half of the session most weeks we will host a guest speaker(typically a member of the Graduate Faculty in Social Data Analytics drawn from thefull range of participating disciplines) discussing an active research project or relatedtopic that touches on one or more areas of concern in the course For each speaker wewill have two or more of you acting as ldquodesignated respondentsrdquo with extra responsibilityfor having questions for discussion with the speaker

Readings and Seminar Discussion The readings discussion and what lecturing Iwill do will focus on interdisciplinary integration In part this involves identifying thoseconcepts that may be new to some of you in this setting ndash eg how ldquobigrdquo or ldquomachine

2

learningrdquo or ldquovisual analyticsrdquo approaches challenge conventional social science method-ology or how social scientific thinking challenges emerging practices and conventionalwisdom in data science ndash and tools associated with those concepts In part this involvesinterdisciplinary arbitrage and translation ndash identifying common concepts and structurethat may go by slightly different names in different disciplines and settings To thisend I want you each to send me ndash by Wednesday 700am each week by email ndash listsof terms concepts that you encountered in that weekrsquos reading in three categories (1)termsconcepts that were new but you think you now understand (2) termsconceptsthat seem to be used differently than in the context of your home discipline and (3)termsconcepts you still find confusing

Grading Criteria Full 30 points if you are present every week have made a good faitheffort to provide your lists of confusing terms and concepts on time have thoughtfullyread all of the assignments are prepared to talk about the weekrsquos readings and themesand consistently contribute in ways that are productive to the discussion (good ques-tions thoughtful responses etc) with all of that weighted more heavily when you are adesignated respondent If you donrsquot do any of that 0 points Sliding scale in between

bull Exercises - 20 It is explicitly not an objective of this course to ldquotrainrdquo you in all of thetools we will mention much less those that we could mention a task that would take yearsIn the interest of collectively ldquomoving the ball forwardrdquo for each of you however we will havea small number of assigned exercises Early in the semester these will be done in (assigned)interdisciplinary teams Later exercises will be individual

bull Semester Team Project - 40 You will in a team consisting of at least three disciplinescreate gather andor organizemanipulateprepare for analytics a ldquobigrdquo ldquosocialrdquo datasetThe data must be at least partly social (arise from human interactions) There must be somenontrivial computational or informational element to the project There need not be a finalanalysis of the data but there must be some basic calculation of descriptive statistics over thedata and some demonstration of the validity of the data for the (or an) intended scientificpurpose ndash eg representativeness (and of what) balance randomization measurementvalidity etc

March 1 - Deadline for approval of teams and (proposed) projects

March 15 - 25 Project Review

March 29 - 50 Project Review

April 12 - 75 Project Review

April 26 - Team Project Presentations

May 3 - White Paper and Data Replication Archive Due Submit a 4-5 pagepaper that documents what was done and why discusses problems you encounteredprovides an assessment of the validity of the data for an analytic purpose (or how itmight be validated) and discusses what further work might be done to make the datainto a useful resource for others andor to publish an analysis based on the data Shareyour code and data with me in maximally documented and reproducible form (ideally anotebook stored on github or similar)

3

Course Schedule 2018

January 11

bull Introductions

bull Syllabus (What this course is and isnrsquot How this course (hopefully) works)

bull Further reference

10RulesforData SoftwareCarpentry Lessons

Section ldquoGeneral Resources for Python and Rrdquo

Recommended Python via Anaconda (httpswwwanacondacom) R (httpswwwr-projectorg) amp RStudio (httpswwwrstudiocom) Account on ICS-ACI (httpsicspsuedu) Git (httpsgit-scmcom)

bull Exercise 1 Team Updates 118 Due 125 BitByBit Exercise 26 (a-g) Teams

TeamFrancisco Sara Claire Rosemary Fangcao

TeamFreelin Brittany Arif So Young Xiaoran

TeamKankane Shipi Steve Lulu Omer

January 18

bull Readings (send list of confusing terms concepts by 700am Jan 17 Wednesday)

BitByBit Ch 1 amp 2

CompSocSci Monroe-No Monroe-5Vs

One article from a different discipline listed in the ldquoMultidisciplinary Perspectivesrdquosection excluding Business-BigData

bull Further reference Section ldquoBig Data amp Social Data Analyticsrdquo

bull Exercise 1 Team Updates

January 25

bull Readings (send list of confusing terms concepts by 700 am Jan 24 Wednesday)

GoogleFlu GoogleBooks EmbeddingsBias MachineBias RacistBot BDSS-Census

ResearchMethodsKB ldquoMeasurementrdquo Quinn-Topics

bull Further reference

Section ldquoiexclCuidadordquo

Section ldquoMeasurement Reliability and Validityrdquo

bull Exercise 1 Due Teams Report

4

bull Determine ldquodiscussion leadrdquo dates

bull Exercise 2 Due 28 Wikipedia Google Trends exercise Teams

TeamKelling Claire Brittany Arif

TeamYalcin Omer Lulu Fangcao

TeamPang Rosemary Shipi Sara

TeamSun Xiaoran Steve So Young

February 1

bull Bing Pan (RPTM) ldquoBig Data and Forecasting in Tourismrdquo

bull Readings (send list of confusing terms concepts by 700 am Jan 31 Wednesday)

Bing Pan ldquoIdentifying the Next Non-Stop Flying Market with a Big Data Approachrdquohttpsdoiorg101016jtourman201712008 ldquoGoogle Trends and Tourist Ar-rivals Emerging Biases and Proposed Correctionsrdquo httpsdoiorg101016j

tourman201710014 (See also ldquoForecasting Destination Weekly Hotel Occupancywith Big Datardquo httpjournalssagepubcomdoiabs1011770047287516669050ldquoForecasting tourism demand with composite search indexrdquo httpsdoiorg10

1016jtourman201607005)

UnobtrusiveMeasures

Multivariate-R Chapter 1 (Donrsquot get bogged down when the math starts) LatentVari-ables Chapter 1 (Skim - donrsquot get bogged down in the math) NetflixPrize

ResearchMethodsKB ldquoSamplingrdquo NRCReport Chapter 8 ldquoSampling and MassiveDatardquo BitByBit Chapter 3 ldquoAsking Questionsrdquo

bull Further reference

Section ldquoIndirect Unobtrusive Nonreactive Measures Data Exhaustrdquo

Section ldquoMultiple Measures Latent Variable Measurementrdquo

Section ldquoSampling and Survey Designrdquo

Section ldquoOpen data APIs linked data rdquo

Section ldquoSpace and Timerdquo (readings on Time)

February 8

bull Clio Andris (GEOG) ldquoWhat AirBNB and Yelp can teach us about human behavior incitiesrdquo

bull Readings (send list of confusing terms concepts by 700 am Feb 7 Wednesday)

Clio Andris ldquoUsing Yelp to Find Romance in the City A Case of Restaurants in FourCitiesrdquo httpswwwdropboxcomsuink0zcklwcpo5gYelp_Restaurantspdfdl=

0 ldquoHidden Style in the City An Analysis of Geolocated Airbnb Rental Images in TenMajor Citiesrdquo httpswwwdropboxcomst7y7f6880m4ty7vAirBNB_Analysispdf

dl=0

5

InfoRetrieval Chs 1 2 6 (you may prefer the slides from their classes) My primaryhope here is that you understand the ldquovector space modelrdquo and ldquocosine similarityrdquo(from Chapter 6) My secondary hope is that you understand the basics of Booleaninformation retrieval (and notions like ldquoindexrdquo ldquoinverted indexrdquo and ldquopostingsrdquo) Mytertiary hope is that you are exposed to some basic concepts of text analytics NLP(including ldquotokenizationrdquo ldquonormalizationrdquo ldquostemmingrdquoldquolemmatizationrdquo ldquostop wordsrdquoldquotf-idfrdquo)

FightinWords (FW was used by Jurafsky in httpfirstmondayorgojsindexphp

fmarticleview49443863 and a best-selling book The Language of Food for similarapplications to Andris based on Yelp reviews and now appears in his textbook NLP)

bull Further reference

Section ldquoSpace and Timerdquo

Section ldquoWeb Scrapingrdquo

Section ldquoData Representations Data Mappingsrdquo

Section ldquoFeature selection feature extraction feature engineering rdquo

Section ldquoDatabases and data managementrdquo (esp SQL)

Section ldquoLanguage Text Speech Audiordquo

bull Exercise 2 Due Teams Report

February 15 (Postponed)

bull Conrad Tucker (IE) ldquoCybersecurity Policies and Their Impact on Dynamic Data DrivenApplication Systemsrdquo

bull Readings (send list of confusing terms concepts by 700 am Feb 14 Wednesday)

Conrad Tucker ldquoCybersecurity Policies and Their Impact on Dynamic Data DrivenApplication Systemsrdquo httpieeexploreieeeorgdocument8064151 See alsoldquoGenerative Adversarial Networks for Increasing the Veracity of Big Datardquo http

ieeexploreieeeorgdocument8258219

February 22

bull David Reitter (IST) ldquoComputational Psycholinguisticsrdquo

bull Readings (two weeks worth)

David Reitter ldquoAlignment in Web-Based Dialogue Studies in Big Data ComputationalPsycholinguisticsrdquo (Based on data from the Cancer Survivors Network and Reddit)httpwwwdavid-reittercompubreitter2017alignmentpdf

Similarity

PatternRecognition Ch2 ldquoRepresentationrdquo

6

MMDS Chapter 3 ldquoFinding Similar Itemsrdquo (to understand ldquominhashingrdquo and ldquolocalitysensitive hashingrdquo yoursquoll need to understand ldquohashingrdquo ndash Section 132 also discussedin InfoRetrieval Chapter 3)

bull Further reference

Section ldquoSimilarity Distance Association rdquo

Section ldquoDerived Data Representations rdquo

Section ldquoRecord Linkage Entity Resolution rdquo

Section ldquoClustering Hashing Compression rdquo

March 1

bull Reading

BitByBit Ch 5 (ldquoCreating Mass Collaborationrdquo)

Crowdsourcing ldquoIntroductionrdquo and Ch1 ldquoConcepts Theories and Cases of Crowd-sourcingrdquo

HumanComputation Chs 125

bull Further reference

Section ldquoCrowdsourcing Human Computation Citizen Science Web Experimentsrdquo

Section ldquoMaking Up Data (smoothing convolution kernels )rdquo

Section ldquoVision image videordquo

bull Semester Projects Must be Approved Before Spring Break

March 8 - Spring Break

March 15

bull Daniel DellaPosta (SOC) ldquoNetworks and the Mid-20th Century American Mafiardquo

bull Reading

Dan DellaPosta ldquoNetwork Closure in the Mid-20th Century American Mafiardquo (SocialNetworks 2017) httpswww-sciencedirect-comezaccesslibrariespsuedusciencearticlepiiS037887331630199X ldquoBetween Clique and Corporation Boundary-Spanningin Solidary Groupsrdquo (RampR in American Journal of Sociology in the Box folder)

Networks Chs 1 2

MMDS Ch 5 ldquoLink Analysisrdquo

MarkovVisually

bull Further references

Section ldquoNetworks and Graphsrdquo

7

Section ldquoSimulation resampling Markov Chains rdquo

DeepLearning (different kind of network)

bull 25 Project Review

March 22

bull Reading

DeepLearning Ch2 ldquoLinear Algebrardquo

Shalizi-ADA Chs 16-18 (ldquoPrincipal Components Analysisrdquo ldquoFactor Modelsrdquo ldquoNonlin-ear Dimension Reductionrdquo)

bull Further reference

Section ldquoDimensionality reduction decomposition rdquo

Section ldquoMultiple measures latent variable measurementrdquo

Section ldquoLinear algebra matrix computationsrdquo

March 29

bull Naomi Altman (STAT) ldquoGeneralizing PCArdquo

bull Reading

Altman ldquoGeneralizing PCArdquo 2015 slides httppersonalpsuedunsa1AltmanWebpagePCATorontopdf

MapReduceIntuition (7 minute video providing ldquodivide and conquerrdquo intuition to MapRe-duce)

TidySAC-Video Hadley Wickham on the tidyverse ldquosplit-apply-combinerdquo with dplyrand tidy data (1 hour video) (More detail but perhaps less intuition in book formdiscussion of ldquotidyingrdquo data Chapter 5 of TidyData-R)

MMDS Ch 2 (Read the Chapter 2 in the new ldquoBETArdquo version of the book whichalso touches on Spark and Tensorflow in section 24 ldquoExtensions to MapReducerdquo httpistanfordedu~ullmanmmdsnhtml)

bull Further reference

Section ldquoNonlinear dimension reduction manifold learningrdquo

Section ldquoDatabases and data managementrdquo (esp NoSQL)

Section ldquoTheoretically-structured approaches to data wranglingrdquo

Section ldquoParallelism MapReduce Split-Apply-Combinerdquo

Section ldquoFunctional programmingrdquo

Section ldquoCutting and bleeding edge rdquo

bull 50 Project Review

bull Exercise 3 (Individual) Due 45

8

April 5

bull Prasenjit Mitra (IST) ldquoClassification of Tweets from Disaster Scenariosrdquo

bull Reading

Prasenjit Mitra Rudra et al rdquoSummarizing Situational and Topical InformationDuring Crisesrdquo httpsarxivorgpdf161001561pdf Imran Mitra amp CastillordquoTwitter as a Lifeline Human-annotated Twitter Corpora for NLP of Crisis-relatedMessagesrdquo httpmimranmepapersimran_prasenjit_carlos_lrec2016pdf

NLP (Jurafsky and Martin) Chapters 15 and 16 ldquoVector Semanticsrdquo and ldquoSemanticswith Dense Vectorsrdquo

bull Further reference

Section ldquoLanguage text speech audiordquo

GloVe word2vec word2vecExplained t-SNE

Section ldquoMobile devices distributed sensors rdquo

Section ldquoCrowdsourcing human computation citizen science rdquo

Section ldquoScaling iteration streaming data online algorithmsrdquo

bull Exercise 3 Due

April 12

bull Reading

BitByBit Ch 4 ldquoRunning Experimentsrdquo review Ch 2 sections ldquoNatural experimentsin observable datardquo examples

CausalInference Chs 1-2

bull Further reference

Section ldquoExperimental and observational designs for causal inferencerdquo

Section ldquoCrowdsourcing web experimentsrdquo

Section ldquoHuman subjects rdquo

bull 75 Project Review

April 19

bull Reading

BitByBit Ch 6 ldquoEthicsrdquo

PSU-ORP ldquoCommon Rule and Other Changesrdquo httpswwwresearchpsuedu

irbcommonrulechanges

DataPrivacy

9

Reproducibility

Refresh on MachineBias

bull Further reference

Section ldquoEthics and Scientific Responsibility in Big Social Datardquo

Section ldquoiexclCuidadordquo

April 26 - LAST CLASS MEETING

bull Team Project Presentations

May 3 - Team Projects Due

10

Past Visiting Speakers in SoDA 501 and 502

SoDA 502 (Fall 2017)

bull Chris Zorn (PLSC)

bull Diane Felmlee (SOC)

bull Alan MacEachren (GEOG)

bull Eric Plutzer (PLSC)

bull James LeBreton (PSYCH)

bull Sesa Slavkovic (STAT)

bull Guido Cervone (GEOG)

bull Ashton Verdery (SOC)

bull Maggie Niu (STAT)

bull Reka Albert (PHYS)

SoDA 501 (Spring 2017)

bull Tim Brick (HDFS) ldquoTowards real-time monitoring and intervention using wearable technologyrdquo

bull Aylin Caliskan (Princeton) ldquoA Story of Discrimination and Unfairness Bias in Word Embeddingsrdquo

bull Jay Yonamine (Google - IGERT alum) ldquoData Science in Industryrdquo

bull Johnathan Rush (Illinois) ldquoGeospatial Data Science Workshoprdquo

bull Rick Gilmore (PSYCH) ldquoToward a more reproducible and robust science of human behaviorrdquo

bull Glenn Firebaugh (SOC) ldquoMeasuring Inequality and Segregation with US Census Datardquo

bull Charles Twardy (Sotera) ldquoData Science for Search and Rescuerdquo

bull Anna Smith (Ohio State) ldquoA Hierarchical Model for Network Data in a Latent Hyperbolic Spacerdquo

bull Rebecca Passonneau (CSE) ldquoOmnigraph Rich Feature Representation for Graph Kernel Learningrdquo

bull Alex Klippel (GEOG) ldquoVirtual Reality for Immersive Analyticsrdquo

bull Murali Haran (STAT) ldquoA Computationally Efficient Projection-based Approach for Spatial General-ized Mixed Modelsrdquo

SoDA 502 (Fall 2016)

bull Clio Andris (GEOG) ldquoIntegrating Social Network Data into GISystemsrdquo

bull Jia Li (STAT) ldquoClustering under the Wasserstein Metricrdquo

bull Rachel Smith (CAS) ldquoStigma Networks Perceptions of Sociogramsrdquo

bull Zita Oravecz (HDFS)

bull Bethany Bray (Methodology Center) ldquoLatent Class and Latent Transition Analysisrdquo

bull Dave Hunter (STAT) ldquoModel Based Clustering of Large Networksrdquo

bull Scott Bennett (PLSC) ldquoABM Model of Insurgencyrdquo

bull David Reitter (IST)

bull Suzanna Linn (PLSC) ldquoMethodological Issues in Automated Text Analysis Application to NewsCoverage of the US Economyrdquo

11

SoDA 501 (Spring 2016)

bull Bruce Desmarais (PLSC) ldquoLearning in the Sunshine Analysis of Local Government Email Corporardquo

bull Timothy Brick (HDFS) ldquoMapping and Manipulating Facial Expressionrdquo

bull Qunying Huang (USC) ldquoSocial Media An Emerging Data Source for Human Mobility Studiesrdquo

bull Lingzhou Zue (STAT) ldquoAn Introduction to High-Dimensional Graphical Modelsrdquo

bull Ashton Verdery (SOC) ldquoSampling from Network Datardquo

bull Sarah Battersby (Tableau) ldquoHelping People See and Understand Spatial Datardquo

bull Alexandra Slavkovic (STAT) ldquoStatistical Privacy with Network Datardquo

bull Lee Giles (IST) ldquoMachine Learning for Scholarly Big Datardquo

12

Readings and References (updated Spring 2018)

We will discuss a relatively small subset of the readings listed here and this will vary based on topics andreadings discussed by visiting speakers and you yourselves The remainder are provided here as curatedreferences for more in-depth investigation in both the theory and practice related to the topic (with thelatter heavily weighted toward resources in Python and R)

dagger Material that is at last check made available through the Penn State library Most journal links shouldwork if you are logged in to a Penn State machine or through the Penn State VPN Some article archivesand most books require additional authentication through webaccess If links are broken start directly froma search via httpswwwlibrariespsuedu Some books require installation of e-readers like AdobeDigital Editions Links to lyndacom can be accessed through httplyndapsuedu

Dagger Material that is at last check legally provided for free In some cases these are the preprint versionsof published material

sect Material for which a legal selection is or will be provided through the class Box folder

Big Data amp Social Data Analytics

Overviews

bull Dagger[CompSocSci] David Lazer Alex Pentland Lada Adamic Sinan Aral Albert-Laszlo BarabasiDevon Brewer Nicholas Christakis Noshir Contractor James Fowler Myron Gutmann Tony JebaraGary King Michael Macy Deb Roy and Marshall Van Alstyne 2009 ldquoComputational Social SciencerdquoScience 323(5915)721-3 + Supp Feb 6 httpsciencesciencemagorgcontent3235915

721full httpsgkingharvardedufilesgkingfilesLazPenAda09pdf

bull Dagger[NRCreport] National Research Council 2013 Frontiers in Massive Data Analysis NationalAcademies Press (Free w registration httpwwwnapeducatalogphprecord_id=18374)Ch1 ldquoIntroductionrdquo Ch2 ldquoMassive Data in Science Technology Commerce National DefenseTelecommunications and other Endeavorsrdquo

bull Dagger[MMDS] Jure Leskovec Anand Rajaraman and Jeff Ullman 2014 Mining of Massive DatasetsCambridge University Press httpwwwmmdsorg (BETA version of Third Edition httpi

stanfordedu~ullmanmmdsnhtml)

bull sect[BitByBit] Matthew J Salganik 2018 (Forthcoming) Bit by Bit Social Research in the DigitalAge Princeton University Press Ch 1 ldquoIntroductionrdquo Ch 2 ldquoObserving Behaviorrdquo

bull DSHandbook-Py Ch 1 ldquoIntroduction Becoming a Unicornrdquo Ch2 ldquoThe Data Science Road Maprdquo

Burt-schtick

bull Dagger[Monroe-5Vs] Burt L Monroe 2013 ldquoThe Five Vs of Big Data Political Science Introductionto the Special Issue on Big Data in Political Sciencerdquo Political Analysis 21(V5) 1ndash9 https

doiorg101017S1047198700014315 (Volume Velocity Variety Vinculation Validity)

bull dagger[Monroe-No] Burt L Monroe Jennifer Pan Margaret E Roberts Maya Sen and Betsy Sin-clair 2015 ldquoNo Formal Theory Causal Inference and Big Data Are Not Contradictory Trendsin Political Sciencerdquo PS Political Science amp Politics 48(1) 71ndash4 httpdxdoiorg101017

S1049096514001760

bull dagger[Quinn-Topics] Kevin M Quinn Burt L Monroe Michael Colaresi Michael H Crespin andDragomir R Radev 2010 ldquoHow to Analyze Political Attention with Minimal Assumptions andCostsrdquo American Journal of Political Science 54(1) 209ndash28 httponlinelibrarywileycom

13

doi101111j1540-5907200900427xfull (esp topic modeling as measurement approach tovalidation)

bull dagger[FightinWords] Burt L Monroe Michael Colaresi and Kevin M Quinn 2008 ldquoFightinrsquo WordsLexical Feature Selection and Evaluation for Identifying the Content of Political Conflictrdquo PoliticalAnalysis 16(4) 372-403 httpsdoiorg101093panmpn018 (esp the impact of samplingvariance and regularization through priors)

bull Dagger[BDSS-Census] Big Data Social Science PSU Team 2012 ldquoA Closer Look at the Kaggle CensusDatardquo httpsburtmonroegithubioBDSSKaggleCensus2012 (esp the relevance of the socialprocesses by which data come to exist as data)

Multidisciplinary Perspectives

bull dagger[Business-BigData] Andrew McAfee and Erik Brynjolfsson 2012 ldquoBig Data The ManagementRevolutionrdquo Harvard Business Review 90(10) 61ndash8 October httpshbrorg201210big-

data-the-management-revolution Thomas H Davenport and DJ Patel 2012 ldquoData ScientistThe Sexiest Job of the 21st Centuryrdquo Harvard Business Review 90(10)70-6 October https

hbrorg201210data-scientist-the-sexiest-job-of-the-21st-century

bull dagger[InfoSci-BigData] CL Philip Chen and Chun-Yang Zhang 2014 ldquoData-intensive ApplicationsChallenges Techniques and Technologies A Survey on Big Datardquo Information Sciences 275 314-47httpsdoiorg101016jins201401015

bull dagger[Informatics-BigData] Vasant G Honavar 2014 ldquoThe Promise and Potential of Big Data ACase for Discovery Informaticsrdquo Review of Policy Research 31(4) 326-330 httpsdoiorg10

1111ropr12080

bull Dagger[Stats-BigData] Beate Franke Jean-Francois Ribana Roscher Annie Lee Cathal Smyth ArminHatefi Fuqi Chen Einat Gil Alexander Schwing Alessandro Selvitella Michael M Hoffman RogerGrosse Dietrich Hendricks and Nancy Reid 2016 ldquoStatistical Inference Learning and Models in BigDatardquo International Statistical Review 84(3) 371-89 httponlinelibrarywileycomdoi101111insr12176full

bull dagger[Econ-BigData] Hal R Varian 2013 ldquoBig Data New Tricks for Econometricsrdquo Journal ofEconomic Perspectives 28(2) 3-28 httpsdoiorg101257jep2823

bull dagger[GeoViz-BigData] Alan MacEachren 2017 ldquoLeveraging Big (Geo) Data with (Geo) Visual An-alytics Place as the Next Frontierrdquo In Chenghu Zhou Fenzhen Su Francis Harvey and Jun Xueds Spatial Data Handling in the Big Data Era pp 139-155 Springer httpslink-springer-

comezaccesslibrariespsueduchapter101007978-981-10-4424-3_10

bull dagger[Soc-BigData] David Lazer and Jason Radford 2017 ldquoData ex Machina Introduction to BigDatardquo Annual Review of Sociology 43 19-39 httpsdoiorg101146annurev-soc-060116-

053457

bull sect[Politics-BigData] Keith T Poole L Jason Anasastopolous and James E Monagan III Forthcom-ing ldquoThe lsquoBig Datarsquo Revolution in Political Campaigning and Governancerdquo Oxford Bibliographies inPolitical Science

iexclCuidado Traps Biases Problems Pains Perils

bull dagger[GoogleFlu] David Lazer Ryan Kennedy Gary King and Alessandro Vespignani 2014 ldquoTheParable of Google Flu Traps in Big Data Analysisrdquo Science (343) 14 March httpgking

harvardedufilesgkingfiles0314policyforumffpdf

bull Dagger[MachineBias] ProPublica ldquoMachine Bias Investigating Algorithmic Injusticerdquo Series https

wwwpropublicaorgseriesmachine-bias See especially

14

Julia Angwin Jeff Larson Surya Mattu and Lauren Kirchner 2016 ldquoMachine Biasrdquo ProPub-lica May 23 httpswwwpropublicaorgarticlemachine-bias-risk-assessments-in-

criminal-sentencing

Julia Angwin Madeleine Varner and Ariana Tobin 2017 ldquoFacebook Enabled Adverstisers toReach Jew Hatersrdquo httpswwwpropublicaorgarticlefacebook-enabled-advertisers-

to-reach-jew-haters

bull Dagger[Polling2016] Doug Rivers (Nov 11 2016) ldquoFirst Thoughts on Polling Problems in the 2016 USElectionsrdquo httpstodayyougovcomnews20161111first-thoughts-polling-problems-

2016-us-elections

bull dagger[EventData] Wei Wang Ryan Kennedy David Lazer Naren Ramakrishnan 2016 ldquoGrowing Painsfor Global Monitoring of Societal Eventsrdquo Science 3536307 pp 1502ndash1503 httpsdoiorg101126scienceaaf6758

bull Dagger[GoogleBooks] Eitan Adam Pechenick Christopher M Danforth and Peter Sheridan Dodds 2015ldquoCharacterizing the Google Books Corpus Strong Limits to Inferences of Socio-Cultural and LinguisticEvolutionrdquo PLoS One httpsdoiorg101371journalpone0137041

bull Dagger[OkCupid] Michael Zimmer 2016 ldquoOkCupid Study Reveals the Perils of Big-Data Sciencerdquo Wiredhttpswwwwiredcom201605okcupid-study-reveals-perils-big-data-science May 14

bull dagger[CriticalQuestions] danah boyd and Kate Crawford 2011 ldquoCritical Questions for Big Data Provo-cations for a Cultural Technological and Scholarly Phenomenonrdquo Information Communication ampSociety 15(5) 662ndash79 httpdxdoiorg1010801369118X2012678878

bull Dagger[RacistBot] Daniel Victor 2016 ldquoMicrosoft created a Twitter bot to learn from users It quicklybecame a racist jerkrdquo New York Times httpswwwnytimescom20160325technology

microsoft-created-a-twitter-bot-to-learn-from-users-it-quickly-became-a-racist-jerk

html

bull MMDS 122-123 on Bonferroni

bull Also BDSS-Census

Research Design and Measurement

Overviews

bull BitByBit

bull Dagger[ResearchMethodsKB] William M Trochim 2006 The Research Methods Knowledge Basehttpwwwsocialresearchmethodsnetkb

bull [7Rules] Glenn Firebaugh 2008 Seven Rules for Social Research Princeton University Press

Measurement Reliability and Validity

bull ResearchMethodsKB ldquoMeasurementrdquo

bull 7Rules Ch 3 ldquoBuild Reality Checks into Your Researchrdquo

bull Reliability see also FightinWords

bull Validity see also Quinn10-Topics Monroe-5Vs

Indirect Unobtrusive Nonreactive Measures Data Exhaust

15

bull dagger[UnobtrusiveMeasures] Raymond M Lee 2015 ldquoUnobtrusive Measuresrdquo Oxford Bibliographieshttpsdoiorg101093OBO9780199846740-0048 (Canonical cite is Eugene J Webb 1966Unobtrusive Measures or Webb Donald T Campbell Richard D Schwartz Lee Sechrest 1999Unobtrusive Measures rev ed Sage)

bull BitByBit Examples in Chapter 2

Multiple Measures Latent Variable Measurement

bull 7Rules Ch 4 ldquoReplicate Where Possiblerdquo

bull dagger[LatentVariables] David J Bartholomew Martin Knott and Irini Moustaki 2011 Latent VariableModels and Factor Analysis A Unified Approach Wiley httpsiteebrarycomezaccess

librariespsuedulibpennstatedetailactiondocID=10483308) (esp Ch1 ldquoBasic ideas andexamplesrdquo)

bull dagger[Multivariate-R] Brian Everitt and Torsten Hothorn 2011 An Introduction to Applied MultivariateAnalysis with R Springer httplinkspringercomezaccesslibrariespsuedubook10

10072F978-1-4419-9650-3

bull Shalizi-ADA Ch17 ldquoFactor Modelsrdquo

bull Dagger[NetflixPrize] Edwin Chen 2011 ldquoWinning the Netflix Prize A Summaryrdquo

bull DaggerCRAN httpscranr-projectorgwebviewsMultivariatehtmlhttpscranr-projectorgwebviewsPsychometricshtmlhttpscranr-projectorgwebviewsClusterhtml

Sampling and Survey Design

bull dagger[Sampling] Steven K Thompson 2012 Sampling 3rd ed httpsk8es4mc2lsearchserialssolutionscomsid=sersolampSS_jc=TC_024492330amptitle=Wiley20Desktop20Editions203A20Sampling

bull ResearchMethodsKB ldquoSamplingrdquo

bull NRCreport Ch 8 ldquoSampling and Massive Datardquo

bull BitByBit Ch 3 ldquoAsking Questionsrdquo

bull Dagger[MSE] Daniel Manrique-Vallier Megan E Price and Anita Gohdes 2013 In Seybolt et al (eds)Counting Civilians ldquoMultiple Systems Estimation Techniques for Estimating Casualties in ArmedConflictsrdquo Preprint httpciteseerxistpsueduviewdocdownloaddoi=1011469939amp

rep=rep1amptype=pdf)

bull dagger[NetworkSampling] Ted Mouw and Ashton M Verdery 2012 ldquoNetwork Sampling with Memory AProposal for More Efficient Sampling from Social Networksrdquo Sociological Methodology 42(1)206ndash56httpsdxdoiorg1011772F0081175012461248

bull See also FightinWords (re hidden heteroskedasticity in sample variance)

bull Dagger[HashDontSample] Mudit Uppal 2016 ldquoProbabilistic data structures in the Big data world (+code)rdquo (re ldquoHash donrsquot samplerdquo) httpsmediumcommuppalprobabilistic-data-structures-

in-the-big-data-world-code-b9387cff0c55

bull MMDS Hash Functions (132) Sampling in streams (42)

bull DaggerCRAN httpscranr-projectorgwebviewsOfficialStatisticshtml (ldquoComplex SurveyDesignrdquo ldquoSmall Area Estimationrdquo)

Experimental and Observational Designs for Causal Inference

16

bull Dagger[CausalInference] Miguel A Hernan James M Robins Forthcoming (2017) Causal Infer-ence Chapman amp HallCRC Preprint httpswwwhsphharvardedumiguel-hernancausal-

inference-book Chs 1 ldquoA definition of causal effectrdquo Ch 2 ldquoRandomized experimentsrdquo Ch3 ldquoObservational studiesrdquo (See Ch 6 ldquoGraphical representation of causal effectsrdquo for integrationwith Judea Pearl approach)

bull BitByBit Chapter 4 ldquoRunning Experimentsrdquo Ch 2 Natural experiments in observable data exam-ples

bull ResearchMethodsKB ldquoDesignrdquo

bull Firebaugh-7Rules Ch 2 ldquoLook for Differences that Make a Difference and Report Themrdquo sectCh5 ldquoCompare Like with Likerdquo Ch 6 ldquoUse Panel Data to Study Individual Change and RepeatedCross-Section Data to Study Social Changerdquo

bull Shalizi-ADA Part IV ldquoCausal Inferencerdquo

bull Monroe-No

bull CRAN httpscranr-projectorgwebviewsExperimentalDesignhtml

Technologies for Primary and Secondary Data Collection

Mobile Devices Distributed Sensors Wearable Sensors Remote Sensing

bull dagger[RealityMining] Nathan Eagle and Alex (Sandy) Pentland 2006 ldquoReality Mining SensingComplex Social Systemsrdquo Personal and Ubiquitous Computing 10(4) 255ndash68 httpsdoiorg

101007s00779-005-0046-3

bull dagger[QuantifiedSelf ] Melanie Swan 2013 ldquoThe Quantified Self Fundamental Disruption in Big DataScience and Biological Discoveryrdquo Big Data 1(2) 85ndash99 httpsdoiorg101089big2012

0002

bull dagger[SensorData] Charu C Aggarwal (Ed) 2013 Managing and Mining Sensor Data Springer httplinkspringercomezaccesslibrariespsuedubook1010072F978-1-4614-6309-2 Ch1ldquoAn Introduction to Sensor Data Analyticsrdquo

bull dagger[NightLights] Thushyanthan Baskaran Brian Min Yogesh Uppal 2015 ldquoElection cycles andelectricity provision Evidence from a quasi-experiment with Indian special electionsrdquo Journal ofPublic Economics 12664-73 httpsdoiorg101016jjpubeco201503011

Crowdsourcing Human Computation Citizen Science Web Experiments

bull dagger[Crowdsourcing] Daren C Brabham 2013 Crowdsourcing MIT Press httpsiteebrary

comezaccesslibrariespsuedulibpennstatedetailactiondocID=10692208 ldquoIntroduc-tionrdquo Ch1 ldquoConcepts Theories and Cases of Crowdsourcingrdquo

bull BitByBit Ch 5 ldquoCreating Mass Collaborationrdquo

bull NRCreport Ch 9 ldquoHuman Interaction with Datardquo

bull dagger[MTurk] Krista Casler Lydia Bickel and Elizabeth Hackett 2013 ldquoSeparate but Equal AComparison of Participants and Data Gathered via Amazonrsquos MTurk Social Media and Face-to-FaceBehavioral Testingrdquo Computers in Human Behavior 29(6) 2156ndash60 httpdoiorg101016j

chb201305009

bull dagger[LabintheWild] Katharina Reinecke and Krzysztof Z Gajos 2015 ldquoLabintheWild ConductingLarge-Scale Online Experiments With Uncompensated Samplesrdquo In Proceedings of the 18th ACMConference on Computer Supported Cooperative Work amp Social Computing (CSCW rsquo15) 1364ndash1378httpdxdoiorg10114526751332675246

17

bull dagger[TweetmentEffects] Kevin Munger 2017 ldquoTweetment Effects on the Tweeted ExperimentallyReducing Racist Harassmentrdquo Political Behavior 39(3) 629ndash49 httpdoiorg101016jchb

201305009

bull dagger[HumanComputation] Edith Law and Luis van Ahn 2011 Human Computation Morgan amp Clay-pool httpwwwmorganclaypoolcomezaccesslibrariespsuedudoipdf102200S00371ED1V01Y201107AIM013

Open Data File Formats APIs Semantic Web Linked Data

bull DSHandbook-Py Ch 12 ldquoData Encodings and File Formatsrdquo

bull Dagger[OpenData] Open Data Institute ldquoWhat Is Open Datardquo httpstheodiorgwhat-is-open-

data

bull Dagger[OpenDataHandbook] Open Knowledge International The Open Data Handbook httpopendatahandbookorg Includes appendix ldquoFile Formatsrdquo httpopendatahandbookorgguideenappendices

file-formats

bull Dagger[APIs] Brian Cooksey 2016 An Introduction to APIs httpszapiercomlearnapis

bull Dagger[APIMarkets] RapidAPI mashape API marketplaces httpsdocsrapidapicom https

marketmashapecom ProgrammableWeb httpswwwprogrammablewebcom

bull dagger[LinkedData] Tom Heath and Christian Bizer 2011 Linked Data Evolving the Web into a GlobalData Space Morgan amp Claypool httpwwwmorganclaypoolcomezaccesslibrariespsu

edudoiabs102200S00334ED1V01Y201102WBE001 Ch 1 ldquoIntroductionrdquo Ch 2 ldquoPrinciples ofLinked Datardquo

bull dagger[SemanticWeb] Nikolaos Konstantinos and Dimitrios-Emmanuel Spanos 2015 Materializing theWeb of Linked Data Springer httplinkspringercomezaccesslibrariespsuedubook

1010072F978-3-319-16074-0 Ch 1 ldquoIntroduction Linked Data and the Semantic Webrdquo

bull daggerMorgan amp Claypool Synthesis Lectures on the Semantic Web Theory and Technology httpwww

morganclaypoolcomtocwbe111

Web Scraping

bull dagger[TheoryDrivenScraping] Richard N Landers Robert C Brusso Katelyn J Cavanaugh andAndrew B Colmus 2016 ldquoA Primer on Theory-Driven Web Scraping Automatic Extraction ofBig Data from the Internet for Use in Psychological Researchrdquo Psychological Methods 4 475ndash492httpdxdoiorgezaccesslibrariespsuedu101037met0000081

bull Dagger[Scraping-Py] Al Sweigart 2015 Automate the Boring Stuff with Python Practical Programmingfor Total Beginners Ch 11 Web-Scraping httpsautomatetheboringstuffcomchapter11

bull dagger[Scraping-R] Simon Munzert Christian Rubba Peter Meiszligner and Dominic Nyhuis 2014 Au-tomated Data Collection with R A Practical Guide to Web Scraping and Text Mining Wileyhttponlinelibrarywileycombook1010029781118834732

bull DaggerCRAN httpscranr-projectorgwebviewsWebTechnologieshtml

Ethics amp Scientific Responsibility in Big Social Data

Human Subjects Consent Privacy

bull BitByBit Ch 6 (ldquoEthicsrdquo)

bull Dagger[PSU-ORP] Penn State Office for Research Protections (under the VP for Research)

Human Subjects Research IRB httpswwwresearchpsueduirb

18

Revised Common Rule httpswwwresearchpsueduirbcommonrulechanges

Responsible Conduct of Research httpswwwresearchpsuedueducationrcr

Research Misconduct httpswwwresearchpsueduresearchmisconduct

SARI PSU (Scientific and Research Integrity Training) httpswwwresearchpsuedu

trainingsari

bull Dagger[MenloReport] David Dittrich and Erin Kenneally (Center for Applied Interneet Data Analysis)2012 ldquoThe Menlo Report Ethical Principles Guiding Information and Communication Technol-ogy Researchrdquo and companion report ldquoApplying Ethical Principles rdquo httpswwwcaidaorg

publicationspapers2012menlo_report_actual_formatted

bull Dagger[BigDataEthics-CBDES] Jacob Metcalf Emily F Keller and danah boyd 2016 ldquoPerspec-tives on Big Data Ethics and Societyrdquo Council for Big Data Ethics and Society httpbdes

datasocietynetcouncil-outputperspectives-on-big-data-ethics-and-society

bull Dagger[AoIRReport] Annette Markham and Elizabeth Buchanan (Association of Internet Researchers)2012 ldquoEthical Decision-Making and Internet Research Recommendations of the AoIR Ethics WorkingCommittee (Version 20)rdquo httpsaoirorgreportsethics2pdf and guidelines chart https

aoirorgwp-contentuploads201701aoir_ethics_graphic_2016pdf

bull Dagger[BigDataEthics-Wired] Sarah Zhang 2016 ldquoScientists are Just as Confused about the Ethics ofBig-Data Research as Yourdquo Wired httpswwwwiredcom201605scientists-just-confused-ethics-big-data-research

bull dagger[BigDataEthics-HerschelMori] Richard Herschel and Virginia M Mori 2017 ldquoEthics amp BigDatardquo Technology in Society 49 31-36 httpdoiorg101016jtechsoc201703003

The Science of Data Privacy

bull dagger[BigDataPrivacy] Terence Craig and Mary E Ludloff 2011 Privacy and Big Data OrsquoReilly MediahttppensueblibcompatronFullRecordaspxp=781814

bull Dagger[DataAnalysisPrivacy] John Abowd Lorenzo Alvisi Cynthia Dwork Sampath Kannan AshwinMachanavajjhala Jerome Reiter 2017 ldquoPrivacy-Preserving Data Analysis for the Federal Statisti-cal Agenciesrdquo A Computing Community Consortium white paper httpsarxivorgabs1701

00752

bull dagger[DataPrivacy] Stephen E Fienberg and Aleksandra B Slavkovic 2011 ldquoData Privacy and Con-fidentialityrdquo International Encyclopedia of Statistical Science 342ndash5 httpdoiorg978-3-642-

04898-2_202

bull Dagger[NetworksPrivacy] Vishesh Karwa and Aleksandra Slavkovic 2016 ldquoInference using noisy degreesDifferentially private β-model and synthetic graphsrdquo Annals of Statistics 44(1) 87-112 http

projecteuclidorgeuclidaos1449755958

bull dagger[DataPublishingPrivacy] Raymond Chi-Wing Wong and Ada Wai-Chee Fu 2010 Privacy-Preserving Data Publishing An Overview Morgan amp Claypool httpwwwmorganclaypoolcom

ezaccesslibrariespsuedudoipdfplus102200S00237ED1V01Y201003DTM002

bull DataMatching Ch 8 ldquoPrivacy Aspects of Data Matchingrdquo

Transparency Reproducibility and Team Science

bull Dagger[10RulesforData] Alyssa Goodman Alberto Pepe Alexander W Blocker Christine L BorgmanKyle Cranmer Merce Crosas Rosanne Di Stefano Yolanda Gil Paul Groth Margaret HedstromDavid W Hogg Vinay Kashyap Ashish Mahabal Aneta Siemiginowska and Aleksandra Slavkovic(2014) ldquoTen Simple Rules for the Care and Feeding of Scientific Datardquo PLoS Computational Biology10(4) e1003542 httpsdoiorg101371journalpcbi1003542 (Note esp curated resourcesfor reproducible research)

19

bull dagger[Transparency] E Miguel C Camerer K Casey J Cohen K M Esterling A Gerber R Glen-nerster D P Green M Humphreys G Imbens D Laitin T Madon L Nelson B A Nosek M Pe-tersen R Sedlmayr J P Simmons U Simonsohn M Van der Laan 2014 ldquoPromoting Transparencyin Social Science Researchrdquo Science 343(6166) 30ndash1 httpsdoiorg101126science1245317

bull dagger[Reproducibility] Marcus R Munafo Brian A Nosek Dorothy V M Bishop Katherine S ButtonChristopher D Chambers Nathalie Percie du Sert Uri Simonsohn Eric-Jan Wagenmakers JenniferJ Ware and John P A Ioannidis 2017 ldquoA Manifesto for Reproducible Sciencerdquo Nature HumanBehavior 0021(2017) httpsdoiorg101038s41562-016-0021

bull Dagger[SoftwareCarpentry] esp ldquoLessonsrdquo httpsoftware-carpentryorglessons

bull DSHandbook-Py Ch 9 ldquoTechnical Communication and Documentationrdquo Ch 15 ldquoSoftware Engi-neering Best Practicesrdquo

bull dagger[TeamScienceToolkit] Vogel AL Hall KL Fiore SM Klein JT Bennett LM Gadlin H Stokols DNebeling LC Wuchty S Patrick K Spotts EL Pohl C Riley WT Falk-Krzesinski HJ 2013 ldquoTheTeam Science Toolkit enhancing research collaboration through online knowledge sharingrdquo AmericanJournal of Preventive Medicine 45 787-9 http101016jamepre201309001

bull sect[Databrary] Kara Hall Robert Croyle and Amanda Vogel Forthcoming (2017) Advancing Socialand Behavioral Health Research through Cross-disciplinary Team Science Springer Includes RickO Gilmore and Karen E Adolph ldquoOpen Sharing of Research Video Breaking the Boundaries of theResearch Teamrdquo (See httpdatabraryorg)

bull CRAN httpscranr-projectorgwebviewsReproducibleResearchhtml

Social Bias Fair Algorithms

bull MachineBias

bull Dagger[EmbeddingsBias] Aylin Caliskan Joanna J Bryson and Arvind Narayanan 2017 ldquoSemanticsDerived Automatically from Language Corpora Contain Human Biasesrdquo Science httpsarxiv

orgabs160807187

bull Dagger[Debiasing] Tolga Bolukbasi Kai-Wei Chang James Zou Venkatesh Saligrama Adam Kalai 2016ldquoMan is to Computer Programmer as Woman is to Homemaker Debiasing Word Embeddingsrdquohttpsarxivorgabs160706520

bull Dagger[AvoidingBias] Moritz Hardt Eric Price Nathan Srebro 2016 ldquoEquality of Opportunity inSupervised Learningrdquo httpsarxivorgabs161002413

bull Dagger[InevitableBias] Jon Kleinberg Sendhil Mullainathan Manish Raghavan 2016 ldquoInherent Trade-Offs in the Fair Determination of Risk Scoresrdquo httpsarxivorgabs160905807

Databases and Data Management

bull NRCreport Ch 3 ldquoScaling the Infrastructure for Data Managementrdquo

bull dagger[SQL] Jan L Harrington 2010 SQL Clearly Explained Elsevier httpwwwsciencedirectcomezaccesslibrariespsuedusciencebook9780123756978

bull DSHandbook-Py Ch 14 ldquoDatabasesrdquo

bull dagger[noSQL] Guy Harrison 2015 Next Generation Databases NoSQL NewSQL and Big Data Apresshttpslink-springer-comezaccesslibrariespsuedubook101007978-1-4842-1329-2

bull dagger[Cloud] Divyakant Agrawal Sudipto Das and Amr El Abbadi 2012 Data Management inthe Cloud Challenges and Opportunities Morgan amp Claypool httpwwwmorganclaypoolcom

ezaccesslibrariespsuedudoipdfplus102200S00456ED1V01Y201211DTM032

20

bull daggerMorgan amp Claypool Synthesis Lectures on Data Management httpwwwmorganclaypoolcom

tocdtm11

Data Wrangling

Theoretically-structured approaches to data wrangling

bull Dagger[TidyData-R] Garrett Grolemund and Hadley Wickham 2017 R for Data Science (httpr4dshadconz) esp Ch 12 ldquoTidy Datardquo also Wickham 2014 ldquoTidy Datardquo Journalof Statistical Software 59(10) httpwwwjstatsoftorgv59i10paper Tools The tidyversehttptidyverseorg

bull Dagger[DataScience-Py] Jake VanderPlas 2016 Python Data Science Handbook (httpsgithubcomjakevdpPythonDSHandbook-Py) esp Ch 3 on pandas httppandaspydataorg

bull Dagger[DataCarpentry] Colin Gillespie and Robin Lovelace 2017 Efficient R Programming https

csgillespiegithubioefficientR esp Ch 6 on ldquoefficient data carpentryrdquo

bull sect[Wrangler] Joseph M Hellerstein Jeffrey Heer Tye Rattenbury and Sean Kandel 2017 DataWrangling Practical Techniques for Data Preparation Tool Trifacta Wrangler httpwwwtrifactacomproductswrangler

Data wrangling practice

bull DSHandbook-Py Ch 4 ldquoData Munging String Manipulation Regular Expressions and Data Clean-ingrdquo

bull dagger[DataSimplification] Jules J Berman 2016 Data Simplification Taming Information with OpenSource Tools Elsevier httpwwwsciencedirectcomezaccesslibrariespsueduscience

book9780128037812

bull dagger[Wrangling-Py] Jacqueline Kazil Katharine Jarmul 2016 Data Wrangling with Python OrsquoReillyhttpproquestcombosafaribooksonlinecomezaccesslibrariespsuedu9781491948804

bull dagger[Wrangling-R] Bradley C Boehmke 2016 Data Wrangling with R Springer httplink

springercomezaccesslibrariespsuedubook1010072F978-3-319-45599-0

Record Linkage Entity Resolution Deduplication

bull dagger[DataMatching] Peter Christen 2012 Data Matching Concepts and Techniques for Record Link-age Entity Resolution and Duplicate Detectionhttplinkspringercomezaccesslibrariespsuedubook1010072F978-3-642-31164-2

(esp Ch 2 ldquoThe Data Matching Processrdquo)

bull DataSimplification Chapter 5 ldquoIdentifying and Deidentifying Datardquo

bull Dagger[EntityResolution] Lise Getoor and Ashwin Machanavajjhala 2013 ldquoEntity Resolution for BigDatardquo Tutorial KDD httpwwwumiacsumdedu~getoorTutorialsER_KDD2013pdf

bull Dagger[SyrianCasualties] Peter Sadosky Anshumali Shrivastava Megan Price and Rebecca C Steorts2015 ldquoBlocking Methods Applied to Casualty Records from the Syrian Conflictrdquo httpsarxiv

orgabs151007714 (For more on blocking see DataMatching Ch 4 ldquoIndexingrdquo)

bull CRAN Task Views httpscranr-projectorgwebviewsOfficialStatisticshtml ldquoStatis-tical Matching and Record Linkagerdquo

ldquoMaking up datardquo Imputation Smoothers Kernels Priors Filters Teleportation Negative Sampling Con-volution Augmentation Adversarial Training

21

bull dagger[Imputation] Yi Deng Changgee Chang Moges Seyoum Ido and Qi Long 2016 ldquoMultiple Im-putation for General Missing Data Patterns in the Presence of High-dimensional Datardquo ScientificReports 6(21689) httpswwwnaturecomarticlessrep21689

bull Dagger[Overimputation] Matthew Blackwell James Honaker and Gary King 2017 ldquoA Unified Ap-proach to Measurement Error and Missing Data Overview and Applicationsrdquo Sociological Methodsamp Research 46(3) 303-341 httpgkingharvardedufilesgkingfilesmeasurepdf

bull Shalizi-ADA Sect 15 (Linear Smoothers) Ch 8 (Splines) Sect 144 (Kernel Density Estimates)DataScience-Python NB 0513 ldquoKernel Density Estimationrdquo

bull FightinWords (re Bayesian priors as additional data impact of priors on regularization)

bull DeepLearning Section 75 (Data Augmentation) 712 (Dropout) 713 (Adversarial Examples) Chap-ter 9 (Convolutional Networks)

bull dagger[Adversarial] Ian J Goodfellow Jonathon Shlens Christian Szegedy 2015 ldquoExplaining and Har-vesting Adversarial Examplesrdquo httpsarxivorgabs14126572

bull See also resampling and simulation methods

bull See also feature engineering preprocessing

bull CRAN Task Views httpscranr-projectorgwebviewsOfficialStatisticshtml ldquoImpu-tationrdquo

(Direct) Data Representations Data Mappings

bull NRCreport Ch 5 ldquoLarge-Scale Data Representationsrdquo

bull dagger[PatternRecognition] M Narasimha Murty and V Susheela Devi 2011 Pattern RecognitionAn Algorithmic Approach Springer httpslink-springer-comezaccesslibrariespsuedu

book1010072F978-0-85729-495-1 Section 21 ldquoData Structures for Pattern Representationrdquo

bull Dagger[InfoRetrieval] Christopher D Manning Prabhakar Raghavan and Hinrich Schutze 2009 Intro-duction to Information Retrieval Cambridge University Press httpnlpstanfordeduIR-bookCh 1 2 6 (also note slides used in their class)

bull dagger[Algorithms] Brian Steele John Chandler Swarna Reddy 2016 Algorithms for Data ScienceWiley httplinkspringercomezaccesslibrariespsuedubook1010072F978-3-319-

45797-0 Ch 2 ldquoData Mapping and Data Dictionariesrdquo

bull See also Social Data Structures

Similarity Distance Association Covariance The Kernel Trick

bull PatternRecognition Section 23 ldquoProximity Measuresrdquo

bull Dagger[Similarity] Brendan OrsquoConnor 2012 ldquoCosine similarity Pearson correlation and OLS coeffi-cientsrdquo httpsbrenoconcomblog201203cosine-similarity-pearson-correlation-and-

ols-coefficients

bull For relatively comprehensive lists see also

M-J Lesot M Rifqi and H Benhadda 2009 ldquoSimilarity measures for binary and numerical data a surveyrdquoInt J Knowledge Engineering and Soft Data Paradigms 1(1) 63- httpciteseerxistpsueduviewdoc

downloaddoi=10112126533amprep=rep1amptype=pdf

Seung-Seok Choi Sung-Hyuk Cha Charles C Tappert 2010 ldquoA Survey of Binary Similarity and DistanceMeasuresrdquo Journal of Systemics Cybernetics amp Informatics 8(1)43-8 httpwwwiiisciorgJournalCV$

scipdfsGS315JGpdf

22

Sung-Hyuk Cha 2007 ldquoComprehensive Survey on DistanceSimilarity Measures between Probability DensityFunctionsrdquo International Journal of Mathematical Models and Methods in Applied Sciences 4(1) 300-7 http

csispaceeductappertdpsd861-12session4-p2pdf

Anna Huang 2008 ldquoSimilarity Measures for Text Document Clusteringrdquo Proceedings of the 6th New ZealandComputer Science Research Student Conference 49-56 httpwwwnzcsrsc08canterburyacnzsiteproceedingsIndividual_Paperspg049_Similarity_Measures_for_Text_Document_Clusteringpdf

bull Michael B Jordan ldquoThe Kernel Trickrdquo (Lecture Notes) httpspeopleeecsberkeleyedu

~jordancourses281B-spring04lectureslec3pdf

bull Eric Kim 2017 ldquoEverything You Ever Wanted to Know about the Kernel Trick (But Were Afraidto Ask)rdquo httpwwweric-kimneteric-kim-netposts1kernel_trick_blog_ekim_12_20_

2017pdf

Derived Data Representations - Dimensionality Reduction Compression De-composition Embeddings

The groupings here and under the measurement multivariate statistics section are particularly arbitraryFor example ldquok-Means clusteringrdquo can be viewed as a technique for ldquodimensionality reductionrdquo ldquocompres-sionrdquo ldquofeature extractionrdquo ldquolatent variable measurementrdquo ldquounsupervised learningrdquo ldquocollaborative filteringrdquo

Clustering hashing quantization blocking compression

bull Dagger[Compression] Khalid Sayood 2012 Introduction to Data Compression 4th ed Springer httpwwwsciencedirectcomezaccesslibrariespsuedusciencebook9780124157965 (eg cod-ing blocking via vector quantization)

bull MMDS Ch3 ldquoFinding Similar Itemsrdquo (minhashing locality sensitive hashing)

bull Multivariate-R Ch 6 ldquoClusteringrdquo

bull DataScience-Python NB 0511 ldquok-Means Clusteringrdquo NG 0512 ldquoGaussian Mixture Modelsrdquo

bull Dagger[KMeansHashing] Kaiming He Fang Wen Jian Sun 2013 ldquoK-means Hashing An Affinity-Preserving Quantization Method for Learning Binary Compact Codesrdquo CVPR httpswwwcv-foundationorgopenaccesscontent_cvpr_2013papersHe_K-Means_Hashing_An_2013_CVPR_paper

pdf

bull dagger[Squashing] Madigan D Raghavan N Dumouchel W Nason M Posse C and RidgewayG (2002) ldquoLikelihood-based data squashing A modeling approach to instance constructionrdquo DataMining and Knowledge Discovery 6(2) 173-190 httpdxdoiorgezaccesslibrariespsu

edu101023A1014095614948

bull dagger[Core-sets] Piotr Indyk Sepideh Mahabadi Mohammad Mahdian Vahab S Mirrokni 2014 ldquoCom-posable core-sets for diversity and coverage maximizationrdquo PODS rsquo14 100-8 httpsdoiorg10114525945382594560

Feature selection feature extraction feature engineering weighting preprocessing

bull PatternRecognition Sections 26-7 ldquoFeature Selection Feature Extractionrdquo

bull DSHandbook-Py Ch 7 ldquoInterlude Feature Extraction Ideasrdquo

bull DataScience-Python NB 0504 ldquoFeature Engineeringrdquo

bull Features in text NLP and InfoRetrieval re tfidf and similar FightinWords

bull Features in images DataScience-Python NB 0514 ldquoImage Featuresrdquo

23

Dimensionality reduction decomposition factorization change of basis reparameterization matrix com-pletion latent variables source separation

bull MMDS Ch 9 ldquoRecommendation Systemsrdquo Ch11 ldquoDimensionality Reductionrdquo

bull DSHandbook-Py Ch 10 ldquoUnsupervised Learning Clustering and Dimensionality Reductionrdquo

bull Dagger[Shalizi-ADA] Cosma Rohilla Shalizi 2017 Advanced Data Analysis from an Elementary Pointof View Ch 16 (ldquoPrincipal Components Analysisrdquo) Ch 17 (ldquoFactor Modelsrdquo) httpwwwstat

cmuedu~cshaliziADAfaEPoV

See also Multivariate-R Ch 3 ldquoPrincipal Components Analysisrdquo Ch 4 ldquoMultidimensional ScalingrdquoCh 5 ldquoExploratory Factor Analysisrdquo Latent

bull DeepLearning Ch 2 ldquoLinear Algebrardquo Ch 13 ldquoLinear Factor Modelsrdquo

bull Dagger[GloVe] Jeffrey Pennington Richard Socher Christopher Manning 2014 ldquoGloVe Global vectorsfor word representationrdquo EMNLP httpsnlpstanfordeduprojectsglove

bull dagger[NMF] Daniel D Lee and H Sebastian Seung 1999 ldquoLearning the parts of objects by non-negativematrix factorizationrdquo Nature 401788-791 httpdxdoiorgezaccesslibrariespsuedu

10103844565

bull dagger[CUR] Michael W Mahoney and Petros Drineas 2009 ldquoCUR matrix decompositions for improveddata analysisrdquo PNAS httpwwwpnasorgcontent1063697full

bull Dagger[ICA] Aapo Hyvarinen and Erkki Oja 2000 ldquoIndependent Component Analysis Algorithmsand Applicationsrdquo Neural Networks 13(4-5) 411-30 httpswwwcshelsinkifiuahyvarin

papersNN00newpdf

bull [RandomProjection] Ella Bingham and Heikki Mannilla 2001 ldquoRandom projection in dimension-ality reduction applications to image and text datardquo KDD httpsdoiorg1011452F502512

502546

bull Compression eg ldquoTransform codingrdquo ldquoWaveletsrdquo

Nonlinear dimensionality reduction Manifold learning

bull DeepLearning Section 5113 ldquoManifold Learningrdquo

bull DataScience-Python NB 0510 ldquoManifold Learningrdquo (Locally linear embedding [LLE] Isomap)

bull Dagger[KernelPCA] Sebastian Raschka 2014 ldquoKernel tricks and nonlinear dimensionality reduction viaRBF kernel PCArdquo httpsebastianraschkacomArticles2014_kernel_pcahtml

bull Shalizi-ADA Ch 18 ldquoNonlinear Dimensionality Reductionrdquo (LLE)

bull Dagger[LaplacianEigenmaps] Mikhail Belkin and Partha Niyogi 2003 ldquoLaplacian eigenmaps for di-mensionality reduction and data representationrdquo Neural Computation 15(6) 13731396 http

webcseohio-stateedu~belkin8papersLEM_NC_03pdf

bull DeepLearning Ch14 (ldquoAutoencodersrdquo)

bull Dagger[word2vec] Tomas Mikolov Kai Chen Greg Corrado Jeffrey Dean 2013 ldquoEfficient Estimation ofWord Representations in Vector Spacerdquo httpsarxivorgabs13013781

bull Dagger[word2vecExplained] Yoav Goldberg and Omer Levy 2014 ldquoword2vec Explained DerivingMikolov et als Negative-Sampling Word-Embedding Methodrdquo httpsarxivorgabs14023722

bull Dagger[t-SNE] Laurens van der Maaten and Geoffrey Hinton 2008 ldquoVisualising Data using t-SNErdquo Jour-nal of Machine Learning Research 9 2579-2605 httpjmlrcsailmitedupapersvolume9

vandermaaten08avandermaaten08apdf

24

Computation and Scaling Up

Scientific computing computation at scale

bull NRCreport Ch 10 ldquoThe Seven Computational Giants of Massive Data Analysisrdquo

bull DSHandbook-Py Ch 21 ldquoPerformance and Computer Memoryrdquo Ch 22 ldquoComputer Memory andData Structuresrdquo

bull Dagger[ComputationalStatistics-Python] Cliburn Chan Computational Statistics in Python https

peopledukeedu~ccc14sta-663indexhtml

bull CRAN Task View High Performance Computinghttpscranr-projectorgwebviewsHighPerformanceComputinghtml

Numerical computing

bull [NumericalComputing] Ward Cheney and David Kincaid 2013 Numerical Mathematics andComputing 7th ed BrooksCole Cengage Learning

bull CRAN Task View Numerical Mathematicshttpscranr-projectorgwebviewsNumericalMathematicshtml

Optimization (eg MLE gradient descent stochastic gradient descent EM algorithm neural nets)

bull DSHandbook-Py Ch 23 ldquoMaximum Likelihood Estimation and Optimizationrdquo (gradient descent)

bull DeepLearning Ch 4 ldquoNumerical Computationrdquo Ch 6 ldquoDeep Feedforward Networksrdquo Ch 8ldquoOptimization for Training Deep Modelsrdquo

bull CRAN Task View Optimization and Mathematical Programminghttpscranr-projectorgwebviewsOptimizationhtml

Linear algebra matrix computations

bull [MatrixComputations] Gene H Golub and Charles F Van Loan 2013 Matrix Computations 4thed Johns Hopkins University Press

bull Dagger[NetflixMatrix] Yehuda Koren Robert Bell and Chris Volinsky 2009 ldquoMatrix FactorizationTechniques for Recommender Systemsrdquo Computer August 42-9 httpsdatajobscomdata-

science-repoRecommender-Systems-[Netflix]pdf

bull Dagger[BigDataPCA] Jianqing Fan Qiang Sun Wen-Xin Zhou Ziwei Zhu ldquoPrincipal Component Anal-ysis for Big Datardquo httpwwwprincetonedu~ziweizpcapdf

bull Dagger[RandomSVD] Andrew Tulloch 2009 ldquoFast Randomized SVDrdquo httpsresearchfbcom

fast-randomized-svd

bull Dagger[FactorSGD] Rainer Gemulla Peter J Haas Erik Nijkamp and Yannis Sismanis 2011 ldquoLarge-Scale Matrix Factorization with Distributed Stochastic Gradient Descentrdquo KDD httpwwwcs

utahedu~hariteachingbigdatagemulla11dsgdpdf

bull Dagger[Sparse] Max Grossman 2015 ldquo101 Ways to Store a Sparse Matrixrdquo httpsmediumcom

jmaxg3101-ways-to-store-a-sparse-matrix-c7f2bf15a229

bull Dagger[DontInvert] John D Cook 2010 ldquoDonrsquot Invert that Matrixrdquo httpswwwjohndcookcom

blog20100119dont-invert-that-matrix

Simulation-based inference resampling Monte Carlo methods MCMC Bayes approximate inference

25

bull Dagger[MarkovVisually] Victor Powell ldquoMarkov Chains Explained Visuallyrdquo httpsetosaioev

markov-chains

bull DSHandbook-Py Ch 25 ldquoStochastic Modelingrdquo (Markov chains MCMC HMM)

bull Bayes ComputationalStatistics-Py

bull DeepLearning Ch 17 ldquoMonte Carlo Methodsrdquo Ch 19 ldquoApproximate Inferencerdquo

bull Dagger[VariationalInference] Jason Eisner 2011 ldquoHigh-Level Explanation of Variational Inferencerdquohttpswwwcsjhuedu~jasontutorialsvariationalhtml

bull CRAN Task View Bayesian Inferencehttpscranr-projectorgwebviewsBayesianhtml

Parallelism MapReduce Split-Apply-Combine

bull Dagger[MapReduceIntuition] Jigsaw Academy 2014 ldquoBig Data Specialist MapReducerdquo https

wwwyoutubecomwatchv=TwcYQzFqg-8ampfeature=youtube

bull MMDS Ch 2 ldquoMap-Reduce and the New Software Stackrdquo (3rd Edition discusses Spark amp Tensor-Flow)

bull Algorithms Ch 3ldquoScalable Algorithms and Associative Statisticsrdquo Ch 4 ldquoHadoop and MapReducerdquo

bull NRCreport Ch 6 ldquoResources Trade-offs and Limitationsrdquo

bull TidyData-R (Split-apply-combine is the motivating principle behind the ldquotidyverserdquo approach) Seealso Part III ldquoProgramrdquo (pipes functions vectors iteration)

bull Dagger[TidySAC-Video] Hadley Wickham 2017 ldquoData Science in the Tidyverserdquo httpswww

rstudiocomresourcesvideosdata-science-in-the-tidyverse

bull See also FactorSGD

Functional Programming

bull ComputationalStatistics-Python ldquoFunctions are first class objectsrdquo through first exercises

bull DSHandbook-Py Ch 20 ldquoProgramming Language Conceptsrdquo

bull See also Haskell

Scaling iteration streaming data online algorithms (Spark)

bull NRCreport Ch 4 ldquoTemporal Data and Real-Time Algorithmsrdquo

bull MMDS Ch 4 ldquoMining Data Streamsrdquo

bull Dagger[BDAS] AMPLab BDAS The Berkeley Data Analytics Stack httpsamplabcsberkeleyedu

software

bull dagger[Spark] Mohammed Guller 2015 Big Data Analytics with Spark A Practitioners Guide toUsing Spark for Large-Scale Data Processing Machine Learning and Graph Analytics and High-Velocity Data Stream Processing Apress httpslink-springer-comezaccesslibrariespsu

edubook101007978-1-4842-0964-6

bull DSHandbook-Py Ch 13 ldquoBig Datardquo

bull DeepLearning Ch 10 ldquoSequence Modeling Recurrent and Recursive Netsrdquo

General Resources for Python and R

26

bull dagger[DSHandbook-Py] Field Cady 2017 The Data Science Handbook (Python-based) Wiley http

onlinelibrarywileycomezaccesslibrariespsuedubook1010029781119092919

bull Dagger[Tutorials-R] Ujjwal Karn 2017 ldquoA curated list of R tutorials for Data Science NLP and MachineLearningrdquo httpsgithubcomujjwalkarnDataScienceR

bull Dagger[Tutorials-Python] Ujjwal Karn 2017 ldquoA curated list of Python tutorials for Data Science NLPand Machine Learningrdquo httpsgithubcomujjwalkarnDataSciencePython

Cutting and Bleeding Edge of Data Science Languages

bull [Scala] daggerVishal Layka and David Pollak 2015 Beginning Scala httpslink-springer-

comezaccesslibrariespsuedubook1010072F978-1-4842-0232-6 daggerNicolas Patrick 2014Scala for Machine Learning httpsebookcentralproquestcomlibpensudetailaction

docID=1901910 DaggerScala site httpswwwscala-langorg (See also )

bull [Julia] daggerIvo Baobaert 2015 Getting Started with Julia httpsebookcentralproquestcom

libpensudetailactiondocID=1973847 DaggerDouglas Bates 2013 ldquoJulia for R Programmersrdquohttpwwwstatwiscedu~batesJuliaForRProgrammerspdf DaggerJulia site httpsjulialang

org

bull [Haskell] daggerRichard Bird 2014 Thinking Functionally with Haskell httpsdoi-orgezaccess

librariespsuedu101017CBO9781316092415 daggerHakim Cassimally 2017 Learning Haskell Pro-gramming httpswwwlyndacomHaskell-tutorialsLearning-Haskell-Programming604926-2html daggerJames Church 2017 Learning Haskell for Data Analysis httpswwwlyndacomDeveloper-tutorialsLearning-Haskell-Data-Analysis604234-2html Haskell site httpswwwhaskellorg

bull [Clojure] daggerMark McDonnell 2017 Quick Clojure Essential Functional Programming https

link-springer-comezaccesslibrariespsuedubook1010072F978-1-4842-2952-1 daggerAkhillWali 2014 Clojure for Machine Learning httpsebookcentralproquestcomlibpensu

detailactiondocID=1674848 daggerArthur Ulfeldt 2015 httpswwwlyndacomClojure-tutorialsUp-Running-Clojure413127-2html DaggerClojure site httpsclojureorg

bull [TensorFlow] DaggerAbadi et al (Google) 2016 ldquoTensorFlow A System for Large-Scale Ma-chine Learningrdquo httpsarxivorgabs160508695 DaggerUdacity ldquoDeep Learningrdquo httpswww

udacitycomcoursedeep-learning--ud730 DaggerTensorFlow site httpswwwtensorfloworg

bull [H2O] Darren Cool 2016 Practical Machine Learning with H2O Powerful Scalable Techniques forDeep Learning and AI OrsquoReilly Arno Candel and Viraj Parmar 2015 Deep Learning with H2ODaggerH2O site httpswwwh2oai

Social Data Structures

Space and Time

bull dagger[GIA] David OrsquoSullivan David J Unwin 2010 Geographic Information Analysis Second EditionJohn Wiley amp Sons httponlinelibrarywileycombook1010029780470549094

bull [Space-Time] Donna J Peuquet 2003 Representations of Space and Time Guilford

bull Roger S Bivand Edzer Pebesma Virgilio Gmez-Rubio 2013 Applied Spatial Data Analysis in RFree through library httplinkspringercomezaccesslibrariespsuedubook101007

2F978-1-4614-7618-4 See also Edzer Pebesma 2016 ldquoHandling and Analyzing Spatial Spa-tiotemporal and Movement Data in Rrdquo httpsedzergithubioUseR2016

bull Dagger[PySAL] Sergio J Rey and Dani Arribas-Bel 2016 ldquoGeographic Data Science with PySAL andthe pydata Stackrdquo httpdarribasorggds_scipy16

27

bull DataScience-Python NB 0413 ldquoGeographic Data with Basemaprdquo

bull Time ldquoLongitudinalrdquo intra-individual data as common in developmental psychology dagger[Longitudinal]Garrett Fitzmaurice Marie Davidian Geert Verbeke and Geert Molenberghs editors 2009 Longitu-dinal Data Analysis Chapman amp Hall CRC httppensueblibcompatronFullRecordaspxp=359998

bull Time ldquoTime Series Time Series Cross Section Panelrdquo as common in economics political scienceand sociology DataScience-Python Ch 3 re pandas DSHandbook-Py Ch 17 ldquoTime Series AnalysisrdquoEconometrics GelmanHill

bull Time ldquoSequential Data Streamsrdquo as in NLP machine learning Spark and other rdquostreaming datardquoreadings

bull Space-Time Data with continuity connectivity neighborhood structure See DeepLearning Chapter 9re convolution

bull CRAN Task Views Spatial httpscranr-projectorgwebviewsSpatialhtml Spatiotempo-ral httpscranr-projectorgwebviewsSpatioTemporalhtml Time Series httpscran

r-projectorgwebviewsTimeSerieshtml

Network Graphs

bull Dagger[Networks] David Easley and Jon Kleinberg 2010 Networks Crowds and Markets ReasoningAbout a Highly Connected World Cambridge University Press httpwwwcscornelleduhome

kleinbernetworks-book

bull MMDS Ch 5 ldquoLink Analysisrdquo Ch 10 ldquoMining Social-Network Graphsrdquo

bull dagger[Networks-R] Douglas Luke 2015 A Userrsquos Guide to Network Analysis in R Springer https

link-springer-comezaccesslibrariespsuedubook101007978-3-319-23883-8

bull dagger[Networks-Python] Mohammed Zuhair Al-Taie and Seifedine Kadry 2017 Python for Graph andNetwork Analysis Springer httpslink-springer-comezaccesslibrariespsuedubook

1010072F978-3-319-53004-8

Hierarchy Clustered Data Aggregation Mixed Models

bull dagger[GelmanHill] Andrew Gelman and Jennifer Hill 2006 Data Analysis Using Regression and Multi-levelHierarchical Models Cambridge University Press httppensueblibcompatronFullRecordaspxp=288457

bull dagger[MixedModels] Eugene Demidenko 2013 Mixed Models Theory and Applications with R 2nd edWiley httpsiteebrarycomezaccesslibrariespsuedulibpennstatedetailaction

docID=10748641

Social Data Channels

Language Text Speech Audio

bull Dagger[NLP] Daniel Jurafsky and James H Martin Forthcoming (2017) Speech and Language ProcessingAn Introduction to Natural Language Processing Speech Recognition and Computational Linguistics3rd ed Prentice-Hall Preprint httpswebstanfordedu~jurafskyslp3

bull InfoRetrieval

bull Dagger[CoreNLP] Stanford CoreNLP httpsstanfordnlpgithubioCoreNLP

28

bull dagger[TextAsData] Grimmer and Stewart 2013 ldquoText as Data The Promise and Pitfalls of AutomaticContent Analysis Methods for Political Textsrdquo Political Analysis httpsdoiorg101093pan

mps028

bull Quinn-Topics

bull FightinWords

bull dagger[NLTK-Python] Jacob Perkins 2010 Python Text Processing with NLTK 20 Cookbook PackthttpsiteebrarycomezaccesslibrariespsuedulibpennstatedetailactiondocID=10435387dagger[TextAnalytics-Python] Dipanjan Sarkar 2016 Text Analytics with Python Springer http

linkspringercomezaccesslibrariespsuedubook1010072F978-1-4842-2388-8

bull DSHandbook-Py Ch 16 ldquoNatural Language Processingrdquo

bull Matthew J Denny httpwwwmjdennycomText_Processing_In_Rhtml

bull CRAN Natural Language Processing httpscranr-projectorgwebviewsNaturalLanguageProcessinghtml

bull dagger[AudioData] Dean Knox and Christopher Lucas 2017 ldquoThe Speaker-Affect Model MeasuringEmotion in Political Speech with Audio Datardquo httpchristopherlucasorgfilesPDFssam

pdf httpswwwyoutubecomwatchv=Hs8A9dwkMzI

bull daggerMorgan amp Claypool Synthesis Lectures on Human Language Technologies httpwwwmorganclaypoolcomtochlt11

bull daggerMorgan amp Claypool Synthesis Lectures on Speech and Audio Processing httpwwwmorganclaypoolcomtocsap11

Vision Image Video

bull Dagger[ComputerVision] Richard Szelinski 2010 Computer Vision Algorithms and Applications SpringerhttpszeliskiorgBook

bull MixedModels Ch 11 ldquoStatistical Analysis of Shaperdquo Ch 12 ldquoStatistical Image Analysisrdquo

bull AudioVideoVolumetric See DeepLearning Chapter 9 re convolution

bull daggerMorgan amp Claypool Synthesis Lectures on Image Video and Multimedia Processing httpwww

morganclaypoolcomtocivm11

bull daggerMorgan amp Claypool Synthesis Lectures on Computer Vision httpwwwmorganclaypoolcom

toccov11

Approaches to Learning from Data (The Analytics Layer)

bull NRCreport Ch 7 ldquoBuilding Models from Massive Datardquo

bull MMDS (data-mining)

bull CausalInference

bull Dagger[VisualAnalytics] Daniel Keim Jorn Kohlhammer Geoffrey Ellis and Florian Mansmann editors2010 Mastering the Information Age Solving Problems with Visual Analytics The EurographicsAssociation Goslar Germany httpwwwvismastereubook See also daggerMorgan amp ClaypoolSynthesis Lectures on Visualization httpwwwmorganclaypoolcomtocvis21

bull Dagger[Econometrics] Bruce E Hansen 2014 Econometrics httpwwwsscwiscedu~bhansen

econometrics

29

bull Dagger[StatisticalLearning] Gareth James Daniela Witten Trevor Hastie and Robert Tibshirani 2013An Introduction to Statistical Learning with Applications in R Springer httpwww-bcfuscedu

~garethISLindexhtml

bull [MachineLearning] Christopher Bishop 2006 Pattern Recognition and Machine Learning Springer

bull dagger[Bayes] Peter D Hoff 2009 A First Course in Bayesian Statistical Methods Springer http

linkspringercombook1010072F978-0-387-92407-6

bull dagger[GraphicalModels] Soslashren Hoslashjsgaard David Edwards Steffen Lauritzen 2012 Graphical Mod-els with R Springer httpslink-springer-comezaccesslibrariespsuedubook101007

2F978-1-4614-2299-0

bull [InformationTheory] David Mackay 2003 Information Theory Inference and Learning AlgorithmsSpringer

bull Dagger[DeepLearning] Ian Goodfellow Yoshua Bengio and Aaron Courville 2016 Deep Learning MITPress httpwwwdeeplearningbookorg (see also Yann LeCun Yoshua Bengio and Geof-frey Hinton 2015 ldquoDeep Learningrdquo Nature 521(7553) 436ndash444 httpsdoiorg101038

nature14539)

See also Dagger[Keras] Keras The Python Deep Learning Library httpskerasio Francoios CholletDirectory of Keras tutorials httpsgithubcomfcholletkeras-resources

bull dagger[SignalProcessing] Jose Maria Giron-Sierra 2013 Digital Signal Processing with MATLAB Exam-ples (Volumes 1-3) Springer httpslink-springer-comezaccesslibrariespsuedubook

101007978-981-10-2534-1

30

Penn State Policy Statements

Academic Integrity

Academic integrity is the pursuit of scholarly activity in an open honest and responsible manner Academicintegrity is a basic guiding principle for all academic activity at The Pennsylvania State University and allmembers of the University community are expected to act in accordance with this principle Consistent withthis expectation the Universitys Code of Conduct states that all students should act with personal integrityrespect other students dignity rights and property and help create and maintain an environment in whichall can succeed through the fruits of their efforts

Academic integrity includes a commitment by all members of the University community not to engage in ortolerate acts of falsification misrepresentation or deception Such acts of dishonesty violate the fundamentalethical principles of the University community and compromise the worth of work completed by others

Disability Accomodation

Penn State welcomes students with disabilities into the Universitys educational programs Every Penn Statecampus has an office for students with disabilities Student Disability Resources (SDR) website provides con-tact information for every Penn State campus (httpequitypsuedusdrdisability-coordinator)For further information please visit the Student Disability Resources website (httpequitypsuedusdr)

In order to receive consideration for reasonable accommodations you must contact the appropriate disabilityservices office at the campus where you are officially enrolled participate in an intake interview and pro-vide documentation See documentation guidelines at (httpequitypsuedusdrguidelines) If thedocumentation supports your request for reasonable accommodations your campus disability services officewill provide you with an accommodation letter Please share this letter with your instructors and discussthe accommodations with them as early as possible You must follow this process for every semester thatyou request accommodations

Psychological Services

Many students at Penn State face personal challenges or have psychological needs that may interfere withtheir academic progress social development or emotional wellbeing The university offers a variety ofconfidential services to help you through difficult times including individual and group counseling crisisintervention consultations online chats and mental health screenings These services are provided by staffwho welcome all students and embrace a philosophy respectful of clients cultural and religious backgroundsand sensitive to differences in race ability gender identity and sexual orientation

bull Counseling and Psychological Services at University Park (CAPS) (httpstudentaffairspsueducounseling) 814-863-0395

bull Penn State Crisis Line (24 hours7 daysweek) 877-229-6400

bull Crisis Text Line (24 hours7 daysweek) Text LIONS to 741741

Educational Equity

Penn State takes great pride to foster a diverse and inclusive environment for students faculty and staffConsistent with University Policy AD29 students who believe they have experienced or observed a hatecrime an act of intolerance discrimination or harassment that occurs at Penn State are urged to report theseincidents as outlined on the Universitys Report Bias webpage (httpequitypsuedureportbias)

31

Background Knowledge

Students will have a variety of backgrounds Prior to beginning interdisciplinary coursework tofulfill Social Data Analytics degree requirements including SoDA 501 and 502 students are ex-pected to have advanced (graduate) training in at least one of the component areas of Social DataAnalytics and a familiarity with basic concepts in the others

With regard to specialization students are expected to have advanced (graduate) training in ONEof the following

bull quantitative social science methodology and a discipline of social science (as would be thecase for a second-year PhD student in Political Science Sociology Criminology Human De-velopment and Family Studies or Demography) OR

bull statistics (as would be the case for a second-year PhD student in Statistics) OR

bull information science or informatics (as would be the case for a second-year PhD student inInformation Science and Technology or a second-year PhD student in Geography specializingin GIScience) OR

bull computer science (as would be the case for a second-year PhD student in Computer Scienceand Engineering)

This requirement is met as a matter of meeting home program requirements for students in thedual-title PhD but may require additional coursework on the part of students in other programswishing to pursue the graduate minor

With regard to general preparation students are expected to have ALL of the following technicalknowledge

bull basic programming skills including basic facility with R ampor Python AND

bull basic knowledge of relational databases ampor geographic information systems AND

bull basic knowledge of probability applied statistics ampor social science research design AND

bull basic familiarity with a substantive or theoretical area of social science (eg 300-level course-work in political science sociology criminology human development psychology economicscommunication anthropology human geography social informatics or similar fields)

It is not unusual for students to have one or more gaps in this preparation at time of applicationto the SoDA program Students should work with Social Data Analytics advisers to develop aplan for timely remediation of any deficiencies which generally will not require formal courseworkfor students whose training and interests are otherwise appropriate for pursuit of the Social DataAnalytics degree Where possible this will be addressed at time of application to the Social DataAnalytics program

To this end some free training materials are linked in the reference section and there are alsoabundant free high-quality self-paced course-style training materials on these and related subjectsavailable through edX (httpwwwedxorg) Udacity (httpwwwudacitycom) Codecademy(httpcodecademycom) and Lynda (httpwwwlyndapsu)

32

Page 3: SoDA 501: Approaches and Issues in Big Social Data Spring ... · \Approaches and Issues in Big Social Data," o ered in the spring semester. Some SoDA / IGERT students will take more

learningrdquo or ldquovisual analyticsrdquo approaches challenge conventional social science method-ology or how social scientific thinking challenges emerging practices and conventionalwisdom in data science ndash and tools associated with those concepts In part this involvesinterdisciplinary arbitrage and translation ndash identifying common concepts and structurethat may go by slightly different names in different disciplines and settings To thisend I want you each to send me ndash by Wednesday 700am each week by email ndash listsof terms concepts that you encountered in that weekrsquos reading in three categories (1)termsconcepts that were new but you think you now understand (2) termsconceptsthat seem to be used differently than in the context of your home discipline and (3)termsconcepts you still find confusing

Grading Criteria Full 30 points if you are present every week have made a good faitheffort to provide your lists of confusing terms and concepts on time have thoughtfullyread all of the assignments are prepared to talk about the weekrsquos readings and themesand consistently contribute in ways that are productive to the discussion (good ques-tions thoughtful responses etc) with all of that weighted more heavily when you are adesignated respondent If you donrsquot do any of that 0 points Sliding scale in between

bull Exercises - 20 It is explicitly not an objective of this course to ldquotrainrdquo you in all of thetools we will mention much less those that we could mention a task that would take yearsIn the interest of collectively ldquomoving the ball forwardrdquo for each of you however we will havea small number of assigned exercises Early in the semester these will be done in (assigned)interdisciplinary teams Later exercises will be individual

bull Semester Team Project - 40 You will in a team consisting of at least three disciplinescreate gather andor organizemanipulateprepare for analytics a ldquobigrdquo ldquosocialrdquo datasetThe data must be at least partly social (arise from human interactions) There must be somenontrivial computational or informational element to the project There need not be a finalanalysis of the data but there must be some basic calculation of descriptive statistics over thedata and some demonstration of the validity of the data for the (or an) intended scientificpurpose ndash eg representativeness (and of what) balance randomization measurementvalidity etc

March 1 - Deadline for approval of teams and (proposed) projects

March 15 - 25 Project Review

March 29 - 50 Project Review

April 12 - 75 Project Review

April 26 - Team Project Presentations

May 3 - White Paper and Data Replication Archive Due Submit a 4-5 pagepaper that documents what was done and why discusses problems you encounteredprovides an assessment of the validity of the data for an analytic purpose (or how itmight be validated) and discusses what further work might be done to make the datainto a useful resource for others andor to publish an analysis based on the data Shareyour code and data with me in maximally documented and reproducible form (ideally anotebook stored on github or similar)

3

Course Schedule 2018

January 11

bull Introductions

bull Syllabus (What this course is and isnrsquot How this course (hopefully) works)

bull Further reference

10RulesforData SoftwareCarpentry Lessons

Section ldquoGeneral Resources for Python and Rrdquo

Recommended Python via Anaconda (httpswwwanacondacom) R (httpswwwr-projectorg) amp RStudio (httpswwwrstudiocom) Account on ICS-ACI (httpsicspsuedu) Git (httpsgit-scmcom)

bull Exercise 1 Team Updates 118 Due 125 BitByBit Exercise 26 (a-g) Teams

TeamFrancisco Sara Claire Rosemary Fangcao

TeamFreelin Brittany Arif So Young Xiaoran

TeamKankane Shipi Steve Lulu Omer

January 18

bull Readings (send list of confusing terms concepts by 700am Jan 17 Wednesday)

BitByBit Ch 1 amp 2

CompSocSci Monroe-No Monroe-5Vs

One article from a different discipline listed in the ldquoMultidisciplinary Perspectivesrdquosection excluding Business-BigData

bull Further reference Section ldquoBig Data amp Social Data Analyticsrdquo

bull Exercise 1 Team Updates

January 25

bull Readings (send list of confusing terms concepts by 700 am Jan 24 Wednesday)

GoogleFlu GoogleBooks EmbeddingsBias MachineBias RacistBot BDSS-Census

ResearchMethodsKB ldquoMeasurementrdquo Quinn-Topics

bull Further reference

Section ldquoiexclCuidadordquo

Section ldquoMeasurement Reliability and Validityrdquo

bull Exercise 1 Due Teams Report

4

bull Determine ldquodiscussion leadrdquo dates

bull Exercise 2 Due 28 Wikipedia Google Trends exercise Teams

TeamKelling Claire Brittany Arif

TeamYalcin Omer Lulu Fangcao

TeamPang Rosemary Shipi Sara

TeamSun Xiaoran Steve So Young

February 1

bull Bing Pan (RPTM) ldquoBig Data and Forecasting in Tourismrdquo

bull Readings (send list of confusing terms concepts by 700 am Jan 31 Wednesday)

Bing Pan ldquoIdentifying the Next Non-Stop Flying Market with a Big Data Approachrdquohttpsdoiorg101016jtourman201712008 ldquoGoogle Trends and Tourist Ar-rivals Emerging Biases and Proposed Correctionsrdquo httpsdoiorg101016j

tourman201710014 (See also ldquoForecasting Destination Weekly Hotel Occupancywith Big Datardquo httpjournalssagepubcomdoiabs1011770047287516669050ldquoForecasting tourism demand with composite search indexrdquo httpsdoiorg10

1016jtourman201607005)

UnobtrusiveMeasures

Multivariate-R Chapter 1 (Donrsquot get bogged down when the math starts) LatentVari-ables Chapter 1 (Skim - donrsquot get bogged down in the math) NetflixPrize

ResearchMethodsKB ldquoSamplingrdquo NRCReport Chapter 8 ldquoSampling and MassiveDatardquo BitByBit Chapter 3 ldquoAsking Questionsrdquo

bull Further reference

Section ldquoIndirect Unobtrusive Nonreactive Measures Data Exhaustrdquo

Section ldquoMultiple Measures Latent Variable Measurementrdquo

Section ldquoSampling and Survey Designrdquo

Section ldquoOpen data APIs linked data rdquo

Section ldquoSpace and Timerdquo (readings on Time)

February 8

bull Clio Andris (GEOG) ldquoWhat AirBNB and Yelp can teach us about human behavior incitiesrdquo

bull Readings (send list of confusing terms concepts by 700 am Feb 7 Wednesday)

Clio Andris ldquoUsing Yelp to Find Romance in the City A Case of Restaurants in FourCitiesrdquo httpswwwdropboxcomsuink0zcklwcpo5gYelp_Restaurantspdfdl=

0 ldquoHidden Style in the City An Analysis of Geolocated Airbnb Rental Images in TenMajor Citiesrdquo httpswwwdropboxcomst7y7f6880m4ty7vAirBNB_Analysispdf

dl=0

5

InfoRetrieval Chs 1 2 6 (you may prefer the slides from their classes) My primaryhope here is that you understand the ldquovector space modelrdquo and ldquocosine similarityrdquo(from Chapter 6) My secondary hope is that you understand the basics of Booleaninformation retrieval (and notions like ldquoindexrdquo ldquoinverted indexrdquo and ldquopostingsrdquo) Mytertiary hope is that you are exposed to some basic concepts of text analytics NLP(including ldquotokenizationrdquo ldquonormalizationrdquo ldquostemmingrdquoldquolemmatizationrdquo ldquostop wordsrdquoldquotf-idfrdquo)

FightinWords (FW was used by Jurafsky in httpfirstmondayorgojsindexphp

fmarticleview49443863 and a best-selling book The Language of Food for similarapplications to Andris based on Yelp reviews and now appears in his textbook NLP)

bull Further reference

Section ldquoSpace and Timerdquo

Section ldquoWeb Scrapingrdquo

Section ldquoData Representations Data Mappingsrdquo

Section ldquoFeature selection feature extraction feature engineering rdquo

Section ldquoDatabases and data managementrdquo (esp SQL)

Section ldquoLanguage Text Speech Audiordquo

bull Exercise 2 Due Teams Report

February 15 (Postponed)

bull Conrad Tucker (IE) ldquoCybersecurity Policies and Their Impact on Dynamic Data DrivenApplication Systemsrdquo

bull Readings (send list of confusing terms concepts by 700 am Feb 14 Wednesday)

Conrad Tucker ldquoCybersecurity Policies and Their Impact on Dynamic Data DrivenApplication Systemsrdquo httpieeexploreieeeorgdocument8064151 See alsoldquoGenerative Adversarial Networks for Increasing the Veracity of Big Datardquo http

ieeexploreieeeorgdocument8258219

February 22

bull David Reitter (IST) ldquoComputational Psycholinguisticsrdquo

bull Readings (two weeks worth)

David Reitter ldquoAlignment in Web-Based Dialogue Studies in Big Data ComputationalPsycholinguisticsrdquo (Based on data from the Cancer Survivors Network and Reddit)httpwwwdavid-reittercompubreitter2017alignmentpdf

Similarity

PatternRecognition Ch2 ldquoRepresentationrdquo

6

MMDS Chapter 3 ldquoFinding Similar Itemsrdquo (to understand ldquominhashingrdquo and ldquolocalitysensitive hashingrdquo yoursquoll need to understand ldquohashingrdquo ndash Section 132 also discussedin InfoRetrieval Chapter 3)

bull Further reference

Section ldquoSimilarity Distance Association rdquo

Section ldquoDerived Data Representations rdquo

Section ldquoRecord Linkage Entity Resolution rdquo

Section ldquoClustering Hashing Compression rdquo

March 1

bull Reading

BitByBit Ch 5 (ldquoCreating Mass Collaborationrdquo)

Crowdsourcing ldquoIntroductionrdquo and Ch1 ldquoConcepts Theories and Cases of Crowd-sourcingrdquo

HumanComputation Chs 125

bull Further reference

Section ldquoCrowdsourcing Human Computation Citizen Science Web Experimentsrdquo

Section ldquoMaking Up Data (smoothing convolution kernels )rdquo

Section ldquoVision image videordquo

bull Semester Projects Must be Approved Before Spring Break

March 8 - Spring Break

March 15

bull Daniel DellaPosta (SOC) ldquoNetworks and the Mid-20th Century American Mafiardquo

bull Reading

Dan DellaPosta ldquoNetwork Closure in the Mid-20th Century American Mafiardquo (SocialNetworks 2017) httpswww-sciencedirect-comezaccesslibrariespsuedusciencearticlepiiS037887331630199X ldquoBetween Clique and Corporation Boundary-Spanningin Solidary Groupsrdquo (RampR in American Journal of Sociology in the Box folder)

Networks Chs 1 2

MMDS Ch 5 ldquoLink Analysisrdquo

MarkovVisually

bull Further references

Section ldquoNetworks and Graphsrdquo

7

Section ldquoSimulation resampling Markov Chains rdquo

DeepLearning (different kind of network)

bull 25 Project Review

March 22

bull Reading

DeepLearning Ch2 ldquoLinear Algebrardquo

Shalizi-ADA Chs 16-18 (ldquoPrincipal Components Analysisrdquo ldquoFactor Modelsrdquo ldquoNonlin-ear Dimension Reductionrdquo)

bull Further reference

Section ldquoDimensionality reduction decomposition rdquo

Section ldquoMultiple measures latent variable measurementrdquo

Section ldquoLinear algebra matrix computationsrdquo

March 29

bull Naomi Altman (STAT) ldquoGeneralizing PCArdquo

bull Reading

Altman ldquoGeneralizing PCArdquo 2015 slides httppersonalpsuedunsa1AltmanWebpagePCATorontopdf

MapReduceIntuition (7 minute video providing ldquodivide and conquerrdquo intuition to MapRe-duce)

TidySAC-Video Hadley Wickham on the tidyverse ldquosplit-apply-combinerdquo with dplyrand tidy data (1 hour video) (More detail but perhaps less intuition in book formdiscussion of ldquotidyingrdquo data Chapter 5 of TidyData-R)

MMDS Ch 2 (Read the Chapter 2 in the new ldquoBETArdquo version of the book whichalso touches on Spark and Tensorflow in section 24 ldquoExtensions to MapReducerdquo httpistanfordedu~ullmanmmdsnhtml)

bull Further reference

Section ldquoNonlinear dimension reduction manifold learningrdquo

Section ldquoDatabases and data managementrdquo (esp NoSQL)

Section ldquoTheoretically-structured approaches to data wranglingrdquo

Section ldquoParallelism MapReduce Split-Apply-Combinerdquo

Section ldquoFunctional programmingrdquo

Section ldquoCutting and bleeding edge rdquo

bull 50 Project Review

bull Exercise 3 (Individual) Due 45

8

April 5

bull Prasenjit Mitra (IST) ldquoClassification of Tweets from Disaster Scenariosrdquo

bull Reading

Prasenjit Mitra Rudra et al rdquoSummarizing Situational and Topical InformationDuring Crisesrdquo httpsarxivorgpdf161001561pdf Imran Mitra amp CastillordquoTwitter as a Lifeline Human-annotated Twitter Corpora for NLP of Crisis-relatedMessagesrdquo httpmimranmepapersimran_prasenjit_carlos_lrec2016pdf

NLP (Jurafsky and Martin) Chapters 15 and 16 ldquoVector Semanticsrdquo and ldquoSemanticswith Dense Vectorsrdquo

bull Further reference

Section ldquoLanguage text speech audiordquo

GloVe word2vec word2vecExplained t-SNE

Section ldquoMobile devices distributed sensors rdquo

Section ldquoCrowdsourcing human computation citizen science rdquo

Section ldquoScaling iteration streaming data online algorithmsrdquo

bull Exercise 3 Due

April 12

bull Reading

BitByBit Ch 4 ldquoRunning Experimentsrdquo review Ch 2 sections ldquoNatural experimentsin observable datardquo examples

CausalInference Chs 1-2

bull Further reference

Section ldquoExperimental and observational designs for causal inferencerdquo

Section ldquoCrowdsourcing web experimentsrdquo

Section ldquoHuman subjects rdquo

bull 75 Project Review

April 19

bull Reading

BitByBit Ch 6 ldquoEthicsrdquo

PSU-ORP ldquoCommon Rule and Other Changesrdquo httpswwwresearchpsuedu

irbcommonrulechanges

DataPrivacy

9

Reproducibility

Refresh on MachineBias

bull Further reference

Section ldquoEthics and Scientific Responsibility in Big Social Datardquo

Section ldquoiexclCuidadordquo

April 26 - LAST CLASS MEETING

bull Team Project Presentations

May 3 - Team Projects Due

10

Past Visiting Speakers in SoDA 501 and 502

SoDA 502 (Fall 2017)

bull Chris Zorn (PLSC)

bull Diane Felmlee (SOC)

bull Alan MacEachren (GEOG)

bull Eric Plutzer (PLSC)

bull James LeBreton (PSYCH)

bull Sesa Slavkovic (STAT)

bull Guido Cervone (GEOG)

bull Ashton Verdery (SOC)

bull Maggie Niu (STAT)

bull Reka Albert (PHYS)

SoDA 501 (Spring 2017)

bull Tim Brick (HDFS) ldquoTowards real-time monitoring and intervention using wearable technologyrdquo

bull Aylin Caliskan (Princeton) ldquoA Story of Discrimination and Unfairness Bias in Word Embeddingsrdquo

bull Jay Yonamine (Google - IGERT alum) ldquoData Science in Industryrdquo

bull Johnathan Rush (Illinois) ldquoGeospatial Data Science Workshoprdquo

bull Rick Gilmore (PSYCH) ldquoToward a more reproducible and robust science of human behaviorrdquo

bull Glenn Firebaugh (SOC) ldquoMeasuring Inequality and Segregation with US Census Datardquo

bull Charles Twardy (Sotera) ldquoData Science for Search and Rescuerdquo

bull Anna Smith (Ohio State) ldquoA Hierarchical Model for Network Data in a Latent Hyperbolic Spacerdquo

bull Rebecca Passonneau (CSE) ldquoOmnigraph Rich Feature Representation for Graph Kernel Learningrdquo

bull Alex Klippel (GEOG) ldquoVirtual Reality for Immersive Analyticsrdquo

bull Murali Haran (STAT) ldquoA Computationally Efficient Projection-based Approach for Spatial General-ized Mixed Modelsrdquo

SoDA 502 (Fall 2016)

bull Clio Andris (GEOG) ldquoIntegrating Social Network Data into GISystemsrdquo

bull Jia Li (STAT) ldquoClustering under the Wasserstein Metricrdquo

bull Rachel Smith (CAS) ldquoStigma Networks Perceptions of Sociogramsrdquo

bull Zita Oravecz (HDFS)

bull Bethany Bray (Methodology Center) ldquoLatent Class and Latent Transition Analysisrdquo

bull Dave Hunter (STAT) ldquoModel Based Clustering of Large Networksrdquo

bull Scott Bennett (PLSC) ldquoABM Model of Insurgencyrdquo

bull David Reitter (IST)

bull Suzanna Linn (PLSC) ldquoMethodological Issues in Automated Text Analysis Application to NewsCoverage of the US Economyrdquo

11

SoDA 501 (Spring 2016)

bull Bruce Desmarais (PLSC) ldquoLearning in the Sunshine Analysis of Local Government Email Corporardquo

bull Timothy Brick (HDFS) ldquoMapping and Manipulating Facial Expressionrdquo

bull Qunying Huang (USC) ldquoSocial Media An Emerging Data Source for Human Mobility Studiesrdquo

bull Lingzhou Zue (STAT) ldquoAn Introduction to High-Dimensional Graphical Modelsrdquo

bull Ashton Verdery (SOC) ldquoSampling from Network Datardquo

bull Sarah Battersby (Tableau) ldquoHelping People See and Understand Spatial Datardquo

bull Alexandra Slavkovic (STAT) ldquoStatistical Privacy with Network Datardquo

bull Lee Giles (IST) ldquoMachine Learning for Scholarly Big Datardquo

12

Readings and References (updated Spring 2018)

We will discuss a relatively small subset of the readings listed here and this will vary based on topics andreadings discussed by visiting speakers and you yourselves The remainder are provided here as curatedreferences for more in-depth investigation in both the theory and practice related to the topic (with thelatter heavily weighted toward resources in Python and R)

dagger Material that is at last check made available through the Penn State library Most journal links shouldwork if you are logged in to a Penn State machine or through the Penn State VPN Some article archivesand most books require additional authentication through webaccess If links are broken start directly froma search via httpswwwlibrariespsuedu Some books require installation of e-readers like AdobeDigital Editions Links to lyndacom can be accessed through httplyndapsuedu

Dagger Material that is at last check legally provided for free In some cases these are the preprint versionsof published material

sect Material for which a legal selection is or will be provided through the class Box folder

Big Data amp Social Data Analytics

Overviews

bull Dagger[CompSocSci] David Lazer Alex Pentland Lada Adamic Sinan Aral Albert-Laszlo BarabasiDevon Brewer Nicholas Christakis Noshir Contractor James Fowler Myron Gutmann Tony JebaraGary King Michael Macy Deb Roy and Marshall Van Alstyne 2009 ldquoComputational Social SciencerdquoScience 323(5915)721-3 + Supp Feb 6 httpsciencesciencemagorgcontent3235915

721full httpsgkingharvardedufilesgkingfilesLazPenAda09pdf

bull Dagger[NRCreport] National Research Council 2013 Frontiers in Massive Data Analysis NationalAcademies Press (Free w registration httpwwwnapeducatalogphprecord_id=18374)Ch1 ldquoIntroductionrdquo Ch2 ldquoMassive Data in Science Technology Commerce National DefenseTelecommunications and other Endeavorsrdquo

bull Dagger[MMDS] Jure Leskovec Anand Rajaraman and Jeff Ullman 2014 Mining of Massive DatasetsCambridge University Press httpwwwmmdsorg (BETA version of Third Edition httpi

stanfordedu~ullmanmmdsnhtml)

bull sect[BitByBit] Matthew J Salganik 2018 (Forthcoming) Bit by Bit Social Research in the DigitalAge Princeton University Press Ch 1 ldquoIntroductionrdquo Ch 2 ldquoObserving Behaviorrdquo

bull DSHandbook-Py Ch 1 ldquoIntroduction Becoming a Unicornrdquo Ch2 ldquoThe Data Science Road Maprdquo

Burt-schtick

bull Dagger[Monroe-5Vs] Burt L Monroe 2013 ldquoThe Five Vs of Big Data Political Science Introductionto the Special Issue on Big Data in Political Sciencerdquo Political Analysis 21(V5) 1ndash9 https

doiorg101017S1047198700014315 (Volume Velocity Variety Vinculation Validity)

bull dagger[Monroe-No] Burt L Monroe Jennifer Pan Margaret E Roberts Maya Sen and Betsy Sin-clair 2015 ldquoNo Formal Theory Causal Inference and Big Data Are Not Contradictory Trendsin Political Sciencerdquo PS Political Science amp Politics 48(1) 71ndash4 httpdxdoiorg101017

S1049096514001760

bull dagger[Quinn-Topics] Kevin M Quinn Burt L Monroe Michael Colaresi Michael H Crespin andDragomir R Radev 2010 ldquoHow to Analyze Political Attention with Minimal Assumptions andCostsrdquo American Journal of Political Science 54(1) 209ndash28 httponlinelibrarywileycom

13

doi101111j1540-5907200900427xfull (esp topic modeling as measurement approach tovalidation)

bull dagger[FightinWords] Burt L Monroe Michael Colaresi and Kevin M Quinn 2008 ldquoFightinrsquo WordsLexical Feature Selection and Evaluation for Identifying the Content of Political Conflictrdquo PoliticalAnalysis 16(4) 372-403 httpsdoiorg101093panmpn018 (esp the impact of samplingvariance and regularization through priors)

bull Dagger[BDSS-Census] Big Data Social Science PSU Team 2012 ldquoA Closer Look at the Kaggle CensusDatardquo httpsburtmonroegithubioBDSSKaggleCensus2012 (esp the relevance of the socialprocesses by which data come to exist as data)

Multidisciplinary Perspectives

bull dagger[Business-BigData] Andrew McAfee and Erik Brynjolfsson 2012 ldquoBig Data The ManagementRevolutionrdquo Harvard Business Review 90(10) 61ndash8 October httpshbrorg201210big-

data-the-management-revolution Thomas H Davenport and DJ Patel 2012 ldquoData ScientistThe Sexiest Job of the 21st Centuryrdquo Harvard Business Review 90(10)70-6 October https

hbrorg201210data-scientist-the-sexiest-job-of-the-21st-century

bull dagger[InfoSci-BigData] CL Philip Chen and Chun-Yang Zhang 2014 ldquoData-intensive ApplicationsChallenges Techniques and Technologies A Survey on Big Datardquo Information Sciences 275 314-47httpsdoiorg101016jins201401015

bull dagger[Informatics-BigData] Vasant G Honavar 2014 ldquoThe Promise and Potential of Big Data ACase for Discovery Informaticsrdquo Review of Policy Research 31(4) 326-330 httpsdoiorg10

1111ropr12080

bull Dagger[Stats-BigData] Beate Franke Jean-Francois Ribana Roscher Annie Lee Cathal Smyth ArminHatefi Fuqi Chen Einat Gil Alexander Schwing Alessandro Selvitella Michael M Hoffman RogerGrosse Dietrich Hendricks and Nancy Reid 2016 ldquoStatistical Inference Learning and Models in BigDatardquo International Statistical Review 84(3) 371-89 httponlinelibrarywileycomdoi101111insr12176full

bull dagger[Econ-BigData] Hal R Varian 2013 ldquoBig Data New Tricks for Econometricsrdquo Journal ofEconomic Perspectives 28(2) 3-28 httpsdoiorg101257jep2823

bull dagger[GeoViz-BigData] Alan MacEachren 2017 ldquoLeveraging Big (Geo) Data with (Geo) Visual An-alytics Place as the Next Frontierrdquo In Chenghu Zhou Fenzhen Su Francis Harvey and Jun Xueds Spatial Data Handling in the Big Data Era pp 139-155 Springer httpslink-springer-

comezaccesslibrariespsueduchapter101007978-981-10-4424-3_10

bull dagger[Soc-BigData] David Lazer and Jason Radford 2017 ldquoData ex Machina Introduction to BigDatardquo Annual Review of Sociology 43 19-39 httpsdoiorg101146annurev-soc-060116-

053457

bull sect[Politics-BigData] Keith T Poole L Jason Anasastopolous and James E Monagan III Forthcom-ing ldquoThe lsquoBig Datarsquo Revolution in Political Campaigning and Governancerdquo Oxford Bibliographies inPolitical Science

iexclCuidado Traps Biases Problems Pains Perils

bull dagger[GoogleFlu] David Lazer Ryan Kennedy Gary King and Alessandro Vespignani 2014 ldquoTheParable of Google Flu Traps in Big Data Analysisrdquo Science (343) 14 March httpgking

harvardedufilesgkingfiles0314policyforumffpdf

bull Dagger[MachineBias] ProPublica ldquoMachine Bias Investigating Algorithmic Injusticerdquo Series https

wwwpropublicaorgseriesmachine-bias See especially

14

Julia Angwin Jeff Larson Surya Mattu and Lauren Kirchner 2016 ldquoMachine Biasrdquo ProPub-lica May 23 httpswwwpropublicaorgarticlemachine-bias-risk-assessments-in-

criminal-sentencing

Julia Angwin Madeleine Varner and Ariana Tobin 2017 ldquoFacebook Enabled Adverstisers toReach Jew Hatersrdquo httpswwwpropublicaorgarticlefacebook-enabled-advertisers-

to-reach-jew-haters

bull Dagger[Polling2016] Doug Rivers (Nov 11 2016) ldquoFirst Thoughts on Polling Problems in the 2016 USElectionsrdquo httpstodayyougovcomnews20161111first-thoughts-polling-problems-

2016-us-elections

bull dagger[EventData] Wei Wang Ryan Kennedy David Lazer Naren Ramakrishnan 2016 ldquoGrowing Painsfor Global Monitoring of Societal Eventsrdquo Science 3536307 pp 1502ndash1503 httpsdoiorg101126scienceaaf6758

bull Dagger[GoogleBooks] Eitan Adam Pechenick Christopher M Danforth and Peter Sheridan Dodds 2015ldquoCharacterizing the Google Books Corpus Strong Limits to Inferences of Socio-Cultural and LinguisticEvolutionrdquo PLoS One httpsdoiorg101371journalpone0137041

bull Dagger[OkCupid] Michael Zimmer 2016 ldquoOkCupid Study Reveals the Perils of Big-Data Sciencerdquo Wiredhttpswwwwiredcom201605okcupid-study-reveals-perils-big-data-science May 14

bull dagger[CriticalQuestions] danah boyd and Kate Crawford 2011 ldquoCritical Questions for Big Data Provo-cations for a Cultural Technological and Scholarly Phenomenonrdquo Information Communication ampSociety 15(5) 662ndash79 httpdxdoiorg1010801369118X2012678878

bull Dagger[RacistBot] Daniel Victor 2016 ldquoMicrosoft created a Twitter bot to learn from users It quicklybecame a racist jerkrdquo New York Times httpswwwnytimescom20160325technology

microsoft-created-a-twitter-bot-to-learn-from-users-it-quickly-became-a-racist-jerk

html

bull MMDS 122-123 on Bonferroni

bull Also BDSS-Census

Research Design and Measurement

Overviews

bull BitByBit

bull Dagger[ResearchMethodsKB] William M Trochim 2006 The Research Methods Knowledge Basehttpwwwsocialresearchmethodsnetkb

bull [7Rules] Glenn Firebaugh 2008 Seven Rules for Social Research Princeton University Press

Measurement Reliability and Validity

bull ResearchMethodsKB ldquoMeasurementrdquo

bull 7Rules Ch 3 ldquoBuild Reality Checks into Your Researchrdquo

bull Reliability see also FightinWords

bull Validity see also Quinn10-Topics Monroe-5Vs

Indirect Unobtrusive Nonreactive Measures Data Exhaust

15

bull dagger[UnobtrusiveMeasures] Raymond M Lee 2015 ldquoUnobtrusive Measuresrdquo Oxford Bibliographieshttpsdoiorg101093OBO9780199846740-0048 (Canonical cite is Eugene J Webb 1966Unobtrusive Measures or Webb Donald T Campbell Richard D Schwartz Lee Sechrest 1999Unobtrusive Measures rev ed Sage)

bull BitByBit Examples in Chapter 2

Multiple Measures Latent Variable Measurement

bull 7Rules Ch 4 ldquoReplicate Where Possiblerdquo

bull dagger[LatentVariables] David J Bartholomew Martin Knott and Irini Moustaki 2011 Latent VariableModels and Factor Analysis A Unified Approach Wiley httpsiteebrarycomezaccess

librariespsuedulibpennstatedetailactiondocID=10483308) (esp Ch1 ldquoBasic ideas andexamplesrdquo)

bull dagger[Multivariate-R] Brian Everitt and Torsten Hothorn 2011 An Introduction to Applied MultivariateAnalysis with R Springer httplinkspringercomezaccesslibrariespsuedubook10

10072F978-1-4419-9650-3

bull Shalizi-ADA Ch17 ldquoFactor Modelsrdquo

bull Dagger[NetflixPrize] Edwin Chen 2011 ldquoWinning the Netflix Prize A Summaryrdquo

bull DaggerCRAN httpscranr-projectorgwebviewsMultivariatehtmlhttpscranr-projectorgwebviewsPsychometricshtmlhttpscranr-projectorgwebviewsClusterhtml

Sampling and Survey Design

bull dagger[Sampling] Steven K Thompson 2012 Sampling 3rd ed httpsk8es4mc2lsearchserialssolutionscomsid=sersolampSS_jc=TC_024492330amptitle=Wiley20Desktop20Editions203A20Sampling

bull ResearchMethodsKB ldquoSamplingrdquo

bull NRCreport Ch 8 ldquoSampling and Massive Datardquo

bull BitByBit Ch 3 ldquoAsking Questionsrdquo

bull Dagger[MSE] Daniel Manrique-Vallier Megan E Price and Anita Gohdes 2013 In Seybolt et al (eds)Counting Civilians ldquoMultiple Systems Estimation Techniques for Estimating Casualties in ArmedConflictsrdquo Preprint httpciteseerxistpsueduviewdocdownloaddoi=1011469939amp

rep=rep1amptype=pdf)

bull dagger[NetworkSampling] Ted Mouw and Ashton M Verdery 2012 ldquoNetwork Sampling with Memory AProposal for More Efficient Sampling from Social Networksrdquo Sociological Methodology 42(1)206ndash56httpsdxdoiorg1011772F0081175012461248

bull See also FightinWords (re hidden heteroskedasticity in sample variance)

bull Dagger[HashDontSample] Mudit Uppal 2016 ldquoProbabilistic data structures in the Big data world (+code)rdquo (re ldquoHash donrsquot samplerdquo) httpsmediumcommuppalprobabilistic-data-structures-

in-the-big-data-world-code-b9387cff0c55

bull MMDS Hash Functions (132) Sampling in streams (42)

bull DaggerCRAN httpscranr-projectorgwebviewsOfficialStatisticshtml (ldquoComplex SurveyDesignrdquo ldquoSmall Area Estimationrdquo)

Experimental and Observational Designs for Causal Inference

16

bull Dagger[CausalInference] Miguel A Hernan James M Robins Forthcoming (2017) Causal Infer-ence Chapman amp HallCRC Preprint httpswwwhsphharvardedumiguel-hernancausal-

inference-book Chs 1 ldquoA definition of causal effectrdquo Ch 2 ldquoRandomized experimentsrdquo Ch3 ldquoObservational studiesrdquo (See Ch 6 ldquoGraphical representation of causal effectsrdquo for integrationwith Judea Pearl approach)

bull BitByBit Chapter 4 ldquoRunning Experimentsrdquo Ch 2 Natural experiments in observable data exam-ples

bull ResearchMethodsKB ldquoDesignrdquo

bull Firebaugh-7Rules Ch 2 ldquoLook for Differences that Make a Difference and Report Themrdquo sectCh5 ldquoCompare Like with Likerdquo Ch 6 ldquoUse Panel Data to Study Individual Change and RepeatedCross-Section Data to Study Social Changerdquo

bull Shalizi-ADA Part IV ldquoCausal Inferencerdquo

bull Monroe-No

bull CRAN httpscranr-projectorgwebviewsExperimentalDesignhtml

Technologies for Primary and Secondary Data Collection

Mobile Devices Distributed Sensors Wearable Sensors Remote Sensing

bull dagger[RealityMining] Nathan Eagle and Alex (Sandy) Pentland 2006 ldquoReality Mining SensingComplex Social Systemsrdquo Personal and Ubiquitous Computing 10(4) 255ndash68 httpsdoiorg

101007s00779-005-0046-3

bull dagger[QuantifiedSelf ] Melanie Swan 2013 ldquoThe Quantified Self Fundamental Disruption in Big DataScience and Biological Discoveryrdquo Big Data 1(2) 85ndash99 httpsdoiorg101089big2012

0002

bull dagger[SensorData] Charu C Aggarwal (Ed) 2013 Managing and Mining Sensor Data Springer httplinkspringercomezaccesslibrariespsuedubook1010072F978-1-4614-6309-2 Ch1ldquoAn Introduction to Sensor Data Analyticsrdquo

bull dagger[NightLights] Thushyanthan Baskaran Brian Min Yogesh Uppal 2015 ldquoElection cycles andelectricity provision Evidence from a quasi-experiment with Indian special electionsrdquo Journal ofPublic Economics 12664-73 httpsdoiorg101016jjpubeco201503011

Crowdsourcing Human Computation Citizen Science Web Experiments

bull dagger[Crowdsourcing] Daren C Brabham 2013 Crowdsourcing MIT Press httpsiteebrary

comezaccesslibrariespsuedulibpennstatedetailactiondocID=10692208 ldquoIntroduc-tionrdquo Ch1 ldquoConcepts Theories and Cases of Crowdsourcingrdquo

bull BitByBit Ch 5 ldquoCreating Mass Collaborationrdquo

bull NRCreport Ch 9 ldquoHuman Interaction with Datardquo

bull dagger[MTurk] Krista Casler Lydia Bickel and Elizabeth Hackett 2013 ldquoSeparate but Equal AComparison of Participants and Data Gathered via Amazonrsquos MTurk Social Media and Face-to-FaceBehavioral Testingrdquo Computers in Human Behavior 29(6) 2156ndash60 httpdoiorg101016j

chb201305009

bull dagger[LabintheWild] Katharina Reinecke and Krzysztof Z Gajos 2015 ldquoLabintheWild ConductingLarge-Scale Online Experiments With Uncompensated Samplesrdquo In Proceedings of the 18th ACMConference on Computer Supported Cooperative Work amp Social Computing (CSCW rsquo15) 1364ndash1378httpdxdoiorg10114526751332675246

17

bull dagger[TweetmentEffects] Kevin Munger 2017 ldquoTweetment Effects on the Tweeted ExperimentallyReducing Racist Harassmentrdquo Political Behavior 39(3) 629ndash49 httpdoiorg101016jchb

201305009

bull dagger[HumanComputation] Edith Law and Luis van Ahn 2011 Human Computation Morgan amp Clay-pool httpwwwmorganclaypoolcomezaccesslibrariespsuedudoipdf102200S00371ED1V01Y201107AIM013

Open Data File Formats APIs Semantic Web Linked Data

bull DSHandbook-Py Ch 12 ldquoData Encodings and File Formatsrdquo

bull Dagger[OpenData] Open Data Institute ldquoWhat Is Open Datardquo httpstheodiorgwhat-is-open-

data

bull Dagger[OpenDataHandbook] Open Knowledge International The Open Data Handbook httpopendatahandbookorg Includes appendix ldquoFile Formatsrdquo httpopendatahandbookorgguideenappendices

file-formats

bull Dagger[APIs] Brian Cooksey 2016 An Introduction to APIs httpszapiercomlearnapis

bull Dagger[APIMarkets] RapidAPI mashape API marketplaces httpsdocsrapidapicom https

marketmashapecom ProgrammableWeb httpswwwprogrammablewebcom

bull dagger[LinkedData] Tom Heath and Christian Bizer 2011 Linked Data Evolving the Web into a GlobalData Space Morgan amp Claypool httpwwwmorganclaypoolcomezaccesslibrariespsu

edudoiabs102200S00334ED1V01Y201102WBE001 Ch 1 ldquoIntroductionrdquo Ch 2 ldquoPrinciples ofLinked Datardquo

bull dagger[SemanticWeb] Nikolaos Konstantinos and Dimitrios-Emmanuel Spanos 2015 Materializing theWeb of Linked Data Springer httplinkspringercomezaccesslibrariespsuedubook

1010072F978-3-319-16074-0 Ch 1 ldquoIntroduction Linked Data and the Semantic Webrdquo

bull daggerMorgan amp Claypool Synthesis Lectures on the Semantic Web Theory and Technology httpwww

morganclaypoolcomtocwbe111

Web Scraping

bull dagger[TheoryDrivenScraping] Richard N Landers Robert C Brusso Katelyn J Cavanaugh andAndrew B Colmus 2016 ldquoA Primer on Theory-Driven Web Scraping Automatic Extraction ofBig Data from the Internet for Use in Psychological Researchrdquo Psychological Methods 4 475ndash492httpdxdoiorgezaccesslibrariespsuedu101037met0000081

bull Dagger[Scraping-Py] Al Sweigart 2015 Automate the Boring Stuff with Python Practical Programmingfor Total Beginners Ch 11 Web-Scraping httpsautomatetheboringstuffcomchapter11

bull dagger[Scraping-R] Simon Munzert Christian Rubba Peter Meiszligner and Dominic Nyhuis 2014 Au-tomated Data Collection with R A Practical Guide to Web Scraping and Text Mining Wileyhttponlinelibrarywileycombook1010029781118834732

bull DaggerCRAN httpscranr-projectorgwebviewsWebTechnologieshtml

Ethics amp Scientific Responsibility in Big Social Data

Human Subjects Consent Privacy

bull BitByBit Ch 6 (ldquoEthicsrdquo)

bull Dagger[PSU-ORP] Penn State Office for Research Protections (under the VP for Research)

Human Subjects Research IRB httpswwwresearchpsueduirb

18

Revised Common Rule httpswwwresearchpsueduirbcommonrulechanges

Responsible Conduct of Research httpswwwresearchpsuedueducationrcr

Research Misconduct httpswwwresearchpsueduresearchmisconduct

SARI PSU (Scientific and Research Integrity Training) httpswwwresearchpsuedu

trainingsari

bull Dagger[MenloReport] David Dittrich and Erin Kenneally (Center for Applied Interneet Data Analysis)2012 ldquoThe Menlo Report Ethical Principles Guiding Information and Communication Technol-ogy Researchrdquo and companion report ldquoApplying Ethical Principles rdquo httpswwwcaidaorg

publicationspapers2012menlo_report_actual_formatted

bull Dagger[BigDataEthics-CBDES] Jacob Metcalf Emily F Keller and danah boyd 2016 ldquoPerspec-tives on Big Data Ethics and Societyrdquo Council for Big Data Ethics and Society httpbdes

datasocietynetcouncil-outputperspectives-on-big-data-ethics-and-society

bull Dagger[AoIRReport] Annette Markham and Elizabeth Buchanan (Association of Internet Researchers)2012 ldquoEthical Decision-Making and Internet Research Recommendations of the AoIR Ethics WorkingCommittee (Version 20)rdquo httpsaoirorgreportsethics2pdf and guidelines chart https

aoirorgwp-contentuploads201701aoir_ethics_graphic_2016pdf

bull Dagger[BigDataEthics-Wired] Sarah Zhang 2016 ldquoScientists are Just as Confused about the Ethics ofBig-Data Research as Yourdquo Wired httpswwwwiredcom201605scientists-just-confused-ethics-big-data-research

bull dagger[BigDataEthics-HerschelMori] Richard Herschel and Virginia M Mori 2017 ldquoEthics amp BigDatardquo Technology in Society 49 31-36 httpdoiorg101016jtechsoc201703003

The Science of Data Privacy

bull dagger[BigDataPrivacy] Terence Craig and Mary E Ludloff 2011 Privacy and Big Data OrsquoReilly MediahttppensueblibcompatronFullRecordaspxp=781814

bull Dagger[DataAnalysisPrivacy] John Abowd Lorenzo Alvisi Cynthia Dwork Sampath Kannan AshwinMachanavajjhala Jerome Reiter 2017 ldquoPrivacy-Preserving Data Analysis for the Federal Statisti-cal Agenciesrdquo A Computing Community Consortium white paper httpsarxivorgabs1701

00752

bull dagger[DataPrivacy] Stephen E Fienberg and Aleksandra B Slavkovic 2011 ldquoData Privacy and Con-fidentialityrdquo International Encyclopedia of Statistical Science 342ndash5 httpdoiorg978-3-642-

04898-2_202

bull Dagger[NetworksPrivacy] Vishesh Karwa and Aleksandra Slavkovic 2016 ldquoInference using noisy degreesDifferentially private β-model and synthetic graphsrdquo Annals of Statistics 44(1) 87-112 http

projecteuclidorgeuclidaos1449755958

bull dagger[DataPublishingPrivacy] Raymond Chi-Wing Wong and Ada Wai-Chee Fu 2010 Privacy-Preserving Data Publishing An Overview Morgan amp Claypool httpwwwmorganclaypoolcom

ezaccesslibrariespsuedudoipdfplus102200S00237ED1V01Y201003DTM002

bull DataMatching Ch 8 ldquoPrivacy Aspects of Data Matchingrdquo

Transparency Reproducibility and Team Science

bull Dagger[10RulesforData] Alyssa Goodman Alberto Pepe Alexander W Blocker Christine L BorgmanKyle Cranmer Merce Crosas Rosanne Di Stefano Yolanda Gil Paul Groth Margaret HedstromDavid W Hogg Vinay Kashyap Ashish Mahabal Aneta Siemiginowska and Aleksandra Slavkovic(2014) ldquoTen Simple Rules for the Care and Feeding of Scientific Datardquo PLoS Computational Biology10(4) e1003542 httpsdoiorg101371journalpcbi1003542 (Note esp curated resourcesfor reproducible research)

19

bull dagger[Transparency] E Miguel C Camerer K Casey J Cohen K M Esterling A Gerber R Glen-nerster D P Green M Humphreys G Imbens D Laitin T Madon L Nelson B A Nosek M Pe-tersen R Sedlmayr J P Simmons U Simonsohn M Van der Laan 2014 ldquoPromoting Transparencyin Social Science Researchrdquo Science 343(6166) 30ndash1 httpsdoiorg101126science1245317

bull dagger[Reproducibility] Marcus R Munafo Brian A Nosek Dorothy V M Bishop Katherine S ButtonChristopher D Chambers Nathalie Percie du Sert Uri Simonsohn Eric-Jan Wagenmakers JenniferJ Ware and John P A Ioannidis 2017 ldquoA Manifesto for Reproducible Sciencerdquo Nature HumanBehavior 0021(2017) httpsdoiorg101038s41562-016-0021

bull Dagger[SoftwareCarpentry] esp ldquoLessonsrdquo httpsoftware-carpentryorglessons

bull DSHandbook-Py Ch 9 ldquoTechnical Communication and Documentationrdquo Ch 15 ldquoSoftware Engi-neering Best Practicesrdquo

bull dagger[TeamScienceToolkit] Vogel AL Hall KL Fiore SM Klein JT Bennett LM Gadlin H Stokols DNebeling LC Wuchty S Patrick K Spotts EL Pohl C Riley WT Falk-Krzesinski HJ 2013 ldquoTheTeam Science Toolkit enhancing research collaboration through online knowledge sharingrdquo AmericanJournal of Preventive Medicine 45 787-9 http101016jamepre201309001

bull sect[Databrary] Kara Hall Robert Croyle and Amanda Vogel Forthcoming (2017) Advancing Socialand Behavioral Health Research through Cross-disciplinary Team Science Springer Includes RickO Gilmore and Karen E Adolph ldquoOpen Sharing of Research Video Breaking the Boundaries of theResearch Teamrdquo (See httpdatabraryorg)

bull CRAN httpscranr-projectorgwebviewsReproducibleResearchhtml

Social Bias Fair Algorithms

bull MachineBias

bull Dagger[EmbeddingsBias] Aylin Caliskan Joanna J Bryson and Arvind Narayanan 2017 ldquoSemanticsDerived Automatically from Language Corpora Contain Human Biasesrdquo Science httpsarxiv

orgabs160807187

bull Dagger[Debiasing] Tolga Bolukbasi Kai-Wei Chang James Zou Venkatesh Saligrama Adam Kalai 2016ldquoMan is to Computer Programmer as Woman is to Homemaker Debiasing Word Embeddingsrdquohttpsarxivorgabs160706520

bull Dagger[AvoidingBias] Moritz Hardt Eric Price Nathan Srebro 2016 ldquoEquality of Opportunity inSupervised Learningrdquo httpsarxivorgabs161002413

bull Dagger[InevitableBias] Jon Kleinberg Sendhil Mullainathan Manish Raghavan 2016 ldquoInherent Trade-Offs in the Fair Determination of Risk Scoresrdquo httpsarxivorgabs160905807

Databases and Data Management

bull NRCreport Ch 3 ldquoScaling the Infrastructure for Data Managementrdquo

bull dagger[SQL] Jan L Harrington 2010 SQL Clearly Explained Elsevier httpwwwsciencedirectcomezaccesslibrariespsuedusciencebook9780123756978

bull DSHandbook-Py Ch 14 ldquoDatabasesrdquo

bull dagger[noSQL] Guy Harrison 2015 Next Generation Databases NoSQL NewSQL and Big Data Apresshttpslink-springer-comezaccesslibrariespsuedubook101007978-1-4842-1329-2

bull dagger[Cloud] Divyakant Agrawal Sudipto Das and Amr El Abbadi 2012 Data Management inthe Cloud Challenges and Opportunities Morgan amp Claypool httpwwwmorganclaypoolcom

ezaccesslibrariespsuedudoipdfplus102200S00456ED1V01Y201211DTM032

20

bull daggerMorgan amp Claypool Synthesis Lectures on Data Management httpwwwmorganclaypoolcom

tocdtm11

Data Wrangling

Theoretically-structured approaches to data wrangling

bull Dagger[TidyData-R] Garrett Grolemund and Hadley Wickham 2017 R for Data Science (httpr4dshadconz) esp Ch 12 ldquoTidy Datardquo also Wickham 2014 ldquoTidy Datardquo Journalof Statistical Software 59(10) httpwwwjstatsoftorgv59i10paper Tools The tidyversehttptidyverseorg

bull Dagger[DataScience-Py] Jake VanderPlas 2016 Python Data Science Handbook (httpsgithubcomjakevdpPythonDSHandbook-Py) esp Ch 3 on pandas httppandaspydataorg

bull Dagger[DataCarpentry] Colin Gillespie and Robin Lovelace 2017 Efficient R Programming https

csgillespiegithubioefficientR esp Ch 6 on ldquoefficient data carpentryrdquo

bull sect[Wrangler] Joseph M Hellerstein Jeffrey Heer Tye Rattenbury and Sean Kandel 2017 DataWrangling Practical Techniques for Data Preparation Tool Trifacta Wrangler httpwwwtrifactacomproductswrangler

Data wrangling practice

bull DSHandbook-Py Ch 4 ldquoData Munging String Manipulation Regular Expressions and Data Clean-ingrdquo

bull dagger[DataSimplification] Jules J Berman 2016 Data Simplification Taming Information with OpenSource Tools Elsevier httpwwwsciencedirectcomezaccesslibrariespsueduscience

book9780128037812

bull dagger[Wrangling-Py] Jacqueline Kazil Katharine Jarmul 2016 Data Wrangling with Python OrsquoReillyhttpproquestcombosafaribooksonlinecomezaccesslibrariespsuedu9781491948804

bull dagger[Wrangling-R] Bradley C Boehmke 2016 Data Wrangling with R Springer httplink

springercomezaccesslibrariespsuedubook1010072F978-3-319-45599-0

Record Linkage Entity Resolution Deduplication

bull dagger[DataMatching] Peter Christen 2012 Data Matching Concepts and Techniques for Record Link-age Entity Resolution and Duplicate Detectionhttplinkspringercomezaccesslibrariespsuedubook1010072F978-3-642-31164-2

(esp Ch 2 ldquoThe Data Matching Processrdquo)

bull DataSimplification Chapter 5 ldquoIdentifying and Deidentifying Datardquo

bull Dagger[EntityResolution] Lise Getoor and Ashwin Machanavajjhala 2013 ldquoEntity Resolution for BigDatardquo Tutorial KDD httpwwwumiacsumdedu~getoorTutorialsER_KDD2013pdf

bull Dagger[SyrianCasualties] Peter Sadosky Anshumali Shrivastava Megan Price and Rebecca C Steorts2015 ldquoBlocking Methods Applied to Casualty Records from the Syrian Conflictrdquo httpsarxiv

orgabs151007714 (For more on blocking see DataMatching Ch 4 ldquoIndexingrdquo)

bull CRAN Task Views httpscranr-projectorgwebviewsOfficialStatisticshtml ldquoStatis-tical Matching and Record Linkagerdquo

ldquoMaking up datardquo Imputation Smoothers Kernels Priors Filters Teleportation Negative Sampling Con-volution Augmentation Adversarial Training

21

bull dagger[Imputation] Yi Deng Changgee Chang Moges Seyoum Ido and Qi Long 2016 ldquoMultiple Im-putation for General Missing Data Patterns in the Presence of High-dimensional Datardquo ScientificReports 6(21689) httpswwwnaturecomarticlessrep21689

bull Dagger[Overimputation] Matthew Blackwell James Honaker and Gary King 2017 ldquoA Unified Ap-proach to Measurement Error and Missing Data Overview and Applicationsrdquo Sociological Methodsamp Research 46(3) 303-341 httpgkingharvardedufilesgkingfilesmeasurepdf

bull Shalizi-ADA Sect 15 (Linear Smoothers) Ch 8 (Splines) Sect 144 (Kernel Density Estimates)DataScience-Python NB 0513 ldquoKernel Density Estimationrdquo

bull FightinWords (re Bayesian priors as additional data impact of priors on regularization)

bull DeepLearning Section 75 (Data Augmentation) 712 (Dropout) 713 (Adversarial Examples) Chap-ter 9 (Convolutional Networks)

bull dagger[Adversarial] Ian J Goodfellow Jonathon Shlens Christian Szegedy 2015 ldquoExplaining and Har-vesting Adversarial Examplesrdquo httpsarxivorgabs14126572

bull See also resampling and simulation methods

bull See also feature engineering preprocessing

bull CRAN Task Views httpscranr-projectorgwebviewsOfficialStatisticshtml ldquoImpu-tationrdquo

(Direct) Data Representations Data Mappings

bull NRCreport Ch 5 ldquoLarge-Scale Data Representationsrdquo

bull dagger[PatternRecognition] M Narasimha Murty and V Susheela Devi 2011 Pattern RecognitionAn Algorithmic Approach Springer httpslink-springer-comezaccesslibrariespsuedu

book1010072F978-0-85729-495-1 Section 21 ldquoData Structures for Pattern Representationrdquo

bull Dagger[InfoRetrieval] Christopher D Manning Prabhakar Raghavan and Hinrich Schutze 2009 Intro-duction to Information Retrieval Cambridge University Press httpnlpstanfordeduIR-bookCh 1 2 6 (also note slides used in their class)

bull dagger[Algorithms] Brian Steele John Chandler Swarna Reddy 2016 Algorithms for Data ScienceWiley httplinkspringercomezaccesslibrariespsuedubook1010072F978-3-319-

45797-0 Ch 2 ldquoData Mapping and Data Dictionariesrdquo

bull See also Social Data Structures

Similarity Distance Association Covariance The Kernel Trick

bull PatternRecognition Section 23 ldquoProximity Measuresrdquo

bull Dagger[Similarity] Brendan OrsquoConnor 2012 ldquoCosine similarity Pearson correlation and OLS coeffi-cientsrdquo httpsbrenoconcomblog201203cosine-similarity-pearson-correlation-and-

ols-coefficients

bull For relatively comprehensive lists see also

M-J Lesot M Rifqi and H Benhadda 2009 ldquoSimilarity measures for binary and numerical data a surveyrdquoInt J Knowledge Engineering and Soft Data Paradigms 1(1) 63- httpciteseerxistpsueduviewdoc

downloaddoi=10112126533amprep=rep1amptype=pdf

Seung-Seok Choi Sung-Hyuk Cha Charles C Tappert 2010 ldquoA Survey of Binary Similarity and DistanceMeasuresrdquo Journal of Systemics Cybernetics amp Informatics 8(1)43-8 httpwwwiiisciorgJournalCV$

scipdfsGS315JGpdf

22

Sung-Hyuk Cha 2007 ldquoComprehensive Survey on DistanceSimilarity Measures between Probability DensityFunctionsrdquo International Journal of Mathematical Models and Methods in Applied Sciences 4(1) 300-7 http

csispaceeductappertdpsd861-12session4-p2pdf

Anna Huang 2008 ldquoSimilarity Measures for Text Document Clusteringrdquo Proceedings of the 6th New ZealandComputer Science Research Student Conference 49-56 httpwwwnzcsrsc08canterburyacnzsiteproceedingsIndividual_Paperspg049_Similarity_Measures_for_Text_Document_Clusteringpdf

bull Michael B Jordan ldquoThe Kernel Trickrdquo (Lecture Notes) httpspeopleeecsberkeleyedu

~jordancourses281B-spring04lectureslec3pdf

bull Eric Kim 2017 ldquoEverything You Ever Wanted to Know about the Kernel Trick (But Were Afraidto Ask)rdquo httpwwweric-kimneteric-kim-netposts1kernel_trick_blog_ekim_12_20_

2017pdf

Derived Data Representations - Dimensionality Reduction Compression De-composition Embeddings

The groupings here and under the measurement multivariate statistics section are particularly arbitraryFor example ldquok-Means clusteringrdquo can be viewed as a technique for ldquodimensionality reductionrdquo ldquocompres-sionrdquo ldquofeature extractionrdquo ldquolatent variable measurementrdquo ldquounsupervised learningrdquo ldquocollaborative filteringrdquo

Clustering hashing quantization blocking compression

bull Dagger[Compression] Khalid Sayood 2012 Introduction to Data Compression 4th ed Springer httpwwwsciencedirectcomezaccesslibrariespsuedusciencebook9780124157965 (eg cod-ing blocking via vector quantization)

bull MMDS Ch3 ldquoFinding Similar Itemsrdquo (minhashing locality sensitive hashing)

bull Multivariate-R Ch 6 ldquoClusteringrdquo

bull DataScience-Python NB 0511 ldquok-Means Clusteringrdquo NG 0512 ldquoGaussian Mixture Modelsrdquo

bull Dagger[KMeansHashing] Kaiming He Fang Wen Jian Sun 2013 ldquoK-means Hashing An Affinity-Preserving Quantization Method for Learning Binary Compact Codesrdquo CVPR httpswwwcv-foundationorgopenaccesscontent_cvpr_2013papersHe_K-Means_Hashing_An_2013_CVPR_paper

pdf

bull dagger[Squashing] Madigan D Raghavan N Dumouchel W Nason M Posse C and RidgewayG (2002) ldquoLikelihood-based data squashing A modeling approach to instance constructionrdquo DataMining and Knowledge Discovery 6(2) 173-190 httpdxdoiorgezaccesslibrariespsu

edu101023A1014095614948

bull dagger[Core-sets] Piotr Indyk Sepideh Mahabadi Mohammad Mahdian Vahab S Mirrokni 2014 ldquoCom-posable core-sets for diversity and coverage maximizationrdquo PODS rsquo14 100-8 httpsdoiorg10114525945382594560

Feature selection feature extraction feature engineering weighting preprocessing

bull PatternRecognition Sections 26-7 ldquoFeature Selection Feature Extractionrdquo

bull DSHandbook-Py Ch 7 ldquoInterlude Feature Extraction Ideasrdquo

bull DataScience-Python NB 0504 ldquoFeature Engineeringrdquo

bull Features in text NLP and InfoRetrieval re tfidf and similar FightinWords

bull Features in images DataScience-Python NB 0514 ldquoImage Featuresrdquo

23

Dimensionality reduction decomposition factorization change of basis reparameterization matrix com-pletion latent variables source separation

bull MMDS Ch 9 ldquoRecommendation Systemsrdquo Ch11 ldquoDimensionality Reductionrdquo

bull DSHandbook-Py Ch 10 ldquoUnsupervised Learning Clustering and Dimensionality Reductionrdquo

bull Dagger[Shalizi-ADA] Cosma Rohilla Shalizi 2017 Advanced Data Analysis from an Elementary Pointof View Ch 16 (ldquoPrincipal Components Analysisrdquo) Ch 17 (ldquoFactor Modelsrdquo) httpwwwstat

cmuedu~cshaliziADAfaEPoV

See also Multivariate-R Ch 3 ldquoPrincipal Components Analysisrdquo Ch 4 ldquoMultidimensional ScalingrdquoCh 5 ldquoExploratory Factor Analysisrdquo Latent

bull DeepLearning Ch 2 ldquoLinear Algebrardquo Ch 13 ldquoLinear Factor Modelsrdquo

bull Dagger[GloVe] Jeffrey Pennington Richard Socher Christopher Manning 2014 ldquoGloVe Global vectorsfor word representationrdquo EMNLP httpsnlpstanfordeduprojectsglove

bull dagger[NMF] Daniel D Lee and H Sebastian Seung 1999 ldquoLearning the parts of objects by non-negativematrix factorizationrdquo Nature 401788-791 httpdxdoiorgezaccesslibrariespsuedu

10103844565

bull dagger[CUR] Michael W Mahoney and Petros Drineas 2009 ldquoCUR matrix decompositions for improveddata analysisrdquo PNAS httpwwwpnasorgcontent1063697full

bull Dagger[ICA] Aapo Hyvarinen and Erkki Oja 2000 ldquoIndependent Component Analysis Algorithmsand Applicationsrdquo Neural Networks 13(4-5) 411-30 httpswwwcshelsinkifiuahyvarin

papersNN00newpdf

bull [RandomProjection] Ella Bingham and Heikki Mannilla 2001 ldquoRandom projection in dimension-ality reduction applications to image and text datardquo KDD httpsdoiorg1011452F502512

502546

bull Compression eg ldquoTransform codingrdquo ldquoWaveletsrdquo

Nonlinear dimensionality reduction Manifold learning

bull DeepLearning Section 5113 ldquoManifold Learningrdquo

bull DataScience-Python NB 0510 ldquoManifold Learningrdquo (Locally linear embedding [LLE] Isomap)

bull Dagger[KernelPCA] Sebastian Raschka 2014 ldquoKernel tricks and nonlinear dimensionality reduction viaRBF kernel PCArdquo httpsebastianraschkacomArticles2014_kernel_pcahtml

bull Shalizi-ADA Ch 18 ldquoNonlinear Dimensionality Reductionrdquo (LLE)

bull Dagger[LaplacianEigenmaps] Mikhail Belkin and Partha Niyogi 2003 ldquoLaplacian eigenmaps for di-mensionality reduction and data representationrdquo Neural Computation 15(6) 13731396 http

webcseohio-stateedu~belkin8papersLEM_NC_03pdf

bull DeepLearning Ch14 (ldquoAutoencodersrdquo)

bull Dagger[word2vec] Tomas Mikolov Kai Chen Greg Corrado Jeffrey Dean 2013 ldquoEfficient Estimation ofWord Representations in Vector Spacerdquo httpsarxivorgabs13013781

bull Dagger[word2vecExplained] Yoav Goldberg and Omer Levy 2014 ldquoword2vec Explained DerivingMikolov et als Negative-Sampling Word-Embedding Methodrdquo httpsarxivorgabs14023722

bull Dagger[t-SNE] Laurens van der Maaten and Geoffrey Hinton 2008 ldquoVisualising Data using t-SNErdquo Jour-nal of Machine Learning Research 9 2579-2605 httpjmlrcsailmitedupapersvolume9

vandermaaten08avandermaaten08apdf

24

Computation and Scaling Up

Scientific computing computation at scale

bull NRCreport Ch 10 ldquoThe Seven Computational Giants of Massive Data Analysisrdquo

bull DSHandbook-Py Ch 21 ldquoPerformance and Computer Memoryrdquo Ch 22 ldquoComputer Memory andData Structuresrdquo

bull Dagger[ComputationalStatistics-Python] Cliburn Chan Computational Statistics in Python https

peopledukeedu~ccc14sta-663indexhtml

bull CRAN Task View High Performance Computinghttpscranr-projectorgwebviewsHighPerformanceComputinghtml

Numerical computing

bull [NumericalComputing] Ward Cheney and David Kincaid 2013 Numerical Mathematics andComputing 7th ed BrooksCole Cengage Learning

bull CRAN Task View Numerical Mathematicshttpscranr-projectorgwebviewsNumericalMathematicshtml

Optimization (eg MLE gradient descent stochastic gradient descent EM algorithm neural nets)

bull DSHandbook-Py Ch 23 ldquoMaximum Likelihood Estimation and Optimizationrdquo (gradient descent)

bull DeepLearning Ch 4 ldquoNumerical Computationrdquo Ch 6 ldquoDeep Feedforward Networksrdquo Ch 8ldquoOptimization for Training Deep Modelsrdquo

bull CRAN Task View Optimization and Mathematical Programminghttpscranr-projectorgwebviewsOptimizationhtml

Linear algebra matrix computations

bull [MatrixComputations] Gene H Golub and Charles F Van Loan 2013 Matrix Computations 4thed Johns Hopkins University Press

bull Dagger[NetflixMatrix] Yehuda Koren Robert Bell and Chris Volinsky 2009 ldquoMatrix FactorizationTechniques for Recommender Systemsrdquo Computer August 42-9 httpsdatajobscomdata-

science-repoRecommender-Systems-[Netflix]pdf

bull Dagger[BigDataPCA] Jianqing Fan Qiang Sun Wen-Xin Zhou Ziwei Zhu ldquoPrincipal Component Anal-ysis for Big Datardquo httpwwwprincetonedu~ziweizpcapdf

bull Dagger[RandomSVD] Andrew Tulloch 2009 ldquoFast Randomized SVDrdquo httpsresearchfbcom

fast-randomized-svd

bull Dagger[FactorSGD] Rainer Gemulla Peter J Haas Erik Nijkamp and Yannis Sismanis 2011 ldquoLarge-Scale Matrix Factorization with Distributed Stochastic Gradient Descentrdquo KDD httpwwwcs

utahedu~hariteachingbigdatagemulla11dsgdpdf

bull Dagger[Sparse] Max Grossman 2015 ldquo101 Ways to Store a Sparse Matrixrdquo httpsmediumcom

jmaxg3101-ways-to-store-a-sparse-matrix-c7f2bf15a229

bull Dagger[DontInvert] John D Cook 2010 ldquoDonrsquot Invert that Matrixrdquo httpswwwjohndcookcom

blog20100119dont-invert-that-matrix

Simulation-based inference resampling Monte Carlo methods MCMC Bayes approximate inference

25

bull Dagger[MarkovVisually] Victor Powell ldquoMarkov Chains Explained Visuallyrdquo httpsetosaioev

markov-chains

bull DSHandbook-Py Ch 25 ldquoStochastic Modelingrdquo (Markov chains MCMC HMM)

bull Bayes ComputationalStatistics-Py

bull DeepLearning Ch 17 ldquoMonte Carlo Methodsrdquo Ch 19 ldquoApproximate Inferencerdquo

bull Dagger[VariationalInference] Jason Eisner 2011 ldquoHigh-Level Explanation of Variational Inferencerdquohttpswwwcsjhuedu~jasontutorialsvariationalhtml

bull CRAN Task View Bayesian Inferencehttpscranr-projectorgwebviewsBayesianhtml

Parallelism MapReduce Split-Apply-Combine

bull Dagger[MapReduceIntuition] Jigsaw Academy 2014 ldquoBig Data Specialist MapReducerdquo https

wwwyoutubecomwatchv=TwcYQzFqg-8ampfeature=youtube

bull MMDS Ch 2 ldquoMap-Reduce and the New Software Stackrdquo (3rd Edition discusses Spark amp Tensor-Flow)

bull Algorithms Ch 3ldquoScalable Algorithms and Associative Statisticsrdquo Ch 4 ldquoHadoop and MapReducerdquo

bull NRCreport Ch 6 ldquoResources Trade-offs and Limitationsrdquo

bull TidyData-R (Split-apply-combine is the motivating principle behind the ldquotidyverserdquo approach) Seealso Part III ldquoProgramrdquo (pipes functions vectors iteration)

bull Dagger[TidySAC-Video] Hadley Wickham 2017 ldquoData Science in the Tidyverserdquo httpswww

rstudiocomresourcesvideosdata-science-in-the-tidyverse

bull See also FactorSGD

Functional Programming

bull ComputationalStatistics-Python ldquoFunctions are first class objectsrdquo through first exercises

bull DSHandbook-Py Ch 20 ldquoProgramming Language Conceptsrdquo

bull See also Haskell

Scaling iteration streaming data online algorithms (Spark)

bull NRCreport Ch 4 ldquoTemporal Data and Real-Time Algorithmsrdquo

bull MMDS Ch 4 ldquoMining Data Streamsrdquo

bull Dagger[BDAS] AMPLab BDAS The Berkeley Data Analytics Stack httpsamplabcsberkeleyedu

software

bull dagger[Spark] Mohammed Guller 2015 Big Data Analytics with Spark A Practitioners Guide toUsing Spark for Large-Scale Data Processing Machine Learning and Graph Analytics and High-Velocity Data Stream Processing Apress httpslink-springer-comezaccesslibrariespsu

edubook101007978-1-4842-0964-6

bull DSHandbook-Py Ch 13 ldquoBig Datardquo

bull DeepLearning Ch 10 ldquoSequence Modeling Recurrent and Recursive Netsrdquo

General Resources for Python and R

26

bull dagger[DSHandbook-Py] Field Cady 2017 The Data Science Handbook (Python-based) Wiley http

onlinelibrarywileycomezaccesslibrariespsuedubook1010029781119092919

bull Dagger[Tutorials-R] Ujjwal Karn 2017 ldquoA curated list of R tutorials for Data Science NLP and MachineLearningrdquo httpsgithubcomujjwalkarnDataScienceR

bull Dagger[Tutorials-Python] Ujjwal Karn 2017 ldquoA curated list of Python tutorials for Data Science NLPand Machine Learningrdquo httpsgithubcomujjwalkarnDataSciencePython

Cutting and Bleeding Edge of Data Science Languages

bull [Scala] daggerVishal Layka and David Pollak 2015 Beginning Scala httpslink-springer-

comezaccesslibrariespsuedubook1010072F978-1-4842-0232-6 daggerNicolas Patrick 2014Scala for Machine Learning httpsebookcentralproquestcomlibpensudetailaction

docID=1901910 DaggerScala site httpswwwscala-langorg (See also )

bull [Julia] daggerIvo Baobaert 2015 Getting Started with Julia httpsebookcentralproquestcom

libpensudetailactiondocID=1973847 DaggerDouglas Bates 2013 ldquoJulia for R Programmersrdquohttpwwwstatwiscedu~batesJuliaForRProgrammerspdf DaggerJulia site httpsjulialang

org

bull [Haskell] daggerRichard Bird 2014 Thinking Functionally with Haskell httpsdoi-orgezaccess

librariespsuedu101017CBO9781316092415 daggerHakim Cassimally 2017 Learning Haskell Pro-gramming httpswwwlyndacomHaskell-tutorialsLearning-Haskell-Programming604926-2html daggerJames Church 2017 Learning Haskell for Data Analysis httpswwwlyndacomDeveloper-tutorialsLearning-Haskell-Data-Analysis604234-2html Haskell site httpswwwhaskellorg

bull [Clojure] daggerMark McDonnell 2017 Quick Clojure Essential Functional Programming https

link-springer-comezaccesslibrariespsuedubook1010072F978-1-4842-2952-1 daggerAkhillWali 2014 Clojure for Machine Learning httpsebookcentralproquestcomlibpensu

detailactiondocID=1674848 daggerArthur Ulfeldt 2015 httpswwwlyndacomClojure-tutorialsUp-Running-Clojure413127-2html DaggerClojure site httpsclojureorg

bull [TensorFlow] DaggerAbadi et al (Google) 2016 ldquoTensorFlow A System for Large-Scale Ma-chine Learningrdquo httpsarxivorgabs160508695 DaggerUdacity ldquoDeep Learningrdquo httpswww

udacitycomcoursedeep-learning--ud730 DaggerTensorFlow site httpswwwtensorfloworg

bull [H2O] Darren Cool 2016 Practical Machine Learning with H2O Powerful Scalable Techniques forDeep Learning and AI OrsquoReilly Arno Candel and Viraj Parmar 2015 Deep Learning with H2ODaggerH2O site httpswwwh2oai

Social Data Structures

Space and Time

bull dagger[GIA] David OrsquoSullivan David J Unwin 2010 Geographic Information Analysis Second EditionJohn Wiley amp Sons httponlinelibrarywileycombook1010029780470549094

bull [Space-Time] Donna J Peuquet 2003 Representations of Space and Time Guilford

bull Roger S Bivand Edzer Pebesma Virgilio Gmez-Rubio 2013 Applied Spatial Data Analysis in RFree through library httplinkspringercomezaccesslibrariespsuedubook101007

2F978-1-4614-7618-4 See also Edzer Pebesma 2016 ldquoHandling and Analyzing Spatial Spa-tiotemporal and Movement Data in Rrdquo httpsedzergithubioUseR2016

bull Dagger[PySAL] Sergio J Rey and Dani Arribas-Bel 2016 ldquoGeographic Data Science with PySAL andthe pydata Stackrdquo httpdarribasorggds_scipy16

27

bull DataScience-Python NB 0413 ldquoGeographic Data with Basemaprdquo

bull Time ldquoLongitudinalrdquo intra-individual data as common in developmental psychology dagger[Longitudinal]Garrett Fitzmaurice Marie Davidian Geert Verbeke and Geert Molenberghs editors 2009 Longitu-dinal Data Analysis Chapman amp Hall CRC httppensueblibcompatronFullRecordaspxp=359998

bull Time ldquoTime Series Time Series Cross Section Panelrdquo as common in economics political scienceand sociology DataScience-Python Ch 3 re pandas DSHandbook-Py Ch 17 ldquoTime Series AnalysisrdquoEconometrics GelmanHill

bull Time ldquoSequential Data Streamsrdquo as in NLP machine learning Spark and other rdquostreaming datardquoreadings

bull Space-Time Data with continuity connectivity neighborhood structure See DeepLearning Chapter 9re convolution

bull CRAN Task Views Spatial httpscranr-projectorgwebviewsSpatialhtml Spatiotempo-ral httpscranr-projectorgwebviewsSpatioTemporalhtml Time Series httpscran

r-projectorgwebviewsTimeSerieshtml

Network Graphs

bull Dagger[Networks] David Easley and Jon Kleinberg 2010 Networks Crowds and Markets ReasoningAbout a Highly Connected World Cambridge University Press httpwwwcscornelleduhome

kleinbernetworks-book

bull MMDS Ch 5 ldquoLink Analysisrdquo Ch 10 ldquoMining Social-Network Graphsrdquo

bull dagger[Networks-R] Douglas Luke 2015 A Userrsquos Guide to Network Analysis in R Springer https

link-springer-comezaccesslibrariespsuedubook101007978-3-319-23883-8

bull dagger[Networks-Python] Mohammed Zuhair Al-Taie and Seifedine Kadry 2017 Python for Graph andNetwork Analysis Springer httpslink-springer-comezaccesslibrariespsuedubook

1010072F978-3-319-53004-8

Hierarchy Clustered Data Aggregation Mixed Models

bull dagger[GelmanHill] Andrew Gelman and Jennifer Hill 2006 Data Analysis Using Regression and Multi-levelHierarchical Models Cambridge University Press httppensueblibcompatronFullRecordaspxp=288457

bull dagger[MixedModels] Eugene Demidenko 2013 Mixed Models Theory and Applications with R 2nd edWiley httpsiteebrarycomezaccesslibrariespsuedulibpennstatedetailaction

docID=10748641

Social Data Channels

Language Text Speech Audio

bull Dagger[NLP] Daniel Jurafsky and James H Martin Forthcoming (2017) Speech and Language ProcessingAn Introduction to Natural Language Processing Speech Recognition and Computational Linguistics3rd ed Prentice-Hall Preprint httpswebstanfordedu~jurafskyslp3

bull InfoRetrieval

bull Dagger[CoreNLP] Stanford CoreNLP httpsstanfordnlpgithubioCoreNLP

28

bull dagger[TextAsData] Grimmer and Stewart 2013 ldquoText as Data The Promise and Pitfalls of AutomaticContent Analysis Methods for Political Textsrdquo Political Analysis httpsdoiorg101093pan

mps028

bull Quinn-Topics

bull FightinWords

bull dagger[NLTK-Python] Jacob Perkins 2010 Python Text Processing with NLTK 20 Cookbook PackthttpsiteebrarycomezaccesslibrariespsuedulibpennstatedetailactiondocID=10435387dagger[TextAnalytics-Python] Dipanjan Sarkar 2016 Text Analytics with Python Springer http

linkspringercomezaccesslibrariespsuedubook1010072F978-1-4842-2388-8

bull DSHandbook-Py Ch 16 ldquoNatural Language Processingrdquo

bull Matthew J Denny httpwwwmjdennycomText_Processing_In_Rhtml

bull CRAN Natural Language Processing httpscranr-projectorgwebviewsNaturalLanguageProcessinghtml

bull dagger[AudioData] Dean Knox and Christopher Lucas 2017 ldquoThe Speaker-Affect Model MeasuringEmotion in Political Speech with Audio Datardquo httpchristopherlucasorgfilesPDFssam

pdf httpswwwyoutubecomwatchv=Hs8A9dwkMzI

bull daggerMorgan amp Claypool Synthesis Lectures on Human Language Technologies httpwwwmorganclaypoolcomtochlt11

bull daggerMorgan amp Claypool Synthesis Lectures on Speech and Audio Processing httpwwwmorganclaypoolcomtocsap11

Vision Image Video

bull Dagger[ComputerVision] Richard Szelinski 2010 Computer Vision Algorithms and Applications SpringerhttpszeliskiorgBook

bull MixedModels Ch 11 ldquoStatistical Analysis of Shaperdquo Ch 12 ldquoStatistical Image Analysisrdquo

bull AudioVideoVolumetric See DeepLearning Chapter 9 re convolution

bull daggerMorgan amp Claypool Synthesis Lectures on Image Video and Multimedia Processing httpwww

morganclaypoolcomtocivm11

bull daggerMorgan amp Claypool Synthesis Lectures on Computer Vision httpwwwmorganclaypoolcom

toccov11

Approaches to Learning from Data (The Analytics Layer)

bull NRCreport Ch 7 ldquoBuilding Models from Massive Datardquo

bull MMDS (data-mining)

bull CausalInference

bull Dagger[VisualAnalytics] Daniel Keim Jorn Kohlhammer Geoffrey Ellis and Florian Mansmann editors2010 Mastering the Information Age Solving Problems with Visual Analytics The EurographicsAssociation Goslar Germany httpwwwvismastereubook See also daggerMorgan amp ClaypoolSynthesis Lectures on Visualization httpwwwmorganclaypoolcomtocvis21

bull Dagger[Econometrics] Bruce E Hansen 2014 Econometrics httpwwwsscwiscedu~bhansen

econometrics

29

bull Dagger[StatisticalLearning] Gareth James Daniela Witten Trevor Hastie and Robert Tibshirani 2013An Introduction to Statistical Learning with Applications in R Springer httpwww-bcfuscedu

~garethISLindexhtml

bull [MachineLearning] Christopher Bishop 2006 Pattern Recognition and Machine Learning Springer

bull dagger[Bayes] Peter D Hoff 2009 A First Course in Bayesian Statistical Methods Springer http

linkspringercombook1010072F978-0-387-92407-6

bull dagger[GraphicalModels] Soslashren Hoslashjsgaard David Edwards Steffen Lauritzen 2012 Graphical Mod-els with R Springer httpslink-springer-comezaccesslibrariespsuedubook101007

2F978-1-4614-2299-0

bull [InformationTheory] David Mackay 2003 Information Theory Inference and Learning AlgorithmsSpringer

bull Dagger[DeepLearning] Ian Goodfellow Yoshua Bengio and Aaron Courville 2016 Deep Learning MITPress httpwwwdeeplearningbookorg (see also Yann LeCun Yoshua Bengio and Geof-frey Hinton 2015 ldquoDeep Learningrdquo Nature 521(7553) 436ndash444 httpsdoiorg101038

nature14539)

See also Dagger[Keras] Keras The Python Deep Learning Library httpskerasio Francoios CholletDirectory of Keras tutorials httpsgithubcomfcholletkeras-resources

bull dagger[SignalProcessing] Jose Maria Giron-Sierra 2013 Digital Signal Processing with MATLAB Exam-ples (Volumes 1-3) Springer httpslink-springer-comezaccesslibrariespsuedubook

101007978-981-10-2534-1

30

Penn State Policy Statements

Academic Integrity

Academic integrity is the pursuit of scholarly activity in an open honest and responsible manner Academicintegrity is a basic guiding principle for all academic activity at The Pennsylvania State University and allmembers of the University community are expected to act in accordance with this principle Consistent withthis expectation the Universitys Code of Conduct states that all students should act with personal integrityrespect other students dignity rights and property and help create and maintain an environment in whichall can succeed through the fruits of their efforts

Academic integrity includes a commitment by all members of the University community not to engage in ortolerate acts of falsification misrepresentation or deception Such acts of dishonesty violate the fundamentalethical principles of the University community and compromise the worth of work completed by others

Disability Accomodation

Penn State welcomes students with disabilities into the Universitys educational programs Every Penn Statecampus has an office for students with disabilities Student Disability Resources (SDR) website provides con-tact information for every Penn State campus (httpequitypsuedusdrdisability-coordinator)For further information please visit the Student Disability Resources website (httpequitypsuedusdr)

In order to receive consideration for reasonable accommodations you must contact the appropriate disabilityservices office at the campus where you are officially enrolled participate in an intake interview and pro-vide documentation See documentation guidelines at (httpequitypsuedusdrguidelines) If thedocumentation supports your request for reasonable accommodations your campus disability services officewill provide you with an accommodation letter Please share this letter with your instructors and discussthe accommodations with them as early as possible You must follow this process for every semester thatyou request accommodations

Psychological Services

Many students at Penn State face personal challenges or have psychological needs that may interfere withtheir academic progress social development or emotional wellbeing The university offers a variety ofconfidential services to help you through difficult times including individual and group counseling crisisintervention consultations online chats and mental health screenings These services are provided by staffwho welcome all students and embrace a philosophy respectful of clients cultural and religious backgroundsand sensitive to differences in race ability gender identity and sexual orientation

bull Counseling and Psychological Services at University Park (CAPS) (httpstudentaffairspsueducounseling) 814-863-0395

bull Penn State Crisis Line (24 hours7 daysweek) 877-229-6400

bull Crisis Text Line (24 hours7 daysweek) Text LIONS to 741741

Educational Equity

Penn State takes great pride to foster a diverse and inclusive environment for students faculty and staffConsistent with University Policy AD29 students who believe they have experienced or observed a hatecrime an act of intolerance discrimination or harassment that occurs at Penn State are urged to report theseincidents as outlined on the Universitys Report Bias webpage (httpequitypsuedureportbias)

31

Background Knowledge

Students will have a variety of backgrounds Prior to beginning interdisciplinary coursework tofulfill Social Data Analytics degree requirements including SoDA 501 and 502 students are ex-pected to have advanced (graduate) training in at least one of the component areas of Social DataAnalytics and a familiarity with basic concepts in the others

With regard to specialization students are expected to have advanced (graduate) training in ONEof the following

bull quantitative social science methodology and a discipline of social science (as would be thecase for a second-year PhD student in Political Science Sociology Criminology Human De-velopment and Family Studies or Demography) OR

bull statistics (as would be the case for a second-year PhD student in Statistics) OR

bull information science or informatics (as would be the case for a second-year PhD student inInformation Science and Technology or a second-year PhD student in Geography specializingin GIScience) OR

bull computer science (as would be the case for a second-year PhD student in Computer Scienceand Engineering)

This requirement is met as a matter of meeting home program requirements for students in thedual-title PhD but may require additional coursework on the part of students in other programswishing to pursue the graduate minor

With regard to general preparation students are expected to have ALL of the following technicalknowledge

bull basic programming skills including basic facility with R ampor Python AND

bull basic knowledge of relational databases ampor geographic information systems AND

bull basic knowledge of probability applied statistics ampor social science research design AND

bull basic familiarity with a substantive or theoretical area of social science (eg 300-level course-work in political science sociology criminology human development psychology economicscommunication anthropology human geography social informatics or similar fields)

It is not unusual for students to have one or more gaps in this preparation at time of applicationto the SoDA program Students should work with Social Data Analytics advisers to develop aplan for timely remediation of any deficiencies which generally will not require formal courseworkfor students whose training and interests are otherwise appropriate for pursuit of the Social DataAnalytics degree Where possible this will be addressed at time of application to the Social DataAnalytics program

To this end some free training materials are linked in the reference section and there are alsoabundant free high-quality self-paced course-style training materials on these and related subjectsavailable through edX (httpwwwedxorg) Udacity (httpwwwudacitycom) Codecademy(httpcodecademycom) and Lynda (httpwwwlyndapsu)

32

Page 4: SoDA 501: Approaches and Issues in Big Social Data Spring ... · \Approaches and Issues in Big Social Data," o ered in the spring semester. Some SoDA / IGERT students will take more

Course Schedule 2018

January 11

bull Introductions

bull Syllabus (What this course is and isnrsquot How this course (hopefully) works)

bull Further reference

10RulesforData SoftwareCarpentry Lessons

Section ldquoGeneral Resources for Python and Rrdquo

Recommended Python via Anaconda (httpswwwanacondacom) R (httpswwwr-projectorg) amp RStudio (httpswwwrstudiocom) Account on ICS-ACI (httpsicspsuedu) Git (httpsgit-scmcom)

bull Exercise 1 Team Updates 118 Due 125 BitByBit Exercise 26 (a-g) Teams

TeamFrancisco Sara Claire Rosemary Fangcao

TeamFreelin Brittany Arif So Young Xiaoran

TeamKankane Shipi Steve Lulu Omer

January 18

bull Readings (send list of confusing terms concepts by 700am Jan 17 Wednesday)

BitByBit Ch 1 amp 2

CompSocSci Monroe-No Monroe-5Vs

One article from a different discipline listed in the ldquoMultidisciplinary Perspectivesrdquosection excluding Business-BigData

bull Further reference Section ldquoBig Data amp Social Data Analyticsrdquo

bull Exercise 1 Team Updates

January 25

bull Readings (send list of confusing terms concepts by 700 am Jan 24 Wednesday)

GoogleFlu GoogleBooks EmbeddingsBias MachineBias RacistBot BDSS-Census

ResearchMethodsKB ldquoMeasurementrdquo Quinn-Topics

bull Further reference

Section ldquoiexclCuidadordquo

Section ldquoMeasurement Reliability and Validityrdquo

bull Exercise 1 Due Teams Report

4

bull Determine ldquodiscussion leadrdquo dates

bull Exercise 2 Due 28 Wikipedia Google Trends exercise Teams

TeamKelling Claire Brittany Arif

TeamYalcin Omer Lulu Fangcao

TeamPang Rosemary Shipi Sara

TeamSun Xiaoran Steve So Young

February 1

bull Bing Pan (RPTM) ldquoBig Data and Forecasting in Tourismrdquo

bull Readings (send list of confusing terms concepts by 700 am Jan 31 Wednesday)

Bing Pan ldquoIdentifying the Next Non-Stop Flying Market with a Big Data Approachrdquohttpsdoiorg101016jtourman201712008 ldquoGoogle Trends and Tourist Ar-rivals Emerging Biases and Proposed Correctionsrdquo httpsdoiorg101016j

tourman201710014 (See also ldquoForecasting Destination Weekly Hotel Occupancywith Big Datardquo httpjournalssagepubcomdoiabs1011770047287516669050ldquoForecasting tourism demand with composite search indexrdquo httpsdoiorg10

1016jtourman201607005)

UnobtrusiveMeasures

Multivariate-R Chapter 1 (Donrsquot get bogged down when the math starts) LatentVari-ables Chapter 1 (Skim - donrsquot get bogged down in the math) NetflixPrize

ResearchMethodsKB ldquoSamplingrdquo NRCReport Chapter 8 ldquoSampling and MassiveDatardquo BitByBit Chapter 3 ldquoAsking Questionsrdquo

bull Further reference

Section ldquoIndirect Unobtrusive Nonreactive Measures Data Exhaustrdquo

Section ldquoMultiple Measures Latent Variable Measurementrdquo

Section ldquoSampling and Survey Designrdquo

Section ldquoOpen data APIs linked data rdquo

Section ldquoSpace and Timerdquo (readings on Time)

February 8

bull Clio Andris (GEOG) ldquoWhat AirBNB and Yelp can teach us about human behavior incitiesrdquo

bull Readings (send list of confusing terms concepts by 700 am Feb 7 Wednesday)

Clio Andris ldquoUsing Yelp to Find Romance in the City A Case of Restaurants in FourCitiesrdquo httpswwwdropboxcomsuink0zcklwcpo5gYelp_Restaurantspdfdl=

0 ldquoHidden Style in the City An Analysis of Geolocated Airbnb Rental Images in TenMajor Citiesrdquo httpswwwdropboxcomst7y7f6880m4ty7vAirBNB_Analysispdf

dl=0

5

InfoRetrieval Chs 1 2 6 (you may prefer the slides from their classes) My primaryhope here is that you understand the ldquovector space modelrdquo and ldquocosine similarityrdquo(from Chapter 6) My secondary hope is that you understand the basics of Booleaninformation retrieval (and notions like ldquoindexrdquo ldquoinverted indexrdquo and ldquopostingsrdquo) Mytertiary hope is that you are exposed to some basic concepts of text analytics NLP(including ldquotokenizationrdquo ldquonormalizationrdquo ldquostemmingrdquoldquolemmatizationrdquo ldquostop wordsrdquoldquotf-idfrdquo)

FightinWords (FW was used by Jurafsky in httpfirstmondayorgojsindexphp

fmarticleview49443863 and a best-selling book The Language of Food for similarapplications to Andris based on Yelp reviews and now appears in his textbook NLP)

bull Further reference

Section ldquoSpace and Timerdquo

Section ldquoWeb Scrapingrdquo

Section ldquoData Representations Data Mappingsrdquo

Section ldquoFeature selection feature extraction feature engineering rdquo

Section ldquoDatabases and data managementrdquo (esp SQL)

Section ldquoLanguage Text Speech Audiordquo

bull Exercise 2 Due Teams Report

February 15 (Postponed)

bull Conrad Tucker (IE) ldquoCybersecurity Policies and Their Impact on Dynamic Data DrivenApplication Systemsrdquo

bull Readings (send list of confusing terms concepts by 700 am Feb 14 Wednesday)

Conrad Tucker ldquoCybersecurity Policies and Their Impact on Dynamic Data DrivenApplication Systemsrdquo httpieeexploreieeeorgdocument8064151 See alsoldquoGenerative Adversarial Networks for Increasing the Veracity of Big Datardquo http

ieeexploreieeeorgdocument8258219

February 22

bull David Reitter (IST) ldquoComputational Psycholinguisticsrdquo

bull Readings (two weeks worth)

David Reitter ldquoAlignment in Web-Based Dialogue Studies in Big Data ComputationalPsycholinguisticsrdquo (Based on data from the Cancer Survivors Network and Reddit)httpwwwdavid-reittercompubreitter2017alignmentpdf

Similarity

PatternRecognition Ch2 ldquoRepresentationrdquo

6

MMDS Chapter 3 ldquoFinding Similar Itemsrdquo (to understand ldquominhashingrdquo and ldquolocalitysensitive hashingrdquo yoursquoll need to understand ldquohashingrdquo ndash Section 132 also discussedin InfoRetrieval Chapter 3)

bull Further reference

Section ldquoSimilarity Distance Association rdquo

Section ldquoDerived Data Representations rdquo

Section ldquoRecord Linkage Entity Resolution rdquo

Section ldquoClustering Hashing Compression rdquo

March 1

bull Reading

BitByBit Ch 5 (ldquoCreating Mass Collaborationrdquo)

Crowdsourcing ldquoIntroductionrdquo and Ch1 ldquoConcepts Theories and Cases of Crowd-sourcingrdquo

HumanComputation Chs 125

bull Further reference

Section ldquoCrowdsourcing Human Computation Citizen Science Web Experimentsrdquo

Section ldquoMaking Up Data (smoothing convolution kernels )rdquo

Section ldquoVision image videordquo

bull Semester Projects Must be Approved Before Spring Break

March 8 - Spring Break

March 15

bull Daniel DellaPosta (SOC) ldquoNetworks and the Mid-20th Century American Mafiardquo

bull Reading

Dan DellaPosta ldquoNetwork Closure in the Mid-20th Century American Mafiardquo (SocialNetworks 2017) httpswww-sciencedirect-comezaccesslibrariespsuedusciencearticlepiiS037887331630199X ldquoBetween Clique and Corporation Boundary-Spanningin Solidary Groupsrdquo (RampR in American Journal of Sociology in the Box folder)

Networks Chs 1 2

MMDS Ch 5 ldquoLink Analysisrdquo

MarkovVisually

bull Further references

Section ldquoNetworks and Graphsrdquo

7

Section ldquoSimulation resampling Markov Chains rdquo

DeepLearning (different kind of network)

bull 25 Project Review

March 22

bull Reading

DeepLearning Ch2 ldquoLinear Algebrardquo

Shalizi-ADA Chs 16-18 (ldquoPrincipal Components Analysisrdquo ldquoFactor Modelsrdquo ldquoNonlin-ear Dimension Reductionrdquo)

bull Further reference

Section ldquoDimensionality reduction decomposition rdquo

Section ldquoMultiple measures latent variable measurementrdquo

Section ldquoLinear algebra matrix computationsrdquo

March 29

bull Naomi Altman (STAT) ldquoGeneralizing PCArdquo

bull Reading

Altman ldquoGeneralizing PCArdquo 2015 slides httppersonalpsuedunsa1AltmanWebpagePCATorontopdf

MapReduceIntuition (7 minute video providing ldquodivide and conquerrdquo intuition to MapRe-duce)

TidySAC-Video Hadley Wickham on the tidyverse ldquosplit-apply-combinerdquo with dplyrand tidy data (1 hour video) (More detail but perhaps less intuition in book formdiscussion of ldquotidyingrdquo data Chapter 5 of TidyData-R)

MMDS Ch 2 (Read the Chapter 2 in the new ldquoBETArdquo version of the book whichalso touches on Spark and Tensorflow in section 24 ldquoExtensions to MapReducerdquo httpistanfordedu~ullmanmmdsnhtml)

bull Further reference

Section ldquoNonlinear dimension reduction manifold learningrdquo

Section ldquoDatabases and data managementrdquo (esp NoSQL)

Section ldquoTheoretically-structured approaches to data wranglingrdquo

Section ldquoParallelism MapReduce Split-Apply-Combinerdquo

Section ldquoFunctional programmingrdquo

Section ldquoCutting and bleeding edge rdquo

bull 50 Project Review

bull Exercise 3 (Individual) Due 45

8

April 5

bull Prasenjit Mitra (IST) ldquoClassification of Tweets from Disaster Scenariosrdquo

bull Reading

Prasenjit Mitra Rudra et al rdquoSummarizing Situational and Topical InformationDuring Crisesrdquo httpsarxivorgpdf161001561pdf Imran Mitra amp CastillordquoTwitter as a Lifeline Human-annotated Twitter Corpora for NLP of Crisis-relatedMessagesrdquo httpmimranmepapersimran_prasenjit_carlos_lrec2016pdf

NLP (Jurafsky and Martin) Chapters 15 and 16 ldquoVector Semanticsrdquo and ldquoSemanticswith Dense Vectorsrdquo

bull Further reference

Section ldquoLanguage text speech audiordquo

GloVe word2vec word2vecExplained t-SNE

Section ldquoMobile devices distributed sensors rdquo

Section ldquoCrowdsourcing human computation citizen science rdquo

Section ldquoScaling iteration streaming data online algorithmsrdquo

bull Exercise 3 Due

April 12

bull Reading

BitByBit Ch 4 ldquoRunning Experimentsrdquo review Ch 2 sections ldquoNatural experimentsin observable datardquo examples

CausalInference Chs 1-2

bull Further reference

Section ldquoExperimental and observational designs for causal inferencerdquo

Section ldquoCrowdsourcing web experimentsrdquo

Section ldquoHuman subjects rdquo

bull 75 Project Review

April 19

bull Reading

BitByBit Ch 6 ldquoEthicsrdquo

PSU-ORP ldquoCommon Rule and Other Changesrdquo httpswwwresearchpsuedu

irbcommonrulechanges

DataPrivacy

9

Reproducibility

Refresh on MachineBias

bull Further reference

Section ldquoEthics and Scientific Responsibility in Big Social Datardquo

Section ldquoiexclCuidadordquo

April 26 - LAST CLASS MEETING

bull Team Project Presentations

May 3 - Team Projects Due

10

Past Visiting Speakers in SoDA 501 and 502

SoDA 502 (Fall 2017)

bull Chris Zorn (PLSC)

bull Diane Felmlee (SOC)

bull Alan MacEachren (GEOG)

bull Eric Plutzer (PLSC)

bull James LeBreton (PSYCH)

bull Sesa Slavkovic (STAT)

bull Guido Cervone (GEOG)

bull Ashton Verdery (SOC)

bull Maggie Niu (STAT)

bull Reka Albert (PHYS)

SoDA 501 (Spring 2017)

bull Tim Brick (HDFS) ldquoTowards real-time monitoring and intervention using wearable technologyrdquo

bull Aylin Caliskan (Princeton) ldquoA Story of Discrimination and Unfairness Bias in Word Embeddingsrdquo

bull Jay Yonamine (Google - IGERT alum) ldquoData Science in Industryrdquo

bull Johnathan Rush (Illinois) ldquoGeospatial Data Science Workshoprdquo

bull Rick Gilmore (PSYCH) ldquoToward a more reproducible and robust science of human behaviorrdquo

bull Glenn Firebaugh (SOC) ldquoMeasuring Inequality and Segregation with US Census Datardquo

bull Charles Twardy (Sotera) ldquoData Science for Search and Rescuerdquo

bull Anna Smith (Ohio State) ldquoA Hierarchical Model for Network Data in a Latent Hyperbolic Spacerdquo

bull Rebecca Passonneau (CSE) ldquoOmnigraph Rich Feature Representation for Graph Kernel Learningrdquo

bull Alex Klippel (GEOG) ldquoVirtual Reality for Immersive Analyticsrdquo

bull Murali Haran (STAT) ldquoA Computationally Efficient Projection-based Approach for Spatial General-ized Mixed Modelsrdquo

SoDA 502 (Fall 2016)

bull Clio Andris (GEOG) ldquoIntegrating Social Network Data into GISystemsrdquo

bull Jia Li (STAT) ldquoClustering under the Wasserstein Metricrdquo

bull Rachel Smith (CAS) ldquoStigma Networks Perceptions of Sociogramsrdquo

bull Zita Oravecz (HDFS)

bull Bethany Bray (Methodology Center) ldquoLatent Class and Latent Transition Analysisrdquo

bull Dave Hunter (STAT) ldquoModel Based Clustering of Large Networksrdquo

bull Scott Bennett (PLSC) ldquoABM Model of Insurgencyrdquo

bull David Reitter (IST)

bull Suzanna Linn (PLSC) ldquoMethodological Issues in Automated Text Analysis Application to NewsCoverage of the US Economyrdquo

11

SoDA 501 (Spring 2016)

bull Bruce Desmarais (PLSC) ldquoLearning in the Sunshine Analysis of Local Government Email Corporardquo

bull Timothy Brick (HDFS) ldquoMapping and Manipulating Facial Expressionrdquo

bull Qunying Huang (USC) ldquoSocial Media An Emerging Data Source for Human Mobility Studiesrdquo

bull Lingzhou Zue (STAT) ldquoAn Introduction to High-Dimensional Graphical Modelsrdquo

bull Ashton Verdery (SOC) ldquoSampling from Network Datardquo

bull Sarah Battersby (Tableau) ldquoHelping People See and Understand Spatial Datardquo

bull Alexandra Slavkovic (STAT) ldquoStatistical Privacy with Network Datardquo

bull Lee Giles (IST) ldquoMachine Learning for Scholarly Big Datardquo

12

Readings and References (updated Spring 2018)

We will discuss a relatively small subset of the readings listed here and this will vary based on topics andreadings discussed by visiting speakers and you yourselves The remainder are provided here as curatedreferences for more in-depth investigation in both the theory and practice related to the topic (with thelatter heavily weighted toward resources in Python and R)

dagger Material that is at last check made available through the Penn State library Most journal links shouldwork if you are logged in to a Penn State machine or through the Penn State VPN Some article archivesand most books require additional authentication through webaccess If links are broken start directly froma search via httpswwwlibrariespsuedu Some books require installation of e-readers like AdobeDigital Editions Links to lyndacom can be accessed through httplyndapsuedu

Dagger Material that is at last check legally provided for free In some cases these are the preprint versionsof published material

sect Material for which a legal selection is or will be provided through the class Box folder

Big Data amp Social Data Analytics

Overviews

bull Dagger[CompSocSci] David Lazer Alex Pentland Lada Adamic Sinan Aral Albert-Laszlo BarabasiDevon Brewer Nicholas Christakis Noshir Contractor James Fowler Myron Gutmann Tony JebaraGary King Michael Macy Deb Roy and Marshall Van Alstyne 2009 ldquoComputational Social SciencerdquoScience 323(5915)721-3 + Supp Feb 6 httpsciencesciencemagorgcontent3235915

721full httpsgkingharvardedufilesgkingfilesLazPenAda09pdf

bull Dagger[NRCreport] National Research Council 2013 Frontiers in Massive Data Analysis NationalAcademies Press (Free w registration httpwwwnapeducatalogphprecord_id=18374)Ch1 ldquoIntroductionrdquo Ch2 ldquoMassive Data in Science Technology Commerce National DefenseTelecommunications and other Endeavorsrdquo

bull Dagger[MMDS] Jure Leskovec Anand Rajaraman and Jeff Ullman 2014 Mining of Massive DatasetsCambridge University Press httpwwwmmdsorg (BETA version of Third Edition httpi

stanfordedu~ullmanmmdsnhtml)

bull sect[BitByBit] Matthew J Salganik 2018 (Forthcoming) Bit by Bit Social Research in the DigitalAge Princeton University Press Ch 1 ldquoIntroductionrdquo Ch 2 ldquoObserving Behaviorrdquo

bull DSHandbook-Py Ch 1 ldquoIntroduction Becoming a Unicornrdquo Ch2 ldquoThe Data Science Road Maprdquo

Burt-schtick

bull Dagger[Monroe-5Vs] Burt L Monroe 2013 ldquoThe Five Vs of Big Data Political Science Introductionto the Special Issue on Big Data in Political Sciencerdquo Political Analysis 21(V5) 1ndash9 https

doiorg101017S1047198700014315 (Volume Velocity Variety Vinculation Validity)

bull dagger[Monroe-No] Burt L Monroe Jennifer Pan Margaret E Roberts Maya Sen and Betsy Sin-clair 2015 ldquoNo Formal Theory Causal Inference and Big Data Are Not Contradictory Trendsin Political Sciencerdquo PS Political Science amp Politics 48(1) 71ndash4 httpdxdoiorg101017

S1049096514001760

bull dagger[Quinn-Topics] Kevin M Quinn Burt L Monroe Michael Colaresi Michael H Crespin andDragomir R Radev 2010 ldquoHow to Analyze Political Attention with Minimal Assumptions andCostsrdquo American Journal of Political Science 54(1) 209ndash28 httponlinelibrarywileycom

13

doi101111j1540-5907200900427xfull (esp topic modeling as measurement approach tovalidation)

bull dagger[FightinWords] Burt L Monroe Michael Colaresi and Kevin M Quinn 2008 ldquoFightinrsquo WordsLexical Feature Selection and Evaluation for Identifying the Content of Political Conflictrdquo PoliticalAnalysis 16(4) 372-403 httpsdoiorg101093panmpn018 (esp the impact of samplingvariance and regularization through priors)

bull Dagger[BDSS-Census] Big Data Social Science PSU Team 2012 ldquoA Closer Look at the Kaggle CensusDatardquo httpsburtmonroegithubioBDSSKaggleCensus2012 (esp the relevance of the socialprocesses by which data come to exist as data)

Multidisciplinary Perspectives

bull dagger[Business-BigData] Andrew McAfee and Erik Brynjolfsson 2012 ldquoBig Data The ManagementRevolutionrdquo Harvard Business Review 90(10) 61ndash8 October httpshbrorg201210big-

data-the-management-revolution Thomas H Davenport and DJ Patel 2012 ldquoData ScientistThe Sexiest Job of the 21st Centuryrdquo Harvard Business Review 90(10)70-6 October https

hbrorg201210data-scientist-the-sexiest-job-of-the-21st-century

bull dagger[InfoSci-BigData] CL Philip Chen and Chun-Yang Zhang 2014 ldquoData-intensive ApplicationsChallenges Techniques and Technologies A Survey on Big Datardquo Information Sciences 275 314-47httpsdoiorg101016jins201401015

bull dagger[Informatics-BigData] Vasant G Honavar 2014 ldquoThe Promise and Potential of Big Data ACase for Discovery Informaticsrdquo Review of Policy Research 31(4) 326-330 httpsdoiorg10

1111ropr12080

bull Dagger[Stats-BigData] Beate Franke Jean-Francois Ribana Roscher Annie Lee Cathal Smyth ArminHatefi Fuqi Chen Einat Gil Alexander Schwing Alessandro Selvitella Michael M Hoffman RogerGrosse Dietrich Hendricks and Nancy Reid 2016 ldquoStatistical Inference Learning and Models in BigDatardquo International Statistical Review 84(3) 371-89 httponlinelibrarywileycomdoi101111insr12176full

bull dagger[Econ-BigData] Hal R Varian 2013 ldquoBig Data New Tricks for Econometricsrdquo Journal ofEconomic Perspectives 28(2) 3-28 httpsdoiorg101257jep2823

bull dagger[GeoViz-BigData] Alan MacEachren 2017 ldquoLeveraging Big (Geo) Data with (Geo) Visual An-alytics Place as the Next Frontierrdquo In Chenghu Zhou Fenzhen Su Francis Harvey and Jun Xueds Spatial Data Handling in the Big Data Era pp 139-155 Springer httpslink-springer-

comezaccesslibrariespsueduchapter101007978-981-10-4424-3_10

bull dagger[Soc-BigData] David Lazer and Jason Radford 2017 ldquoData ex Machina Introduction to BigDatardquo Annual Review of Sociology 43 19-39 httpsdoiorg101146annurev-soc-060116-

053457

bull sect[Politics-BigData] Keith T Poole L Jason Anasastopolous and James E Monagan III Forthcom-ing ldquoThe lsquoBig Datarsquo Revolution in Political Campaigning and Governancerdquo Oxford Bibliographies inPolitical Science

iexclCuidado Traps Biases Problems Pains Perils

bull dagger[GoogleFlu] David Lazer Ryan Kennedy Gary King and Alessandro Vespignani 2014 ldquoTheParable of Google Flu Traps in Big Data Analysisrdquo Science (343) 14 March httpgking

harvardedufilesgkingfiles0314policyforumffpdf

bull Dagger[MachineBias] ProPublica ldquoMachine Bias Investigating Algorithmic Injusticerdquo Series https

wwwpropublicaorgseriesmachine-bias See especially

14

Julia Angwin Jeff Larson Surya Mattu and Lauren Kirchner 2016 ldquoMachine Biasrdquo ProPub-lica May 23 httpswwwpropublicaorgarticlemachine-bias-risk-assessments-in-

criminal-sentencing

Julia Angwin Madeleine Varner and Ariana Tobin 2017 ldquoFacebook Enabled Adverstisers toReach Jew Hatersrdquo httpswwwpropublicaorgarticlefacebook-enabled-advertisers-

to-reach-jew-haters

bull Dagger[Polling2016] Doug Rivers (Nov 11 2016) ldquoFirst Thoughts on Polling Problems in the 2016 USElectionsrdquo httpstodayyougovcomnews20161111first-thoughts-polling-problems-

2016-us-elections

bull dagger[EventData] Wei Wang Ryan Kennedy David Lazer Naren Ramakrishnan 2016 ldquoGrowing Painsfor Global Monitoring of Societal Eventsrdquo Science 3536307 pp 1502ndash1503 httpsdoiorg101126scienceaaf6758

bull Dagger[GoogleBooks] Eitan Adam Pechenick Christopher M Danforth and Peter Sheridan Dodds 2015ldquoCharacterizing the Google Books Corpus Strong Limits to Inferences of Socio-Cultural and LinguisticEvolutionrdquo PLoS One httpsdoiorg101371journalpone0137041

bull Dagger[OkCupid] Michael Zimmer 2016 ldquoOkCupid Study Reveals the Perils of Big-Data Sciencerdquo Wiredhttpswwwwiredcom201605okcupid-study-reveals-perils-big-data-science May 14

bull dagger[CriticalQuestions] danah boyd and Kate Crawford 2011 ldquoCritical Questions for Big Data Provo-cations for a Cultural Technological and Scholarly Phenomenonrdquo Information Communication ampSociety 15(5) 662ndash79 httpdxdoiorg1010801369118X2012678878

bull Dagger[RacistBot] Daniel Victor 2016 ldquoMicrosoft created a Twitter bot to learn from users It quicklybecame a racist jerkrdquo New York Times httpswwwnytimescom20160325technology

microsoft-created-a-twitter-bot-to-learn-from-users-it-quickly-became-a-racist-jerk

html

bull MMDS 122-123 on Bonferroni

bull Also BDSS-Census

Research Design and Measurement

Overviews

bull BitByBit

bull Dagger[ResearchMethodsKB] William M Trochim 2006 The Research Methods Knowledge Basehttpwwwsocialresearchmethodsnetkb

bull [7Rules] Glenn Firebaugh 2008 Seven Rules for Social Research Princeton University Press

Measurement Reliability and Validity

bull ResearchMethodsKB ldquoMeasurementrdquo

bull 7Rules Ch 3 ldquoBuild Reality Checks into Your Researchrdquo

bull Reliability see also FightinWords

bull Validity see also Quinn10-Topics Monroe-5Vs

Indirect Unobtrusive Nonreactive Measures Data Exhaust

15

bull dagger[UnobtrusiveMeasures] Raymond M Lee 2015 ldquoUnobtrusive Measuresrdquo Oxford Bibliographieshttpsdoiorg101093OBO9780199846740-0048 (Canonical cite is Eugene J Webb 1966Unobtrusive Measures or Webb Donald T Campbell Richard D Schwartz Lee Sechrest 1999Unobtrusive Measures rev ed Sage)

bull BitByBit Examples in Chapter 2

Multiple Measures Latent Variable Measurement

bull 7Rules Ch 4 ldquoReplicate Where Possiblerdquo

bull dagger[LatentVariables] David J Bartholomew Martin Knott and Irini Moustaki 2011 Latent VariableModels and Factor Analysis A Unified Approach Wiley httpsiteebrarycomezaccess

librariespsuedulibpennstatedetailactiondocID=10483308) (esp Ch1 ldquoBasic ideas andexamplesrdquo)

bull dagger[Multivariate-R] Brian Everitt and Torsten Hothorn 2011 An Introduction to Applied MultivariateAnalysis with R Springer httplinkspringercomezaccesslibrariespsuedubook10

10072F978-1-4419-9650-3

bull Shalizi-ADA Ch17 ldquoFactor Modelsrdquo

bull Dagger[NetflixPrize] Edwin Chen 2011 ldquoWinning the Netflix Prize A Summaryrdquo

bull DaggerCRAN httpscranr-projectorgwebviewsMultivariatehtmlhttpscranr-projectorgwebviewsPsychometricshtmlhttpscranr-projectorgwebviewsClusterhtml

Sampling and Survey Design

bull dagger[Sampling] Steven K Thompson 2012 Sampling 3rd ed httpsk8es4mc2lsearchserialssolutionscomsid=sersolampSS_jc=TC_024492330amptitle=Wiley20Desktop20Editions203A20Sampling

bull ResearchMethodsKB ldquoSamplingrdquo

bull NRCreport Ch 8 ldquoSampling and Massive Datardquo

bull BitByBit Ch 3 ldquoAsking Questionsrdquo

bull Dagger[MSE] Daniel Manrique-Vallier Megan E Price and Anita Gohdes 2013 In Seybolt et al (eds)Counting Civilians ldquoMultiple Systems Estimation Techniques for Estimating Casualties in ArmedConflictsrdquo Preprint httpciteseerxistpsueduviewdocdownloaddoi=1011469939amp

rep=rep1amptype=pdf)

bull dagger[NetworkSampling] Ted Mouw and Ashton M Verdery 2012 ldquoNetwork Sampling with Memory AProposal for More Efficient Sampling from Social Networksrdquo Sociological Methodology 42(1)206ndash56httpsdxdoiorg1011772F0081175012461248

bull See also FightinWords (re hidden heteroskedasticity in sample variance)

bull Dagger[HashDontSample] Mudit Uppal 2016 ldquoProbabilistic data structures in the Big data world (+code)rdquo (re ldquoHash donrsquot samplerdquo) httpsmediumcommuppalprobabilistic-data-structures-

in-the-big-data-world-code-b9387cff0c55

bull MMDS Hash Functions (132) Sampling in streams (42)

bull DaggerCRAN httpscranr-projectorgwebviewsOfficialStatisticshtml (ldquoComplex SurveyDesignrdquo ldquoSmall Area Estimationrdquo)

Experimental and Observational Designs for Causal Inference

16

bull Dagger[CausalInference] Miguel A Hernan James M Robins Forthcoming (2017) Causal Infer-ence Chapman amp HallCRC Preprint httpswwwhsphharvardedumiguel-hernancausal-

inference-book Chs 1 ldquoA definition of causal effectrdquo Ch 2 ldquoRandomized experimentsrdquo Ch3 ldquoObservational studiesrdquo (See Ch 6 ldquoGraphical representation of causal effectsrdquo for integrationwith Judea Pearl approach)

bull BitByBit Chapter 4 ldquoRunning Experimentsrdquo Ch 2 Natural experiments in observable data exam-ples

bull ResearchMethodsKB ldquoDesignrdquo

bull Firebaugh-7Rules Ch 2 ldquoLook for Differences that Make a Difference and Report Themrdquo sectCh5 ldquoCompare Like with Likerdquo Ch 6 ldquoUse Panel Data to Study Individual Change and RepeatedCross-Section Data to Study Social Changerdquo

bull Shalizi-ADA Part IV ldquoCausal Inferencerdquo

bull Monroe-No

bull CRAN httpscranr-projectorgwebviewsExperimentalDesignhtml

Technologies for Primary and Secondary Data Collection

Mobile Devices Distributed Sensors Wearable Sensors Remote Sensing

bull dagger[RealityMining] Nathan Eagle and Alex (Sandy) Pentland 2006 ldquoReality Mining SensingComplex Social Systemsrdquo Personal and Ubiquitous Computing 10(4) 255ndash68 httpsdoiorg

101007s00779-005-0046-3

bull dagger[QuantifiedSelf ] Melanie Swan 2013 ldquoThe Quantified Self Fundamental Disruption in Big DataScience and Biological Discoveryrdquo Big Data 1(2) 85ndash99 httpsdoiorg101089big2012

0002

bull dagger[SensorData] Charu C Aggarwal (Ed) 2013 Managing and Mining Sensor Data Springer httplinkspringercomezaccesslibrariespsuedubook1010072F978-1-4614-6309-2 Ch1ldquoAn Introduction to Sensor Data Analyticsrdquo

bull dagger[NightLights] Thushyanthan Baskaran Brian Min Yogesh Uppal 2015 ldquoElection cycles andelectricity provision Evidence from a quasi-experiment with Indian special electionsrdquo Journal ofPublic Economics 12664-73 httpsdoiorg101016jjpubeco201503011

Crowdsourcing Human Computation Citizen Science Web Experiments

bull dagger[Crowdsourcing] Daren C Brabham 2013 Crowdsourcing MIT Press httpsiteebrary

comezaccesslibrariespsuedulibpennstatedetailactiondocID=10692208 ldquoIntroduc-tionrdquo Ch1 ldquoConcepts Theories and Cases of Crowdsourcingrdquo

bull BitByBit Ch 5 ldquoCreating Mass Collaborationrdquo

bull NRCreport Ch 9 ldquoHuman Interaction with Datardquo

bull dagger[MTurk] Krista Casler Lydia Bickel and Elizabeth Hackett 2013 ldquoSeparate but Equal AComparison of Participants and Data Gathered via Amazonrsquos MTurk Social Media and Face-to-FaceBehavioral Testingrdquo Computers in Human Behavior 29(6) 2156ndash60 httpdoiorg101016j

chb201305009

bull dagger[LabintheWild] Katharina Reinecke and Krzysztof Z Gajos 2015 ldquoLabintheWild ConductingLarge-Scale Online Experiments With Uncompensated Samplesrdquo In Proceedings of the 18th ACMConference on Computer Supported Cooperative Work amp Social Computing (CSCW rsquo15) 1364ndash1378httpdxdoiorg10114526751332675246

17

bull dagger[TweetmentEffects] Kevin Munger 2017 ldquoTweetment Effects on the Tweeted ExperimentallyReducing Racist Harassmentrdquo Political Behavior 39(3) 629ndash49 httpdoiorg101016jchb

201305009

bull dagger[HumanComputation] Edith Law and Luis van Ahn 2011 Human Computation Morgan amp Clay-pool httpwwwmorganclaypoolcomezaccesslibrariespsuedudoipdf102200S00371ED1V01Y201107AIM013

Open Data File Formats APIs Semantic Web Linked Data

bull DSHandbook-Py Ch 12 ldquoData Encodings and File Formatsrdquo

bull Dagger[OpenData] Open Data Institute ldquoWhat Is Open Datardquo httpstheodiorgwhat-is-open-

data

bull Dagger[OpenDataHandbook] Open Knowledge International The Open Data Handbook httpopendatahandbookorg Includes appendix ldquoFile Formatsrdquo httpopendatahandbookorgguideenappendices

file-formats

bull Dagger[APIs] Brian Cooksey 2016 An Introduction to APIs httpszapiercomlearnapis

bull Dagger[APIMarkets] RapidAPI mashape API marketplaces httpsdocsrapidapicom https

marketmashapecom ProgrammableWeb httpswwwprogrammablewebcom

bull dagger[LinkedData] Tom Heath and Christian Bizer 2011 Linked Data Evolving the Web into a GlobalData Space Morgan amp Claypool httpwwwmorganclaypoolcomezaccesslibrariespsu

edudoiabs102200S00334ED1V01Y201102WBE001 Ch 1 ldquoIntroductionrdquo Ch 2 ldquoPrinciples ofLinked Datardquo

bull dagger[SemanticWeb] Nikolaos Konstantinos and Dimitrios-Emmanuel Spanos 2015 Materializing theWeb of Linked Data Springer httplinkspringercomezaccesslibrariespsuedubook

1010072F978-3-319-16074-0 Ch 1 ldquoIntroduction Linked Data and the Semantic Webrdquo

bull daggerMorgan amp Claypool Synthesis Lectures on the Semantic Web Theory and Technology httpwww

morganclaypoolcomtocwbe111

Web Scraping

bull dagger[TheoryDrivenScraping] Richard N Landers Robert C Brusso Katelyn J Cavanaugh andAndrew B Colmus 2016 ldquoA Primer on Theory-Driven Web Scraping Automatic Extraction ofBig Data from the Internet for Use in Psychological Researchrdquo Psychological Methods 4 475ndash492httpdxdoiorgezaccesslibrariespsuedu101037met0000081

bull Dagger[Scraping-Py] Al Sweigart 2015 Automate the Boring Stuff with Python Practical Programmingfor Total Beginners Ch 11 Web-Scraping httpsautomatetheboringstuffcomchapter11

bull dagger[Scraping-R] Simon Munzert Christian Rubba Peter Meiszligner and Dominic Nyhuis 2014 Au-tomated Data Collection with R A Practical Guide to Web Scraping and Text Mining Wileyhttponlinelibrarywileycombook1010029781118834732

bull DaggerCRAN httpscranr-projectorgwebviewsWebTechnologieshtml

Ethics amp Scientific Responsibility in Big Social Data

Human Subjects Consent Privacy

bull BitByBit Ch 6 (ldquoEthicsrdquo)

bull Dagger[PSU-ORP] Penn State Office for Research Protections (under the VP for Research)

Human Subjects Research IRB httpswwwresearchpsueduirb

18

Revised Common Rule httpswwwresearchpsueduirbcommonrulechanges

Responsible Conduct of Research httpswwwresearchpsuedueducationrcr

Research Misconduct httpswwwresearchpsueduresearchmisconduct

SARI PSU (Scientific and Research Integrity Training) httpswwwresearchpsuedu

trainingsari

bull Dagger[MenloReport] David Dittrich and Erin Kenneally (Center for Applied Interneet Data Analysis)2012 ldquoThe Menlo Report Ethical Principles Guiding Information and Communication Technol-ogy Researchrdquo and companion report ldquoApplying Ethical Principles rdquo httpswwwcaidaorg

publicationspapers2012menlo_report_actual_formatted

bull Dagger[BigDataEthics-CBDES] Jacob Metcalf Emily F Keller and danah boyd 2016 ldquoPerspec-tives on Big Data Ethics and Societyrdquo Council for Big Data Ethics and Society httpbdes

datasocietynetcouncil-outputperspectives-on-big-data-ethics-and-society

bull Dagger[AoIRReport] Annette Markham and Elizabeth Buchanan (Association of Internet Researchers)2012 ldquoEthical Decision-Making and Internet Research Recommendations of the AoIR Ethics WorkingCommittee (Version 20)rdquo httpsaoirorgreportsethics2pdf and guidelines chart https

aoirorgwp-contentuploads201701aoir_ethics_graphic_2016pdf

bull Dagger[BigDataEthics-Wired] Sarah Zhang 2016 ldquoScientists are Just as Confused about the Ethics ofBig-Data Research as Yourdquo Wired httpswwwwiredcom201605scientists-just-confused-ethics-big-data-research

bull dagger[BigDataEthics-HerschelMori] Richard Herschel and Virginia M Mori 2017 ldquoEthics amp BigDatardquo Technology in Society 49 31-36 httpdoiorg101016jtechsoc201703003

The Science of Data Privacy

bull dagger[BigDataPrivacy] Terence Craig and Mary E Ludloff 2011 Privacy and Big Data OrsquoReilly MediahttppensueblibcompatronFullRecordaspxp=781814

bull Dagger[DataAnalysisPrivacy] John Abowd Lorenzo Alvisi Cynthia Dwork Sampath Kannan AshwinMachanavajjhala Jerome Reiter 2017 ldquoPrivacy-Preserving Data Analysis for the Federal Statisti-cal Agenciesrdquo A Computing Community Consortium white paper httpsarxivorgabs1701

00752

bull dagger[DataPrivacy] Stephen E Fienberg and Aleksandra B Slavkovic 2011 ldquoData Privacy and Con-fidentialityrdquo International Encyclopedia of Statistical Science 342ndash5 httpdoiorg978-3-642-

04898-2_202

bull Dagger[NetworksPrivacy] Vishesh Karwa and Aleksandra Slavkovic 2016 ldquoInference using noisy degreesDifferentially private β-model and synthetic graphsrdquo Annals of Statistics 44(1) 87-112 http

projecteuclidorgeuclidaos1449755958

bull dagger[DataPublishingPrivacy] Raymond Chi-Wing Wong and Ada Wai-Chee Fu 2010 Privacy-Preserving Data Publishing An Overview Morgan amp Claypool httpwwwmorganclaypoolcom

ezaccesslibrariespsuedudoipdfplus102200S00237ED1V01Y201003DTM002

bull DataMatching Ch 8 ldquoPrivacy Aspects of Data Matchingrdquo

Transparency Reproducibility and Team Science

bull Dagger[10RulesforData] Alyssa Goodman Alberto Pepe Alexander W Blocker Christine L BorgmanKyle Cranmer Merce Crosas Rosanne Di Stefano Yolanda Gil Paul Groth Margaret HedstromDavid W Hogg Vinay Kashyap Ashish Mahabal Aneta Siemiginowska and Aleksandra Slavkovic(2014) ldquoTen Simple Rules for the Care and Feeding of Scientific Datardquo PLoS Computational Biology10(4) e1003542 httpsdoiorg101371journalpcbi1003542 (Note esp curated resourcesfor reproducible research)

19

bull dagger[Transparency] E Miguel C Camerer K Casey J Cohen K M Esterling A Gerber R Glen-nerster D P Green M Humphreys G Imbens D Laitin T Madon L Nelson B A Nosek M Pe-tersen R Sedlmayr J P Simmons U Simonsohn M Van der Laan 2014 ldquoPromoting Transparencyin Social Science Researchrdquo Science 343(6166) 30ndash1 httpsdoiorg101126science1245317

bull dagger[Reproducibility] Marcus R Munafo Brian A Nosek Dorothy V M Bishop Katherine S ButtonChristopher D Chambers Nathalie Percie du Sert Uri Simonsohn Eric-Jan Wagenmakers JenniferJ Ware and John P A Ioannidis 2017 ldquoA Manifesto for Reproducible Sciencerdquo Nature HumanBehavior 0021(2017) httpsdoiorg101038s41562-016-0021

bull Dagger[SoftwareCarpentry] esp ldquoLessonsrdquo httpsoftware-carpentryorglessons

bull DSHandbook-Py Ch 9 ldquoTechnical Communication and Documentationrdquo Ch 15 ldquoSoftware Engi-neering Best Practicesrdquo

bull dagger[TeamScienceToolkit] Vogel AL Hall KL Fiore SM Klein JT Bennett LM Gadlin H Stokols DNebeling LC Wuchty S Patrick K Spotts EL Pohl C Riley WT Falk-Krzesinski HJ 2013 ldquoTheTeam Science Toolkit enhancing research collaboration through online knowledge sharingrdquo AmericanJournal of Preventive Medicine 45 787-9 http101016jamepre201309001

bull sect[Databrary] Kara Hall Robert Croyle and Amanda Vogel Forthcoming (2017) Advancing Socialand Behavioral Health Research through Cross-disciplinary Team Science Springer Includes RickO Gilmore and Karen E Adolph ldquoOpen Sharing of Research Video Breaking the Boundaries of theResearch Teamrdquo (See httpdatabraryorg)

bull CRAN httpscranr-projectorgwebviewsReproducibleResearchhtml

Social Bias Fair Algorithms

bull MachineBias

bull Dagger[EmbeddingsBias] Aylin Caliskan Joanna J Bryson and Arvind Narayanan 2017 ldquoSemanticsDerived Automatically from Language Corpora Contain Human Biasesrdquo Science httpsarxiv

orgabs160807187

bull Dagger[Debiasing] Tolga Bolukbasi Kai-Wei Chang James Zou Venkatesh Saligrama Adam Kalai 2016ldquoMan is to Computer Programmer as Woman is to Homemaker Debiasing Word Embeddingsrdquohttpsarxivorgabs160706520

bull Dagger[AvoidingBias] Moritz Hardt Eric Price Nathan Srebro 2016 ldquoEquality of Opportunity inSupervised Learningrdquo httpsarxivorgabs161002413

bull Dagger[InevitableBias] Jon Kleinberg Sendhil Mullainathan Manish Raghavan 2016 ldquoInherent Trade-Offs in the Fair Determination of Risk Scoresrdquo httpsarxivorgabs160905807

Databases and Data Management

bull NRCreport Ch 3 ldquoScaling the Infrastructure for Data Managementrdquo

bull dagger[SQL] Jan L Harrington 2010 SQL Clearly Explained Elsevier httpwwwsciencedirectcomezaccesslibrariespsuedusciencebook9780123756978

bull DSHandbook-Py Ch 14 ldquoDatabasesrdquo

bull dagger[noSQL] Guy Harrison 2015 Next Generation Databases NoSQL NewSQL and Big Data Apresshttpslink-springer-comezaccesslibrariespsuedubook101007978-1-4842-1329-2

bull dagger[Cloud] Divyakant Agrawal Sudipto Das and Amr El Abbadi 2012 Data Management inthe Cloud Challenges and Opportunities Morgan amp Claypool httpwwwmorganclaypoolcom

ezaccesslibrariespsuedudoipdfplus102200S00456ED1V01Y201211DTM032

20

bull daggerMorgan amp Claypool Synthesis Lectures on Data Management httpwwwmorganclaypoolcom

tocdtm11

Data Wrangling

Theoretically-structured approaches to data wrangling

bull Dagger[TidyData-R] Garrett Grolemund and Hadley Wickham 2017 R for Data Science (httpr4dshadconz) esp Ch 12 ldquoTidy Datardquo also Wickham 2014 ldquoTidy Datardquo Journalof Statistical Software 59(10) httpwwwjstatsoftorgv59i10paper Tools The tidyversehttptidyverseorg

bull Dagger[DataScience-Py] Jake VanderPlas 2016 Python Data Science Handbook (httpsgithubcomjakevdpPythonDSHandbook-Py) esp Ch 3 on pandas httppandaspydataorg

bull Dagger[DataCarpentry] Colin Gillespie and Robin Lovelace 2017 Efficient R Programming https

csgillespiegithubioefficientR esp Ch 6 on ldquoefficient data carpentryrdquo

bull sect[Wrangler] Joseph M Hellerstein Jeffrey Heer Tye Rattenbury and Sean Kandel 2017 DataWrangling Practical Techniques for Data Preparation Tool Trifacta Wrangler httpwwwtrifactacomproductswrangler

Data wrangling practice

bull DSHandbook-Py Ch 4 ldquoData Munging String Manipulation Regular Expressions and Data Clean-ingrdquo

bull dagger[DataSimplification] Jules J Berman 2016 Data Simplification Taming Information with OpenSource Tools Elsevier httpwwwsciencedirectcomezaccesslibrariespsueduscience

book9780128037812

bull dagger[Wrangling-Py] Jacqueline Kazil Katharine Jarmul 2016 Data Wrangling with Python OrsquoReillyhttpproquestcombosafaribooksonlinecomezaccesslibrariespsuedu9781491948804

bull dagger[Wrangling-R] Bradley C Boehmke 2016 Data Wrangling with R Springer httplink

springercomezaccesslibrariespsuedubook1010072F978-3-319-45599-0

Record Linkage Entity Resolution Deduplication

bull dagger[DataMatching] Peter Christen 2012 Data Matching Concepts and Techniques for Record Link-age Entity Resolution and Duplicate Detectionhttplinkspringercomezaccesslibrariespsuedubook1010072F978-3-642-31164-2

(esp Ch 2 ldquoThe Data Matching Processrdquo)

bull DataSimplification Chapter 5 ldquoIdentifying and Deidentifying Datardquo

bull Dagger[EntityResolution] Lise Getoor and Ashwin Machanavajjhala 2013 ldquoEntity Resolution for BigDatardquo Tutorial KDD httpwwwumiacsumdedu~getoorTutorialsER_KDD2013pdf

bull Dagger[SyrianCasualties] Peter Sadosky Anshumali Shrivastava Megan Price and Rebecca C Steorts2015 ldquoBlocking Methods Applied to Casualty Records from the Syrian Conflictrdquo httpsarxiv

orgabs151007714 (For more on blocking see DataMatching Ch 4 ldquoIndexingrdquo)

bull CRAN Task Views httpscranr-projectorgwebviewsOfficialStatisticshtml ldquoStatis-tical Matching and Record Linkagerdquo

ldquoMaking up datardquo Imputation Smoothers Kernels Priors Filters Teleportation Negative Sampling Con-volution Augmentation Adversarial Training

21

bull dagger[Imputation] Yi Deng Changgee Chang Moges Seyoum Ido and Qi Long 2016 ldquoMultiple Im-putation for General Missing Data Patterns in the Presence of High-dimensional Datardquo ScientificReports 6(21689) httpswwwnaturecomarticlessrep21689

bull Dagger[Overimputation] Matthew Blackwell James Honaker and Gary King 2017 ldquoA Unified Ap-proach to Measurement Error and Missing Data Overview and Applicationsrdquo Sociological Methodsamp Research 46(3) 303-341 httpgkingharvardedufilesgkingfilesmeasurepdf

bull Shalizi-ADA Sect 15 (Linear Smoothers) Ch 8 (Splines) Sect 144 (Kernel Density Estimates)DataScience-Python NB 0513 ldquoKernel Density Estimationrdquo

bull FightinWords (re Bayesian priors as additional data impact of priors on regularization)

bull DeepLearning Section 75 (Data Augmentation) 712 (Dropout) 713 (Adversarial Examples) Chap-ter 9 (Convolutional Networks)

bull dagger[Adversarial] Ian J Goodfellow Jonathon Shlens Christian Szegedy 2015 ldquoExplaining and Har-vesting Adversarial Examplesrdquo httpsarxivorgabs14126572

bull See also resampling and simulation methods

bull See also feature engineering preprocessing

bull CRAN Task Views httpscranr-projectorgwebviewsOfficialStatisticshtml ldquoImpu-tationrdquo

(Direct) Data Representations Data Mappings

bull NRCreport Ch 5 ldquoLarge-Scale Data Representationsrdquo

bull dagger[PatternRecognition] M Narasimha Murty and V Susheela Devi 2011 Pattern RecognitionAn Algorithmic Approach Springer httpslink-springer-comezaccesslibrariespsuedu

book1010072F978-0-85729-495-1 Section 21 ldquoData Structures for Pattern Representationrdquo

bull Dagger[InfoRetrieval] Christopher D Manning Prabhakar Raghavan and Hinrich Schutze 2009 Intro-duction to Information Retrieval Cambridge University Press httpnlpstanfordeduIR-bookCh 1 2 6 (also note slides used in their class)

bull dagger[Algorithms] Brian Steele John Chandler Swarna Reddy 2016 Algorithms for Data ScienceWiley httplinkspringercomezaccesslibrariespsuedubook1010072F978-3-319-

45797-0 Ch 2 ldquoData Mapping and Data Dictionariesrdquo

bull See also Social Data Structures

Similarity Distance Association Covariance The Kernel Trick

bull PatternRecognition Section 23 ldquoProximity Measuresrdquo

bull Dagger[Similarity] Brendan OrsquoConnor 2012 ldquoCosine similarity Pearson correlation and OLS coeffi-cientsrdquo httpsbrenoconcomblog201203cosine-similarity-pearson-correlation-and-

ols-coefficients

bull For relatively comprehensive lists see also

M-J Lesot M Rifqi and H Benhadda 2009 ldquoSimilarity measures for binary and numerical data a surveyrdquoInt J Knowledge Engineering and Soft Data Paradigms 1(1) 63- httpciteseerxistpsueduviewdoc

downloaddoi=10112126533amprep=rep1amptype=pdf

Seung-Seok Choi Sung-Hyuk Cha Charles C Tappert 2010 ldquoA Survey of Binary Similarity and DistanceMeasuresrdquo Journal of Systemics Cybernetics amp Informatics 8(1)43-8 httpwwwiiisciorgJournalCV$

scipdfsGS315JGpdf

22

Sung-Hyuk Cha 2007 ldquoComprehensive Survey on DistanceSimilarity Measures between Probability DensityFunctionsrdquo International Journal of Mathematical Models and Methods in Applied Sciences 4(1) 300-7 http

csispaceeductappertdpsd861-12session4-p2pdf

Anna Huang 2008 ldquoSimilarity Measures for Text Document Clusteringrdquo Proceedings of the 6th New ZealandComputer Science Research Student Conference 49-56 httpwwwnzcsrsc08canterburyacnzsiteproceedingsIndividual_Paperspg049_Similarity_Measures_for_Text_Document_Clusteringpdf

bull Michael B Jordan ldquoThe Kernel Trickrdquo (Lecture Notes) httpspeopleeecsberkeleyedu

~jordancourses281B-spring04lectureslec3pdf

bull Eric Kim 2017 ldquoEverything You Ever Wanted to Know about the Kernel Trick (But Were Afraidto Ask)rdquo httpwwweric-kimneteric-kim-netposts1kernel_trick_blog_ekim_12_20_

2017pdf

Derived Data Representations - Dimensionality Reduction Compression De-composition Embeddings

The groupings here and under the measurement multivariate statistics section are particularly arbitraryFor example ldquok-Means clusteringrdquo can be viewed as a technique for ldquodimensionality reductionrdquo ldquocompres-sionrdquo ldquofeature extractionrdquo ldquolatent variable measurementrdquo ldquounsupervised learningrdquo ldquocollaborative filteringrdquo

Clustering hashing quantization blocking compression

bull Dagger[Compression] Khalid Sayood 2012 Introduction to Data Compression 4th ed Springer httpwwwsciencedirectcomezaccesslibrariespsuedusciencebook9780124157965 (eg cod-ing blocking via vector quantization)

bull MMDS Ch3 ldquoFinding Similar Itemsrdquo (minhashing locality sensitive hashing)

bull Multivariate-R Ch 6 ldquoClusteringrdquo

bull DataScience-Python NB 0511 ldquok-Means Clusteringrdquo NG 0512 ldquoGaussian Mixture Modelsrdquo

bull Dagger[KMeansHashing] Kaiming He Fang Wen Jian Sun 2013 ldquoK-means Hashing An Affinity-Preserving Quantization Method for Learning Binary Compact Codesrdquo CVPR httpswwwcv-foundationorgopenaccesscontent_cvpr_2013papersHe_K-Means_Hashing_An_2013_CVPR_paper

pdf

bull dagger[Squashing] Madigan D Raghavan N Dumouchel W Nason M Posse C and RidgewayG (2002) ldquoLikelihood-based data squashing A modeling approach to instance constructionrdquo DataMining and Knowledge Discovery 6(2) 173-190 httpdxdoiorgezaccesslibrariespsu

edu101023A1014095614948

bull dagger[Core-sets] Piotr Indyk Sepideh Mahabadi Mohammad Mahdian Vahab S Mirrokni 2014 ldquoCom-posable core-sets for diversity and coverage maximizationrdquo PODS rsquo14 100-8 httpsdoiorg10114525945382594560

Feature selection feature extraction feature engineering weighting preprocessing

bull PatternRecognition Sections 26-7 ldquoFeature Selection Feature Extractionrdquo

bull DSHandbook-Py Ch 7 ldquoInterlude Feature Extraction Ideasrdquo

bull DataScience-Python NB 0504 ldquoFeature Engineeringrdquo

bull Features in text NLP and InfoRetrieval re tfidf and similar FightinWords

bull Features in images DataScience-Python NB 0514 ldquoImage Featuresrdquo

23

Dimensionality reduction decomposition factorization change of basis reparameterization matrix com-pletion latent variables source separation

bull MMDS Ch 9 ldquoRecommendation Systemsrdquo Ch11 ldquoDimensionality Reductionrdquo

bull DSHandbook-Py Ch 10 ldquoUnsupervised Learning Clustering and Dimensionality Reductionrdquo

bull Dagger[Shalizi-ADA] Cosma Rohilla Shalizi 2017 Advanced Data Analysis from an Elementary Pointof View Ch 16 (ldquoPrincipal Components Analysisrdquo) Ch 17 (ldquoFactor Modelsrdquo) httpwwwstat

cmuedu~cshaliziADAfaEPoV

See also Multivariate-R Ch 3 ldquoPrincipal Components Analysisrdquo Ch 4 ldquoMultidimensional ScalingrdquoCh 5 ldquoExploratory Factor Analysisrdquo Latent

bull DeepLearning Ch 2 ldquoLinear Algebrardquo Ch 13 ldquoLinear Factor Modelsrdquo

bull Dagger[GloVe] Jeffrey Pennington Richard Socher Christopher Manning 2014 ldquoGloVe Global vectorsfor word representationrdquo EMNLP httpsnlpstanfordeduprojectsglove

bull dagger[NMF] Daniel D Lee and H Sebastian Seung 1999 ldquoLearning the parts of objects by non-negativematrix factorizationrdquo Nature 401788-791 httpdxdoiorgezaccesslibrariespsuedu

10103844565

bull dagger[CUR] Michael W Mahoney and Petros Drineas 2009 ldquoCUR matrix decompositions for improveddata analysisrdquo PNAS httpwwwpnasorgcontent1063697full

bull Dagger[ICA] Aapo Hyvarinen and Erkki Oja 2000 ldquoIndependent Component Analysis Algorithmsand Applicationsrdquo Neural Networks 13(4-5) 411-30 httpswwwcshelsinkifiuahyvarin

papersNN00newpdf

bull [RandomProjection] Ella Bingham and Heikki Mannilla 2001 ldquoRandom projection in dimension-ality reduction applications to image and text datardquo KDD httpsdoiorg1011452F502512

502546

bull Compression eg ldquoTransform codingrdquo ldquoWaveletsrdquo

Nonlinear dimensionality reduction Manifold learning

bull DeepLearning Section 5113 ldquoManifold Learningrdquo

bull DataScience-Python NB 0510 ldquoManifold Learningrdquo (Locally linear embedding [LLE] Isomap)

bull Dagger[KernelPCA] Sebastian Raschka 2014 ldquoKernel tricks and nonlinear dimensionality reduction viaRBF kernel PCArdquo httpsebastianraschkacomArticles2014_kernel_pcahtml

bull Shalizi-ADA Ch 18 ldquoNonlinear Dimensionality Reductionrdquo (LLE)

bull Dagger[LaplacianEigenmaps] Mikhail Belkin and Partha Niyogi 2003 ldquoLaplacian eigenmaps for di-mensionality reduction and data representationrdquo Neural Computation 15(6) 13731396 http

webcseohio-stateedu~belkin8papersLEM_NC_03pdf

bull DeepLearning Ch14 (ldquoAutoencodersrdquo)

bull Dagger[word2vec] Tomas Mikolov Kai Chen Greg Corrado Jeffrey Dean 2013 ldquoEfficient Estimation ofWord Representations in Vector Spacerdquo httpsarxivorgabs13013781

bull Dagger[word2vecExplained] Yoav Goldberg and Omer Levy 2014 ldquoword2vec Explained DerivingMikolov et als Negative-Sampling Word-Embedding Methodrdquo httpsarxivorgabs14023722

bull Dagger[t-SNE] Laurens van der Maaten and Geoffrey Hinton 2008 ldquoVisualising Data using t-SNErdquo Jour-nal of Machine Learning Research 9 2579-2605 httpjmlrcsailmitedupapersvolume9

vandermaaten08avandermaaten08apdf

24

Computation and Scaling Up

Scientific computing computation at scale

bull NRCreport Ch 10 ldquoThe Seven Computational Giants of Massive Data Analysisrdquo

bull DSHandbook-Py Ch 21 ldquoPerformance and Computer Memoryrdquo Ch 22 ldquoComputer Memory andData Structuresrdquo

bull Dagger[ComputationalStatistics-Python] Cliburn Chan Computational Statistics in Python https

peopledukeedu~ccc14sta-663indexhtml

bull CRAN Task View High Performance Computinghttpscranr-projectorgwebviewsHighPerformanceComputinghtml

Numerical computing

bull [NumericalComputing] Ward Cheney and David Kincaid 2013 Numerical Mathematics andComputing 7th ed BrooksCole Cengage Learning

bull CRAN Task View Numerical Mathematicshttpscranr-projectorgwebviewsNumericalMathematicshtml

Optimization (eg MLE gradient descent stochastic gradient descent EM algorithm neural nets)

bull DSHandbook-Py Ch 23 ldquoMaximum Likelihood Estimation and Optimizationrdquo (gradient descent)

bull DeepLearning Ch 4 ldquoNumerical Computationrdquo Ch 6 ldquoDeep Feedforward Networksrdquo Ch 8ldquoOptimization for Training Deep Modelsrdquo

bull CRAN Task View Optimization and Mathematical Programminghttpscranr-projectorgwebviewsOptimizationhtml

Linear algebra matrix computations

bull [MatrixComputations] Gene H Golub and Charles F Van Loan 2013 Matrix Computations 4thed Johns Hopkins University Press

bull Dagger[NetflixMatrix] Yehuda Koren Robert Bell and Chris Volinsky 2009 ldquoMatrix FactorizationTechniques for Recommender Systemsrdquo Computer August 42-9 httpsdatajobscomdata-

science-repoRecommender-Systems-[Netflix]pdf

bull Dagger[BigDataPCA] Jianqing Fan Qiang Sun Wen-Xin Zhou Ziwei Zhu ldquoPrincipal Component Anal-ysis for Big Datardquo httpwwwprincetonedu~ziweizpcapdf

bull Dagger[RandomSVD] Andrew Tulloch 2009 ldquoFast Randomized SVDrdquo httpsresearchfbcom

fast-randomized-svd

bull Dagger[FactorSGD] Rainer Gemulla Peter J Haas Erik Nijkamp and Yannis Sismanis 2011 ldquoLarge-Scale Matrix Factorization with Distributed Stochastic Gradient Descentrdquo KDD httpwwwcs

utahedu~hariteachingbigdatagemulla11dsgdpdf

bull Dagger[Sparse] Max Grossman 2015 ldquo101 Ways to Store a Sparse Matrixrdquo httpsmediumcom

jmaxg3101-ways-to-store-a-sparse-matrix-c7f2bf15a229

bull Dagger[DontInvert] John D Cook 2010 ldquoDonrsquot Invert that Matrixrdquo httpswwwjohndcookcom

blog20100119dont-invert-that-matrix

Simulation-based inference resampling Monte Carlo methods MCMC Bayes approximate inference

25

bull Dagger[MarkovVisually] Victor Powell ldquoMarkov Chains Explained Visuallyrdquo httpsetosaioev

markov-chains

bull DSHandbook-Py Ch 25 ldquoStochastic Modelingrdquo (Markov chains MCMC HMM)

bull Bayes ComputationalStatistics-Py

bull DeepLearning Ch 17 ldquoMonte Carlo Methodsrdquo Ch 19 ldquoApproximate Inferencerdquo

bull Dagger[VariationalInference] Jason Eisner 2011 ldquoHigh-Level Explanation of Variational Inferencerdquohttpswwwcsjhuedu~jasontutorialsvariationalhtml

bull CRAN Task View Bayesian Inferencehttpscranr-projectorgwebviewsBayesianhtml

Parallelism MapReduce Split-Apply-Combine

bull Dagger[MapReduceIntuition] Jigsaw Academy 2014 ldquoBig Data Specialist MapReducerdquo https

wwwyoutubecomwatchv=TwcYQzFqg-8ampfeature=youtube

bull MMDS Ch 2 ldquoMap-Reduce and the New Software Stackrdquo (3rd Edition discusses Spark amp Tensor-Flow)

bull Algorithms Ch 3ldquoScalable Algorithms and Associative Statisticsrdquo Ch 4 ldquoHadoop and MapReducerdquo

bull NRCreport Ch 6 ldquoResources Trade-offs and Limitationsrdquo

bull TidyData-R (Split-apply-combine is the motivating principle behind the ldquotidyverserdquo approach) Seealso Part III ldquoProgramrdquo (pipes functions vectors iteration)

bull Dagger[TidySAC-Video] Hadley Wickham 2017 ldquoData Science in the Tidyverserdquo httpswww

rstudiocomresourcesvideosdata-science-in-the-tidyverse

bull See also FactorSGD

Functional Programming

bull ComputationalStatistics-Python ldquoFunctions are first class objectsrdquo through first exercises

bull DSHandbook-Py Ch 20 ldquoProgramming Language Conceptsrdquo

bull See also Haskell

Scaling iteration streaming data online algorithms (Spark)

bull NRCreport Ch 4 ldquoTemporal Data and Real-Time Algorithmsrdquo

bull MMDS Ch 4 ldquoMining Data Streamsrdquo

bull Dagger[BDAS] AMPLab BDAS The Berkeley Data Analytics Stack httpsamplabcsberkeleyedu

software

bull dagger[Spark] Mohammed Guller 2015 Big Data Analytics with Spark A Practitioners Guide toUsing Spark for Large-Scale Data Processing Machine Learning and Graph Analytics and High-Velocity Data Stream Processing Apress httpslink-springer-comezaccesslibrariespsu

edubook101007978-1-4842-0964-6

bull DSHandbook-Py Ch 13 ldquoBig Datardquo

bull DeepLearning Ch 10 ldquoSequence Modeling Recurrent and Recursive Netsrdquo

General Resources for Python and R

26

bull dagger[DSHandbook-Py] Field Cady 2017 The Data Science Handbook (Python-based) Wiley http

onlinelibrarywileycomezaccesslibrariespsuedubook1010029781119092919

bull Dagger[Tutorials-R] Ujjwal Karn 2017 ldquoA curated list of R tutorials for Data Science NLP and MachineLearningrdquo httpsgithubcomujjwalkarnDataScienceR

bull Dagger[Tutorials-Python] Ujjwal Karn 2017 ldquoA curated list of Python tutorials for Data Science NLPand Machine Learningrdquo httpsgithubcomujjwalkarnDataSciencePython

Cutting and Bleeding Edge of Data Science Languages

bull [Scala] daggerVishal Layka and David Pollak 2015 Beginning Scala httpslink-springer-

comezaccesslibrariespsuedubook1010072F978-1-4842-0232-6 daggerNicolas Patrick 2014Scala for Machine Learning httpsebookcentralproquestcomlibpensudetailaction

docID=1901910 DaggerScala site httpswwwscala-langorg (See also )

bull [Julia] daggerIvo Baobaert 2015 Getting Started with Julia httpsebookcentralproquestcom

libpensudetailactiondocID=1973847 DaggerDouglas Bates 2013 ldquoJulia for R Programmersrdquohttpwwwstatwiscedu~batesJuliaForRProgrammerspdf DaggerJulia site httpsjulialang

org

bull [Haskell] daggerRichard Bird 2014 Thinking Functionally with Haskell httpsdoi-orgezaccess

librariespsuedu101017CBO9781316092415 daggerHakim Cassimally 2017 Learning Haskell Pro-gramming httpswwwlyndacomHaskell-tutorialsLearning-Haskell-Programming604926-2html daggerJames Church 2017 Learning Haskell for Data Analysis httpswwwlyndacomDeveloper-tutorialsLearning-Haskell-Data-Analysis604234-2html Haskell site httpswwwhaskellorg

bull [Clojure] daggerMark McDonnell 2017 Quick Clojure Essential Functional Programming https

link-springer-comezaccesslibrariespsuedubook1010072F978-1-4842-2952-1 daggerAkhillWali 2014 Clojure for Machine Learning httpsebookcentralproquestcomlibpensu

detailactiondocID=1674848 daggerArthur Ulfeldt 2015 httpswwwlyndacomClojure-tutorialsUp-Running-Clojure413127-2html DaggerClojure site httpsclojureorg

bull [TensorFlow] DaggerAbadi et al (Google) 2016 ldquoTensorFlow A System for Large-Scale Ma-chine Learningrdquo httpsarxivorgabs160508695 DaggerUdacity ldquoDeep Learningrdquo httpswww

udacitycomcoursedeep-learning--ud730 DaggerTensorFlow site httpswwwtensorfloworg

bull [H2O] Darren Cool 2016 Practical Machine Learning with H2O Powerful Scalable Techniques forDeep Learning and AI OrsquoReilly Arno Candel and Viraj Parmar 2015 Deep Learning with H2ODaggerH2O site httpswwwh2oai

Social Data Structures

Space and Time

bull dagger[GIA] David OrsquoSullivan David J Unwin 2010 Geographic Information Analysis Second EditionJohn Wiley amp Sons httponlinelibrarywileycombook1010029780470549094

bull [Space-Time] Donna J Peuquet 2003 Representations of Space and Time Guilford

bull Roger S Bivand Edzer Pebesma Virgilio Gmez-Rubio 2013 Applied Spatial Data Analysis in RFree through library httplinkspringercomezaccesslibrariespsuedubook101007

2F978-1-4614-7618-4 See also Edzer Pebesma 2016 ldquoHandling and Analyzing Spatial Spa-tiotemporal and Movement Data in Rrdquo httpsedzergithubioUseR2016

bull Dagger[PySAL] Sergio J Rey and Dani Arribas-Bel 2016 ldquoGeographic Data Science with PySAL andthe pydata Stackrdquo httpdarribasorggds_scipy16

27

bull DataScience-Python NB 0413 ldquoGeographic Data with Basemaprdquo

bull Time ldquoLongitudinalrdquo intra-individual data as common in developmental psychology dagger[Longitudinal]Garrett Fitzmaurice Marie Davidian Geert Verbeke and Geert Molenberghs editors 2009 Longitu-dinal Data Analysis Chapman amp Hall CRC httppensueblibcompatronFullRecordaspxp=359998

bull Time ldquoTime Series Time Series Cross Section Panelrdquo as common in economics political scienceand sociology DataScience-Python Ch 3 re pandas DSHandbook-Py Ch 17 ldquoTime Series AnalysisrdquoEconometrics GelmanHill

bull Time ldquoSequential Data Streamsrdquo as in NLP machine learning Spark and other rdquostreaming datardquoreadings

bull Space-Time Data with continuity connectivity neighborhood structure See DeepLearning Chapter 9re convolution

bull CRAN Task Views Spatial httpscranr-projectorgwebviewsSpatialhtml Spatiotempo-ral httpscranr-projectorgwebviewsSpatioTemporalhtml Time Series httpscran

r-projectorgwebviewsTimeSerieshtml

Network Graphs

bull Dagger[Networks] David Easley and Jon Kleinberg 2010 Networks Crowds and Markets ReasoningAbout a Highly Connected World Cambridge University Press httpwwwcscornelleduhome

kleinbernetworks-book

bull MMDS Ch 5 ldquoLink Analysisrdquo Ch 10 ldquoMining Social-Network Graphsrdquo

bull dagger[Networks-R] Douglas Luke 2015 A Userrsquos Guide to Network Analysis in R Springer https

link-springer-comezaccesslibrariespsuedubook101007978-3-319-23883-8

bull dagger[Networks-Python] Mohammed Zuhair Al-Taie and Seifedine Kadry 2017 Python for Graph andNetwork Analysis Springer httpslink-springer-comezaccesslibrariespsuedubook

1010072F978-3-319-53004-8

Hierarchy Clustered Data Aggregation Mixed Models

bull dagger[GelmanHill] Andrew Gelman and Jennifer Hill 2006 Data Analysis Using Regression and Multi-levelHierarchical Models Cambridge University Press httppensueblibcompatronFullRecordaspxp=288457

bull dagger[MixedModels] Eugene Demidenko 2013 Mixed Models Theory and Applications with R 2nd edWiley httpsiteebrarycomezaccesslibrariespsuedulibpennstatedetailaction

docID=10748641

Social Data Channels

Language Text Speech Audio

bull Dagger[NLP] Daniel Jurafsky and James H Martin Forthcoming (2017) Speech and Language ProcessingAn Introduction to Natural Language Processing Speech Recognition and Computational Linguistics3rd ed Prentice-Hall Preprint httpswebstanfordedu~jurafskyslp3

bull InfoRetrieval

bull Dagger[CoreNLP] Stanford CoreNLP httpsstanfordnlpgithubioCoreNLP

28

bull dagger[TextAsData] Grimmer and Stewart 2013 ldquoText as Data The Promise and Pitfalls of AutomaticContent Analysis Methods for Political Textsrdquo Political Analysis httpsdoiorg101093pan

mps028

bull Quinn-Topics

bull FightinWords

bull dagger[NLTK-Python] Jacob Perkins 2010 Python Text Processing with NLTK 20 Cookbook PackthttpsiteebrarycomezaccesslibrariespsuedulibpennstatedetailactiondocID=10435387dagger[TextAnalytics-Python] Dipanjan Sarkar 2016 Text Analytics with Python Springer http

linkspringercomezaccesslibrariespsuedubook1010072F978-1-4842-2388-8

bull DSHandbook-Py Ch 16 ldquoNatural Language Processingrdquo

bull Matthew J Denny httpwwwmjdennycomText_Processing_In_Rhtml

bull CRAN Natural Language Processing httpscranr-projectorgwebviewsNaturalLanguageProcessinghtml

bull dagger[AudioData] Dean Knox and Christopher Lucas 2017 ldquoThe Speaker-Affect Model MeasuringEmotion in Political Speech with Audio Datardquo httpchristopherlucasorgfilesPDFssam

pdf httpswwwyoutubecomwatchv=Hs8A9dwkMzI

bull daggerMorgan amp Claypool Synthesis Lectures on Human Language Technologies httpwwwmorganclaypoolcomtochlt11

bull daggerMorgan amp Claypool Synthesis Lectures on Speech and Audio Processing httpwwwmorganclaypoolcomtocsap11

Vision Image Video

bull Dagger[ComputerVision] Richard Szelinski 2010 Computer Vision Algorithms and Applications SpringerhttpszeliskiorgBook

bull MixedModels Ch 11 ldquoStatistical Analysis of Shaperdquo Ch 12 ldquoStatistical Image Analysisrdquo

bull AudioVideoVolumetric See DeepLearning Chapter 9 re convolution

bull daggerMorgan amp Claypool Synthesis Lectures on Image Video and Multimedia Processing httpwww

morganclaypoolcomtocivm11

bull daggerMorgan amp Claypool Synthesis Lectures on Computer Vision httpwwwmorganclaypoolcom

toccov11

Approaches to Learning from Data (The Analytics Layer)

bull NRCreport Ch 7 ldquoBuilding Models from Massive Datardquo

bull MMDS (data-mining)

bull CausalInference

bull Dagger[VisualAnalytics] Daniel Keim Jorn Kohlhammer Geoffrey Ellis and Florian Mansmann editors2010 Mastering the Information Age Solving Problems with Visual Analytics The EurographicsAssociation Goslar Germany httpwwwvismastereubook See also daggerMorgan amp ClaypoolSynthesis Lectures on Visualization httpwwwmorganclaypoolcomtocvis21

bull Dagger[Econometrics] Bruce E Hansen 2014 Econometrics httpwwwsscwiscedu~bhansen

econometrics

29

bull Dagger[StatisticalLearning] Gareth James Daniela Witten Trevor Hastie and Robert Tibshirani 2013An Introduction to Statistical Learning with Applications in R Springer httpwww-bcfuscedu

~garethISLindexhtml

bull [MachineLearning] Christopher Bishop 2006 Pattern Recognition and Machine Learning Springer

bull dagger[Bayes] Peter D Hoff 2009 A First Course in Bayesian Statistical Methods Springer http

linkspringercombook1010072F978-0-387-92407-6

bull dagger[GraphicalModels] Soslashren Hoslashjsgaard David Edwards Steffen Lauritzen 2012 Graphical Mod-els with R Springer httpslink-springer-comezaccesslibrariespsuedubook101007

2F978-1-4614-2299-0

bull [InformationTheory] David Mackay 2003 Information Theory Inference and Learning AlgorithmsSpringer

bull Dagger[DeepLearning] Ian Goodfellow Yoshua Bengio and Aaron Courville 2016 Deep Learning MITPress httpwwwdeeplearningbookorg (see also Yann LeCun Yoshua Bengio and Geof-frey Hinton 2015 ldquoDeep Learningrdquo Nature 521(7553) 436ndash444 httpsdoiorg101038

nature14539)

See also Dagger[Keras] Keras The Python Deep Learning Library httpskerasio Francoios CholletDirectory of Keras tutorials httpsgithubcomfcholletkeras-resources

bull dagger[SignalProcessing] Jose Maria Giron-Sierra 2013 Digital Signal Processing with MATLAB Exam-ples (Volumes 1-3) Springer httpslink-springer-comezaccesslibrariespsuedubook

101007978-981-10-2534-1

30

Penn State Policy Statements

Academic Integrity

Academic integrity is the pursuit of scholarly activity in an open honest and responsible manner Academicintegrity is a basic guiding principle for all academic activity at The Pennsylvania State University and allmembers of the University community are expected to act in accordance with this principle Consistent withthis expectation the Universitys Code of Conduct states that all students should act with personal integrityrespect other students dignity rights and property and help create and maintain an environment in whichall can succeed through the fruits of their efforts

Academic integrity includes a commitment by all members of the University community not to engage in ortolerate acts of falsification misrepresentation or deception Such acts of dishonesty violate the fundamentalethical principles of the University community and compromise the worth of work completed by others

Disability Accomodation

Penn State welcomes students with disabilities into the Universitys educational programs Every Penn Statecampus has an office for students with disabilities Student Disability Resources (SDR) website provides con-tact information for every Penn State campus (httpequitypsuedusdrdisability-coordinator)For further information please visit the Student Disability Resources website (httpequitypsuedusdr)

In order to receive consideration for reasonable accommodations you must contact the appropriate disabilityservices office at the campus where you are officially enrolled participate in an intake interview and pro-vide documentation See documentation guidelines at (httpequitypsuedusdrguidelines) If thedocumentation supports your request for reasonable accommodations your campus disability services officewill provide you with an accommodation letter Please share this letter with your instructors and discussthe accommodations with them as early as possible You must follow this process for every semester thatyou request accommodations

Psychological Services

Many students at Penn State face personal challenges or have psychological needs that may interfere withtheir academic progress social development or emotional wellbeing The university offers a variety ofconfidential services to help you through difficult times including individual and group counseling crisisintervention consultations online chats and mental health screenings These services are provided by staffwho welcome all students and embrace a philosophy respectful of clients cultural and religious backgroundsand sensitive to differences in race ability gender identity and sexual orientation

bull Counseling and Psychological Services at University Park (CAPS) (httpstudentaffairspsueducounseling) 814-863-0395

bull Penn State Crisis Line (24 hours7 daysweek) 877-229-6400

bull Crisis Text Line (24 hours7 daysweek) Text LIONS to 741741

Educational Equity

Penn State takes great pride to foster a diverse and inclusive environment for students faculty and staffConsistent with University Policy AD29 students who believe they have experienced or observed a hatecrime an act of intolerance discrimination or harassment that occurs at Penn State are urged to report theseincidents as outlined on the Universitys Report Bias webpage (httpequitypsuedureportbias)

31

Background Knowledge

Students will have a variety of backgrounds Prior to beginning interdisciplinary coursework tofulfill Social Data Analytics degree requirements including SoDA 501 and 502 students are ex-pected to have advanced (graduate) training in at least one of the component areas of Social DataAnalytics and a familiarity with basic concepts in the others

With regard to specialization students are expected to have advanced (graduate) training in ONEof the following

bull quantitative social science methodology and a discipline of social science (as would be thecase for a second-year PhD student in Political Science Sociology Criminology Human De-velopment and Family Studies or Demography) OR

bull statistics (as would be the case for a second-year PhD student in Statistics) OR

bull information science or informatics (as would be the case for a second-year PhD student inInformation Science and Technology or a second-year PhD student in Geography specializingin GIScience) OR

bull computer science (as would be the case for a second-year PhD student in Computer Scienceand Engineering)

This requirement is met as a matter of meeting home program requirements for students in thedual-title PhD but may require additional coursework on the part of students in other programswishing to pursue the graduate minor

With regard to general preparation students are expected to have ALL of the following technicalknowledge

bull basic programming skills including basic facility with R ampor Python AND

bull basic knowledge of relational databases ampor geographic information systems AND

bull basic knowledge of probability applied statistics ampor social science research design AND

bull basic familiarity with a substantive or theoretical area of social science (eg 300-level course-work in political science sociology criminology human development psychology economicscommunication anthropology human geography social informatics or similar fields)

It is not unusual for students to have one or more gaps in this preparation at time of applicationto the SoDA program Students should work with Social Data Analytics advisers to develop aplan for timely remediation of any deficiencies which generally will not require formal courseworkfor students whose training and interests are otherwise appropriate for pursuit of the Social DataAnalytics degree Where possible this will be addressed at time of application to the Social DataAnalytics program

To this end some free training materials are linked in the reference section and there are alsoabundant free high-quality self-paced course-style training materials on these and related subjectsavailable through edX (httpwwwedxorg) Udacity (httpwwwudacitycom) Codecademy(httpcodecademycom) and Lynda (httpwwwlyndapsu)

32

Page 5: SoDA 501: Approaches and Issues in Big Social Data Spring ... · \Approaches and Issues in Big Social Data," o ered in the spring semester. Some SoDA / IGERT students will take more

bull Determine ldquodiscussion leadrdquo dates

bull Exercise 2 Due 28 Wikipedia Google Trends exercise Teams

TeamKelling Claire Brittany Arif

TeamYalcin Omer Lulu Fangcao

TeamPang Rosemary Shipi Sara

TeamSun Xiaoran Steve So Young

February 1

bull Bing Pan (RPTM) ldquoBig Data and Forecasting in Tourismrdquo

bull Readings (send list of confusing terms concepts by 700 am Jan 31 Wednesday)

Bing Pan ldquoIdentifying the Next Non-Stop Flying Market with a Big Data Approachrdquohttpsdoiorg101016jtourman201712008 ldquoGoogle Trends and Tourist Ar-rivals Emerging Biases and Proposed Correctionsrdquo httpsdoiorg101016j

tourman201710014 (See also ldquoForecasting Destination Weekly Hotel Occupancywith Big Datardquo httpjournalssagepubcomdoiabs1011770047287516669050ldquoForecasting tourism demand with composite search indexrdquo httpsdoiorg10

1016jtourman201607005)

UnobtrusiveMeasures

Multivariate-R Chapter 1 (Donrsquot get bogged down when the math starts) LatentVari-ables Chapter 1 (Skim - donrsquot get bogged down in the math) NetflixPrize

ResearchMethodsKB ldquoSamplingrdquo NRCReport Chapter 8 ldquoSampling and MassiveDatardquo BitByBit Chapter 3 ldquoAsking Questionsrdquo

bull Further reference

Section ldquoIndirect Unobtrusive Nonreactive Measures Data Exhaustrdquo

Section ldquoMultiple Measures Latent Variable Measurementrdquo

Section ldquoSampling and Survey Designrdquo

Section ldquoOpen data APIs linked data rdquo

Section ldquoSpace and Timerdquo (readings on Time)

February 8

bull Clio Andris (GEOG) ldquoWhat AirBNB and Yelp can teach us about human behavior incitiesrdquo

bull Readings (send list of confusing terms concepts by 700 am Feb 7 Wednesday)

Clio Andris ldquoUsing Yelp to Find Romance in the City A Case of Restaurants in FourCitiesrdquo httpswwwdropboxcomsuink0zcklwcpo5gYelp_Restaurantspdfdl=

0 ldquoHidden Style in the City An Analysis of Geolocated Airbnb Rental Images in TenMajor Citiesrdquo httpswwwdropboxcomst7y7f6880m4ty7vAirBNB_Analysispdf

dl=0

5

InfoRetrieval Chs 1 2 6 (you may prefer the slides from their classes) My primaryhope here is that you understand the ldquovector space modelrdquo and ldquocosine similarityrdquo(from Chapter 6) My secondary hope is that you understand the basics of Booleaninformation retrieval (and notions like ldquoindexrdquo ldquoinverted indexrdquo and ldquopostingsrdquo) Mytertiary hope is that you are exposed to some basic concepts of text analytics NLP(including ldquotokenizationrdquo ldquonormalizationrdquo ldquostemmingrdquoldquolemmatizationrdquo ldquostop wordsrdquoldquotf-idfrdquo)

FightinWords (FW was used by Jurafsky in httpfirstmondayorgojsindexphp

fmarticleview49443863 and a best-selling book The Language of Food for similarapplications to Andris based on Yelp reviews and now appears in his textbook NLP)

bull Further reference

Section ldquoSpace and Timerdquo

Section ldquoWeb Scrapingrdquo

Section ldquoData Representations Data Mappingsrdquo

Section ldquoFeature selection feature extraction feature engineering rdquo

Section ldquoDatabases and data managementrdquo (esp SQL)

Section ldquoLanguage Text Speech Audiordquo

bull Exercise 2 Due Teams Report

February 15 (Postponed)

bull Conrad Tucker (IE) ldquoCybersecurity Policies and Their Impact on Dynamic Data DrivenApplication Systemsrdquo

bull Readings (send list of confusing terms concepts by 700 am Feb 14 Wednesday)

Conrad Tucker ldquoCybersecurity Policies and Their Impact on Dynamic Data DrivenApplication Systemsrdquo httpieeexploreieeeorgdocument8064151 See alsoldquoGenerative Adversarial Networks for Increasing the Veracity of Big Datardquo http

ieeexploreieeeorgdocument8258219

February 22

bull David Reitter (IST) ldquoComputational Psycholinguisticsrdquo

bull Readings (two weeks worth)

David Reitter ldquoAlignment in Web-Based Dialogue Studies in Big Data ComputationalPsycholinguisticsrdquo (Based on data from the Cancer Survivors Network and Reddit)httpwwwdavid-reittercompubreitter2017alignmentpdf

Similarity

PatternRecognition Ch2 ldquoRepresentationrdquo

6

MMDS Chapter 3 ldquoFinding Similar Itemsrdquo (to understand ldquominhashingrdquo and ldquolocalitysensitive hashingrdquo yoursquoll need to understand ldquohashingrdquo ndash Section 132 also discussedin InfoRetrieval Chapter 3)

bull Further reference

Section ldquoSimilarity Distance Association rdquo

Section ldquoDerived Data Representations rdquo

Section ldquoRecord Linkage Entity Resolution rdquo

Section ldquoClustering Hashing Compression rdquo

March 1

bull Reading

BitByBit Ch 5 (ldquoCreating Mass Collaborationrdquo)

Crowdsourcing ldquoIntroductionrdquo and Ch1 ldquoConcepts Theories and Cases of Crowd-sourcingrdquo

HumanComputation Chs 125

bull Further reference

Section ldquoCrowdsourcing Human Computation Citizen Science Web Experimentsrdquo

Section ldquoMaking Up Data (smoothing convolution kernels )rdquo

Section ldquoVision image videordquo

bull Semester Projects Must be Approved Before Spring Break

March 8 - Spring Break

March 15

bull Daniel DellaPosta (SOC) ldquoNetworks and the Mid-20th Century American Mafiardquo

bull Reading

Dan DellaPosta ldquoNetwork Closure in the Mid-20th Century American Mafiardquo (SocialNetworks 2017) httpswww-sciencedirect-comezaccesslibrariespsuedusciencearticlepiiS037887331630199X ldquoBetween Clique and Corporation Boundary-Spanningin Solidary Groupsrdquo (RampR in American Journal of Sociology in the Box folder)

Networks Chs 1 2

MMDS Ch 5 ldquoLink Analysisrdquo

MarkovVisually

bull Further references

Section ldquoNetworks and Graphsrdquo

7

Section ldquoSimulation resampling Markov Chains rdquo

DeepLearning (different kind of network)

bull 25 Project Review

March 22

bull Reading

DeepLearning Ch2 ldquoLinear Algebrardquo

Shalizi-ADA Chs 16-18 (ldquoPrincipal Components Analysisrdquo ldquoFactor Modelsrdquo ldquoNonlin-ear Dimension Reductionrdquo)

bull Further reference

Section ldquoDimensionality reduction decomposition rdquo

Section ldquoMultiple measures latent variable measurementrdquo

Section ldquoLinear algebra matrix computationsrdquo

March 29

bull Naomi Altman (STAT) ldquoGeneralizing PCArdquo

bull Reading

Altman ldquoGeneralizing PCArdquo 2015 slides httppersonalpsuedunsa1AltmanWebpagePCATorontopdf

MapReduceIntuition (7 minute video providing ldquodivide and conquerrdquo intuition to MapRe-duce)

TidySAC-Video Hadley Wickham on the tidyverse ldquosplit-apply-combinerdquo with dplyrand tidy data (1 hour video) (More detail but perhaps less intuition in book formdiscussion of ldquotidyingrdquo data Chapter 5 of TidyData-R)

MMDS Ch 2 (Read the Chapter 2 in the new ldquoBETArdquo version of the book whichalso touches on Spark and Tensorflow in section 24 ldquoExtensions to MapReducerdquo httpistanfordedu~ullmanmmdsnhtml)

bull Further reference

Section ldquoNonlinear dimension reduction manifold learningrdquo

Section ldquoDatabases and data managementrdquo (esp NoSQL)

Section ldquoTheoretically-structured approaches to data wranglingrdquo

Section ldquoParallelism MapReduce Split-Apply-Combinerdquo

Section ldquoFunctional programmingrdquo

Section ldquoCutting and bleeding edge rdquo

bull 50 Project Review

bull Exercise 3 (Individual) Due 45

8

April 5

bull Prasenjit Mitra (IST) ldquoClassification of Tweets from Disaster Scenariosrdquo

bull Reading

Prasenjit Mitra Rudra et al rdquoSummarizing Situational and Topical InformationDuring Crisesrdquo httpsarxivorgpdf161001561pdf Imran Mitra amp CastillordquoTwitter as a Lifeline Human-annotated Twitter Corpora for NLP of Crisis-relatedMessagesrdquo httpmimranmepapersimran_prasenjit_carlos_lrec2016pdf

NLP (Jurafsky and Martin) Chapters 15 and 16 ldquoVector Semanticsrdquo and ldquoSemanticswith Dense Vectorsrdquo

bull Further reference

Section ldquoLanguage text speech audiordquo

GloVe word2vec word2vecExplained t-SNE

Section ldquoMobile devices distributed sensors rdquo

Section ldquoCrowdsourcing human computation citizen science rdquo

Section ldquoScaling iteration streaming data online algorithmsrdquo

bull Exercise 3 Due

April 12

bull Reading

BitByBit Ch 4 ldquoRunning Experimentsrdquo review Ch 2 sections ldquoNatural experimentsin observable datardquo examples

CausalInference Chs 1-2

bull Further reference

Section ldquoExperimental and observational designs for causal inferencerdquo

Section ldquoCrowdsourcing web experimentsrdquo

Section ldquoHuman subjects rdquo

bull 75 Project Review

April 19

bull Reading

BitByBit Ch 6 ldquoEthicsrdquo

PSU-ORP ldquoCommon Rule and Other Changesrdquo httpswwwresearchpsuedu

irbcommonrulechanges

DataPrivacy

9

Reproducibility

Refresh on MachineBias

bull Further reference

Section ldquoEthics and Scientific Responsibility in Big Social Datardquo

Section ldquoiexclCuidadordquo

April 26 - LAST CLASS MEETING

bull Team Project Presentations

May 3 - Team Projects Due

10

Past Visiting Speakers in SoDA 501 and 502

SoDA 502 (Fall 2017)

bull Chris Zorn (PLSC)

bull Diane Felmlee (SOC)

bull Alan MacEachren (GEOG)

bull Eric Plutzer (PLSC)

bull James LeBreton (PSYCH)

bull Sesa Slavkovic (STAT)

bull Guido Cervone (GEOG)

bull Ashton Verdery (SOC)

bull Maggie Niu (STAT)

bull Reka Albert (PHYS)

SoDA 501 (Spring 2017)

bull Tim Brick (HDFS) ldquoTowards real-time monitoring and intervention using wearable technologyrdquo

bull Aylin Caliskan (Princeton) ldquoA Story of Discrimination and Unfairness Bias in Word Embeddingsrdquo

bull Jay Yonamine (Google - IGERT alum) ldquoData Science in Industryrdquo

bull Johnathan Rush (Illinois) ldquoGeospatial Data Science Workshoprdquo

bull Rick Gilmore (PSYCH) ldquoToward a more reproducible and robust science of human behaviorrdquo

bull Glenn Firebaugh (SOC) ldquoMeasuring Inequality and Segregation with US Census Datardquo

bull Charles Twardy (Sotera) ldquoData Science for Search and Rescuerdquo

bull Anna Smith (Ohio State) ldquoA Hierarchical Model for Network Data in a Latent Hyperbolic Spacerdquo

bull Rebecca Passonneau (CSE) ldquoOmnigraph Rich Feature Representation for Graph Kernel Learningrdquo

bull Alex Klippel (GEOG) ldquoVirtual Reality for Immersive Analyticsrdquo

bull Murali Haran (STAT) ldquoA Computationally Efficient Projection-based Approach for Spatial General-ized Mixed Modelsrdquo

SoDA 502 (Fall 2016)

bull Clio Andris (GEOG) ldquoIntegrating Social Network Data into GISystemsrdquo

bull Jia Li (STAT) ldquoClustering under the Wasserstein Metricrdquo

bull Rachel Smith (CAS) ldquoStigma Networks Perceptions of Sociogramsrdquo

bull Zita Oravecz (HDFS)

bull Bethany Bray (Methodology Center) ldquoLatent Class and Latent Transition Analysisrdquo

bull Dave Hunter (STAT) ldquoModel Based Clustering of Large Networksrdquo

bull Scott Bennett (PLSC) ldquoABM Model of Insurgencyrdquo

bull David Reitter (IST)

bull Suzanna Linn (PLSC) ldquoMethodological Issues in Automated Text Analysis Application to NewsCoverage of the US Economyrdquo

11

SoDA 501 (Spring 2016)

bull Bruce Desmarais (PLSC) ldquoLearning in the Sunshine Analysis of Local Government Email Corporardquo

bull Timothy Brick (HDFS) ldquoMapping and Manipulating Facial Expressionrdquo

bull Qunying Huang (USC) ldquoSocial Media An Emerging Data Source for Human Mobility Studiesrdquo

bull Lingzhou Zue (STAT) ldquoAn Introduction to High-Dimensional Graphical Modelsrdquo

bull Ashton Verdery (SOC) ldquoSampling from Network Datardquo

bull Sarah Battersby (Tableau) ldquoHelping People See and Understand Spatial Datardquo

bull Alexandra Slavkovic (STAT) ldquoStatistical Privacy with Network Datardquo

bull Lee Giles (IST) ldquoMachine Learning for Scholarly Big Datardquo

12

Readings and References (updated Spring 2018)

We will discuss a relatively small subset of the readings listed here and this will vary based on topics andreadings discussed by visiting speakers and you yourselves The remainder are provided here as curatedreferences for more in-depth investigation in both the theory and practice related to the topic (with thelatter heavily weighted toward resources in Python and R)

dagger Material that is at last check made available through the Penn State library Most journal links shouldwork if you are logged in to a Penn State machine or through the Penn State VPN Some article archivesand most books require additional authentication through webaccess If links are broken start directly froma search via httpswwwlibrariespsuedu Some books require installation of e-readers like AdobeDigital Editions Links to lyndacom can be accessed through httplyndapsuedu

Dagger Material that is at last check legally provided for free In some cases these are the preprint versionsof published material

sect Material for which a legal selection is or will be provided through the class Box folder

Big Data amp Social Data Analytics

Overviews

bull Dagger[CompSocSci] David Lazer Alex Pentland Lada Adamic Sinan Aral Albert-Laszlo BarabasiDevon Brewer Nicholas Christakis Noshir Contractor James Fowler Myron Gutmann Tony JebaraGary King Michael Macy Deb Roy and Marshall Van Alstyne 2009 ldquoComputational Social SciencerdquoScience 323(5915)721-3 + Supp Feb 6 httpsciencesciencemagorgcontent3235915

721full httpsgkingharvardedufilesgkingfilesLazPenAda09pdf

bull Dagger[NRCreport] National Research Council 2013 Frontiers in Massive Data Analysis NationalAcademies Press (Free w registration httpwwwnapeducatalogphprecord_id=18374)Ch1 ldquoIntroductionrdquo Ch2 ldquoMassive Data in Science Technology Commerce National DefenseTelecommunications and other Endeavorsrdquo

bull Dagger[MMDS] Jure Leskovec Anand Rajaraman and Jeff Ullman 2014 Mining of Massive DatasetsCambridge University Press httpwwwmmdsorg (BETA version of Third Edition httpi

stanfordedu~ullmanmmdsnhtml)

bull sect[BitByBit] Matthew J Salganik 2018 (Forthcoming) Bit by Bit Social Research in the DigitalAge Princeton University Press Ch 1 ldquoIntroductionrdquo Ch 2 ldquoObserving Behaviorrdquo

bull DSHandbook-Py Ch 1 ldquoIntroduction Becoming a Unicornrdquo Ch2 ldquoThe Data Science Road Maprdquo

Burt-schtick

bull Dagger[Monroe-5Vs] Burt L Monroe 2013 ldquoThe Five Vs of Big Data Political Science Introductionto the Special Issue on Big Data in Political Sciencerdquo Political Analysis 21(V5) 1ndash9 https

doiorg101017S1047198700014315 (Volume Velocity Variety Vinculation Validity)

bull dagger[Monroe-No] Burt L Monroe Jennifer Pan Margaret E Roberts Maya Sen and Betsy Sin-clair 2015 ldquoNo Formal Theory Causal Inference and Big Data Are Not Contradictory Trendsin Political Sciencerdquo PS Political Science amp Politics 48(1) 71ndash4 httpdxdoiorg101017

S1049096514001760

bull dagger[Quinn-Topics] Kevin M Quinn Burt L Monroe Michael Colaresi Michael H Crespin andDragomir R Radev 2010 ldquoHow to Analyze Political Attention with Minimal Assumptions andCostsrdquo American Journal of Political Science 54(1) 209ndash28 httponlinelibrarywileycom

13

doi101111j1540-5907200900427xfull (esp topic modeling as measurement approach tovalidation)

bull dagger[FightinWords] Burt L Monroe Michael Colaresi and Kevin M Quinn 2008 ldquoFightinrsquo WordsLexical Feature Selection and Evaluation for Identifying the Content of Political Conflictrdquo PoliticalAnalysis 16(4) 372-403 httpsdoiorg101093panmpn018 (esp the impact of samplingvariance and regularization through priors)

bull Dagger[BDSS-Census] Big Data Social Science PSU Team 2012 ldquoA Closer Look at the Kaggle CensusDatardquo httpsburtmonroegithubioBDSSKaggleCensus2012 (esp the relevance of the socialprocesses by which data come to exist as data)

Multidisciplinary Perspectives

bull dagger[Business-BigData] Andrew McAfee and Erik Brynjolfsson 2012 ldquoBig Data The ManagementRevolutionrdquo Harvard Business Review 90(10) 61ndash8 October httpshbrorg201210big-

data-the-management-revolution Thomas H Davenport and DJ Patel 2012 ldquoData ScientistThe Sexiest Job of the 21st Centuryrdquo Harvard Business Review 90(10)70-6 October https

hbrorg201210data-scientist-the-sexiest-job-of-the-21st-century

bull dagger[InfoSci-BigData] CL Philip Chen and Chun-Yang Zhang 2014 ldquoData-intensive ApplicationsChallenges Techniques and Technologies A Survey on Big Datardquo Information Sciences 275 314-47httpsdoiorg101016jins201401015

bull dagger[Informatics-BigData] Vasant G Honavar 2014 ldquoThe Promise and Potential of Big Data ACase for Discovery Informaticsrdquo Review of Policy Research 31(4) 326-330 httpsdoiorg10

1111ropr12080

bull Dagger[Stats-BigData] Beate Franke Jean-Francois Ribana Roscher Annie Lee Cathal Smyth ArminHatefi Fuqi Chen Einat Gil Alexander Schwing Alessandro Selvitella Michael M Hoffman RogerGrosse Dietrich Hendricks and Nancy Reid 2016 ldquoStatistical Inference Learning and Models in BigDatardquo International Statistical Review 84(3) 371-89 httponlinelibrarywileycomdoi101111insr12176full

bull dagger[Econ-BigData] Hal R Varian 2013 ldquoBig Data New Tricks for Econometricsrdquo Journal ofEconomic Perspectives 28(2) 3-28 httpsdoiorg101257jep2823

bull dagger[GeoViz-BigData] Alan MacEachren 2017 ldquoLeveraging Big (Geo) Data with (Geo) Visual An-alytics Place as the Next Frontierrdquo In Chenghu Zhou Fenzhen Su Francis Harvey and Jun Xueds Spatial Data Handling in the Big Data Era pp 139-155 Springer httpslink-springer-

comezaccesslibrariespsueduchapter101007978-981-10-4424-3_10

bull dagger[Soc-BigData] David Lazer and Jason Radford 2017 ldquoData ex Machina Introduction to BigDatardquo Annual Review of Sociology 43 19-39 httpsdoiorg101146annurev-soc-060116-

053457

bull sect[Politics-BigData] Keith T Poole L Jason Anasastopolous and James E Monagan III Forthcom-ing ldquoThe lsquoBig Datarsquo Revolution in Political Campaigning and Governancerdquo Oxford Bibliographies inPolitical Science

iexclCuidado Traps Biases Problems Pains Perils

bull dagger[GoogleFlu] David Lazer Ryan Kennedy Gary King and Alessandro Vespignani 2014 ldquoTheParable of Google Flu Traps in Big Data Analysisrdquo Science (343) 14 March httpgking

harvardedufilesgkingfiles0314policyforumffpdf

bull Dagger[MachineBias] ProPublica ldquoMachine Bias Investigating Algorithmic Injusticerdquo Series https

wwwpropublicaorgseriesmachine-bias See especially

14

Julia Angwin Jeff Larson Surya Mattu and Lauren Kirchner 2016 ldquoMachine Biasrdquo ProPub-lica May 23 httpswwwpropublicaorgarticlemachine-bias-risk-assessments-in-

criminal-sentencing

Julia Angwin Madeleine Varner and Ariana Tobin 2017 ldquoFacebook Enabled Adverstisers toReach Jew Hatersrdquo httpswwwpropublicaorgarticlefacebook-enabled-advertisers-

to-reach-jew-haters

bull Dagger[Polling2016] Doug Rivers (Nov 11 2016) ldquoFirst Thoughts on Polling Problems in the 2016 USElectionsrdquo httpstodayyougovcomnews20161111first-thoughts-polling-problems-

2016-us-elections

bull dagger[EventData] Wei Wang Ryan Kennedy David Lazer Naren Ramakrishnan 2016 ldquoGrowing Painsfor Global Monitoring of Societal Eventsrdquo Science 3536307 pp 1502ndash1503 httpsdoiorg101126scienceaaf6758

bull Dagger[GoogleBooks] Eitan Adam Pechenick Christopher M Danforth and Peter Sheridan Dodds 2015ldquoCharacterizing the Google Books Corpus Strong Limits to Inferences of Socio-Cultural and LinguisticEvolutionrdquo PLoS One httpsdoiorg101371journalpone0137041

bull Dagger[OkCupid] Michael Zimmer 2016 ldquoOkCupid Study Reveals the Perils of Big-Data Sciencerdquo Wiredhttpswwwwiredcom201605okcupid-study-reveals-perils-big-data-science May 14

bull dagger[CriticalQuestions] danah boyd and Kate Crawford 2011 ldquoCritical Questions for Big Data Provo-cations for a Cultural Technological and Scholarly Phenomenonrdquo Information Communication ampSociety 15(5) 662ndash79 httpdxdoiorg1010801369118X2012678878

bull Dagger[RacistBot] Daniel Victor 2016 ldquoMicrosoft created a Twitter bot to learn from users It quicklybecame a racist jerkrdquo New York Times httpswwwnytimescom20160325technology

microsoft-created-a-twitter-bot-to-learn-from-users-it-quickly-became-a-racist-jerk

html

bull MMDS 122-123 on Bonferroni

bull Also BDSS-Census

Research Design and Measurement

Overviews

bull BitByBit

bull Dagger[ResearchMethodsKB] William M Trochim 2006 The Research Methods Knowledge Basehttpwwwsocialresearchmethodsnetkb

bull [7Rules] Glenn Firebaugh 2008 Seven Rules for Social Research Princeton University Press

Measurement Reliability and Validity

bull ResearchMethodsKB ldquoMeasurementrdquo

bull 7Rules Ch 3 ldquoBuild Reality Checks into Your Researchrdquo

bull Reliability see also FightinWords

bull Validity see also Quinn10-Topics Monroe-5Vs

Indirect Unobtrusive Nonreactive Measures Data Exhaust

15

bull dagger[UnobtrusiveMeasures] Raymond M Lee 2015 ldquoUnobtrusive Measuresrdquo Oxford Bibliographieshttpsdoiorg101093OBO9780199846740-0048 (Canonical cite is Eugene J Webb 1966Unobtrusive Measures or Webb Donald T Campbell Richard D Schwartz Lee Sechrest 1999Unobtrusive Measures rev ed Sage)

bull BitByBit Examples in Chapter 2

Multiple Measures Latent Variable Measurement

bull 7Rules Ch 4 ldquoReplicate Where Possiblerdquo

bull dagger[LatentVariables] David J Bartholomew Martin Knott and Irini Moustaki 2011 Latent VariableModels and Factor Analysis A Unified Approach Wiley httpsiteebrarycomezaccess

librariespsuedulibpennstatedetailactiondocID=10483308) (esp Ch1 ldquoBasic ideas andexamplesrdquo)

bull dagger[Multivariate-R] Brian Everitt and Torsten Hothorn 2011 An Introduction to Applied MultivariateAnalysis with R Springer httplinkspringercomezaccesslibrariespsuedubook10

10072F978-1-4419-9650-3

bull Shalizi-ADA Ch17 ldquoFactor Modelsrdquo

bull Dagger[NetflixPrize] Edwin Chen 2011 ldquoWinning the Netflix Prize A Summaryrdquo

bull DaggerCRAN httpscranr-projectorgwebviewsMultivariatehtmlhttpscranr-projectorgwebviewsPsychometricshtmlhttpscranr-projectorgwebviewsClusterhtml

Sampling and Survey Design

bull dagger[Sampling] Steven K Thompson 2012 Sampling 3rd ed httpsk8es4mc2lsearchserialssolutionscomsid=sersolampSS_jc=TC_024492330amptitle=Wiley20Desktop20Editions203A20Sampling

bull ResearchMethodsKB ldquoSamplingrdquo

bull NRCreport Ch 8 ldquoSampling and Massive Datardquo

bull BitByBit Ch 3 ldquoAsking Questionsrdquo

bull Dagger[MSE] Daniel Manrique-Vallier Megan E Price and Anita Gohdes 2013 In Seybolt et al (eds)Counting Civilians ldquoMultiple Systems Estimation Techniques for Estimating Casualties in ArmedConflictsrdquo Preprint httpciteseerxistpsueduviewdocdownloaddoi=1011469939amp

rep=rep1amptype=pdf)

bull dagger[NetworkSampling] Ted Mouw and Ashton M Verdery 2012 ldquoNetwork Sampling with Memory AProposal for More Efficient Sampling from Social Networksrdquo Sociological Methodology 42(1)206ndash56httpsdxdoiorg1011772F0081175012461248

bull See also FightinWords (re hidden heteroskedasticity in sample variance)

bull Dagger[HashDontSample] Mudit Uppal 2016 ldquoProbabilistic data structures in the Big data world (+code)rdquo (re ldquoHash donrsquot samplerdquo) httpsmediumcommuppalprobabilistic-data-structures-

in-the-big-data-world-code-b9387cff0c55

bull MMDS Hash Functions (132) Sampling in streams (42)

bull DaggerCRAN httpscranr-projectorgwebviewsOfficialStatisticshtml (ldquoComplex SurveyDesignrdquo ldquoSmall Area Estimationrdquo)

Experimental and Observational Designs for Causal Inference

16

bull Dagger[CausalInference] Miguel A Hernan James M Robins Forthcoming (2017) Causal Infer-ence Chapman amp HallCRC Preprint httpswwwhsphharvardedumiguel-hernancausal-

inference-book Chs 1 ldquoA definition of causal effectrdquo Ch 2 ldquoRandomized experimentsrdquo Ch3 ldquoObservational studiesrdquo (See Ch 6 ldquoGraphical representation of causal effectsrdquo for integrationwith Judea Pearl approach)

bull BitByBit Chapter 4 ldquoRunning Experimentsrdquo Ch 2 Natural experiments in observable data exam-ples

bull ResearchMethodsKB ldquoDesignrdquo

bull Firebaugh-7Rules Ch 2 ldquoLook for Differences that Make a Difference and Report Themrdquo sectCh5 ldquoCompare Like with Likerdquo Ch 6 ldquoUse Panel Data to Study Individual Change and RepeatedCross-Section Data to Study Social Changerdquo

bull Shalizi-ADA Part IV ldquoCausal Inferencerdquo

bull Monroe-No

bull CRAN httpscranr-projectorgwebviewsExperimentalDesignhtml

Technologies for Primary and Secondary Data Collection

Mobile Devices Distributed Sensors Wearable Sensors Remote Sensing

bull dagger[RealityMining] Nathan Eagle and Alex (Sandy) Pentland 2006 ldquoReality Mining SensingComplex Social Systemsrdquo Personal and Ubiquitous Computing 10(4) 255ndash68 httpsdoiorg

101007s00779-005-0046-3

bull dagger[QuantifiedSelf ] Melanie Swan 2013 ldquoThe Quantified Self Fundamental Disruption in Big DataScience and Biological Discoveryrdquo Big Data 1(2) 85ndash99 httpsdoiorg101089big2012

0002

bull dagger[SensorData] Charu C Aggarwal (Ed) 2013 Managing and Mining Sensor Data Springer httplinkspringercomezaccesslibrariespsuedubook1010072F978-1-4614-6309-2 Ch1ldquoAn Introduction to Sensor Data Analyticsrdquo

bull dagger[NightLights] Thushyanthan Baskaran Brian Min Yogesh Uppal 2015 ldquoElection cycles andelectricity provision Evidence from a quasi-experiment with Indian special electionsrdquo Journal ofPublic Economics 12664-73 httpsdoiorg101016jjpubeco201503011

Crowdsourcing Human Computation Citizen Science Web Experiments

bull dagger[Crowdsourcing] Daren C Brabham 2013 Crowdsourcing MIT Press httpsiteebrary

comezaccesslibrariespsuedulibpennstatedetailactiondocID=10692208 ldquoIntroduc-tionrdquo Ch1 ldquoConcepts Theories and Cases of Crowdsourcingrdquo

bull BitByBit Ch 5 ldquoCreating Mass Collaborationrdquo

bull NRCreport Ch 9 ldquoHuman Interaction with Datardquo

bull dagger[MTurk] Krista Casler Lydia Bickel and Elizabeth Hackett 2013 ldquoSeparate but Equal AComparison of Participants and Data Gathered via Amazonrsquos MTurk Social Media and Face-to-FaceBehavioral Testingrdquo Computers in Human Behavior 29(6) 2156ndash60 httpdoiorg101016j

chb201305009

bull dagger[LabintheWild] Katharina Reinecke and Krzysztof Z Gajos 2015 ldquoLabintheWild ConductingLarge-Scale Online Experiments With Uncompensated Samplesrdquo In Proceedings of the 18th ACMConference on Computer Supported Cooperative Work amp Social Computing (CSCW rsquo15) 1364ndash1378httpdxdoiorg10114526751332675246

17

bull dagger[TweetmentEffects] Kevin Munger 2017 ldquoTweetment Effects on the Tweeted ExperimentallyReducing Racist Harassmentrdquo Political Behavior 39(3) 629ndash49 httpdoiorg101016jchb

201305009

bull dagger[HumanComputation] Edith Law and Luis van Ahn 2011 Human Computation Morgan amp Clay-pool httpwwwmorganclaypoolcomezaccesslibrariespsuedudoipdf102200S00371ED1V01Y201107AIM013

Open Data File Formats APIs Semantic Web Linked Data

bull DSHandbook-Py Ch 12 ldquoData Encodings and File Formatsrdquo

bull Dagger[OpenData] Open Data Institute ldquoWhat Is Open Datardquo httpstheodiorgwhat-is-open-

data

bull Dagger[OpenDataHandbook] Open Knowledge International The Open Data Handbook httpopendatahandbookorg Includes appendix ldquoFile Formatsrdquo httpopendatahandbookorgguideenappendices

file-formats

bull Dagger[APIs] Brian Cooksey 2016 An Introduction to APIs httpszapiercomlearnapis

bull Dagger[APIMarkets] RapidAPI mashape API marketplaces httpsdocsrapidapicom https

marketmashapecom ProgrammableWeb httpswwwprogrammablewebcom

bull dagger[LinkedData] Tom Heath and Christian Bizer 2011 Linked Data Evolving the Web into a GlobalData Space Morgan amp Claypool httpwwwmorganclaypoolcomezaccesslibrariespsu

edudoiabs102200S00334ED1V01Y201102WBE001 Ch 1 ldquoIntroductionrdquo Ch 2 ldquoPrinciples ofLinked Datardquo

bull dagger[SemanticWeb] Nikolaos Konstantinos and Dimitrios-Emmanuel Spanos 2015 Materializing theWeb of Linked Data Springer httplinkspringercomezaccesslibrariespsuedubook

1010072F978-3-319-16074-0 Ch 1 ldquoIntroduction Linked Data and the Semantic Webrdquo

bull daggerMorgan amp Claypool Synthesis Lectures on the Semantic Web Theory and Technology httpwww

morganclaypoolcomtocwbe111

Web Scraping

bull dagger[TheoryDrivenScraping] Richard N Landers Robert C Brusso Katelyn J Cavanaugh andAndrew B Colmus 2016 ldquoA Primer on Theory-Driven Web Scraping Automatic Extraction ofBig Data from the Internet for Use in Psychological Researchrdquo Psychological Methods 4 475ndash492httpdxdoiorgezaccesslibrariespsuedu101037met0000081

bull Dagger[Scraping-Py] Al Sweigart 2015 Automate the Boring Stuff with Python Practical Programmingfor Total Beginners Ch 11 Web-Scraping httpsautomatetheboringstuffcomchapter11

bull dagger[Scraping-R] Simon Munzert Christian Rubba Peter Meiszligner and Dominic Nyhuis 2014 Au-tomated Data Collection with R A Practical Guide to Web Scraping and Text Mining Wileyhttponlinelibrarywileycombook1010029781118834732

bull DaggerCRAN httpscranr-projectorgwebviewsWebTechnologieshtml

Ethics amp Scientific Responsibility in Big Social Data

Human Subjects Consent Privacy

bull BitByBit Ch 6 (ldquoEthicsrdquo)

bull Dagger[PSU-ORP] Penn State Office for Research Protections (under the VP for Research)

Human Subjects Research IRB httpswwwresearchpsueduirb

18

Revised Common Rule httpswwwresearchpsueduirbcommonrulechanges

Responsible Conduct of Research httpswwwresearchpsuedueducationrcr

Research Misconduct httpswwwresearchpsueduresearchmisconduct

SARI PSU (Scientific and Research Integrity Training) httpswwwresearchpsuedu

trainingsari

bull Dagger[MenloReport] David Dittrich and Erin Kenneally (Center for Applied Interneet Data Analysis)2012 ldquoThe Menlo Report Ethical Principles Guiding Information and Communication Technol-ogy Researchrdquo and companion report ldquoApplying Ethical Principles rdquo httpswwwcaidaorg

publicationspapers2012menlo_report_actual_formatted

bull Dagger[BigDataEthics-CBDES] Jacob Metcalf Emily F Keller and danah boyd 2016 ldquoPerspec-tives on Big Data Ethics and Societyrdquo Council for Big Data Ethics and Society httpbdes

datasocietynetcouncil-outputperspectives-on-big-data-ethics-and-society

bull Dagger[AoIRReport] Annette Markham and Elizabeth Buchanan (Association of Internet Researchers)2012 ldquoEthical Decision-Making and Internet Research Recommendations of the AoIR Ethics WorkingCommittee (Version 20)rdquo httpsaoirorgreportsethics2pdf and guidelines chart https

aoirorgwp-contentuploads201701aoir_ethics_graphic_2016pdf

bull Dagger[BigDataEthics-Wired] Sarah Zhang 2016 ldquoScientists are Just as Confused about the Ethics ofBig-Data Research as Yourdquo Wired httpswwwwiredcom201605scientists-just-confused-ethics-big-data-research

bull dagger[BigDataEthics-HerschelMori] Richard Herschel and Virginia M Mori 2017 ldquoEthics amp BigDatardquo Technology in Society 49 31-36 httpdoiorg101016jtechsoc201703003

The Science of Data Privacy

bull dagger[BigDataPrivacy] Terence Craig and Mary E Ludloff 2011 Privacy and Big Data OrsquoReilly MediahttppensueblibcompatronFullRecordaspxp=781814

bull Dagger[DataAnalysisPrivacy] John Abowd Lorenzo Alvisi Cynthia Dwork Sampath Kannan AshwinMachanavajjhala Jerome Reiter 2017 ldquoPrivacy-Preserving Data Analysis for the Federal Statisti-cal Agenciesrdquo A Computing Community Consortium white paper httpsarxivorgabs1701

00752

bull dagger[DataPrivacy] Stephen E Fienberg and Aleksandra B Slavkovic 2011 ldquoData Privacy and Con-fidentialityrdquo International Encyclopedia of Statistical Science 342ndash5 httpdoiorg978-3-642-

04898-2_202

bull Dagger[NetworksPrivacy] Vishesh Karwa and Aleksandra Slavkovic 2016 ldquoInference using noisy degreesDifferentially private β-model and synthetic graphsrdquo Annals of Statistics 44(1) 87-112 http

projecteuclidorgeuclidaos1449755958

bull dagger[DataPublishingPrivacy] Raymond Chi-Wing Wong and Ada Wai-Chee Fu 2010 Privacy-Preserving Data Publishing An Overview Morgan amp Claypool httpwwwmorganclaypoolcom

ezaccesslibrariespsuedudoipdfplus102200S00237ED1V01Y201003DTM002

bull DataMatching Ch 8 ldquoPrivacy Aspects of Data Matchingrdquo

Transparency Reproducibility and Team Science

bull Dagger[10RulesforData] Alyssa Goodman Alberto Pepe Alexander W Blocker Christine L BorgmanKyle Cranmer Merce Crosas Rosanne Di Stefano Yolanda Gil Paul Groth Margaret HedstromDavid W Hogg Vinay Kashyap Ashish Mahabal Aneta Siemiginowska and Aleksandra Slavkovic(2014) ldquoTen Simple Rules for the Care and Feeding of Scientific Datardquo PLoS Computational Biology10(4) e1003542 httpsdoiorg101371journalpcbi1003542 (Note esp curated resourcesfor reproducible research)

19

bull dagger[Transparency] E Miguel C Camerer K Casey J Cohen K M Esterling A Gerber R Glen-nerster D P Green M Humphreys G Imbens D Laitin T Madon L Nelson B A Nosek M Pe-tersen R Sedlmayr J P Simmons U Simonsohn M Van der Laan 2014 ldquoPromoting Transparencyin Social Science Researchrdquo Science 343(6166) 30ndash1 httpsdoiorg101126science1245317

bull dagger[Reproducibility] Marcus R Munafo Brian A Nosek Dorothy V M Bishop Katherine S ButtonChristopher D Chambers Nathalie Percie du Sert Uri Simonsohn Eric-Jan Wagenmakers JenniferJ Ware and John P A Ioannidis 2017 ldquoA Manifesto for Reproducible Sciencerdquo Nature HumanBehavior 0021(2017) httpsdoiorg101038s41562-016-0021

bull Dagger[SoftwareCarpentry] esp ldquoLessonsrdquo httpsoftware-carpentryorglessons

bull DSHandbook-Py Ch 9 ldquoTechnical Communication and Documentationrdquo Ch 15 ldquoSoftware Engi-neering Best Practicesrdquo

bull dagger[TeamScienceToolkit] Vogel AL Hall KL Fiore SM Klein JT Bennett LM Gadlin H Stokols DNebeling LC Wuchty S Patrick K Spotts EL Pohl C Riley WT Falk-Krzesinski HJ 2013 ldquoTheTeam Science Toolkit enhancing research collaboration through online knowledge sharingrdquo AmericanJournal of Preventive Medicine 45 787-9 http101016jamepre201309001

bull sect[Databrary] Kara Hall Robert Croyle and Amanda Vogel Forthcoming (2017) Advancing Socialand Behavioral Health Research through Cross-disciplinary Team Science Springer Includes RickO Gilmore and Karen E Adolph ldquoOpen Sharing of Research Video Breaking the Boundaries of theResearch Teamrdquo (See httpdatabraryorg)

bull CRAN httpscranr-projectorgwebviewsReproducibleResearchhtml

Social Bias Fair Algorithms

bull MachineBias

bull Dagger[EmbeddingsBias] Aylin Caliskan Joanna J Bryson and Arvind Narayanan 2017 ldquoSemanticsDerived Automatically from Language Corpora Contain Human Biasesrdquo Science httpsarxiv

orgabs160807187

bull Dagger[Debiasing] Tolga Bolukbasi Kai-Wei Chang James Zou Venkatesh Saligrama Adam Kalai 2016ldquoMan is to Computer Programmer as Woman is to Homemaker Debiasing Word Embeddingsrdquohttpsarxivorgabs160706520

bull Dagger[AvoidingBias] Moritz Hardt Eric Price Nathan Srebro 2016 ldquoEquality of Opportunity inSupervised Learningrdquo httpsarxivorgabs161002413

bull Dagger[InevitableBias] Jon Kleinberg Sendhil Mullainathan Manish Raghavan 2016 ldquoInherent Trade-Offs in the Fair Determination of Risk Scoresrdquo httpsarxivorgabs160905807

Databases and Data Management

bull NRCreport Ch 3 ldquoScaling the Infrastructure for Data Managementrdquo

bull dagger[SQL] Jan L Harrington 2010 SQL Clearly Explained Elsevier httpwwwsciencedirectcomezaccesslibrariespsuedusciencebook9780123756978

bull DSHandbook-Py Ch 14 ldquoDatabasesrdquo

bull dagger[noSQL] Guy Harrison 2015 Next Generation Databases NoSQL NewSQL and Big Data Apresshttpslink-springer-comezaccesslibrariespsuedubook101007978-1-4842-1329-2

bull dagger[Cloud] Divyakant Agrawal Sudipto Das and Amr El Abbadi 2012 Data Management inthe Cloud Challenges and Opportunities Morgan amp Claypool httpwwwmorganclaypoolcom

ezaccesslibrariespsuedudoipdfplus102200S00456ED1V01Y201211DTM032

20

bull daggerMorgan amp Claypool Synthesis Lectures on Data Management httpwwwmorganclaypoolcom

tocdtm11

Data Wrangling

Theoretically-structured approaches to data wrangling

bull Dagger[TidyData-R] Garrett Grolemund and Hadley Wickham 2017 R for Data Science (httpr4dshadconz) esp Ch 12 ldquoTidy Datardquo also Wickham 2014 ldquoTidy Datardquo Journalof Statistical Software 59(10) httpwwwjstatsoftorgv59i10paper Tools The tidyversehttptidyverseorg

bull Dagger[DataScience-Py] Jake VanderPlas 2016 Python Data Science Handbook (httpsgithubcomjakevdpPythonDSHandbook-Py) esp Ch 3 on pandas httppandaspydataorg

bull Dagger[DataCarpentry] Colin Gillespie and Robin Lovelace 2017 Efficient R Programming https

csgillespiegithubioefficientR esp Ch 6 on ldquoefficient data carpentryrdquo

bull sect[Wrangler] Joseph M Hellerstein Jeffrey Heer Tye Rattenbury and Sean Kandel 2017 DataWrangling Practical Techniques for Data Preparation Tool Trifacta Wrangler httpwwwtrifactacomproductswrangler

Data wrangling practice

bull DSHandbook-Py Ch 4 ldquoData Munging String Manipulation Regular Expressions and Data Clean-ingrdquo

bull dagger[DataSimplification] Jules J Berman 2016 Data Simplification Taming Information with OpenSource Tools Elsevier httpwwwsciencedirectcomezaccesslibrariespsueduscience

book9780128037812

bull dagger[Wrangling-Py] Jacqueline Kazil Katharine Jarmul 2016 Data Wrangling with Python OrsquoReillyhttpproquestcombosafaribooksonlinecomezaccesslibrariespsuedu9781491948804

bull dagger[Wrangling-R] Bradley C Boehmke 2016 Data Wrangling with R Springer httplink

springercomezaccesslibrariespsuedubook1010072F978-3-319-45599-0

Record Linkage Entity Resolution Deduplication

bull dagger[DataMatching] Peter Christen 2012 Data Matching Concepts and Techniques for Record Link-age Entity Resolution and Duplicate Detectionhttplinkspringercomezaccesslibrariespsuedubook1010072F978-3-642-31164-2

(esp Ch 2 ldquoThe Data Matching Processrdquo)

bull DataSimplification Chapter 5 ldquoIdentifying and Deidentifying Datardquo

bull Dagger[EntityResolution] Lise Getoor and Ashwin Machanavajjhala 2013 ldquoEntity Resolution for BigDatardquo Tutorial KDD httpwwwumiacsumdedu~getoorTutorialsER_KDD2013pdf

bull Dagger[SyrianCasualties] Peter Sadosky Anshumali Shrivastava Megan Price and Rebecca C Steorts2015 ldquoBlocking Methods Applied to Casualty Records from the Syrian Conflictrdquo httpsarxiv

orgabs151007714 (For more on blocking see DataMatching Ch 4 ldquoIndexingrdquo)

bull CRAN Task Views httpscranr-projectorgwebviewsOfficialStatisticshtml ldquoStatis-tical Matching and Record Linkagerdquo

ldquoMaking up datardquo Imputation Smoothers Kernels Priors Filters Teleportation Negative Sampling Con-volution Augmentation Adversarial Training

21

bull dagger[Imputation] Yi Deng Changgee Chang Moges Seyoum Ido and Qi Long 2016 ldquoMultiple Im-putation for General Missing Data Patterns in the Presence of High-dimensional Datardquo ScientificReports 6(21689) httpswwwnaturecomarticlessrep21689

bull Dagger[Overimputation] Matthew Blackwell James Honaker and Gary King 2017 ldquoA Unified Ap-proach to Measurement Error and Missing Data Overview and Applicationsrdquo Sociological Methodsamp Research 46(3) 303-341 httpgkingharvardedufilesgkingfilesmeasurepdf

bull Shalizi-ADA Sect 15 (Linear Smoothers) Ch 8 (Splines) Sect 144 (Kernel Density Estimates)DataScience-Python NB 0513 ldquoKernel Density Estimationrdquo

bull FightinWords (re Bayesian priors as additional data impact of priors on regularization)

bull DeepLearning Section 75 (Data Augmentation) 712 (Dropout) 713 (Adversarial Examples) Chap-ter 9 (Convolutional Networks)

bull dagger[Adversarial] Ian J Goodfellow Jonathon Shlens Christian Szegedy 2015 ldquoExplaining and Har-vesting Adversarial Examplesrdquo httpsarxivorgabs14126572

bull See also resampling and simulation methods

bull See also feature engineering preprocessing

bull CRAN Task Views httpscranr-projectorgwebviewsOfficialStatisticshtml ldquoImpu-tationrdquo

(Direct) Data Representations Data Mappings

bull NRCreport Ch 5 ldquoLarge-Scale Data Representationsrdquo

bull dagger[PatternRecognition] M Narasimha Murty and V Susheela Devi 2011 Pattern RecognitionAn Algorithmic Approach Springer httpslink-springer-comezaccesslibrariespsuedu

book1010072F978-0-85729-495-1 Section 21 ldquoData Structures for Pattern Representationrdquo

bull Dagger[InfoRetrieval] Christopher D Manning Prabhakar Raghavan and Hinrich Schutze 2009 Intro-duction to Information Retrieval Cambridge University Press httpnlpstanfordeduIR-bookCh 1 2 6 (also note slides used in their class)

bull dagger[Algorithms] Brian Steele John Chandler Swarna Reddy 2016 Algorithms for Data ScienceWiley httplinkspringercomezaccesslibrariespsuedubook1010072F978-3-319-

45797-0 Ch 2 ldquoData Mapping and Data Dictionariesrdquo

bull See also Social Data Structures

Similarity Distance Association Covariance The Kernel Trick

bull PatternRecognition Section 23 ldquoProximity Measuresrdquo

bull Dagger[Similarity] Brendan OrsquoConnor 2012 ldquoCosine similarity Pearson correlation and OLS coeffi-cientsrdquo httpsbrenoconcomblog201203cosine-similarity-pearson-correlation-and-

ols-coefficients

bull For relatively comprehensive lists see also

M-J Lesot M Rifqi and H Benhadda 2009 ldquoSimilarity measures for binary and numerical data a surveyrdquoInt J Knowledge Engineering and Soft Data Paradigms 1(1) 63- httpciteseerxistpsueduviewdoc

downloaddoi=10112126533amprep=rep1amptype=pdf

Seung-Seok Choi Sung-Hyuk Cha Charles C Tappert 2010 ldquoA Survey of Binary Similarity and DistanceMeasuresrdquo Journal of Systemics Cybernetics amp Informatics 8(1)43-8 httpwwwiiisciorgJournalCV$

scipdfsGS315JGpdf

22

Sung-Hyuk Cha 2007 ldquoComprehensive Survey on DistanceSimilarity Measures between Probability DensityFunctionsrdquo International Journal of Mathematical Models and Methods in Applied Sciences 4(1) 300-7 http

csispaceeductappertdpsd861-12session4-p2pdf

Anna Huang 2008 ldquoSimilarity Measures for Text Document Clusteringrdquo Proceedings of the 6th New ZealandComputer Science Research Student Conference 49-56 httpwwwnzcsrsc08canterburyacnzsiteproceedingsIndividual_Paperspg049_Similarity_Measures_for_Text_Document_Clusteringpdf

bull Michael B Jordan ldquoThe Kernel Trickrdquo (Lecture Notes) httpspeopleeecsberkeleyedu

~jordancourses281B-spring04lectureslec3pdf

bull Eric Kim 2017 ldquoEverything You Ever Wanted to Know about the Kernel Trick (But Were Afraidto Ask)rdquo httpwwweric-kimneteric-kim-netposts1kernel_trick_blog_ekim_12_20_

2017pdf

Derived Data Representations - Dimensionality Reduction Compression De-composition Embeddings

The groupings here and under the measurement multivariate statistics section are particularly arbitraryFor example ldquok-Means clusteringrdquo can be viewed as a technique for ldquodimensionality reductionrdquo ldquocompres-sionrdquo ldquofeature extractionrdquo ldquolatent variable measurementrdquo ldquounsupervised learningrdquo ldquocollaborative filteringrdquo

Clustering hashing quantization blocking compression

bull Dagger[Compression] Khalid Sayood 2012 Introduction to Data Compression 4th ed Springer httpwwwsciencedirectcomezaccesslibrariespsuedusciencebook9780124157965 (eg cod-ing blocking via vector quantization)

bull MMDS Ch3 ldquoFinding Similar Itemsrdquo (minhashing locality sensitive hashing)

bull Multivariate-R Ch 6 ldquoClusteringrdquo

bull DataScience-Python NB 0511 ldquok-Means Clusteringrdquo NG 0512 ldquoGaussian Mixture Modelsrdquo

bull Dagger[KMeansHashing] Kaiming He Fang Wen Jian Sun 2013 ldquoK-means Hashing An Affinity-Preserving Quantization Method for Learning Binary Compact Codesrdquo CVPR httpswwwcv-foundationorgopenaccesscontent_cvpr_2013papersHe_K-Means_Hashing_An_2013_CVPR_paper

pdf

bull dagger[Squashing] Madigan D Raghavan N Dumouchel W Nason M Posse C and RidgewayG (2002) ldquoLikelihood-based data squashing A modeling approach to instance constructionrdquo DataMining and Knowledge Discovery 6(2) 173-190 httpdxdoiorgezaccesslibrariespsu

edu101023A1014095614948

bull dagger[Core-sets] Piotr Indyk Sepideh Mahabadi Mohammad Mahdian Vahab S Mirrokni 2014 ldquoCom-posable core-sets for diversity and coverage maximizationrdquo PODS rsquo14 100-8 httpsdoiorg10114525945382594560

Feature selection feature extraction feature engineering weighting preprocessing

bull PatternRecognition Sections 26-7 ldquoFeature Selection Feature Extractionrdquo

bull DSHandbook-Py Ch 7 ldquoInterlude Feature Extraction Ideasrdquo

bull DataScience-Python NB 0504 ldquoFeature Engineeringrdquo

bull Features in text NLP and InfoRetrieval re tfidf and similar FightinWords

bull Features in images DataScience-Python NB 0514 ldquoImage Featuresrdquo

23

Dimensionality reduction decomposition factorization change of basis reparameterization matrix com-pletion latent variables source separation

bull MMDS Ch 9 ldquoRecommendation Systemsrdquo Ch11 ldquoDimensionality Reductionrdquo

bull DSHandbook-Py Ch 10 ldquoUnsupervised Learning Clustering and Dimensionality Reductionrdquo

bull Dagger[Shalizi-ADA] Cosma Rohilla Shalizi 2017 Advanced Data Analysis from an Elementary Pointof View Ch 16 (ldquoPrincipal Components Analysisrdquo) Ch 17 (ldquoFactor Modelsrdquo) httpwwwstat

cmuedu~cshaliziADAfaEPoV

See also Multivariate-R Ch 3 ldquoPrincipal Components Analysisrdquo Ch 4 ldquoMultidimensional ScalingrdquoCh 5 ldquoExploratory Factor Analysisrdquo Latent

bull DeepLearning Ch 2 ldquoLinear Algebrardquo Ch 13 ldquoLinear Factor Modelsrdquo

bull Dagger[GloVe] Jeffrey Pennington Richard Socher Christopher Manning 2014 ldquoGloVe Global vectorsfor word representationrdquo EMNLP httpsnlpstanfordeduprojectsglove

bull dagger[NMF] Daniel D Lee and H Sebastian Seung 1999 ldquoLearning the parts of objects by non-negativematrix factorizationrdquo Nature 401788-791 httpdxdoiorgezaccesslibrariespsuedu

10103844565

bull dagger[CUR] Michael W Mahoney and Petros Drineas 2009 ldquoCUR matrix decompositions for improveddata analysisrdquo PNAS httpwwwpnasorgcontent1063697full

bull Dagger[ICA] Aapo Hyvarinen and Erkki Oja 2000 ldquoIndependent Component Analysis Algorithmsand Applicationsrdquo Neural Networks 13(4-5) 411-30 httpswwwcshelsinkifiuahyvarin

papersNN00newpdf

bull [RandomProjection] Ella Bingham and Heikki Mannilla 2001 ldquoRandom projection in dimension-ality reduction applications to image and text datardquo KDD httpsdoiorg1011452F502512

502546

bull Compression eg ldquoTransform codingrdquo ldquoWaveletsrdquo

Nonlinear dimensionality reduction Manifold learning

bull DeepLearning Section 5113 ldquoManifold Learningrdquo

bull DataScience-Python NB 0510 ldquoManifold Learningrdquo (Locally linear embedding [LLE] Isomap)

bull Dagger[KernelPCA] Sebastian Raschka 2014 ldquoKernel tricks and nonlinear dimensionality reduction viaRBF kernel PCArdquo httpsebastianraschkacomArticles2014_kernel_pcahtml

bull Shalizi-ADA Ch 18 ldquoNonlinear Dimensionality Reductionrdquo (LLE)

bull Dagger[LaplacianEigenmaps] Mikhail Belkin and Partha Niyogi 2003 ldquoLaplacian eigenmaps for di-mensionality reduction and data representationrdquo Neural Computation 15(6) 13731396 http

webcseohio-stateedu~belkin8papersLEM_NC_03pdf

bull DeepLearning Ch14 (ldquoAutoencodersrdquo)

bull Dagger[word2vec] Tomas Mikolov Kai Chen Greg Corrado Jeffrey Dean 2013 ldquoEfficient Estimation ofWord Representations in Vector Spacerdquo httpsarxivorgabs13013781

bull Dagger[word2vecExplained] Yoav Goldberg and Omer Levy 2014 ldquoword2vec Explained DerivingMikolov et als Negative-Sampling Word-Embedding Methodrdquo httpsarxivorgabs14023722

bull Dagger[t-SNE] Laurens van der Maaten and Geoffrey Hinton 2008 ldquoVisualising Data using t-SNErdquo Jour-nal of Machine Learning Research 9 2579-2605 httpjmlrcsailmitedupapersvolume9

vandermaaten08avandermaaten08apdf

24

Computation and Scaling Up

Scientific computing computation at scale

bull NRCreport Ch 10 ldquoThe Seven Computational Giants of Massive Data Analysisrdquo

bull DSHandbook-Py Ch 21 ldquoPerformance and Computer Memoryrdquo Ch 22 ldquoComputer Memory andData Structuresrdquo

bull Dagger[ComputationalStatistics-Python] Cliburn Chan Computational Statistics in Python https

peopledukeedu~ccc14sta-663indexhtml

bull CRAN Task View High Performance Computinghttpscranr-projectorgwebviewsHighPerformanceComputinghtml

Numerical computing

bull [NumericalComputing] Ward Cheney and David Kincaid 2013 Numerical Mathematics andComputing 7th ed BrooksCole Cengage Learning

bull CRAN Task View Numerical Mathematicshttpscranr-projectorgwebviewsNumericalMathematicshtml

Optimization (eg MLE gradient descent stochastic gradient descent EM algorithm neural nets)

bull DSHandbook-Py Ch 23 ldquoMaximum Likelihood Estimation and Optimizationrdquo (gradient descent)

bull DeepLearning Ch 4 ldquoNumerical Computationrdquo Ch 6 ldquoDeep Feedforward Networksrdquo Ch 8ldquoOptimization for Training Deep Modelsrdquo

bull CRAN Task View Optimization and Mathematical Programminghttpscranr-projectorgwebviewsOptimizationhtml

Linear algebra matrix computations

bull [MatrixComputations] Gene H Golub and Charles F Van Loan 2013 Matrix Computations 4thed Johns Hopkins University Press

bull Dagger[NetflixMatrix] Yehuda Koren Robert Bell and Chris Volinsky 2009 ldquoMatrix FactorizationTechniques for Recommender Systemsrdquo Computer August 42-9 httpsdatajobscomdata-

science-repoRecommender-Systems-[Netflix]pdf

bull Dagger[BigDataPCA] Jianqing Fan Qiang Sun Wen-Xin Zhou Ziwei Zhu ldquoPrincipal Component Anal-ysis for Big Datardquo httpwwwprincetonedu~ziweizpcapdf

bull Dagger[RandomSVD] Andrew Tulloch 2009 ldquoFast Randomized SVDrdquo httpsresearchfbcom

fast-randomized-svd

bull Dagger[FactorSGD] Rainer Gemulla Peter J Haas Erik Nijkamp and Yannis Sismanis 2011 ldquoLarge-Scale Matrix Factorization with Distributed Stochastic Gradient Descentrdquo KDD httpwwwcs

utahedu~hariteachingbigdatagemulla11dsgdpdf

bull Dagger[Sparse] Max Grossman 2015 ldquo101 Ways to Store a Sparse Matrixrdquo httpsmediumcom

jmaxg3101-ways-to-store-a-sparse-matrix-c7f2bf15a229

bull Dagger[DontInvert] John D Cook 2010 ldquoDonrsquot Invert that Matrixrdquo httpswwwjohndcookcom

blog20100119dont-invert-that-matrix

Simulation-based inference resampling Monte Carlo methods MCMC Bayes approximate inference

25

bull Dagger[MarkovVisually] Victor Powell ldquoMarkov Chains Explained Visuallyrdquo httpsetosaioev

markov-chains

bull DSHandbook-Py Ch 25 ldquoStochastic Modelingrdquo (Markov chains MCMC HMM)

bull Bayes ComputationalStatistics-Py

bull DeepLearning Ch 17 ldquoMonte Carlo Methodsrdquo Ch 19 ldquoApproximate Inferencerdquo

bull Dagger[VariationalInference] Jason Eisner 2011 ldquoHigh-Level Explanation of Variational Inferencerdquohttpswwwcsjhuedu~jasontutorialsvariationalhtml

bull CRAN Task View Bayesian Inferencehttpscranr-projectorgwebviewsBayesianhtml

Parallelism MapReduce Split-Apply-Combine

bull Dagger[MapReduceIntuition] Jigsaw Academy 2014 ldquoBig Data Specialist MapReducerdquo https

wwwyoutubecomwatchv=TwcYQzFqg-8ampfeature=youtube

bull MMDS Ch 2 ldquoMap-Reduce and the New Software Stackrdquo (3rd Edition discusses Spark amp Tensor-Flow)

bull Algorithms Ch 3ldquoScalable Algorithms and Associative Statisticsrdquo Ch 4 ldquoHadoop and MapReducerdquo

bull NRCreport Ch 6 ldquoResources Trade-offs and Limitationsrdquo

bull TidyData-R (Split-apply-combine is the motivating principle behind the ldquotidyverserdquo approach) Seealso Part III ldquoProgramrdquo (pipes functions vectors iteration)

bull Dagger[TidySAC-Video] Hadley Wickham 2017 ldquoData Science in the Tidyverserdquo httpswww

rstudiocomresourcesvideosdata-science-in-the-tidyverse

bull See also FactorSGD

Functional Programming

bull ComputationalStatistics-Python ldquoFunctions are first class objectsrdquo through first exercises

bull DSHandbook-Py Ch 20 ldquoProgramming Language Conceptsrdquo

bull See also Haskell

Scaling iteration streaming data online algorithms (Spark)

bull NRCreport Ch 4 ldquoTemporal Data and Real-Time Algorithmsrdquo

bull MMDS Ch 4 ldquoMining Data Streamsrdquo

bull Dagger[BDAS] AMPLab BDAS The Berkeley Data Analytics Stack httpsamplabcsberkeleyedu

software

bull dagger[Spark] Mohammed Guller 2015 Big Data Analytics with Spark A Practitioners Guide toUsing Spark for Large-Scale Data Processing Machine Learning and Graph Analytics and High-Velocity Data Stream Processing Apress httpslink-springer-comezaccesslibrariespsu

edubook101007978-1-4842-0964-6

bull DSHandbook-Py Ch 13 ldquoBig Datardquo

bull DeepLearning Ch 10 ldquoSequence Modeling Recurrent and Recursive Netsrdquo

General Resources for Python and R

26

bull dagger[DSHandbook-Py] Field Cady 2017 The Data Science Handbook (Python-based) Wiley http

onlinelibrarywileycomezaccesslibrariespsuedubook1010029781119092919

bull Dagger[Tutorials-R] Ujjwal Karn 2017 ldquoA curated list of R tutorials for Data Science NLP and MachineLearningrdquo httpsgithubcomujjwalkarnDataScienceR

bull Dagger[Tutorials-Python] Ujjwal Karn 2017 ldquoA curated list of Python tutorials for Data Science NLPand Machine Learningrdquo httpsgithubcomujjwalkarnDataSciencePython

Cutting and Bleeding Edge of Data Science Languages

bull [Scala] daggerVishal Layka and David Pollak 2015 Beginning Scala httpslink-springer-

comezaccesslibrariespsuedubook1010072F978-1-4842-0232-6 daggerNicolas Patrick 2014Scala for Machine Learning httpsebookcentralproquestcomlibpensudetailaction

docID=1901910 DaggerScala site httpswwwscala-langorg (See also )

bull [Julia] daggerIvo Baobaert 2015 Getting Started with Julia httpsebookcentralproquestcom

libpensudetailactiondocID=1973847 DaggerDouglas Bates 2013 ldquoJulia for R Programmersrdquohttpwwwstatwiscedu~batesJuliaForRProgrammerspdf DaggerJulia site httpsjulialang

org

bull [Haskell] daggerRichard Bird 2014 Thinking Functionally with Haskell httpsdoi-orgezaccess

librariespsuedu101017CBO9781316092415 daggerHakim Cassimally 2017 Learning Haskell Pro-gramming httpswwwlyndacomHaskell-tutorialsLearning-Haskell-Programming604926-2html daggerJames Church 2017 Learning Haskell for Data Analysis httpswwwlyndacomDeveloper-tutorialsLearning-Haskell-Data-Analysis604234-2html Haskell site httpswwwhaskellorg

bull [Clojure] daggerMark McDonnell 2017 Quick Clojure Essential Functional Programming https

link-springer-comezaccesslibrariespsuedubook1010072F978-1-4842-2952-1 daggerAkhillWali 2014 Clojure for Machine Learning httpsebookcentralproquestcomlibpensu

detailactiondocID=1674848 daggerArthur Ulfeldt 2015 httpswwwlyndacomClojure-tutorialsUp-Running-Clojure413127-2html DaggerClojure site httpsclojureorg

bull [TensorFlow] DaggerAbadi et al (Google) 2016 ldquoTensorFlow A System for Large-Scale Ma-chine Learningrdquo httpsarxivorgabs160508695 DaggerUdacity ldquoDeep Learningrdquo httpswww

udacitycomcoursedeep-learning--ud730 DaggerTensorFlow site httpswwwtensorfloworg

bull [H2O] Darren Cool 2016 Practical Machine Learning with H2O Powerful Scalable Techniques forDeep Learning and AI OrsquoReilly Arno Candel and Viraj Parmar 2015 Deep Learning with H2ODaggerH2O site httpswwwh2oai

Social Data Structures

Space and Time

bull dagger[GIA] David OrsquoSullivan David J Unwin 2010 Geographic Information Analysis Second EditionJohn Wiley amp Sons httponlinelibrarywileycombook1010029780470549094

bull [Space-Time] Donna J Peuquet 2003 Representations of Space and Time Guilford

bull Roger S Bivand Edzer Pebesma Virgilio Gmez-Rubio 2013 Applied Spatial Data Analysis in RFree through library httplinkspringercomezaccesslibrariespsuedubook101007

2F978-1-4614-7618-4 See also Edzer Pebesma 2016 ldquoHandling and Analyzing Spatial Spa-tiotemporal and Movement Data in Rrdquo httpsedzergithubioUseR2016

bull Dagger[PySAL] Sergio J Rey and Dani Arribas-Bel 2016 ldquoGeographic Data Science with PySAL andthe pydata Stackrdquo httpdarribasorggds_scipy16

27

bull DataScience-Python NB 0413 ldquoGeographic Data with Basemaprdquo

bull Time ldquoLongitudinalrdquo intra-individual data as common in developmental psychology dagger[Longitudinal]Garrett Fitzmaurice Marie Davidian Geert Verbeke and Geert Molenberghs editors 2009 Longitu-dinal Data Analysis Chapman amp Hall CRC httppensueblibcompatronFullRecordaspxp=359998

bull Time ldquoTime Series Time Series Cross Section Panelrdquo as common in economics political scienceand sociology DataScience-Python Ch 3 re pandas DSHandbook-Py Ch 17 ldquoTime Series AnalysisrdquoEconometrics GelmanHill

bull Time ldquoSequential Data Streamsrdquo as in NLP machine learning Spark and other rdquostreaming datardquoreadings

bull Space-Time Data with continuity connectivity neighborhood structure See DeepLearning Chapter 9re convolution

bull CRAN Task Views Spatial httpscranr-projectorgwebviewsSpatialhtml Spatiotempo-ral httpscranr-projectorgwebviewsSpatioTemporalhtml Time Series httpscran

r-projectorgwebviewsTimeSerieshtml

Network Graphs

bull Dagger[Networks] David Easley and Jon Kleinberg 2010 Networks Crowds and Markets ReasoningAbout a Highly Connected World Cambridge University Press httpwwwcscornelleduhome

kleinbernetworks-book

bull MMDS Ch 5 ldquoLink Analysisrdquo Ch 10 ldquoMining Social-Network Graphsrdquo

bull dagger[Networks-R] Douglas Luke 2015 A Userrsquos Guide to Network Analysis in R Springer https

link-springer-comezaccesslibrariespsuedubook101007978-3-319-23883-8

bull dagger[Networks-Python] Mohammed Zuhair Al-Taie and Seifedine Kadry 2017 Python for Graph andNetwork Analysis Springer httpslink-springer-comezaccesslibrariespsuedubook

1010072F978-3-319-53004-8

Hierarchy Clustered Data Aggregation Mixed Models

bull dagger[GelmanHill] Andrew Gelman and Jennifer Hill 2006 Data Analysis Using Regression and Multi-levelHierarchical Models Cambridge University Press httppensueblibcompatronFullRecordaspxp=288457

bull dagger[MixedModels] Eugene Demidenko 2013 Mixed Models Theory and Applications with R 2nd edWiley httpsiteebrarycomezaccesslibrariespsuedulibpennstatedetailaction

docID=10748641

Social Data Channels

Language Text Speech Audio

bull Dagger[NLP] Daniel Jurafsky and James H Martin Forthcoming (2017) Speech and Language ProcessingAn Introduction to Natural Language Processing Speech Recognition and Computational Linguistics3rd ed Prentice-Hall Preprint httpswebstanfordedu~jurafskyslp3

bull InfoRetrieval

bull Dagger[CoreNLP] Stanford CoreNLP httpsstanfordnlpgithubioCoreNLP

28

bull dagger[TextAsData] Grimmer and Stewart 2013 ldquoText as Data The Promise and Pitfalls of AutomaticContent Analysis Methods for Political Textsrdquo Political Analysis httpsdoiorg101093pan

mps028

bull Quinn-Topics

bull FightinWords

bull dagger[NLTK-Python] Jacob Perkins 2010 Python Text Processing with NLTK 20 Cookbook PackthttpsiteebrarycomezaccesslibrariespsuedulibpennstatedetailactiondocID=10435387dagger[TextAnalytics-Python] Dipanjan Sarkar 2016 Text Analytics with Python Springer http

linkspringercomezaccesslibrariespsuedubook1010072F978-1-4842-2388-8

bull DSHandbook-Py Ch 16 ldquoNatural Language Processingrdquo

bull Matthew J Denny httpwwwmjdennycomText_Processing_In_Rhtml

bull CRAN Natural Language Processing httpscranr-projectorgwebviewsNaturalLanguageProcessinghtml

bull dagger[AudioData] Dean Knox and Christopher Lucas 2017 ldquoThe Speaker-Affect Model MeasuringEmotion in Political Speech with Audio Datardquo httpchristopherlucasorgfilesPDFssam

pdf httpswwwyoutubecomwatchv=Hs8A9dwkMzI

bull daggerMorgan amp Claypool Synthesis Lectures on Human Language Technologies httpwwwmorganclaypoolcomtochlt11

bull daggerMorgan amp Claypool Synthesis Lectures on Speech and Audio Processing httpwwwmorganclaypoolcomtocsap11

Vision Image Video

bull Dagger[ComputerVision] Richard Szelinski 2010 Computer Vision Algorithms and Applications SpringerhttpszeliskiorgBook

bull MixedModels Ch 11 ldquoStatistical Analysis of Shaperdquo Ch 12 ldquoStatistical Image Analysisrdquo

bull AudioVideoVolumetric See DeepLearning Chapter 9 re convolution

bull daggerMorgan amp Claypool Synthesis Lectures on Image Video and Multimedia Processing httpwww

morganclaypoolcomtocivm11

bull daggerMorgan amp Claypool Synthesis Lectures on Computer Vision httpwwwmorganclaypoolcom

toccov11

Approaches to Learning from Data (The Analytics Layer)

bull NRCreport Ch 7 ldquoBuilding Models from Massive Datardquo

bull MMDS (data-mining)

bull CausalInference

bull Dagger[VisualAnalytics] Daniel Keim Jorn Kohlhammer Geoffrey Ellis and Florian Mansmann editors2010 Mastering the Information Age Solving Problems with Visual Analytics The EurographicsAssociation Goslar Germany httpwwwvismastereubook See also daggerMorgan amp ClaypoolSynthesis Lectures on Visualization httpwwwmorganclaypoolcomtocvis21

bull Dagger[Econometrics] Bruce E Hansen 2014 Econometrics httpwwwsscwiscedu~bhansen

econometrics

29

bull Dagger[StatisticalLearning] Gareth James Daniela Witten Trevor Hastie and Robert Tibshirani 2013An Introduction to Statistical Learning with Applications in R Springer httpwww-bcfuscedu

~garethISLindexhtml

bull [MachineLearning] Christopher Bishop 2006 Pattern Recognition and Machine Learning Springer

bull dagger[Bayes] Peter D Hoff 2009 A First Course in Bayesian Statistical Methods Springer http

linkspringercombook1010072F978-0-387-92407-6

bull dagger[GraphicalModels] Soslashren Hoslashjsgaard David Edwards Steffen Lauritzen 2012 Graphical Mod-els with R Springer httpslink-springer-comezaccesslibrariespsuedubook101007

2F978-1-4614-2299-0

bull [InformationTheory] David Mackay 2003 Information Theory Inference and Learning AlgorithmsSpringer

bull Dagger[DeepLearning] Ian Goodfellow Yoshua Bengio and Aaron Courville 2016 Deep Learning MITPress httpwwwdeeplearningbookorg (see also Yann LeCun Yoshua Bengio and Geof-frey Hinton 2015 ldquoDeep Learningrdquo Nature 521(7553) 436ndash444 httpsdoiorg101038

nature14539)

See also Dagger[Keras] Keras The Python Deep Learning Library httpskerasio Francoios CholletDirectory of Keras tutorials httpsgithubcomfcholletkeras-resources

bull dagger[SignalProcessing] Jose Maria Giron-Sierra 2013 Digital Signal Processing with MATLAB Exam-ples (Volumes 1-3) Springer httpslink-springer-comezaccesslibrariespsuedubook

101007978-981-10-2534-1

30

Penn State Policy Statements

Academic Integrity

Academic integrity is the pursuit of scholarly activity in an open honest and responsible manner Academicintegrity is a basic guiding principle for all academic activity at The Pennsylvania State University and allmembers of the University community are expected to act in accordance with this principle Consistent withthis expectation the Universitys Code of Conduct states that all students should act with personal integrityrespect other students dignity rights and property and help create and maintain an environment in whichall can succeed through the fruits of their efforts

Academic integrity includes a commitment by all members of the University community not to engage in ortolerate acts of falsification misrepresentation or deception Such acts of dishonesty violate the fundamentalethical principles of the University community and compromise the worth of work completed by others

Disability Accomodation

Penn State welcomes students with disabilities into the Universitys educational programs Every Penn Statecampus has an office for students with disabilities Student Disability Resources (SDR) website provides con-tact information for every Penn State campus (httpequitypsuedusdrdisability-coordinator)For further information please visit the Student Disability Resources website (httpequitypsuedusdr)

In order to receive consideration for reasonable accommodations you must contact the appropriate disabilityservices office at the campus where you are officially enrolled participate in an intake interview and pro-vide documentation See documentation guidelines at (httpequitypsuedusdrguidelines) If thedocumentation supports your request for reasonable accommodations your campus disability services officewill provide you with an accommodation letter Please share this letter with your instructors and discussthe accommodations with them as early as possible You must follow this process for every semester thatyou request accommodations

Psychological Services

Many students at Penn State face personal challenges or have psychological needs that may interfere withtheir academic progress social development or emotional wellbeing The university offers a variety ofconfidential services to help you through difficult times including individual and group counseling crisisintervention consultations online chats and mental health screenings These services are provided by staffwho welcome all students and embrace a philosophy respectful of clients cultural and religious backgroundsand sensitive to differences in race ability gender identity and sexual orientation

bull Counseling and Psychological Services at University Park (CAPS) (httpstudentaffairspsueducounseling) 814-863-0395

bull Penn State Crisis Line (24 hours7 daysweek) 877-229-6400

bull Crisis Text Line (24 hours7 daysweek) Text LIONS to 741741

Educational Equity

Penn State takes great pride to foster a diverse and inclusive environment for students faculty and staffConsistent with University Policy AD29 students who believe they have experienced or observed a hatecrime an act of intolerance discrimination or harassment that occurs at Penn State are urged to report theseincidents as outlined on the Universitys Report Bias webpage (httpequitypsuedureportbias)

31

Background Knowledge

Students will have a variety of backgrounds Prior to beginning interdisciplinary coursework tofulfill Social Data Analytics degree requirements including SoDA 501 and 502 students are ex-pected to have advanced (graduate) training in at least one of the component areas of Social DataAnalytics and a familiarity with basic concepts in the others

With regard to specialization students are expected to have advanced (graduate) training in ONEof the following

bull quantitative social science methodology and a discipline of social science (as would be thecase for a second-year PhD student in Political Science Sociology Criminology Human De-velopment and Family Studies or Demography) OR

bull statistics (as would be the case for a second-year PhD student in Statistics) OR

bull information science or informatics (as would be the case for a second-year PhD student inInformation Science and Technology or a second-year PhD student in Geography specializingin GIScience) OR

bull computer science (as would be the case for a second-year PhD student in Computer Scienceand Engineering)

This requirement is met as a matter of meeting home program requirements for students in thedual-title PhD but may require additional coursework on the part of students in other programswishing to pursue the graduate minor

With regard to general preparation students are expected to have ALL of the following technicalknowledge

bull basic programming skills including basic facility with R ampor Python AND

bull basic knowledge of relational databases ampor geographic information systems AND

bull basic knowledge of probability applied statistics ampor social science research design AND

bull basic familiarity with a substantive or theoretical area of social science (eg 300-level course-work in political science sociology criminology human development psychology economicscommunication anthropology human geography social informatics or similar fields)

It is not unusual for students to have one or more gaps in this preparation at time of applicationto the SoDA program Students should work with Social Data Analytics advisers to develop aplan for timely remediation of any deficiencies which generally will not require formal courseworkfor students whose training and interests are otherwise appropriate for pursuit of the Social DataAnalytics degree Where possible this will be addressed at time of application to the Social DataAnalytics program

To this end some free training materials are linked in the reference section and there are alsoabundant free high-quality self-paced course-style training materials on these and related subjectsavailable through edX (httpwwwedxorg) Udacity (httpwwwudacitycom) Codecademy(httpcodecademycom) and Lynda (httpwwwlyndapsu)

32

Page 6: SoDA 501: Approaches and Issues in Big Social Data Spring ... · \Approaches and Issues in Big Social Data," o ered in the spring semester. Some SoDA / IGERT students will take more

InfoRetrieval Chs 1 2 6 (you may prefer the slides from their classes) My primaryhope here is that you understand the ldquovector space modelrdquo and ldquocosine similarityrdquo(from Chapter 6) My secondary hope is that you understand the basics of Booleaninformation retrieval (and notions like ldquoindexrdquo ldquoinverted indexrdquo and ldquopostingsrdquo) Mytertiary hope is that you are exposed to some basic concepts of text analytics NLP(including ldquotokenizationrdquo ldquonormalizationrdquo ldquostemmingrdquoldquolemmatizationrdquo ldquostop wordsrdquoldquotf-idfrdquo)

FightinWords (FW was used by Jurafsky in httpfirstmondayorgojsindexphp

fmarticleview49443863 and a best-selling book The Language of Food for similarapplications to Andris based on Yelp reviews and now appears in his textbook NLP)

bull Further reference

Section ldquoSpace and Timerdquo

Section ldquoWeb Scrapingrdquo

Section ldquoData Representations Data Mappingsrdquo

Section ldquoFeature selection feature extraction feature engineering rdquo

Section ldquoDatabases and data managementrdquo (esp SQL)

Section ldquoLanguage Text Speech Audiordquo

bull Exercise 2 Due Teams Report

February 15 (Postponed)

bull Conrad Tucker (IE) ldquoCybersecurity Policies and Their Impact on Dynamic Data DrivenApplication Systemsrdquo

bull Readings (send list of confusing terms concepts by 700 am Feb 14 Wednesday)

Conrad Tucker ldquoCybersecurity Policies and Their Impact on Dynamic Data DrivenApplication Systemsrdquo httpieeexploreieeeorgdocument8064151 See alsoldquoGenerative Adversarial Networks for Increasing the Veracity of Big Datardquo http

ieeexploreieeeorgdocument8258219

February 22

bull David Reitter (IST) ldquoComputational Psycholinguisticsrdquo

bull Readings (two weeks worth)

David Reitter ldquoAlignment in Web-Based Dialogue Studies in Big Data ComputationalPsycholinguisticsrdquo (Based on data from the Cancer Survivors Network and Reddit)httpwwwdavid-reittercompubreitter2017alignmentpdf

Similarity

PatternRecognition Ch2 ldquoRepresentationrdquo

6

MMDS Chapter 3 ldquoFinding Similar Itemsrdquo (to understand ldquominhashingrdquo and ldquolocalitysensitive hashingrdquo yoursquoll need to understand ldquohashingrdquo ndash Section 132 also discussedin InfoRetrieval Chapter 3)

bull Further reference

Section ldquoSimilarity Distance Association rdquo

Section ldquoDerived Data Representations rdquo

Section ldquoRecord Linkage Entity Resolution rdquo

Section ldquoClustering Hashing Compression rdquo

March 1

bull Reading

BitByBit Ch 5 (ldquoCreating Mass Collaborationrdquo)

Crowdsourcing ldquoIntroductionrdquo and Ch1 ldquoConcepts Theories and Cases of Crowd-sourcingrdquo

HumanComputation Chs 125

bull Further reference

Section ldquoCrowdsourcing Human Computation Citizen Science Web Experimentsrdquo

Section ldquoMaking Up Data (smoothing convolution kernels )rdquo

Section ldquoVision image videordquo

bull Semester Projects Must be Approved Before Spring Break

March 8 - Spring Break

March 15

bull Daniel DellaPosta (SOC) ldquoNetworks and the Mid-20th Century American Mafiardquo

bull Reading

Dan DellaPosta ldquoNetwork Closure in the Mid-20th Century American Mafiardquo (SocialNetworks 2017) httpswww-sciencedirect-comezaccesslibrariespsuedusciencearticlepiiS037887331630199X ldquoBetween Clique and Corporation Boundary-Spanningin Solidary Groupsrdquo (RampR in American Journal of Sociology in the Box folder)

Networks Chs 1 2

MMDS Ch 5 ldquoLink Analysisrdquo

MarkovVisually

bull Further references

Section ldquoNetworks and Graphsrdquo

7

Section ldquoSimulation resampling Markov Chains rdquo

DeepLearning (different kind of network)

bull 25 Project Review

March 22

bull Reading

DeepLearning Ch2 ldquoLinear Algebrardquo

Shalizi-ADA Chs 16-18 (ldquoPrincipal Components Analysisrdquo ldquoFactor Modelsrdquo ldquoNonlin-ear Dimension Reductionrdquo)

bull Further reference

Section ldquoDimensionality reduction decomposition rdquo

Section ldquoMultiple measures latent variable measurementrdquo

Section ldquoLinear algebra matrix computationsrdquo

March 29

bull Naomi Altman (STAT) ldquoGeneralizing PCArdquo

bull Reading

Altman ldquoGeneralizing PCArdquo 2015 slides httppersonalpsuedunsa1AltmanWebpagePCATorontopdf

MapReduceIntuition (7 minute video providing ldquodivide and conquerrdquo intuition to MapRe-duce)

TidySAC-Video Hadley Wickham on the tidyverse ldquosplit-apply-combinerdquo with dplyrand tidy data (1 hour video) (More detail but perhaps less intuition in book formdiscussion of ldquotidyingrdquo data Chapter 5 of TidyData-R)

MMDS Ch 2 (Read the Chapter 2 in the new ldquoBETArdquo version of the book whichalso touches on Spark and Tensorflow in section 24 ldquoExtensions to MapReducerdquo httpistanfordedu~ullmanmmdsnhtml)

bull Further reference

Section ldquoNonlinear dimension reduction manifold learningrdquo

Section ldquoDatabases and data managementrdquo (esp NoSQL)

Section ldquoTheoretically-structured approaches to data wranglingrdquo

Section ldquoParallelism MapReduce Split-Apply-Combinerdquo

Section ldquoFunctional programmingrdquo

Section ldquoCutting and bleeding edge rdquo

bull 50 Project Review

bull Exercise 3 (Individual) Due 45

8

April 5

bull Prasenjit Mitra (IST) ldquoClassification of Tweets from Disaster Scenariosrdquo

bull Reading

Prasenjit Mitra Rudra et al rdquoSummarizing Situational and Topical InformationDuring Crisesrdquo httpsarxivorgpdf161001561pdf Imran Mitra amp CastillordquoTwitter as a Lifeline Human-annotated Twitter Corpora for NLP of Crisis-relatedMessagesrdquo httpmimranmepapersimran_prasenjit_carlos_lrec2016pdf

NLP (Jurafsky and Martin) Chapters 15 and 16 ldquoVector Semanticsrdquo and ldquoSemanticswith Dense Vectorsrdquo

bull Further reference

Section ldquoLanguage text speech audiordquo

GloVe word2vec word2vecExplained t-SNE

Section ldquoMobile devices distributed sensors rdquo

Section ldquoCrowdsourcing human computation citizen science rdquo

Section ldquoScaling iteration streaming data online algorithmsrdquo

bull Exercise 3 Due

April 12

bull Reading

BitByBit Ch 4 ldquoRunning Experimentsrdquo review Ch 2 sections ldquoNatural experimentsin observable datardquo examples

CausalInference Chs 1-2

bull Further reference

Section ldquoExperimental and observational designs for causal inferencerdquo

Section ldquoCrowdsourcing web experimentsrdquo

Section ldquoHuman subjects rdquo

bull 75 Project Review

April 19

bull Reading

BitByBit Ch 6 ldquoEthicsrdquo

PSU-ORP ldquoCommon Rule and Other Changesrdquo httpswwwresearchpsuedu

irbcommonrulechanges

DataPrivacy

9

Reproducibility

Refresh on MachineBias

bull Further reference

Section ldquoEthics and Scientific Responsibility in Big Social Datardquo

Section ldquoiexclCuidadordquo

April 26 - LAST CLASS MEETING

bull Team Project Presentations

May 3 - Team Projects Due

10

Past Visiting Speakers in SoDA 501 and 502

SoDA 502 (Fall 2017)

bull Chris Zorn (PLSC)

bull Diane Felmlee (SOC)

bull Alan MacEachren (GEOG)

bull Eric Plutzer (PLSC)

bull James LeBreton (PSYCH)

bull Sesa Slavkovic (STAT)

bull Guido Cervone (GEOG)

bull Ashton Verdery (SOC)

bull Maggie Niu (STAT)

bull Reka Albert (PHYS)

SoDA 501 (Spring 2017)

bull Tim Brick (HDFS) ldquoTowards real-time monitoring and intervention using wearable technologyrdquo

bull Aylin Caliskan (Princeton) ldquoA Story of Discrimination and Unfairness Bias in Word Embeddingsrdquo

bull Jay Yonamine (Google - IGERT alum) ldquoData Science in Industryrdquo

bull Johnathan Rush (Illinois) ldquoGeospatial Data Science Workshoprdquo

bull Rick Gilmore (PSYCH) ldquoToward a more reproducible and robust science of human behaviorrdquo

bull Glenn Firebaugh (SOC) ldquoMeasuring Inequality and Segregation with US Census Datardquo

bull Charles Twardy (Sotera) ldquoData Science for Search and Rescuerdquo

bull Anna Smith (Ohio State) ldquoA Hierarchical Model for Network Data in a Latent Hyperbolic Spacerdquo

bull Rebecca Passonneau (CSE) ldquoOmnigraph Rich Feature Representation for Graph Kernel Learningrdquo

bull Alex Klippel (GEOG) ldquoVirtual Reality for Immersive Analyticsrdquo

bull Murali Haran (STAT) ldquoA Computationally Efficient Projection-based Approach for Spatial General-ized Mixed Modelsrdquo

SoDA 502 (Fall 2016)

bull Clio Andris (GEOG) ldquoIntegrating Social Network Data into GISystemsrdquo

bull Jia Li (STAT) ldquoClustering under the Wasserstein Metricrdquo

bull Rachel Smith (CAS) ldquoStigma Networks Perceptions of Sociogramsrdquo

bull Zita Oravecz (HDFS)

bull Bethany Bray (Methodology Center) ldquoLatent Class and Latent Transition Analysisrdquo

bull Dave Hunter (STAT) ldquoModel Based Clustering of Large Networksrdquo

bull Scott Bennett (PLSC) ldquoABM Model of Insurgencyrdquo

bull David Reitter (IST)

bull Suzanna Linn (PLSC) ldquoMethodological Issues in Automated Text Analysis Application to NewsCoverage of the US Economyrdquo

11

SoDA 501 (Spring 2016)

bull Bruce Desmarais (PLSC) ldquoLearning in the Sunshine Analysis of Local Government Email Corporardquo

bull Timothy Brick (HDFS) ldquoMapping and Manipulating Facial Expressionrdquo

bull Qunying Huang (USC) ldquoSocial Media An Emerging Data Source for Human Mobility Studiesrdquo

bull Lingzhou Zue (STAT) ldquoAn Introduction to High-Dimensional Graphical Modelsrdquo

bull Ashton Verdery (SOC) ldquoSampling from Network Datardquo

bull Sarah Battersby (Tableau) ldquoHelping People See and Understand Spatial Datardquo

bull Alexandra Slavkovic (STAT) ldquoStatistical Privacy with Network Datardquo

bull Lee Giles (IST) ldquoMachine Learning for Scholarly Big Datardquo

12

Readings and References (updated Spring 2018)

We will discuss a relatively small subset of the readings listed here and this will vary based on topics andreadings discussed by visiting speakers and you yourselves The remainder are provided here as curatedreferences for more in-depth investigation in both the theory and practice related to the topic (with thelatter heavily weighted toward resources in Python and R)

dagger Material that is at last check made available through the Penn State library Most journal links shouldwork if you are logged in to a Penn State machine or through the Penn State VPN Some article archivesand most books require additional authentication through webaccess If links are broken start directly froma search via httpswwwlibrariespsuedu Some books require installation of e-readers like AdobeDigital Editions Links to lyndacom can be accessed through httplyndapsuedu

Dagger Material that is at last check legally provided for free In some cases these are the preprint versionsof published material

sect Material for which a legal selection is or will be provided through the class Box folder

Big Data amp Social Data Analytics

Overviews

bull Dagger[CompSocSci] David Lazer Alex Pentland Lada Adamic Sinan Aral Albert-Laszlo BarabasiDevon Brewer Nicholas Christakis Noshir Contractor James Fowler Myron Gutmann Tony JebaraGary King Michael Macy Deb Roy and Marshall Van Alstyne 2009 ldquoComputational Social SciencerdquoScience 323(5915)721-3 + Supp Feb 6 httpsciencesciencemagorgcontent3235915

721full httpsgkingharvardedufilesgkingfilesLazPenAda09pdf

bull Dagger[NRCreport] National Research Council 2013 Frontiers in Massive Data Analysis NationalAcademies Press (Free w registration httpwwwnapeducatalogphprecord_id=18374)Ch1 ldquoIntroductionrdquo Ch2 ldquoMassive Data in Science Technology Commerce National DefenseTelecommunications and other Endeavorsrdquo

bull Dagger[MMDS] Jure Leskovec Anand Rajaraman and Jeff Ullman 2014 Mining of Massive DatasetsCambridge University Press httpwwwmmdsorg (BETA version of Third Edition httpi

stanfordedu~ullmanmmdsnhtml)

bull sect[BitByBit] Matthew J Salganik 2018 (Forthcoming) Bit by Bit Social Research in the DigitalAge Princeton University Press Ch 1 ldquoIntroductionrdquo Ch 2 ldquoObserving Behaviorrdquo

bull DSHandbook-Py Ch 1 ldquoIntroduction Becoming a Unicornrdquo Ch2 ldquoThe Data Science Road Maprdquo

Burt-schtick

bull Dagger[Monroe-5Vs] Burt L Monroe 2013 ldquoThe Five Vs of Big Data Political Science Introductionto the Special Issue on Big Data in Political Sciencerdquo Political Analysis 21(V5) 1ndash9 https

doiorg101017S1047198700014315 (Volume Velocity Variety Vinculation Validity)

bull dagger[Monroe-No] Burt L Monroe Jennifer Pan Margaret E Roberts Maya Sen and Betsy Sin-clair 2015 ldquoNo Formal Theory Causal Inference and Big Data Are Not Contradictory Trendsin Political Sciencerdquo PS Political Science amp Politics 48(1) 71ndash4 httpdxdoiorg101017

S1049096514001760

bull dagger[Quinn-Topics] Kevin M Quinn Burt L Monroe Michael Colaresi Michael H Crespin andDragomir R Radev 2010 ldquoHow to Analyze Political Attention with Minimal Assumptions andCostsrdquo American Journal of Political Science 54(1) 209ndash28 httponlinelibrarywileycom

13

doi101111j1540-5907200900427xfull (esp topic modeling as measurement approach tovalidation)

bull dagger[FightinWords] Burt L Monroe Michael Colaresi and Kevin M Quinn 2008 ldquoFightinrsquo WordsLexical Feature Selection and Evaluation for Identifying the Content of Political Conflictrdquo PoliticalAnalysis 16(4) 372-403 httpsdoiorg101093panmpn018 (esp the impact of samplingvariance and regularization through priors)

bull Dagger[BDSS-Census] Big Data Social Science PSU Team 2012 ldquoA Closer Look at the Kaggle CensusDatardquo httpsburtmonroegithubioBDSSKaggleCensus2012 (esp the relevance of the socialprocesses by which data come to exist as data)

Multidisciplinary Perspectives

bull dagger[Business-BigData] Andrew McAfee and Erik Brynjolfsson 2012 ldquoBig Data The ManagementRevolutionrdquo Harvard Business Review 90(10) 61ndash8 October httpshbrorg201210big-

data-the-management-revolution Thomas H Davenport and DJ Patel 2012 ldquoData ScientistThe Sexiest Job of the 21st Centuryrdquo Harvard Business Review 90(10)70-6 October https

hbrorg201210data-scientist-the-sexiest-job-of-the-21st-century

bull dagger[InfoSci-BigData] CL Philip Chen and Chun-Yang Zhang 2014 ldquoData-intensive ApplicationsChallenges Techniques and Technologies A Survey on Big Datardquo Information Sciences 275 314-47httpsdoiorg101016jins201401015

bull dagger[Informatics-BigData] Vasant G Honavar 2014 ldquoThe Promise and Potential of Big Data ACase for Discovery Informaticsrdquo Review of Policy Research 31(4) 326-330 httpsdoiorg10

1111ropr12080

bull Dagger[Stats-BigData] Beate Franke Jean-Francois Ribana Roscher Annie Lee Cathal Smyth ArminHatefi Fuqi Chen Einat Gil Alexander Schwing Alessandro Selvitella Michael M Hoffman RogerGrosse Dietrich Hendricks and Nancy Reid 2016 ldquoStatistical Inference Learning and Models in BigDatardquo International Statistical Review 84(3) 371-89 httponlinelibrarywileycomdoi101111insr12176full

bull dagger[Econ-BigData] Hal R Varian 2013 ldquoBig Data New Tricks for Econometricsrdquo Journal ofEconomic Perspectives 28(2) 3-28 httpsdoiorg101257jep2823

bull dagger[GeoViz-BigData] Alan MacEachren 2017 ldquoLeveraging Big (Geo) Data with (Geo) Visual An-alytics Place as the Next Frontierrdquo In Chenghu Zhou Fenzhen Su Francis Harvey and Jun Xueds Spatial Data Handling in the Big Data Era pp 139-155 Springer httpslink-springer-

comezaccesslibrariespsueduchapter101007978-981-10-4424-3_10

bull dagger[Soc-BigData] David Lazer and Jason Radford 2017 ldquoData ex Machina Introduction to BigDatardquo Annual Review of Sociology 43 19-39 httpsdoiorg101146annurev-soc-060116-

053457

bull sect[Politics-BigData] Keith T Poole L Jason Anasastopolous and James E Monagan III Forthcom-ing ldquoThe lsquoBig Datarsquo Revolution in Political Campaigning and Governancerdquo Oxford Bibliographies inPolitical Science

iexclCuidado Traps Biases Problems Pains Perils

bull dagger[GoogleFlu] David Lazer Ryan Kennedy Gary King and Alessandro Vespignani 2014 ldquoTheParable of Google Flu Traps in Big Data Analysisrdquo Science (343) 14 March httpgking

harvardedufilesgkingfiles0314policyforumffpdf

bull Dagger[MachineBias] ProPublica ldquoMachine Bias Investigating Algorithmic Injusticerdquo Series https

wwwpropublicaorgseriesmachine-bias See especially

14

Julia Angwin Jeff Larson Surya Mattu and Lauren Kirchner 2016 ldquoMachine Biasrdquo ProPub-lica May 23 httpswwwpropublicaorgarticlemachine-bias-risk-assessments-in-

criminal-sentencing

Julia Angwin Madeleine Varner and Ariana Tobin 2017 ldquoFacebook Enabled Adverstisers toReach Jew Hatersrdquo httpswwwpropublicaorgarticlefacebook-enabled-advertisers-

to-reach-jew-haters

bull Dagger[Polling2016] Doug Rivers (Nov 11 2016) ldquoFirst Thoughts on Polling Problems in the 2016 USElectionsrdquo httpstodayyougovcomnews20161111first-thoughts-polling-problems-

2016-us-elections

bull dagger[EventData] Wei Wang Ryan Kennedy David Lazer Naren Ramakrishnan 2016 ldquoGrowing Painsfor Global Monitoring of Societal Eventsrdquo Science 3536307 pp 1502ndash1503 httpsdoiorg101126scienceaaf6758

bull Dagger[GoogleBooks] Eitan Adam Pechenick Christopher M Danforth and Peter Sheridan Dodds 2015ldquoCharacterizing the Google Books Corpus Strong Limits to Inferences of Socio-Cultural and LinguisticEvolutionrdquo PLoS One httpsdoiorg101371journalpone0137041

bull Dagger[OkCupid] Michael Zimmer 2016 ldquoOkCupid Study Reveals the Perils of Big-Data Sciencerdquo Wiredhttpswwwwiredcom201605okcupid-study-reveals-perils-big-data-science May 14

bull dagger[CriticalQuestions] danah boyd and Kate Crawford 2011 ldquoCritical Questions for Big Data Provo-cations for a Cultural Technological and Scholarly Phenomenonrdquo Information Communication ampSociety 15(5) 662ndash79 httpdxdoiorg1010801369118X2012678878

bull Dagger[RacistBot] Daniel Victor 2016 ldquoMicrosoft created a Twitter bot to learn from users It quicklybecame a racist jerkrdquo New York Times httpswwwnytimescom20160325technology

microsoft-created-a-twitter-bot-to-learn-from-users-it-quickly-became-a-racist-jerk

html

bull MMDS 122-123 on Bonferroni

bull Also BDSS-Census

Research Design and Measurement

Overviews

bull BitByBit

bull Dagger[ResearchMethodsKB] William M Trochim 2006 The Research Methods Knowledge Basehttpwwwsocialresearchmethodsnetkb

bull [7Rules] Glenn Firebaugh 2008 Seven Rules for Social Research Princeton University Press

Measurement Reliability and Validity

bull ResearchMethodsKB ldquoMeasurementrdquo

bull 7Rules Ch 3 ldquoBuild Reality Checks into Your Researchrdquo

bull Reliability see also FightinWords

bull Validity see also Quinn10-Topics Monroe-5Vs

Indirect Unobtrusive Nonreactive Measures Data Exhaust

15

bull dagger[UnobtrusiveMeasures] Raymond M Lee 2015 ldquoUnobtrusive Measuresrdquo Oxford Bibliographieshttpsdoiorg101093OBO9780199846740-0048 (Canonical cite is Eugene J Webb 1966Unobtrusive Measures or Webb Donald T Campbell Richard D Schwartz Lee Sechrest 1999Unobtrusive Measures rev ed Sage)

bull BitByBit Examples in Chapter 2

Multiple Measures Latent Variable Measurement

bull 7Rules Ch 4 ldquoReplicate Where Possiblerdquo

bull dagger[LatentVariables] David J Bartholomew Martin Knott and Irini Moustaki 2011 Latent VariableModels and Factor Analysis A Unified Approach Wiley httpsiteebrarycomezaccess

librariespsuedulibpennstatedetailactiondocID=10483308) (esp Ch1 ldquoBasic ideas andexamplesrdquo)

bull dagger[Multivariate-R] Brian Everitt and Torsten Hothorn 2011 An Introduction to Applied MultivariateAnalysis with R Springer httplinkspringercomezaccesslibrariespsuedubook10

10072F978-1-4419-9650-3

bull Shalizi-ADA Ch17 ldquoFactor Modelsrdquo

bull Dagger[NetflixPrize] Edwin Chen 2011 ldquoWinning the Netflix Prize A Summaryrdquo

bull DaggerCRAN httpscranr-projectorgwebviewsMultivariatehtmlhttpscranr-projectorgwebviewsPsychometricshtmlhttpscranr-projectorgwebviewsClusterhtml

Sampling and Survey Design

bull dagger[Sampling] Steven K Thompson 2012 Sampling 3rd ed httpsk8es4mc2lsearchserialssolutionscomsid=sersolampSS_jc=TC_024492330amptitle=Wiley20Desktop20Editions203A20Sampling

bull ResearchMethodsKB ldquoSamplingrdquo

bull NRCreport Ch 8 ldquoSampling and Massive Datardquo

bull BitByBit Ch 3 ldquoAsking Questionsrdquo

bull Dagger[MSE] Daniel Manrique-Vallier Megan E Price and Anita Gohdes 2013 In Seybolt et al (eds)Counting Civilians ldquoMultiple Systems Estimation Techniques for Estimating Casualties in ArmedConflictsrdquo Preprint httpciteseerxistpsueduviewdocdownloaddoi=1011469939amp

rep=rep1amptype=pdf)

bull dagger[NetworkSampling] Ted Mouw and Ashton M Verdery 2012 ldquoNetwork Sampling with Memory AProposal for More Efficient Sampling from Social Networksrdquo Sociological Methodology 42(1)206ndash56httpsdxdoiorg1011772F0081175012461248

bull See also FightinWords (re hidden heteroskedasticity in sample variance)

bull Dagger[HashDontSample] Mudit Uppal 2016 ldquoProbabilistic data structures in the Big data world (+code)rdquo (re ldquoHash donrsquot samplerdquo) httpsmediumcommuppalprobabilistic-data-structures-

in-the-big-data-world-code-b9387cff0c55

bull MMDS Hash Functions (132) Sampling in streams (42)

bull DaggerCRAN httpscranr-projectorgwebviewsOfficialStatisticshtml (ldquoComplex SurveyDesignrdquo ldquoSmall Area Estimationrdquo)

Experimental and Observational Designs for Causal Inference

16

bull Dagger[CausalInference] Miguel A Hernan James M Robins Forthcoming (2017) Causal Infer-ence Chapman amp HallCRC Preprint httpswwwhsphharvardedumiguel-hernancausal-

inference-book Chs 1 ldquoA definition of causal effectrdquo Ch 2 ldquoRandomized experimentsrdquo Ch3 ldquoObservational studiesrdquo (See Ch 6 ldquoGraphical representation of causal effectsrdquo for integrationwith Judea Pearl approach)

bull BitByBit Chapter 4 ldquoRunning Experimentsrdquo Ch 2 Natural experiments in observable data exam-ples

bull ResearchMethodsKB ldquoDesignrdquo

bull Firebaugh-7Rules Ch 2 ldquoLook for Differences that Make a Difference and Report Themrdquo sectCh5 ldquoCompare Like with Likerdquo Ch 6 ldquoUse Panel Data to Study Individual Change and RepeatedCross-Section Data to Study Social Changerdquo

bull Shalizi-ADA Part IV ldquoCausal Inferencerdquo

bull Monroe-No

bull CRAN httpscranr-projectorgwebviewsExperimentalDesignhtml

Technologies for Primary and Secondary Data Collection

Mobile Devices Distributed Sensors Wearable Sensors Remote Sensing

bull dagger[RealityMining] Nathan Eagle and Alex (Sandy) Pentland 2006 ldquoReality Mining SensingComplex Social Systemsrdquo Personal and Ubiquitous Computing 10(4) 255ndash68 httpsdoiorg

101007s00779-005-0046-3

bull dagger[QuantifiedSelf ] Melanie Swan 2013 ldquoThe Quantified Self Fundamental Disruption in Big DataScience and Biological Discoveryrdquo Big Data 1(2) 85ndash99 httpsdoiorg101089big2012

0002

bull dagger[SensorData] Charu C Aggarwal (Ed) 2013 Managing and Mining Sensor Data Springer httplinkspringercomezaccesslibrariespsuedubook1010072F978-1-4614-6309-2 Ch1ldquoAn Introduction to Sensor Data Analyticsrdquo

bull dagger[NightLights] Thushyanthan Baskaran Brian Min Yogesh Uppal 2015 ldquoElection cycles andelectricity provision Evidence from a quasi-experiment with Indian special electionsrdquo Journal ofPublic Economics 12664-73 httpsdoiorg101016jjpubeco201503011

Crowdsourcing Human Computation Citizen Science Web Experiments

bull dagger[Crowdsourcing] Daren C Brabham 2013 Crowdsourcing MIT Press httpsiteebrary

comezaccesslibrariespsuedulibpennstatedetailactiondocID=10692208 ldquoIntroduc-tionrdquo Ch1 ldquoConcepts Theories and Cases of Crowdsourcingrdquo

bull BitByBit Ch 5 ldquoCreating Mass Collaborationrdquo

bull NRCreport Ch 9 ldquoHuman Interaction with Datardquo

bull dagger[MTurk] Krista Casler Lydia Bickel and Elizabeth Hackett 2013 ldquoSeparate but Equal AComparison of Participants and Data Gathered via Amazonrsquos MTurk Social Media and Face-to-FaceBehavioral Testingrdquo Computers in Human Behavior 29(6) 2156ndash60 httpdoiorg101016j

chb201305009

bull dagger[LabintheWild] Katharina Reinecke and Krzysztof Z Gajos 2015 ldquoLabintheWild ConductingLarge-Scale Online Experiments With Uncompensated Samplesrdquo In Proceedings of the 18th ACMConference on Computer Supported Cooperative Work amp Social Computing (CSCW rsquo15) 1364ndash1378httpdxdoiorg10114526751332675246

17

bull dagger[TweetmentEffects] Kevin Munger 2017 ldquoTweetment Effects on the Tweeted ExperimentallyReducing Racist Harassmentrdquo Political Behavior 39(3) 629ndash49 httpdoiorg101016jchb

201305009

bull dagger[HumanComputation] Edith Law and Luis van Ahn 2011 Human Computation Morgan amp Clay-pool httpwwwmorganclaypoolcomezaccesslibrariespsuedudoipdf102200S00371ED1V01Y201107AIM013

Open Data File Formats APIs Semantic Web Linked Data

bull DSHandbook-Py Ch 12 ldquoData Encodings and File Formatsrdquo

bull Dagger[OpenData] Open Data Institute ldquoWhat Is Open Datardquo httpstheodiorgwhat-is-open-

data

bull Dagger[OpenDataHandbook] Open Knowledge International The Open Data Handbook httpopendatahandbookorg Includes appendix ldquoFile Formatsrdquo httpopendatahandbookorgguideenappendices

file-formats

bull Dagger[APIs] Brian Cooksey 2016 An Introduction to APIs httpszapiercomlearnapis

bull Dagger[APIMarkets] RapidAPI mashape API marketplaces httpsdocsrapidapicom https

marketmashapecom ProgrammableWeb httpswwwprogrammablewebcom

bull dagger[LinkedData] Tom Heath and Christian Bizer 2011 Linked Data Evolving the Web into a GlobalData Space Morgan amp Claypool httpwwwmorganclaypoolcomezaccesslibrariespsu

edudoiabs102200S00334ED1V01Y201102WBE001 Ch 1 ldquoIntroductionrdquo Ch 2 ldquoPrinciples ofLinked Datardquo

bull dagger[SemanticWeb] Nikolaos Konstantinos and Dimitrios-Emmanuel Spanos 2015 Materializing theWeb of Linked Data Springer httplinkspringercomezaccesslibrariespsuedubook

1010072F978-3-319-16074-0 Ch 1 ldquoIntroduction Linked Data and the Semantic Webrdquo

bull daggerMorgan amp Claypool Synthesis Lectures on the Semantic Web Theory and Technology httpwww

morganclaypoolcomtocwbe111

Web Scraping

bull dagger[TheoryDrivenScraping] Richard N Landers Robert C Brusso Katelyn J Cavanaugh andAndrew B Colmus 2016 ldquoA Primer on Theory-Driven Web Scraping Automatic Extraction ofBig Data from the Internet for Use in Psychological Researchrdquo Psychological Methods 4 475ndash492httpdxdoiorgezaccesslibrariespsuedu101037met0000081

bull Dagger[Scraping-Py] Al Sweigart 2015 Automate the Boring Stuff with Python Practical Programmingfor Total Beginners Ch 11 Web-Scraping httpsautomatetheboringstuffcomchapter11

bull dagger[Scraping-R] Simon Munzert Christian Rubba Peter Meiszligner and Dominic Nyhuis 2014 Au-tomated Data Collection with R A Practical Guide to Web Scraping and Text Mining Wileyhttponlinelibrarywileycombook1010029781118834732

bull DaggerCRAN httpscranr-projectorgwebviewsWebTechnologieshtml

Ethics amp Scientific Responsibility in Big Social Data

Human Subjects Consent Privacy

bull BitByBit Ch 6 (ldquoEthicsrdquo)

bull Dagger[PSU-ORP] Penn State Office for Research Protections (under the VP for Research)

Human Subjects Research IRB httpswwwresearchpsueduirb

18

Revised Common Rule httpswwwresearchpsueduirbcommonrulechanges

Responsible Conduct of Research httpswwwresearchpsuedueducationrcr

Research Misconduct httpswwwresearchpsueduresearchmisconduct

SARI PSU (Scientific and Research Integrity Training) httpswwwresearchpsuedu

trainingsari

bull Dagger[MenloReport] David Dittrich and Erin Kenneally (Center for Applied Interneet Data Analysis)2012 ldquoThe Menlo Report Ethical Principles Guiding Information and Communication Technol-ogy Researchrdquo and companion report ldquoApplying Ethical Principles rdquo httpswwwcaidaorg

publicationspapers2012menlo_report_actual_formatted

bull Dagger[BigDataEthics-CBDES] Jacob Metcalf Emily F Keller and danah boyd 2016 ldquoPerspec-tives on Big Data Ethics and Societyrdquo Council for Big Data Ethics and Society httpbdes

datasocietynetcouncil-outputperspectives-on-big-data-ethics-and-society

bull Dagger[AoIRReport] Annette Markham and Elizabeth Buchanan (Association of Internet Researchers)2012 ldquoEthical Decision-Making and Internet Research Recommendations of the AoIR Ethics WorkingCommittee (Version 20)rdquo httpsaoirorgreportsethics2pdf and guidelines chart https

aoirorgwp-contentuploads201701aoir_ethics_graphic_2016pdf

bull Dagger[BigDataEthics-Wired] Sarah Zhang 2016 ldquoScientists are Just as Confused about the Ethics ofBig-Data Research as Yourdquo Wired httpswwwwiredcom201605scientists-just-confused-ethics-big-data-research

bull dagger[BigDataEthics-HerschelMori] Richard Herschel and Virginia M Mori 2017 ldquoEthics amp BigDatardquo Technology in Society 49 31-36 httpdoiorg101016jtechsoc201703003

The Science of Data Privacy

bull dagger[BigDataPrivacy] Terence Craig and Mary E Ludloff 2011 Privacy and Big Data OrsquoReilly MediahttppensueblibcompatronFullRecordaspxp=781814

bull Dagger[DataAnalysisPrivacy] John Abowd Lorenzo Alvisi Cynthia Dwork Sampath Kannan AshwinMachanavajjhala Jerome Reiter 2017 ldquoPrivacy-Preserving Data Analysis for the Federal Statisti-cal Agenciesrdquo A Computing Community Consortium white paper httpsarxivorgabs1701

00752

bull dagger[DataPrivacy] Stephen E Fienberg and Aleksandra B Slavkovic 2011 ldquoData Privacy and Con-fidentialityrdquo International Encyclopedia of Statistical Science 342ndash5 httpdoiorg978-3-642-

04898-2_202

bull Dagger[NetworksPrivacy] Vishesh Karwa and Aleksandra Slavkovic 2016 ldquoInference using noisy degreesDifferentially private β-model and synthetic graphsrdquo Annals of Statistics 44(1) 87-112 http

projecteuclidorgeuclidaos1449755958

bull dagger[DataPublishingPrivacy] Raymond Chi-Wing Wong and Ada Wai-Chee Fu 2010 Privacy-Preserving Data Publishing An Overview Morgan amp Claypool httpwwwmorganclaypoolcom

ezaccesslibrariespsuedudoipdfplus102200S00237ED1V01Y201003DTM002

bull DataMatching Ch 8 ldquoPrivacy Aspects of Data Matchingrdquo

Transparency Reproducibility and Team Science

bull Dagger[10RulesforData] Alyssa Goodman Alberto Pepe Alexander W Blocker Christine L BorgmanKyle Cranmer Merce Crosas Rosanne Di Stefano Yolanda Gil Paul Groth Margaret HedstromDavid W Hogg Vinay Kashyap Ashish Mahabal Aneta Siemiginowska and Aleksandra Slavkovic(2014) ldquoTen Simple Rules for the Care and Feeding of Scientific Datardquo PLoS Computational Biology10(4) e1003542 httpsdoiorg101371journalpcbi1003542 (Note esp curated resourcesfor reproducible research)

19

bull dagger[Transparency] E Miguel C Camerer K Casey J Cohen K M Esterling A Gerber R Glen-nerster D P Green M Humphreys G Imbens D Laitin T Madon L Nelson B A Nosek M Pe-tersen R Sedlmayr J P Simmons U Simonsohn M Van der Laan 2014 ldquoPromoting Transparencyin Social Science Researchrdquo Science 343(6166) 30ndash1 httpsdoiorg101126science1245317

bull dagger[Reproducibility] Marcus R Munafo Brian A Nosek Dorothy V M Bishop Katherine S ButtonChristopher D Chambers Nathalie Percie du Sert Uri Simonsohn Eric-Jan Wagenmakers JenniferJ Ware and John P A Ioannidis 2017 ldquoA Manifesto for Reproducible Sciencerdquo Nature HumanBehavior 0021(2017) httpsdoiorg101038s41562-016-0021

bull Dagger[SoftwareCarpentry] esp ldquoLessonsrdquo httpsoftware-carpentryorglessons

bull DSHandbook-Py Ch 9 ldquoTechnical Communication and Documentationrdquo Ch 15 ldquoSoftware Engi-neering Best Practicesrdquo

bull dagger[TeamScienceToolkit] Vogel AL Hall KL Fiore SM Klein JT Bennett LM Gadlin H Stokols DNebeling LC Wuchty S Patrick K Spotts EL Pohl C Riley WT Falk-Krzesinski HJ 2013 ldquoTheTeam Science Toolkit enhancing research collaboration through online knowledge sharingrdquo AmericanJournal of Preventive Medicine 45 787-9 http101016jamepre201309001

bull sect[Databrary] Kara Hall Robert Croyle and Amanda Vogel Forthcoming (2017) Advancing Socialand Behavioral Health Research through Cross-disciplinary Team Science Springer Includes RickO Gilmore and Karen E Adolph ldquoOpen Sharing of Research Video Breaking the Boundaries of theResearch Teamrdquo (See httpdatabraryorg)

bull CRAN httpscranr-projectorgwebviewsReproducibleResearchhtml

Social Bias Fair Algorithms

bull MachineBias

bull Dagger[EmbeddingsBias] Aylin Caliskan Joanna J Bryson and Arvind Narayanan 2017 ldquoSemanticsDerived Automatically from Language Corpora Contain Human Biasesrdquo Science httpsarxiv

orgabs160807187

bull Dagger[Debiasing] Tolga Bolukbasi Kai-Wei Chang James Zou Venkatesh Saligrama Adam Kalai 2016ldquoMan is to Computer Programmer as Woman is to Homemaker Debiasing Word Embeddingsrdquohttpsarxivorgabs160706520

bull Dagger[AvoidingBias] Moritz Hardt Eric Price Nathan Srebro 2016 ldquoEquality of Opportunity inSupervised Learningrdquo httpsarxivorgabs161002413

bull Dagger[InevitableBias] Jon Kleinberg Sendhil Mullainathan Manish Raghavan 2016 ldquoInherent Trade-Offs in the Fair Determination of Risk Scoresrdquo httpsarxivorgabs160905807

Databases and Data Management

bull NRCreport Ch 3 ldquoScaling the Infrastructure for Data Managementrdquo

bull dagger[SQL] Jan L Harrington 2010 SQL Clearly Explained Elsevier httpwwwsciencedirectcomezaccesslibrariespsuedusciencebook9780123756978

bull DSHandbook-Py Ch 14 ldquoDatabasesrdquo

bull dagger[noSQL] Guy Harrison 2015 Next Generation Databases NoSQL NewSQL and Big Data Apresshttpslink-springer-comezaccesslibrariespsuedubook101007978-1-4842-1329-2

bull dagger[Cloud] Divyakant Agrawal Sudipto Das and Amr El Abbadi 2012 Data Management inthe Cloud Challenges and Opportunities Morgan amp Claypool httpwwwmorganclaypoolcom

ezaccesslibrariespsuedudoipdfplus102200S00456ED1V01Y201211DTM032

20

bull daggerMorgan amp Claypool Synthesis Lectures on Data Management httpwwwmorganclaypoolcom

tocdtm11

Data Wrangling

Theoretically-structured approaches to data wrangling

bull Dagger[TidyData-R] Garrett Grolemund and Hadley Wickham 2017 R for Data Science (httpr4dshadconz) esp Ch 12 ldquoTidy Datardquo also Wickham 2014 ldquoTidy Datardquo Journalof Statistical Software 59(10) httpwwwjstatsoftorgv59i10paper Tools The tidyversehttptidyverseorg

bull Dagger[DataScience-Py] Jake VanderPlas 2016 Python Data Science Handbook (httpsgithubcomjakevdpPythonDSHandbook-Py) esp Ch 3 on pandas httppandaspydataorg

bull Dagger[DataCarpentry] Colin Gillespie and Robin Lovelace 2017 Efficient R Programming https

csgillespiegithubioefficientR esp Ch 6 on ldquoefficient data carpentryrdquo

bull sect[Wrangler] Joseph M Hellerstein Jeffrey Heer Tye Rattenbury and Sean Kandel 2017 DataWrangling Practical Techniques for Data Preparation Tool Trifacta Wrangler httpwwwtrifactacomproductswrangler

Data wrangling practice

bull DSHandbook-Py Ch 4 ldquoData Munging String Manipulation Regular Expressions and Data Clean-ingrdquo

bull dagger[DataSimplification] Jules J Berman 2016 Data Simplification Taming Information with OpenSource Tools Elsevier httpwwwsciencedirectcomezaccesslibrariespsueduscience

book9780128037812

bull dagger[Wrangling-Py] Jacqueline Kazil Katharine Jarmul 2016 Data Wrangling with Python OrsquoReillyhttpproquestcombosafaribooksonlinecomezaccesslibrariespsuedu9781491948804

bull dagger[Wrangling-R] Bradley C Boehmke 2016 Data Wrangling with R Springer httplink

springercomezaccesslibrariespsuedubook1010072F978-3-319-45599-0

Record Linkage Entity Resolution Deduplication

bull dagger[DataMatching] Peter Christen 2012 Data Matching Concepts and Techniques for Record Link-age Entity Resolution and Duplicate Detectionhttplinkspringercomezaccesslibrariespsuedubook1010072F978-3-642-31164-2

(esp Ch 2 ldquoThe Data Matching Processrdquo)

bull DataSimplification Chapter 5 ldquoIdentifying and Deidentifying Datardquo

bull Dagger[EntityResolution] Lise Getoor and Ashwin Machanavajjhala 2013 ldquoEntity Resolution for BigDatardquo Tutorial KDD httpwwwumiacsumdedu~getoorTutorialsER_KDD2013pdf

bull Dagger[SyrianCasualties] Peter Sadosky Anshumali Shrivastava Megan Price and Rebecca C Steorts2015 ldquoBlocking Methods Applied to Casualty Records from the Syrian Conflictrdquo httpsarxiv

orgabs151007714 (For more on blocking see DataMatching Ch 4 ldquoIndexingrdquo)

bull CRAN Task Views httpscranr-projectorgwebviewsOfficialStatisticshtml ldquoStatis-tical Matching and Record Linkagerdquo

ldquoMaking up datardquo Imputation Smoothers Kernels Priors Filters Teleportation Negative Sampling Con-volution Augmentation Adversarial Training

21

bull dagger[Imputation] Yi Deng Changgee Chang Moges Seyoum Ido and Qi Long 2016 ldquoMultiple Im-putation for General Missing Data Patterns in the Presence of High-dimensional Datardquo ScientificReports 6(21689) httpswwwnaturecomarticlessrep21689

bull Dagger[Overimputation] Matthew Blackwell James Honaker and Gary King 2017 ldquoA Unified Ap-proach to Measurement Error and Missing Data Overview and Applicationsrdquo Sociological Methodsamp Research 46(3) 303-341 httpgkingharvardedufilesgkingfilesmeasurepdf

bull Shalizi-ADA Sect 15 (Linear Smoothers) Ch 8 (Splines) Sect 144 (Kernel Density Estimates)DataScience-Python NB 0513 ldquoKernel Density Estimationrdquo

bull FightinWords (re Bayesian priors as additional data impact of priors on regularization)

bull DeepLearning Section 75 (Data Augmentation) 712 (Dropout) 713 (Adversarial Examples) Chap-ter 9 (Convolutional Networks)

bull dagger[Adversarial] Ian J Goodfellow Jonathon Shlens Christian Szegedy 2015 ldquoExplaining and Har-vesting Adversarial Examplesrdquo httpsarxivorgabs14126572

bull See also resampling and simulation methods

bull See also feature engineering preprocessing

bull CRAN Task Views httpscranr-projectorgwebviewsOfficialStatisticshtml ldquoImpu-tationrdquo

(Direct) Data Representations Data Mappings

bull NRCreport Ch 5 ldquoLarge-Scale Data Representationsrdquo

bull dagger[PatternRecognition] M Narasimha Murty and V Susheela Devi 2011 Pattern RecognitionAn Algorithmic Approach Springer httpslink-springer-comezaccesslibrariespsuedu

book1010072F978-0-85729-495-1 Section 21 ldquoData Structures for Pattern Representationrdquo

bull Dagger[InfoRetrieval] Christopher D Manning Prabhakar Raghavan and Hinrich Schutze 2009 Intro-duction to Information Retrieval Cambridge University Press httpnlpstanfordeduIR-bookCh 1 2 6 (also note slides used in their class)

bull dagger[Algorithms] Brian Steele John Chandler Swarna Reddy 2016 Algorithms for Data ScienceWiley httplinkspringercomezaccesslibrariespsuedubook1010072F978-3-319-

45797-0 Ch 2 ldquoData Mapping and Data Dictionariesrdquo

bull See also Social Data Structures

Similarity Distance Association Covariance The Kernel Trick

bull PatternRecognition Section 23 ldquoProximity Measuresrdquo

bull Dagger[Similarity] Brendan OrsquoConnor 2012 ldquoCosine similarity Pearson correlation and OLS coeffi-cientsrdquo httpsbrenoconcomblog201203cosine-similarity-pearson-correlation-and-

ols-coefficients

bull For relatively comprehensive lists see also

M-J Lesot M Rifqi and H Benhadda 2009 ldquoSimilarity measures for binary and numerical data a surveyrdquoInt J Knowledge Engineering and Soft Data Paradigms 1(1) 63- httpciteseerxistpsueduviewdoc

downloaddoi=10112126533amprep=rep1amptype=pdf

Seung-Seok Choi Sung-Hyuk Cha Charles C Tappert 2010 ldquoA Survey of Binary Similarity and DistanceMeasuresrdquo Journal of Systemics Cybernetics amp Informatics 8(1)43-8 httpwwwiiisciorgJournalCV$

scipdfsGS315JGpdf

22

Sung-Hyuk Cha 2007 ldquoComprehensive Survey on DistanceSimilarity Measures between Probability DensityFunctionsrdquo International Journal of Mathematical Models and Methods in Applied Sciences 4(1) 300-7 http

csispaceeductappertdpsd861-12session4-p2pdf

Anna Huang 2008 ldquoSimilarity Measures for Text Document Clusteringrdquo Proceedings of the 6th New ZealandComputer Science Research Student Conference 49-56 httpwwwnzcsrsc08canterburyacnzsiteproceedingsIndividual_Paperspg049_Similarity_Measures_for_Text_Document_Clusteringpdf

bull Michael B Jordan ldquoThe Kernel Trickrdquo (Lecture Notes) httpspeopleeecsberkeleyedu

~jordancourses281B-spring04lectureslec3pdf

bull Eric Kim 2017 ldquoEverything You Ever Wanted to Know about the Kernel Trick (But Were Afraidto Ask)rdquo httpwwweric-kimneteric-kim-netposts1kernel_trick_blog_ekim_12_20_

2017pdf

Derived Data Representations - Dimensionality Reduction Compression De-composition Embeddings

The groupings here and under the measurement multivariate statistics section are particularly arbitraryFor example ldquok-Means clusteringrdquo can be viewed as a technique for ldquodimensionality reductionrdquo ldquocompres-sionrdquo ldquofeature extractionrdquo ldquolatent variable measurementrdquo ldquounsupervised learningrdquo ldquocollaborative filteringrdquo

Clustering hashing quantization blocking compression

bull Dagger[Compression] Khalid Sayood 2012 Introduction to Data Compression 4th ed Springer httpwwwsciencedirectcomezaccesslibrariespsuedusciencebook9780124157965 (eg cod-ing blocking via vector quantization)

bull MMDS Ch3 ldquoFinding Similar Itemsrdquo (minhashing locality sensitive hashing)

bull Multivariate-R Ch 6 ldquoClusteringrdquo

bull DataScience-Python NB 0511 ldquok-Means Clusteringrdquo NG 0512 ldquoGaussian Mixture Modelsrdquo

bull Dagger[KMeansHashing] Kaiming He Fang Wen Jian Sun 2013 ldquoK-means Hashing An Affinity-Preserving Quantization Method for Learning Binary Compact Codesrdquo CVPR httpswwwcv-foundationorgopenaccesscontent_cvpr_2013papersHe_K-Means_Hashing_An_2013_CVPR_paper

pdf

bull dagger[Squashing] Madigan D Raghavan N Dumouchel W Nason M Posse C and RidgewayG (2002) ldquoLikelihood-based data squashing A modeling approach to instance constructionrdquo DataMining and Knowledge Discovery 6(2) 173-190 httpdxdoiorgezaccesslibrariespsu

edu101023A1014095614948

bull dagger[Core-sets] Piotr Indyk Sepideh Mahabadi Mohammad Mahdian Vahab S Mirrokni 2014 ldquoCom-posable core-sets for diversity and coverage maximizationrdquo PODS rsquo14 100-8 httpsdoiorg10114525945382594560

Feature selection feature extraction feature engineering weighting preprocessing

bull PatternRecognition Sections 26-7 ldquoFeature Selection Feature Extractionrdquo

bull DSHandbook-Py Ch 7 ldquoInterlude Feature Extraction Ideasrdquo

bull DataScience-Python NB 0504 ldquoFeature Engineeringrdquo

bull Features in text NLP and InfoRetrieval re tfidf and similar FightinWords

bull Features in images DataScience-Python NB 0514 ldquoImage Featuresrdquo

23

Dimensionality reduction decomposition factorization change of basis reparameterization matrix com-pletion latent variables source separation

bull MMDS Ch 9 ldquoRecommendation Systemsrdquo Ch11 ldquoDimensionality Reductionrdquo

bull DSHandbook-Py Ch 10 ldquoUnsupervised Learning Clustering and Dimensionality Reductionrdquo

bull Dagger[Shalizi-ADA] Cosma Rohilla Shalizi 2017 Advanced Data Analysis from an Elementary Pointof View Ch 16 (ldquoPrincipal Components Analysisrdquo) Ch 17 (ldquoFactor Modelsrdquo) httpwwwstat

cmuedu~cshaliziADAfaEPoV

See also Multivariate-R Ch 3 ldquoPrincipal Components Analysisrdquo Ch 4 ldquoMultidimensional ScalingrdquoCh 5 ldquoExploratory Factor Analysisrdquo Latent

bull DeepLearning Ch 2 ldquoLinear Algebrardquo Ch 13 ldquoLinear Factor Modelsrdquo

bull Dagger[GloVe] Jeffrey Pennington Richard Socher Christopher Manning 2014 ldquoGloVe Global vectorsfor word representationrdquo EMNLP httpsnlpstanfordeduprojectsglove

bull dagger[NMF] Daniel D Lee and H Sebastian Seung 1999 ldquoLearning the parts of objects by non-negativematrix factorizationrdquo Nature 401788-791 httpdxdoiorgezaccesslibrariespsuedu

10103844565

bull dagger[CUR] Michael W Mahoney and Petros Drineas 2009 ldquoCUR matrix decompositions for improveddata analysisrdquo PNAS httpwwwpnasorgcontent1063697full

bull Dagger[ICA] Aapo Hyvarinen and Erkki Oja 2000 ldquoIndependent Component Analysis Algorithmsand Applicationsrdquo Neural Networks 13(4-5) 411-30 httpswwwcshelsinkifiuahyvarin

papersNN00newpdf

bull [RandomProjection] Ella Bingham and Heikki Mannilla 2001 ldquoRandom projection in dimension-ality reduction applications to image and text datardquo KDD httpsdoiorg1011452F502512

502546

bull Compression eg ldquoTransform codingrdquo ldquoWaveletsrdquo

Nonlinear dimensionality reduction Manifold learning

bull DeepLearning Section 5113 ldquoManifold Learningrdquo

bull DataScience-Python NB 0510 ldquoManifold Learningrdquo (Locally linear embedding [LLE] Isomap)

bull Dagger[KernelPCA] Sebastian Raschka 2014 ldquoKernel tricks and nonlinear dimensionality reduction viaRBF kernel PCArdquo httpsebastianraschkacomArticles2014_kernel_pcahtml

bull Shalizi-ADA Ch 18 ldquoNonlinear Dimensionality Reductionrdquo (LLE)

bull Dagger[LaplacianEigenmaps] Mikhail Belkin and Partha Niyogi 2003 ldquoLaplacian eigenmaps for di-mensionality reduction and data representationrdquo Neural Computation 15(6) 13731396 http

webcseohio-stateedu~belkin8papersLEM_NC_03pdf

bull DeepLearning Ch14 (ldquoAutoencodersrdquo)

bull Dagger[word2vec] Tomas Mikolov Kai Chen Greg Corrado Jeffrey Dean 2013 ldquoEfficient Estimation ofWord Representations in Vector Spacerdquo httpsarxivorgabs13013781

bull Dagger[word2vecExplained] Yoav Goldberg and Omer Levy 2014 ldquoword2vec Explained DerivingMikolov et als Negative-Sampling Word-Embedding Methodrdquo httpsarxivorgabs14023722

bull Dagger[t-SNE] Laurens van der Maaten and Geoffrey Hinton 2008 ldquoVisualising Data using t-SNErdquo Jour-nal of Machine Learning Research 9 2579-2605 httpjmlrcsailmitedupapersvolume9

vandermaaten08avandermaaten08apdf

24

Computation and Scaling Up

Scientific computing computation at scale

bull NRCreport Ch 10 ldquoThe Seven Computational Giants of Massive Data Analysisrdquo

bull DSHandbook-Py Ch 21 ldquoPerformance and Computer Memoryrdquo Ch 22 ldquoComputer Memory andData Structuresrdquo

bull Dagger[ComputationalStatistics-Python] Cliburn Chan Computational Statistics in Python https

peopledukeedu~ccc14sta-663indexhtml

bull CRAN Task View High Performance Computinghttpscranr-projectorgwebviewsHighPerformanceComputinghtml

Numerical computing

bull [NumericalComputing] Ward Cheney and David Kincaid 2013 Numerical Mathematics andComputing 7th ed BrooksCole Cengage Learning

bull CRAN Task View Numerical Mathematicshttpscranr-projectorgwebviewsNumericalMathematicshtml

Optimization (eg MLE gradient descent stochastic gradient descent EM algorithm neural nets)

bull DSHandbook-Py Ch 23 ldquoMaximum Likelihood Estimation and Optimizationrdquo (gradient descent)

bull DeepLearning Ch 4 ldquoNumerical Computationrdquo Ch 6 ldquoDeep Feedforward Networksrdquo Ch 8ldquoOptimization for Training Deep Modelsrdquo

bull CRAN Task View Optimization and Mathematical Programminghttpscranr-projectorgwebviewsOptimizationhtml

Linear algebra matrix computations

bull [MatrixComputations] Gene H Golub and Charles F Van Loan 2013 Matrix Computations 4thed Johns Hopkins University Press

bull Dagger[NetflixMatrix] Yehuda Koren Robert Bell and Chris Volinsky 2009 ldquoMatrix FactorizationTechniques for Recommender Systemsrdquo Computer August 42-9 httpsdatajobscomdata-

science-repoRecommender-Systems-[Netflix]pdf

bull Dagger[BigDataPCA] Jianqing Fan Qiang Sun Wen-Xin Zhou Ziwei Zhu ldquoPrincipal Component Anal-ysis for Big Datardquo httpwwwprincetonedu~ziweizpcapdf

bull Dagger[RandomSVD] Andrew Tulloch 2009 ldquoFast Randomized SVDrdquo httpsresearchfbcom

fast-randomized-svd

bull Dagger[FactorSGD] Rainer Gemulla Peter J Haas Erik Nijkamp and Yannis Sismanis 2011 ldquoLarge-Scale Matrix Factorization with Distributed Stochastic Gradient Descentrdquo KDD httpwwwcs

utahedu~hariteachingbigdatagemulla11dsgdpdf

bull Dagger[Sparse] Max Grossman 2015 ldquo101 Ways to Store a Sparse Matrixrdquo httpsmediumcom

jmaxg3101-ways-to-store-a-sparse-matrix-c7f2bf15a229

bull Dagger[DontInvert] John D Cook 2010 ldquoDonrsquot Invert that Matrixrdquo httpswwwjohndcookcom

blog20100119dont-invert-that-matrix

Simulation-based inference resampling Monte Carlo methods MCMC Bayes approximate inference

25

bull Dagger[MarkovVisually] Victor Powell ldquoMarkov Chains Explained Visuallyrdquo httpsetosaioev

markov-chains

bull DSHandbook-Py Ch 25 ldquoStochastic Modelingrdquo (Markov chains MCMC HMM)

bull Bayes ComputationalStatistics-Py

bull DeepLearning Ch 17 ldquoMonte Carlo Methodsrdquo Ch 19 ldquoApproximate Inferencerdquo

bull Dagger[VariationalInference] Jason Eisner 2011 ldquoHigh-Level Explanation of Variational Inferencerdquohttpswwwcsjhuedu~jasontutorialsvariationalhtml

bull CRAN Task View Bayesian Inferencehttpscranr-projectorgwebviewsBayesianhtml

Parallelism MapReduce Split-Apply-Combine

bull Dagger[MapReduceIntuition] Jigsaw Academy 2014 ldquoBig Data Specialist MapReducerdquo https

wwwyoutubecomwatchv=TwcYQzFqg-8ampfeature=youtube

bull MMDS Ch 2 ldquoMap-Reduce and the New Software Stackrdquo (3rd Edition discusses Spark amp Tensor-Flow)

bull Algorithms Ch 3ldquoScalable Algorithms and Associative Statisticsrdquo Ch 4 ldquoHadoop and MapReducerdquo

bull NRCreport Ch 6 ldquoResources Trade-offs and Limitationsrdquo

bull TidyData-R (Split-apply-combine is the motivating principle behind the ldquotidyverserdquo approach) Seealso Part III ldquoProgramrdquo (pipes functions vectors iteration)

bull Dagger[TidySAC-Video] Hadley Wickham 2017 ldquoData Science in the Tidyverserdquo httpswww

rstudiocomresourcesvideosdata-science-in-the-tidyverse

bull See also FactorSGD

Functional Programming

bull ComputationalStatistics-Python ldquoFunctions are first class objectsrdquo through first exercises

bull DSHandbook-Py Ch 20 ldquoProgramming Language Conceptsrdquo

bull See also Haskell

Scaling iteration streaming data online algorithms (Spark)

bull NRCreport Ch 4 ldquoTemporal Data and Real-Time Algorithmsrdquo

bull MMDS Ch 4 ldquoMining Data Streamsrdquo

bull Dagger[BDAS] AMPLab BDAS The Berkeley Data Analytics Stack httpsamplabcsberkeleyedu

software

bull dagger[Spark] Mohammed Guller 2015 Big Data Analytics with Spark A Practitioners Guide toUsing Spark for Large-Scale Data Processing Machine Learning and Graph Analytics and High-Velocity Data Stream Processing Apress httpslink-springer-comezaccesslibrariespsu

edubook101007978-1-4842-0964-6

bull DSHandbook-Py Ch 13 ldquoBig Datardquo

bull DeepLearning Ch 10 ldquoSequence Modeling Recurrent and Recursive Netsrdquo

General Resources for Python and R

26

bull dagger[DSHandbook-Py] Field Cady 2017 The Data Science Handbook (Python-based) Wiley http

onlinelibrarywileycomezaccesslibrariespsuedubook1010029781119092919

bull Dagger[Tutorials-R] Ujjwal Karn 2017 ldquoA curated list of R tutorials for Data Science NLP and MachineLearningrdquo httpsgithubcomujjwalkarnDataScienceR

bull Dagger[Tutorials-Python] Ujjwal Karn 2017 ldquoA curated list of Python tutorials for Data Science NLPand Machine Learningrdquo httpsgithubcomujjwalkarnDataSciencePython

Cutting and Bleeding Edge of Data Science Languages

bull [Scala] daggerVishal Layka and David Pollak 2015 Beginning Scala httpslink-springer-

comezaccesslibrariespsuedubook1010072F978-1-4842-0232-6 daggerNicolas Patrick 2014Scala for Machine Learning httpsebookcentralproquestcomlibpensudetailaction

docID=1901910 DaggerScala site httpswwwscala-langorg (See also )

bull [Julia] daggerIvo Baobaert 2015 Getting Started with Julia httpsebookcentralproquestcom

libpensudetailactiondocID=1973847 DaggerDouglas Bates 2013 ldquoJulia for R Programmersrdquohttpwwwstatwiscedu~batesJuliaForRProgrammerspdf DaggerJulia site httpsjulialang

org

bull [Haskell] daggerRichard Bird 2014 Thinking Functionally with Haskell httpsdoi-orgezaccess

librariespsuedu101017CBO9781316092415 daggerHakim Cassimally 2017 Learning Haskell Pro-gramming httpswwwlyndacomHaskell-tutorialsLearning-Haskell-Programming604926-2html daggerJames Church 2017 Learning Haskell for Data Analysis httpswwwlyndacomDeveloper-tutorialsLearning-Haskell-Data-Analysis604234-2html Haskell site httpswwwhaskellorg

bull [Clojure] daggerMark McDonnell 2017 Quick Clojure Essential Functional Programming https

link-springer-comezaccesslibrariespsuedubook1010072F978-1-4842-2952-1 daggerAkhillWali 2014 Clojure for Machine Learning httpsebookcentralproquestcomlibpensu

detailactiondocID=1674848 daggerArthur Ulfeldt 2015 httpswwwlyndacomClojure-tutorialsUp-Running-Clojure413127-2html DaggerClojure site httpsclojureorg

bull [TensorFlow] DaggerAbadi et al (Google) 2016 ldquoTensorFlow A System for Large-Scale Ma-chine Learningrdquo httpsarxivorgabs160508695 DaggerUdacity ldquoDeep Learningrdquo httpswww

udacitycomcoursedeep-learning--ud730 DaggerTensorFlow site httpswwwtensorfloworg

bull [H2O] Darren Cool 2016 Practical Machine Learning with H2O Powerful Scalable Techniques forDeep Learning and AI OrsquoReilly Arno Candel and Viraj Parmar 2015 Deep Learning with H2ODaggerH2O site httpswwwh2oai

Social Data Structures

Space and Time

bull dagger[GIA] David OrsquoSullivan David J Unwin 2010 Geographic Information Analysis Second EditionJohn Wiley amp Sons httponlinelibrarywileycombook1010029780470549094

bull [Space-Time] Donna J Peuquet 2003 Representations of Space and Time Guilford

bull Roger S Bivand Edzer Pebesma Virgilio Gmez-Rubio 2013 Applied Spatial Data Analysis in RFree through library httplinkspringercomezaccesslibrariespsuedubook101007

2F978-1-4614-7618-4 See also Edzer Pebesma 2016 ldquoHandling and Analyzing Spatial Spa-tiotemporal and Movement Data in Rrdquo httpsedzergithubioUseR2016

bull Dagger[PySAL] Sergio J Rey and Dani Arribas-Bel 2016 ldquoGeographic Data Science with PySAL andthe pydata Stackrdquo httpdarribasorggds_scipy16

27

bull DataScience-Python NB 0413 ldquoGeographic Data with Basemaprdquo

bull Time ldquoLongitudinalrdquo intra-individual data as common in developmental psychology dagger[Longitudinal]Garrett Fitzmaurice Marie Davidian Geert Verbeke and Geert Molenberghs editors 2009 Longitu-dinal Data Analysis Chapman amp Hall CRC httppensueblibcompatronFullRecordaspxp=359998

bull Time ldquoTime Series Time Series Cross Section Panelrdquo as common in economics political scienceand sociology DataScience-Python Ch 3 re pandas DSHandbook-Py Ch 17 ldquoTime Series AnalysisrdquoEconometrics GelmanHill

bull Time ldquoSequential Data Streamsrdquo as in NLP machine learning Spark and other rdquostreaming datardquoreadings

bull Space-Time Data with continuity connectivity neighborhood structure See DeepLearning Chapter 9re convolution

bull CRAN Task Views Spatial httpscranr-projectorgwebviewsSpatialhtml Spatiotempo-ral httpscranr-projectorgwebviewsSpatioTemporalhtml Time Series httpscran

r-projectorgwebviewsTimeSerieshtml

Network Graphs

bull Dagger[Networks] David Easley and Jon Kleinberg 2010 Networks Crowds and Markets ReasoningAbout a Highly Connected World Cambridge University Press httpwwwcscornelleduhome

kleinbernetworks-book

bull MMDS Ch 5 ldquoLink Analysisrdquo Ch 10 ldquoMining Social-Network Graphsrdquo

bull dagger[Networks-R] Douglas Luke 2015 A Userrsquos Guide to Network Analysis in R Springer https

link-springer-comezaccesslibrariespsuedubook101007978-3-319-23883-8

bull dagger[Networks-Python] Mohammed Zuhair Al-Taie and Seifedine Kadry 2017 Python for Graph andNetwork Analysis Springer httpslink-springer-comezaccesslibrariespsuedubook

1010072F978-3-319-53004-8

Hierarchy Clustered Data Aggregation Mixed Models

bull dagger[GelmanHill] Andrew Gelman and Jennifer Hill 2006 Data Analysis Using Regression and Multi-levelHierarchical Models Cambridge University Press httppensueblibcompatronFullRecordaspxp=288457

bull dagger[MixedModels] Eugene Demidenko 2013 Mixed Models Theory and Applications with R 2nd edWiley httpsiteebrarycomezaccesslibrariespsuedulibpennstatedetailaction

docID=10748641

Social Data Channels

Language Text Speech Audio

bull Dagger[NLP] Daniel Jurafsky and James H Martin Forthcoming (2017) Speech and Language ProcessingAn Introduction to Natural Language Processing Speech Recognition and Computational Linguistics3rd ed Prentice-Hall Preprint httpswebstanfordedu~jurafskyslp3

bull InfoRetrieval

bull Dagger[CoreNLP] Stanford CoreNLP httpsstanfordnlpgithubioCoreNLP

28

bull dagger[TextAsData] Grimmer and Stewart 2013 ldquoText as Data The Promise and Pitfalls of AutomaticContent Analysis Methods for Political Textsrdquo Political Analysis httpsdoiorg101093pan

mps028

bull Quinn-Topics

bull FightinWords

bull dagger[NLTK-Python] Jacob Perkins 2010 Python Text Processing with NLTK 20 Cookbook PackthttpsiteebrarycomezaccesslibrariespsuedulibpennstatedetailactiondocID=10435387dagger[TextAnalytics-Python] Dipanjan Sarkar 2016 Text Analytics with Python Springer http

linkspringercomezaccesslibrariespsuedubook1010072F978-1-4842-2388-8

bull DSHandbook-Py Ch 16 ldquoNatural Language Processingrdquo

bull Matthew J Denny httpwwwmjdennycomText_Processing_In_Rhtml

bull CRAN Natural Language Processing httpscranr-projectorgwebviewsNaturalLanguageProcessinghtml

bull dagger[AudioData] Dean Knox and Christopher Lucas 2017 ldquoThe Speaker-Affect Model MeasuringEmotion in Political Speech with Audio Datardquo httpchristopherlucasorgfilesPDFssam

pdf httpswwwyoutubecomwatchv=Hs8A9dwkMzI

bull daggerMorgan amp Claypool Synthesis Lectures on Human Language Technologies httpwwwmorganclaypoolcomtochlt11

bull daggerMorgan amp Claypool Synthesis Lectures on Speech and Audio Processing httpwwwmorganclaypoolcomtocsap11

Vision Image Video

bull Dagger[ComputerVision] Richard Szelinski 2010 Computer Vision Algorithms and Applications SpringerhttpszeliskiorgBook

bull MixedModels Ch 11 ldquoStatistical Analysis of Shaperdquo Ch 12 ldquoStatistical Image Analysisrdquo

bull AudioVideoVolumetric See DeepLearning Chapter 9 re convolution

bull daggerMorgan amp Claypool Synthesis Lectures on Image Video and Multimedia Processing httpwww

morganclaypoolcomtocivm11

bull daggerMorgan amp Claypool Synthesis Lectures on Computer Vision httpwwwmorganclaypoolcom

toccov11

Approaches to Learning from Data (The Analytics Layer)

bull NRCreport Ch 7 ldquoBuilding Models from Massive Datardquo

bull MMDS (data-mining)

bull CausalInference

bull Dagger[VisualAnalytics] Daniel Keim Jorn Kohlhammer Geoffrey Ellis and Florian Mansmann editors2010 Mastering the Information Age Solving Problems with Visual Analytics The EurographicsAssociation Goslar Germany httpwwwvismastereubook See also daggerMorgan amp ClaypoolSynthesis Lectures on Visualization httpwwwmorganclaypoolcomtocvis21

bull Dagger[Econometrics] Bruce E Hansen 2014 Econometrics httpwwwsscwiscedu~bhansen

econometrics

29

bull Dagger[StatisticalLearning] Gareth James Daniela Witten Trevor Hastie and Robert Tibshirani 2013An Introduction to Statistical Learning with Applications in R Springer httpwww-bcfuscedu

~garethISLindexhtml

bull [MachineLearning] Christopher Bishop 2006 Pattern Recognition and Machine Learning Springer

bull dagger[Bayes] Peter D Hoff 2009 A First Course in Bayesian Statistical Methods Springer http

linkspringercombook1010072F978-0-387-92407-6

bull dagger[GraphicalModels] Soslashren Hoslashjsgaard David Edwards Steffen Lauritzen 2012 Graphical Mod-els with R Springer httpslink-springer-comezaccesslibrariespsuedubook101007

2F978-1-4614-2299-0

bull [InformationTheory] David Mackay 2003 Information Theory Inference and Learning AlgorithmsSpringer

bull Dagger[DeepLearning] Ian Goodfellow Yoshua Bengio and Aaron Courville 2016 Deep Learning MITPress httpwwwdeeplearningbookorg (see also Yann LeCun Yoshua Bengio and Geof-frey Hinton 2015 ldquoDeep Learningrdquo Nature 521(7553) 436ndash444 httpsdoiorg101038

nature14539)

See also Dagger[Keras] Keras The Python Deep Learning Library httpskerasio Francoios CholletDirectory of Keras tutorials httpsgithubcomfcholletkeras-resources

bull dagger[SignalProcessing] Jose Maria Giron-Sierra 2013 Digital Signal Processing with MATLAB Exam-ples (Volumes 1-3) Springer httpslink-springer-comezaccesslibrariespsuedubook

101007978-981-10-2534-1

30

Penn State Policy Statements

Academic Integrity

Academic integrity is the pursuit of scholarly activity in an open honest and responsible manner Academicintegrity is a basic guiding principle for all academic activity at The Pennsylvania State University and allmembers of the University community are expected to act in accordance with this principle Consistent withthis expectation the Universitys Code of Conduct states that all students should act with personal integrityrespect other students dignity rights and property and help create and maintain an environment in whichall can succeed through the fruits of their efforts

Academic integrity includes a commitment by all members of the University community not to engage in ortolerate acts of falsification misrepresentation or deception Such acts of dishonesty violate the fundamentalethical principles of the University community and compromise the worth of work completed by others

Disability Accomodation

Penn State welcomes students with disabilities into the Universitys educational programs Every Penn Statecampus has an office for students with disabilities Student Disability Resources (SDR) website provides con-tact information for every Penn State campus (httpequitypsuedusdrdisability-coordinator)For further information please visit the Student Disability Resources website (httpequitypsuedusdr)

In order to receive consideration for reasonable accommodations you must contact the appropriate disabilityservices office at the campus where you are officially enrolled participate in an intake interview and pro-vide documentation See documentation guidelines at (httpequitypsuedusdrguidelines) If thedocumentation supports your request for reasonable accommodations your campus disability services officewill provide you with an accommodation letter Please share this letter with your instructors and discussthe accommodations with them as early as possible You must follow this process for every semester thatyou request accommodations

Psychological Services

Many students at Penn State face personal challenges or have psychological needs that may interfere withtheir academic progress social development or emotional wellbeing The university offers a variety ofconfidential services to help you through difficult times including individual and group counseling crisisintervention consultations online chats and mental health screenings These services are provided by staffwho welcome all students and embrace a philosophy respectful of clients cultural and religious backgroundsand sensitive to differences in race ability gender identity and sexual orientation

bull Counseling and Psychological Services at University Park (CAPS) (httpstudentaffairspsueducounseling) 814-863-0395

bull Penn State Crisis Line (24 hours7 daysweek) 877-229-6400

bull Crisis Text Line (24 hours7 daysweek) Text LIONS to 741741

Educational Equity

Penn State takes great pride to foster a diverse and inclusive environment for students faculty and staffConsistent with University Policy AD29 students who believe they have experienced or observed a hatecrime an act of intolerance discrimination or harassment that occurs at Penn State are urged to report theseincidents as outlined on the Universitys Report Bias webpage (httpequitypsuedureportbias)

31

Background Knowledge

Students will have a variety of backgrounds Prior to beginning interdisciplinary coursework tofulfill Social Data Analytics degree requirements including SoDA 501 and 502 students are ex-pected to have advanced (graduate) training in at least one of the component areas of Social DataAnalytics and a familiarity with basic concepts in the others

With regard to specialization students are expected to have advanced (graduate) training in ONEof the following

bull quantitative social science methodology and a discipline of social science (as would be thecase for a second-year PhD student in Political Science Sociology Criminology Human De-velopment and Family Studies or Demography) OR

bull statistics (as would be the case for a second-year PhD student in Statistics) OR

bull information science or informatics (as would be the case for a second-year PhD student inInformation Science and Technology or a second-year PhD student in Geography specializingin GIScience) OR

bull computer science (as would be the case for a second-year PhD student in Computer Scienceand Engineering)

This requirement is met as a matter of meeting home program requirements for students in thedual-title PhD but may require additional coursework on the part of students in other programswishing to pursue the graduate minor

With regard to general preparation students are expected to have ALL of the following technicalknowledge

bull basic programming skills including basic facility with R ampor Python AND

bull basic knowledge of relational databases ampor geographic information systems AND

bull basic knowledge of probability applied statistics ampor social science research design AND

bull basic familiarity with a substantive or theoretical area of social science (eg 300-level course-work in political science sociology criminology human development psychology economicscommunication anthropology human geography social informatics or similar fields)

It is not unusual for students to have one or more gaps in this preparation at time of applicationto the SoDA program Students should work with Social Data Analytics advisers to develop aplan for timely remediation of any deficiencies which generally will not require formal courseworkfor students whose training and interests are otherwise appropriate for pursuit of the Social DataAnalytics degree Where possible this will be addressed at time of application to the Social DataAnalytics program

To this end some free training materials are linked in the reference section and there are alsoabundant free high-quality self-paced course-style training materials on these and related subjectsavailable through edX (httpwwwedxorg) Udacity (httpwwwudacitycom) Codecademy(httpcodecademycom) and Lynda (httpwwwlyndapsu)

32

Page 7: SoDA 501: Approaches and Issues in Big Social Data Spring ... · \Approaches and Issues in Big Social Data," o ered in the spring semester. Some SoDA / IGERT students will take more

MMDS Chapter 3 ldquoFinding Similar Itemsrdquo (to understand ldquominhashingrdquo and ldquolocalitysensitive hashingrdquo yoursquoll need to understand ldquohashingrdquo ndash Section 132 also discussedin InfoRetrieval Chapter 3)

bull Further reference

Section ldquoSimilarity Distance Association rdquo

Section ldquoDerived Data Representations rdquo

Section ldquoRecord Linkage Entity Resolution rdquo

Section ldquoClustering Hashing Compression rdquo

March 1

bull Reading

BitByBit Ch 5 (ldquoCreating Mass Collaborationrdquo)

Crowdsourcing ldquoIntroductionrdquo and Ch1 ldquoConcepts Theories and Cases of Crowd-sourcingrdquo

HumanComputation Chs 125

bull Further reference

Section ldquoCrowdsourcing Human Computation Citizen Science Web Experimentsrdquo

Section ldquoMaking Up Data (smoothing convolution kernels )rdquo

Section ldquoVision image videordquo

bull Semester Projects Must be Approved Before Spring Break

March 8 - Spring Break

March 15

bull Daniel DellaPosta (SOC) ldquoNetworks and the Mid-20th Century American Mafiardquo

bull Reading

Dan DellaPosta ldquoNetwork Closure in the Mid-20th Century American Mafiardquo (SocialNetworks 2017) httpswww-sciencedirect-comezaccesslibrariespsuedusciencearticlepiiS037887331630199X ldquoBetween Clique and Corporation Boundary-Spanningin Solidary Groupsrdquo (RampR in American Journal of Sociology in the Box folder)

Networks Chs 1 2

MMDS Ch 5 ldquoLink Analysisrdquo

MarkovVisually

bull Further references

Section ldquoNetworks and Graphsrdquo

7

Section ldquoSimulation resampling Markov Chains rdquo

DeepLearning (different kind of network)

bull 25 Project Review

March 22

bull Reading

DeepLearning Ch2 ldquoLinear Algebrardquo

Shalizi-ADA Chs 16-18 (ldquoPrincipal Components Analysisrdquo ldquoFactor Modelsrdquo ldquoNonlin-ear Dimension Reductionrdquo)

bull Further reference

Section ldquoDimensionality reduction decomposition rdquo

Section ldquoMultiple measures latent variable measurementrdquo

Section ldquoLinear algebra matrix computationsrdquo

March 29

bull Naomi Altman (STAT) ldquoGeneralizing PCArdquo

bull Reading

Altman ldquoGeneralizing PCArdquo 2015 slides httppersonalpsuedunsa1AltmanWebpagePCATorontopdf

MapReduceIntuition (7 minute video providing ldquodivide and conquerrdquo intuition to MapRe-duce)

TidySAC-Video Hadley Wickham on the tidyverse ldquosplit-apply-combinerdquo with dplyrand tidy data (1 hour video) (More detail but perhaps less intuition in book formdiscussion of ldquotidyingrdquo data Chapter 5 of TidyData-R)

MMDS Ch 2 (Read the Chapter 2 in the new ldquoBETArdquo version of the book whichalso touches on Spark and Tensorflow in section 24 ldquoExtensions to MapReducerdquo httpistanfordedu~ullmanmmdsnhtml)

bull Further reference

Section ldquoNonlinear dimension reduction manifold learningrdquo

Section ldquoDatabases and data managementrdquo (esp NoSQL)

Section ldquoTheoretically-structured approaches to data wranglingrdquo

Section ldquoParallelism MapReduce Split-Apply-Combinerdquo

Section ldquoFunctional programmingrdquo

Section ldquoCutting and bleeding edge rdquo

bull 50 Project Review

bull Exercise 3 (Individual) Due 45

8

April 5

bull Prasenjit Mitra (IST) ldquoClassification of Tweets from Disaster Scenariosrdquo

bull Reading

Prasenjit Mitra Rudra et al rdquoSummarizing Situational and Topical InformationDuring Crisesrdquo httpsarxivorgpdf161001561pdf Imran Mitra amp CastillordquoTwitter as a Lifeline Human-annotated Twitter Corpora for NLP of Crisis-relatedMessagesrdquo httpmimranmepapersimran_prasenjit_carlos_lrec2016pdf

NLP (Jurafsky and Martin) Chapters 15 and 16 ldquoVector Semanticsrdquo and ldquoSemanticswith Dense Vectorsrdquo

bull Further reference

Section ldquoLanguage text speech audiordquo

GloVe word2vec word2vecExplained t-SNE

Section ldquoMobile devices distributed sensors rdquo

Section ldquoCrowdsourcing human computation citizen science rdquo

Section ldquoScaling iteration streaming data online algorithmsrdquo

bull Exercise 3 Due

April 12

bull Reading

BitByBit Ch 4 ldquoRunning Experimentsrdquo review Ch 2 sections ldquoNatural experimentsin observable datardquo examples

CausalInference Chs 1-2

bull Further reference

Section ldquoExperimental and observational designs for causal inferencerdquo

Section ldquoCrowdsourcing web experimentsrdquo

Section ldquoHuman subjects rdquo

bull 75 Project Review

April 19

bull Reading

BitByBit Ch 6 ldquoEthicsrdquo

PSU-ORP ldquoCommon Rule and Other Changesrdquo httpswwwresearchpsuedu

irbcommonrulechanges

DataPrivacy

9

Reproducibility

Refresh on MachineBias

bull Further reference

Section ldquoEthics and Scientific Responsibility in Big Social Datardquo

Section ldquoiexclCuidadordquo

April 26 - LAST CLASS MEETING

bull Team Project Presentations

May 3 - Team Projects Due

10

Past Visiting Speakers in SoDA 501 and 502

SoDA 502 (Fall 2017)

bull Chris Zorn (PLSC)

bull Diane Felmlee (SOC)

bull Alan MacEachren (GEOG)

bull Eric Plutzer (PLSC)

bull James LeBreton (PSYCH)

bull Sesa Slavkovic (STAT)

bull Guido Cervone (GEOG)

bull Ashton Verdery (SOC)

bull Maggie Niu (STAT)

bull Reka Albert (PHYS)

SoDA 501 (Spring 2017)

bull Tim Brick (HDFS) ldquoTowards real-time monitoring and intervention using wearable technologyrdquo

bull Aylin Caliskan (Princeton) ldquoA Story of Discrimination and Unfairness Bias in Word Embeddingsrdquo

bull Jay Yonamine (Google - IGERT alum) ldquoData Science in Industryrdquo

bull Johnathan Rush (Illinois) ldquoGeospatial Data Science Workshoprdquo

bull Rick Gilmore (PSYCH) ldquoToward a more reproducible and robust science of human behaviorrdquo

bull Glenn Firebaugh (SOC) ldquoMeasuring Inequality and Segregation with US Census Datardquo

bull Charles Twardy (Sotera) ldquoData Science for Search and Rescuerdquo

bull Anna Smith (Ohio State) ldquoA Hierarchical Model for Network Data in a Latent Hyperbolic Spacerdquo

bull Rebecca Passonneau (CSE) ldquoOmnigraph Rich Feature Representation for Graph Kernel Learningrdquo

bull Alex Klippel (GEOG) ldquoVirtual Reality for Immersive Analyticsrdquo

bull Murali Haran (STAT) ldquoA Computationally Efficient Projection-based Approach for Spatial General-ized Mixed Modelsrdquo

SoDA 502 (Fall 2016)

bull Clio Andris (GEOG) ldquoIntegrating Social Network Data into GISystemsrdquo

bull Jia Li (STAT) ldquoClustering under the Wasserstein Metricrdquo

bull Rachel Smith (CAS) ldquoStigma Networks Perceptions of Sociogramsrdquo

bull Zita Oravecz (HDFS)

bull Bethany Bray (Methodology Center) ldquoLatent Class and Latent Transition Analysisrdquo

bull Dave Hunter (STAT) ldquoModel Based Clustering of Large Networksrdquo

bull Scott Bennett (PLSC) ldquoABM Model of Insurgencyrdquo

bull David Reitter (IST)

bull Suzanna Linn (PLSC) ldquoMethodological Issues in Automated Text Analysis Application to NewsCoverage of the US Economyrdquo

11

SoDA 501 (Spring 2016)

bull Bruce Desmarais (PLSC) ldquoLearning in the Sunshine Analysis of Local Government Email Corporardquo

bull Timothy Brick (HDFS) ldquoMapping and Manipulating Facial Expressionrdquo

bull Qunying Huang (USC) ldquoSocial Media An Emerging Data Source for Human Mobility Studiesrdquo

bull Lingzhou Zue (STAT) ldquoAn Introduction to High-Dimensional Graphical Modelsrdquo

bull Ashton Verdery (SOC) ldquoSampling from Network Datardquo

bull Sarah Battersby (Tableau) ldquoHelping People See and Understand Spatial Datardquo

bull Alexandra Slavkovic (STAT) ldquoStatistical Privacy with Network Datardquo

bull Lee Giles (IST) ldquoMachine Learning for Scholarly Big Datardquo

12

Readings and References (updated Spring 2018)

We will discuss a relatively small subset of the readings listed here and this will vary based on topics andreadings discussed by visiting speakers and you yourselves The remainder are provided here as curatedreferences for more in-depth investigation in both the theory and practice related to the topic (with thelatter heavily weighted toward resources in Python and R)

dagger Material that is at last check made available through the Penn State library Most journal links shouldwork if you are logged in to a Penn State machine or through the Penn State VPN Some article archivesand most books require additional authentication through webaccess If links are broken start directly froma search via httpswwwlibrariespsuedu Some books require installation of e-readers like AdobeDigital Editions Links to lyndacom can be accessed through httplyndapsuedu

Dagger Material that is at last check legally provided for free In some cases these are the preprint versionsof published material

sect Material for which a legal selection is or will be provided through the class Box folder

Big Data amp Social Data Analytics

Overviews

bull Dagger[CompSocSci] David Lazer Alex Pentland Lada Adamic Sinan Aral Albert-Laszlo BarabasiDevon Brewer Nicholas Christakis Noshir Contractor James Fowler Myron Gutmann Tony JebaraGary King Michael Macy Deb Roy and Marshall Van Alstyne 2009 ldquoComputational Social SciencerdquoScience 323(5915)721-3 + Supp Feb 6 httpsciencesciencemagorgcontent3235915

721full httpsgkingharvardedufilesgkingfilesLazPenAda09pdf

bull Dagger[NRCreport] National Research Council 2013 Frontiers in Massive Data Analysis NationalAcademies Press (Free w registration httpwwwnapeducatalogphprecord_id=18374)Ch1 ldquoIntroductionrdquo Ch2 ldquoMassive Data in Science Technology Commerce National DefenseTelecommunications and other Endeavorsrdquo

bull Dagger[MMDS] Jure Leskovec Anand Rajaraman and Jeff Ullman 2014 Mining of Massive DatasetsCambridge University Press httpwwwmmdsorg (BETA version of Third Edition httpi

stanfordedu~ullmanmmdsnhtml)

bull sect[BitByBit] Matthew J Salganik 2018 (Forthcoming) Bit by Bit Social Research in the DigitalAge Princeton University Press Ch 1 ldquoIntroductionrdquo Ch 2 ldquoObserving Behaviorrdquo

bull DSHandbook-Py Ch 1 ldquoIntroduction Becoming a Unicornrdquo Ch2 ldquoThe Data Science Road Maprdquo

Burt-schtick

bull Dagger[Monroe-5Vs] Burt L Monroe 2013 ldquoThe Five Vs of Big Data Political Science Introductionto the Special Issue on Big Data in Political Sciencerdquo Political Analysis 21(V5) 1ndash9 https

doiorg101017S1047198700014315 (Volume Velocity Variety Vinculation Validity)

bull dagger[Monroe-No] Burt L Monroe Jennifer Pan Margaret E Roberts Maya Sen and Betsy Sin-clair 2015 ldquoNo Formal Theory Causal Inference and Big Data Are Not Contradictory Trendsin Political Sciencerdquo PS Political Science amp Politics 48(1) 71ndash4 httpdxdoiorg101017

S1049096514001760

bull dagger[Quinn-Topics] Kevin M Quinn Burt L Monroe Michael Colaresi Michael H Crespin andDragomir R Radev 2010 ldquoHow to Analyze Political Attention with Minimal Assumptions andCostsrdquo American Journal of Political Science 54(1) 209ndash28 httponlinelibrarywileycom

13

doi101111j1540-5907200900427xfull (esp topic modeling as measurement approach tovalidation)

bull dagger[FightinWords] Burt L Monroe Michael Colaresi and Kevin M Quinn 2008 ldquoFightinrsquo WordsLexical Feature Selection and Evaluation for Identifying the Content of Political Conflictrdquo PoliticalAnalysis 16(4) 372-403 httpsdoiorg101093panmpn018 (esp the impact of samplingvariance and regularization through priors)

bull Dagger[BDSS-Census] Big Data Social Science PSU Team 2012 ldquoA Closer Look at the Kaggle CensusDatardquo httpsburtmonroegithubioBDSSKaggleCensus2012 (esp the relevance of the socialprocesses by which data come to exist as data)

Multidisciplinary Perspectives

bull dagger[Business-BigData] Andrew McAfee and Erik Brynjolfsson 2012 ldquoBig Data The ManagementRevolutionrdquo Harvard Business Review 90(10) 61ndash8 October httpshbrorg201210big-

data-the-management-revolution Thomas H Davenport and DJ Patel 2012 ldquoData ScientistThe Sexiest Job of the 21st Centuryrdquo Harvard Business Review 90(10)70-6 October https

hbrorg201210data-scientist-the-sexiest-job-of-the-21st-century

bull dagger[InfoSci-BigData] CL Philip Chen and Chun-Yang Zhang 2014 ldquoData-intensive ApplicationsChallenges Techniques and Technologies A Survey on Big Datardquo Information Sciences 275 314-47httpsdoiorg101016jins201401015

bull dagger[Informatics-BigData] Vasant G Honavar 2014 ldquoThe Promise and Potential of Big Data ACase for Discovery Informaticsrdquo Review of Policy Research 31(4) 326-330 httpsdoiorg10

1111ropr12080

bull Dagger[Stats-BigData] Beate Franke Jean-Francois Ribana Roscher Annie Lee Cathal Smyth ArminHatefi Fuqi Chen Einat Gil Alexander Schwing Alessandro Selvitella Michael M Hoffman RogerGrosse Dietrich Hendricks and Nancy Reid 2016 ldquoStatistical Inference Learning and Models in BigDatardquo International Statistical Review 84(3) 371-89 httponlinelibrarywileycomdoi101111insr12176full

bull dagger[Econ-BigData] Hal R Varian 2013 ldquoBig Data New Tricks for Econometricsrdquo Journal ofEconomic Perspectives 28(2) 3-28 httpsdoiorg101257jep2823

bull dagger[GeoViz-BigData] Alan MacEachren 2017 ldquoLeveraging Big (Geo) Data with (Geo) Visual An-alytics Place as the Next Frontierrdquo In Chenghu Zhou Fenzhen Su Francis Harvey and Jun Xueds Spatial Data Handling in the Big Data Era pp 139-155 Springer httpslink-springer-

comezaccesslibrariespsueduchapter101007978-981-10-4424-3_10

bull dagger[Soc-BigData] David Lazer and Jason Radford 2017 ldquoData ex Machina Introduction to BigDatardquo Annual Review of Sociology 43 19-39 httpsdoiorg101146annurev-soc-060116-

053457

bull sect[Politics-BigData] Keith T Poole L Jason Anasastopolous and James E Monagan III Forthcom-ing ldquoThe lsquoBig Datarsquo Revolution in Political Campaigning and Governancerdquo Oxford Bibliographies inPolitical Science

iexclCuidado Traps Biases Problems Pains Perils

bull dagger[GoogleFlu] David Lazer Ryan Kennedy Gary King and Alessandro Vespignani 2014 ldquoTheParable of Google Flu Traps in Big Data Analysisrdquo Science (343) 14 March httpgking

harvardedufilesgkingfiles0314policyforumffpdf

bull Dagger[MachineBias] ProPublica ldquoMachine Bias Investigating Algorithmic Injusticerdquo Series https

wwwpropublicaorgseriesmachine-bias See especially

14

Julia Angwin Jeff Larson Surya Mattu and Lauren Kirchner 2016 ldquoMachine Biasrdquo ProPub-lica May 23 httpswwwpropublicaorgarticlemachine-bias-risk-assessments-in-

criminal-sentencing

Julia Angwin Madeleine Varner and Ariana Tobin 2017 ldquoFacebook Enabled Adverstisers toReach Jew Hatersrdquo httpswwwpropublicaorgarticlefacebook-enabled-advertisers-

to-reach-jew-haters

bull Dagger[Polling2016] Doug Rivers (Nov 11 2016) ldquoFirst Thoughts on Polling Problems in the 2016 USElectionsrdquo httpstodayyougovcomnews20161111first-thoughts-polling-problems-

2016-us-elections

bull dagger[EventData] Wei Wang Ryan Kennedy David Lazer Naren Ramakrishnan 2016 ldquoGrowing Painsfor Global Monitoring of Societal Eventsrdquo Science 3536307 pp 1502ndash1503 httpsdoiorg101126scienceaaf6758

bull Dagger[GoogleBooks] Eitan Adam Pechenick Christopher M Danforth and Peter Sheridan Dodds 2015ldquoCharacterizing the Google Books Corpus Strong Limits to Inferences of Socio-Cultural and LinguisticEvolutionrdquo PLoS One httpsdoiorg101371journalpone0137041

bull Dagger[OkCupid] Michael Zimmer 2016 ldquoOkCupid Study Reveals the Perils of Big-Data Sciencerdquo Wiredhttpswwwwiredcom201605okcupid-study-reveals-perils-big-data-science May 14

bull dagger[CriticalQuestions] danah boyd and Kate Crawford 2011 ldquoCritical Questions for Big Data Provo-cations for a Cultural Technological and Scholarly Phenomenonrdquo Information Communication ampSociety 15(5) 662ndash79 httpdxdoiorg1010801369118X2012678878

bull Dagger[RacistBot] Daniel Victor 2016 ldquoMicrosoft created a Twitter bot to learn from users It quicklybecame a racist jerkrdquo New York Times httpswwwnytimescom20160325technology

microsoft-created-a-twitter-bot-to-learn-from-users-it-quickly-became-a-racist-jerk

html

bull MMDS 122-123 on Bonferroni

bull Also BDSS-Census

Research Design and Measurement

Overviews

bull BitByBit

bull Dagger[ResearchMethodsKB] William M Trochim 2006 The Research Methods Knowledge Basehttpwwwsocialresearchmethodsnetkb

bull [7Rules] Glenn Firebaugh 2008 Seven Rules for Social Research Princeton University Press

Measurement Reliability and Validity

bull ResearchMethodsKB ldquoMeasurementrdquo

bull 7Rules Ch 3 ldquoBuild Reality Checks into Your Researchrdquo

bull Reliability see also FightinWords

bull Validity see also Quinn10-Topics Monroe-5Vs

Indirect Unobtrusive Nonreactive Measures Data Exhaust

15

bull dagger[UnobtrusiveMeasures] Raymond M Lee 2015 ldquoUnobtrusive Measuresrdquo Oxford Bibliographieshttpsdoiorg101093OBO9780199846740-0048 (Canonical cite is Eugene J Webb 1966Unobtrusive Measures or Webb Donald T Campbell Richard D Schwartz Lee Sechrest 1999Unobtrusive Measures rev ed Sage)

bull BitByBit Examples in Chapter 2

Multiple Measures Latent Variable Measurement

bull 7Rules Ch 4 ldquoReplicate Where Possiblerdquo

bull dagger[LatentVariables] David J Bartholomew Martin Knott and Irini Moustaki 2011 Latent VariableModels and Factor Analysis A Unified Approach Wiley httpsiteebrarycomezaccess

librariespsuedulibpennstatedetailactiondocID=10483308) (esp Ch1 ldquoBasic ideas andexamplesrdquo)

bull dagger[Multivariate-R] Brian Everitt and Torsten Hothorn 2011 An Introduction to Applied MultivariateAnalysis with R Springer httplinkspringercomezaccesslibrariespsuedubook10

10072F978-1-4419-9650-3

bull Shalizi-ADA Ch17 ldquoFactor Modelsrdquo

bull Dagger[NetflixPrize] Edwin Chen 2011 ldquoWinning the Netflix Prize A Summaryrdquo

bull DaggerCRAN httpscranr-projectorgwebviewsMultivariatehtmlhttpscranr-projectorgwebviewsPsychometricshtmlhttpscranr-projectorgwebviewsClusterhtml

Sampling and Survey Design

bull dagger[Sampling] Steven K Thompson 2012 Sampling 3rd ed httpsk8es4mc2lsearchserialssolutionscomsid=sersolampSS_jc=TC_024492330amptitle=Wiley20Desktop20Editions203A20Sampling

bull ResearchMethodsKB ldquoSamplingrdquo

bull NRCreport Ch 8 ldquoSampling and Massive Datardquo

bull BitByBit Ch 3 ldquoAsking Questionsrdquo

bull Dagger[MSE] Daniel Manrique-Vallier Megan E Price and Anita Gohdes 2013 In Seybolt et al (eds)Counting Civilians ldquoMultiple Systems Estimation Techniques for Estimating Casualties in ArmedConflictsrdquo Preprint httpciteseerxistpsueduviewdocdownloaddoi=1011469939amp

rep=rep1amptype=pdf)

bull dagger[NetworkSampling] Ted Mouw and Ashton M Verdery 2012 ldquoNetwork Sampling with Memory AProposal for More Efficient Sampling from Social Networksrdquo Sociological Methodology 42(1)206ndash56httpsdxdoiorg1011772F0081175012461248

bull See also FightinWords (re hidden heteroskedasticity in sample variance)

bull Dagger[HashDontSample] Mudit Uppal 2016 ldquoProbabilistic data structures in the Big data world (+code)rdquo (re ldquoHash donrsquot samplerdquo) httpsmediumcommuppalprobabilistic-data-structures-

in-the-big-data-world-code-b9387cff0c55

bull MMDS Hash Functions (132) Sampling in streams (42)

bull DaggerCRAN httpscranr-projectorgwebviewsOfficialStatisticshtml (ldquoComplex SurveyDesignrdquo ldquoSmall Area Estimationrdquo)

Experimental and Observational Designs for Causal Inference

16

bull Dagger[CausalInference] Miguel A Hernan James M Robins Forthcoming (2017) Causal Infer-ence Chapman amp HallCRC Preprint httpswwwhsphharvardedumiguel-hernancausal-

inference-book Chs 1 ldquoA definition of causal effectrdquo Ch 2 ldquoRandomized experimentsrdquo Ch3 ldquoObservational studiesrdquo (See Ch 6 ldquoGraphical representation of causal effectsrdquo for integrationwith Judea Pearl approach)

bull BitByBit Chapter 4 ldquoRunning Experimentsrdquo Ch 2 Natural experiments in observable data exam-ples

bull ResearchMethodsKB ldquoDesignrdquo

bull Firebaugh-7Rules Ch 2 ldquoLook for Differences that Make a Difference and Report Themrdquo sectCh5 ldquoCompare Like with Likerdquo Ch 6 ldquoUse Panel Data to Study Individual Change and RepeatedCross-Section Data to Study Social Changerdquo

bull Shalizi-ADA Part IV ldquoCausal Inferencerdquo

bull Monroe-No

bull CRAN httpscranr-projectorgwebviewsExperimentalDesignhtml

Technologies for Primary and Secondary Data Collection

Mobile Devices Distributed Sensors Wearable Sensors Remote Sensing

bull dagger[RealityMining] Nathan Eagle and Alex (Sandy) Pentland 2006 ldquoReality Mining SensingComplex Social Systemsrdquo Personal and Ubiquitous Computing 10(4) 255ndash68 httpsdoiorg

101007s00779-005-0046-3

bull dagger[QuantifiedSelf ] Melanie Swan 2013 ldquoThe Quantified Self Fundamental Disruption in Big DataScience and Biological Discoveryrdquo Big Data 1(2) 85ndash99 httpsdoiorg101089big2012

0002

bull dagger[SensorData] Charu C Aggarwal (Ed) 2013 Managing and Mining Sensor Data Springer httplinkspringercomezaccesslibrariespsuedubook1010072F978-1-4614-6309-2 Ch1ldquoAn Introduction to Sensor Data Analyticsrdquo

bull dagger[NightLights] Thushyanthan Baskaran Brian Min Yogesh Uppal 2015 ldquoElection cycles andelectricity provision Evidence from a quasi-experiment with Indian special electionsrdquo Journal ofPublic Economics 12664-73 httpsdoiorg101016jjpubeco201503011

Crowdsourcing Human Computation Citizen Science Web Experiments

bull dagger[Crowdsourcing] Daren C Brabham 2013 Crowdsourcing MIT Press httpsiteebrary

comezaccesslibrariespsuedulibpennstatedetailactiondocID=10692208 ldquoIntroduc-tionrdquo Ch1 ldquoConcepts Theories and Cases of Crowdsourcingrdquo

bull BitByBit Ch 5 ldquoCreating Mass Collaborationrdquo

bull NRCreport Ch 9 ldquoHuman Interaction with Datardquo

bull dagger[MTurk] Krista Casler Lydia Bickel and Elizabeth Hackett 2013 ldquoSeparate but Equal AComparison of Participants and Data Gathered via Amazonrsquos MTurk Social Media and Face-to-FaceBehavioral Testingrdquo Computers in Human Behavior 29(6) 2156ndash60 httpdoiorg101016j

chb201305009

bull dagger[LabintheWild] Katharina Reinecke and Krzysztof Z Gajos 2015 ldquoLabintheWild ConductingLarge-Scale Online Experiments With Uncompensated Samplesrdquo In Proceedings of the 18th ACMConference on Computer Supported Cooperative Work amp Social Computing (CSCW rsquo15) 1364ndash1378httpdxdoiorg10114526751332675246

17

bull dagger[TweetmentEffects] Kevin Munger 2017 ldquoTweetment Effects on the Tweeted ExperimentallyReducing Racist Harassmentrdquo Political Behavior 39(3) 629ndash49 httpdoiorg101016jchb

201305009

bull dagger[HumanComputation] Edith Law and Luis van Ahn 2011 Human Computation Morgan amp Clay-pool httpwwwmorganclaypoolcomezaccesslibrariespsuedudoipdf102200S00371ED1V01Y201107AIM013

Open Data File Formats APIs Semantic Web Linked Data

bull DSHandbook-Py Ch 12 ldquoData Encodings and File Formatsrdquo

bull Dagger[OpenData] Open Data Institute ldquoWhat Is Open Datardquo httpstheodiorgwhat-is-open-

data

bull Dagger[OpenDataHandbook] Open Knowledge International The Open Data Handbook httpopendatahandbookorg Includes appendix ldquoFile Formatsrdquo httpopendatahandbookorgguideenappendices

file-formats

bull Dagger[APIs] Brian Cooksey 2016 An Introduction to APIs httpszapiercomlearnapis

bull Dagger[APIMarkets] RapidAPI mashape API marketplaces httpsdocsrapidapicom https

marketmashapecom ProgrammableWeb httpswwwprogrammablewebcom

bull dagger[LinkedData] Tom Heath and Christian Bizer 2011 Linked Data Evolving the Web into a GlobalData Space Morgan amp Claypool httpwwwmorganclaypoolcomezaccesslibrariespsu

edudoiabs102200S00334ED1V01Y201102WBE001 Ch 1 ldquoIntroductionrdquo Ch 2 ldquoPrinciples ofLinked Datardquo

bull dagger[SemanticWeb] Nikolaos Konstantinos and Dimitrios-Emmanuel Spanos 2015 Materializing theWeb of Linked Data Springer httplinkspringercomezaccesslibrariespsuedubook

1010072F978-3-319-16074-0 Ch 1 ldquoIntroduction Linked Data and the Semantic Webrdquo

bull daggerMorgan amp Claypool Synthesis Lectures on the Semantic Web Theory and Technology httpwww

morganclaypoolcomtocwbe111

Web Scraping

bull dagger[TheoryDrivenScraping] Richard N Landers Robert C Brusso Katelyn J Cavanaugh andAndrew B Colmus 2016 ldquoA Primer on Theory-Driven Web Scraping Automatic Extraction ofBig Data from the Internet for Use in Psychological Researchrdquo Psychological Methods 4 475ndash492httpdxdoiorgezaccesslibrariespsuedu101037met0000081

bull Dagger[Scraping-Py] Al Sweigart 2015 Automate the Boring Stuff with Python Practical Programmingfor Total Beginners Ch 11 Web-Scraping httpsautomatetheboringstuffcomchapter11

bull dagger[Scraping-R] Simon Munzert Christian Rubba Peter Meiszligner and Dominic Nyhuis 2014 Au-tomated Data Collection with R A Practical Guide to Web Scraping and Text Mining Wileyhttponlinelibrarywileycombook1010029781118834732

bull DaggerCRAN httpscranr-projectorgwebviewsWebTechnologieshtml

Ethics amp Scientific Responsibility in Big Social Data

Human Subjects Consent Privacy

bull BitByBit Ch 6 (ldquoEthicsrdquo)

bull Dagger[PSU-ORP] Penn State Office for Research Protections (under the VP for Research)

Human Subjects Research IRB httpswwwresearchpsueduirb

18

Revised Common Rule httpswwwresearchpsueduirbcommonrulechanges

Responsible Conduct of Research httpswwwresearchpsuedueducationrcr

Research Misconduct httpswwwresearchpsueduresearchmisconduct

SARI PSU (Scientific and Research Integrity Training) httpswwwresearchpsuedu

trainingsari

bull Dagger[MenloReport] David Dittrich and Erin Kenneally (Center for Applied Interneet Data Analysis)2012 ldquoThe Menlo Report Ethical Principles Guiding Information and Communication Technol-ogy Researchrdquo and companion report ldquoApplying Ethical Principles rdquo httpswwwcaidaorg

publicationspapers2012menlo_report_actual_formatted

bull Dagger[BigDataEthics-CBDES] Jacob Metcalf Emily F Keller and danah boyd 2016 ldquoPerspec-tives on Big Data Ethics and Societyrdquo Council for Big Data Ethics and Society httpbdes

datasocietynetcouncil-outputperspectives-on-big-data-ethics-and-society

bull Dagger[AoIRReport] Annette Markham and Elizabeth Buchanan (Association of Internet Researchers)2012 ldquoEthical Decision-Making and Internet Research Recommendations of the AoIR Ethics WorkingCommittee (Version 20)rdquo httpsaoirorgreportsethics2pdf and guidelines chart https

aoirorgwp-contentuploads201701aoir_ethics_graphic_2016pdf

bull Dagger[BigDataEthics-Wired] Sarah Zhang 2016 ldquoScientists are Just as Confused about the Ethics ofBig-Data Research as Yourdquo Wired httpswwwwiredcom201605scientists-just-confused-ethics-big-data-research

bull dagger[BigDataEthics-HerschelMori] Richard Herschel and Virginia M Mori 2017 ldquoEthics amp BigDatardquo Technology in Society 49 31-36 httpdoiorg101016jtechsoc201703003

The Science of Data Privacy

bull dagger[BigDataPrivacy] Terence Craig and Mary E Ludloff 2011 Privacy and Big Data OrsquoReilly MediahttppensueblibcompatronFullRecordaspxp=781814

bull Dagger[DataAnalysisPrivacy] John Abowd Lorenzo Alvisi Cynthia Dwork Sampath Kannan AshwinMachanavajjhala Jerome Reiter 2017 ldquoPrivacy-Preserving Data Analysis for the Federal Statisti-cal Agenciesrdquo A Computing Community Consortium white paper httpsarxivorgabs1701

00752

bull dagger[DataPrivacy] Stephen E Fienberg and Aleksandra B Slavkovic 2011 ldquoData Privacy and Con-fidentialityrdquo International Encyclopedia of Statistical Science 342ndash5 httpdoiorg978-3-642-

04898-2_202

bull Dagger[NetworksPrivacy] Vishesh Karwa and Aleksandra Slavkovic 2016 ldquoInference using noisy degreesDifferentially private β-model and synthetic graphsrdquo Annals of Statistics 44(1) 87-112 http

projecteuclidorgeuclidaos1449755958

bull dagger[DataPublishingPrivacy] Raymond Chi-Wing Wong and Ada Wai-Chee Fu 2010 Privacy-Preserving Data Publishing An Overview Morgan amp Claypool httpwwwmorganclaypoolcom

ezaccesslibrariespsuedudoipdfplus102200S00237ED1V01Y201003DTM002

bull DataMatching Ch 8 ldquoPrivacy Aspects of Data Matchingrdquo

Transparency Reproducibility and Team Science

bull Dagger[10RulesforData] Alyssa Goodman Alberto Pepe Alexander W Blocker Christine L BorgmanKyle Cranmer Merce Crosas Rosanne Di Stefano Yolanda Gil Paul Groth Margaret HedstromDavid W Hogg Vinay Kashyap Ashish Mahabal Aneta Siemiginowska and Aleksandra Slavkovic(2014) ldquoTen Simple Rules for the Care and Feeding of Scientific Datardquo PLoS Computational Biology10(4) e1003542 httpsdoiorg101371journalpcbi1003542 (Note esp curated resourcesfor reproducible research)

19

bull dagger[Transparency] E Miguel C Camerer K Casey J Cohen K M Esterling A Gerber R Glen-nerster D P Green M Humphreys G Imbens D Laitin T Madon L Nelson B A Nosek M Pe-tersen R Sedlmayr J P Simmons U Simonsohn M Van der Laan 2014 ldquoPromoting Transparencyin Social Science Researchrdquo Science 343(6166) 30ndash1 httpsdoiorg101126science1245317

bull dagger[Reproducibility] Marcus R Munafo Brian A Nosek Dorothy V M Bishop Katherine S ButtonChristopher D Chambers Nathalie Percie du Sert Uri Simonsohn Eric-Jan Wagenmakers JenniferJ Ware and John P A Ioannidis 2017 ldquoA Manifesto for Reproducible Sciencerdquo Nature HumanBehavior 0021(2017) httpsdoiorg101038s41562-016-0021

bull Dagger[SoftwareCarpentry] esp ldquoLessonsrdquo httpsoftware-carpentryorglessons

bull DSHandbook-Py Ch 9 ldquoTechnical Communication and Documentationrdquo Ch 15 ldquoSoftware Engi-neering Best Practicesrdquo

bull dagger[TeamScienceToolkit] Vogel AL Hall KL Fiore SM Klein JT Bennett LM Gadlin H Stokols DNebeling LC Wuchty S Patrick K Spotts EL Pohl C Riley WT Falk-Krzesinski HJ 2013 ldquoTheTeam Science Toolkit enhancing research collaboration through online knowledge sharingrdquo AmericanJournal of Preventive Medicine 45 787-9 http101016jamepre201309001

bull sect[Databrary] Kara Hall Robert Croyle and Amanda Vogel Forthcoming (2017) Advancing Socialand Behavioral Health Research through Cross-disciplinary Team Science Springer Includes RickO Gilmore and Karen E Adolph ldquoOpen Sharing of Research Video Breaking the Boundaries of theResearch Teamrdquo (See httpdatabraryorg)

bull CRAN httpscranr-projectorgwebviewsReproducibleResearchhtml

Social Bias Fair Algorithms

bull MachineBias

bull Dagger[EmbeddingsBias] Aylin Caliskan Joanna J Bryson and Arvind Narayanan 2017 ldquoSemanticsDerived Automatically from Language Corpora Contain Human Biasesrdquo Science httpsarxiv

orgabs160807187

bull Dagger[Debiasing] Tolga Bolukbasi Kai-Wei Chang James Zou Venkatesh Saligrama Adam Kalai 2016ldquoMan is to Computer Programmer as Woman is to Homemaker Debiasing Word Embeddingsrdquohttpsarxivorgabs160706520

bull Dagger[AvoidingBias] Moritz Hardt Eric Price Nathan Srebro 2016 ldquoEquality of Opportunity inSupervised Learningrdquo httpsarxivorgabs161002413

bull Dagger[InevitableBias] Jon Kleinberg Sendhil Mullainathan Manish Raghavan 2016 ldquoInherent Trade-Offs in the Fair Determination of Risk Scoresrdquo httpsarxivorgabs160905807

Databases and Data Management

bull NRCreport Ch 3 ldquoScaling the Infrastructure for Data Managementrdquo

bull dagger[SQL] Jan L Harrington 2010 SQL Clearly Explained Elsevier httpwwwsciencedirectcomezaccesslibrariespsuedusciencebook9780123756978

bull DSHandbook-Py Ch 14 ldquoDatabasesrdquo

bull dagger[noSQL] Guy Harrison 2015 Next Generation Databases NoSQL NewSQL and Big Data Apresshttpslink-springer-comezaccesslibrariespsuedubook101007978-1-4842-1329-2

bull dagger[Cloud] Divyakant Agrawal Sudipto Das and Amr El Abbadi 2012 Data Management inthe Cloud Challenges and Opportunities Morgan amp Claypool httpwwwmorganclaypoolcom

ezaccesslibrariespsuedudoipdfplus102200S00456ED1V01Y201211DTM032

20

bull daggerMorgan amp Claypool Synthesis Lectures on Data Management httpwwwmorganclaypoolcom

tocdtm11

Data Wrangling

Theoretically-structured approaches to data wrangling

bull Dagger[TidyData-R] Garrett Grolemund and Hadley Wickham 2017 R for Data Science (httpr4dshadconz) esp Ch 12 ldquoTidy Datardquo also Wickham 2014 ldquoTidy Datardquo Journalof Statistical Software 59(10) httpwwwjstatsoftorgv59i10paper Tools The tidyversehttptidyverseorg

bull Dagger[DataScience-Py] Jake VanderPlas 2016 Python Data Science Handbook (httpsgithubcomjakevdpPythonDSHandbook-Py) esp Ch 3 on pandas httppandaspydataorg

bull Dagger[DataCarpentry] Colin Gillespie and Robin Lovelace 2017 Efficient R Programming https

csgillespiegithubioefficientR esp Ch 6 on ldquoefficient data carpentryrdquo

bull sect[Wrangler] Joseph M Hellerstein Jeffrey Heer Tye Rattenbury and Sean Kandel 2017 DataWrangling Practical Techniques for Data Preparation Tool Trifacta Wrangler httpwwwtrifactacomproductswrangler

Data wrangling practice

bull DSHandbook-Py Ch 4 ldquoData Munging String Manipulation Regular Expressions and Data Clean-ingrdquo

bull dagger[DataSimplification] Jules J Berman 2016 Data Simplification Taming Information with OpenSource Tools Elsevier httpwwwsciencedirectcomezaccesslibrariespsueduscience

book9780128037812

bull dagger[Wrangling-Py] Jacqueline Kazil Katharine Jarmul 2016 Data Wrangling with Python OrsquoReillyhttpproquestcombosafaribooksonlinecomezaccesslibrariespsuedu9781491948804

bull dagger[Wrangling-R] Bradley C Boehmke 2016 Data Wrangling with R Springer httplink

springercomezaccesslibrariespsuedubook1010072F978-3-319-45599-0

Record Linkage Entity Resolution Deduplication

bull dagger[DataMatching] Peter Christen 2012 Data Matching Concepts and Techniques for Record Link-age Entity Resolution and Duplicate Detectionhttplinkspringercomezaccesslibrariespsuedubook1010072F978-3-642-31164-2

(esp Ch 2 ldquoThe Data Matching Processrdquo)

bull DataSimplification Chapter 5 ldquoIdentifying and Deidentifying Datardquo

bull Dagger[EntityResolution] Lise Getoor and Ashwin Machanavajjhala 2013 ldquoEntity Resolution for BigDatardquo Tutorial KDD httpwwwumiacsumdedu~getoorTutorialsER_KDD2013pdf

bull Dagger[SyrianCasualties] Peter Sadosky Anshumali Shrivastava Megan Price and Rebecca C Steorts2015 ldquoBlocking Methods Applied to Casualty Records from the Syrian Conflictrdquo httpsarxiv

orgabs151007714 (For more on blocking see DataMatching Ch 4 ldquoIndexingrdquo)

bull CRAN Task Views httpscranr-projectorgwebviewsOfficialStatisticshtml ldquoStatis-tical Matching and Record Linkagerdquo

ldquoMaking up datardquo Imputation Smoothers Kernels Priors Filters Teleportation Negative Sampling Con-volution Augmentation Adversarial Training

21

bull dagger[Imputation] Yi Deng Changgee Chang Moges Seyoum Ido and Qi Long 2016 ldquoMultiple Im-putation for General Missing Data Patterns in the Presence of High-dimensional Datardquo ScientificReports 6(21689) httpswwwnaturecomarticlessrep21689

bull Dagger[Overimputation] Matthew Blackwell James Honaker and Gary King 2017 ldquoA Unified Ap-proach to Measurement Error and Missing Data Overview and Applicationsrdquo Sociological Methodsamp Research 46(3) 303-341 httpgkingharvardedufilesgkingfilesmeasurepdf

bull Shalizi-ADA Sect 15 (Linear Smoothers) Ch 8 (Splines) Sect 144 (Kernel Density Estimates)DataScience-Python NB 0513 ldquoKernel Density Estimationrdquo

bull FightinWords (re Bayesian priors as additional data impact of priors on regularization)

bull DeepLearning Section 75 (Data Augmentation) 712 (Dropout) 713 (Adversarial Examples) Chap-ter 9 (Convolutional Networks)

bull dagger[Adversarial] Ian J Goodfellow Jonathon Shlens Christian Szegedy 2015 ldquoExplaining and Har-vesting Adversarial Examplesrdquo httpsarxivorgabs14126572

bull See also resampling and simulation methods

bull See also feature engineering preprocessing

bull CRAN Task Views httpscranr-projectorgwebviewsOfficialStatisticshtml ldquoImpu-tationrdquo

(Direct) Data Representations Data Mappings

bull NRCreport Ch 5 ldquoLarge-Scale Data Representationsrdquo

bull dagger[PatternRecognition] M Narasimha Murty and V Susheela Devi 2011 Pattern RecognitionAn Algorithmic Approach Springer httpslink-springer-comezaccesslibrariespsuedu

book1010072F978-0-85729-495-1 Section 21 ldquoData Structures for Pattern Representationrdquo

bull Dagger[InfoRetrieval] Christopher D Manning Prabhakar Raghavan and Hinrich Schutze 2009 Intro-duction to Information Retrieval Cambridge University Press httpnlpstanfordeduIR-bookCh 1 2 6 (also note slides used in their class)

bull dagger[Algorithms] Brian Steele John Chandler Swarna Reddy 2016 Algorithms for Data ScienceWiley httplinkspringercomezaccesslibrariespsuedubook1010072F978-3-319-

45797-0 Ch 2 ldquoData Mapping and Data Dictionariesrdquo

bull See also Social Data Structures

Similarity Distance Association Covariance The Kernel Trick

bull PatternRecognition Section 23 ldquoProximity Measuresrdquo

bull Dagger[Similarity] Brendan OrsquoConnor 2012 ldquoCosine similarity Pearson correlation and OLS coeffi-cientsrdquo httpsbrenoconcomblog201203cosine-similarity-pearson-correlation-and-

ols-coefficients

bull For relatively comprehensive lists see also

M-J Lesot M Rifqi and H Benhadda 2009 ldquoSimilarity measures for binary and numerical data a surveyrdquoInt J Knowledge Engineering and Soft Data Paradigms 1(1) 63- httpciteseerxistpsueduviewdoc

downloaddoi=10112126533amprep=rep1amptype=pdf

Seung-Seok Choi Sung-Hyuk Cha Charles C Tappert 2010 ldquoA Survey of Binary Similarity and DistanceMeasuresrdquo Journal of Systemics Cybernetics amp Informatics 8(1)43-8 httpwwwiiisciorgJournalCV$

scipdfsGS315JGpdf

22

Sung-Hyuk Cha 2007 ldquoComprehensive Survey on DistanceSimilarity Measures between Probability DensityFunctionsrdquo International Journal of Mathematical Models and Methods in Applied Sciences 4(1) 300-7 http

csispaceeductappertdpsd861-12session4-p2pdf

Anna Huang 2008 ldquoSimilarity Measures for Text Document Clusteringrdquo Proceedings of the 6th New ZealandComputer Science Research Student Conference 49-56 httpwwwnzcsrsc08canterburyacnzsiteproceedingsIndividual_Paperspg049_Similarity_Measures_for_Text_Document_Clusteringpdf

bull Michael B Jordan ldquoThe Kernel Trickrdquo (Lecture Notes) httpspeopleeecsberkeleyedu

~jordancourses281B-spring04lectureslec3pdf

bull Eric Kim 2017 ldquoEverything You Ever Wanted to Know about the Kernel Trick (But Were Afraidto Ask)rdquo httpwwweric-kimneteric-kim-netposts1kernel_trick_blog_ekim_12_20_

2017pdf

Derived Data Representations - Dimensionality Reduction Compression De-composition Embeddings

The groupings here and under the measurement multivariate statistics section are particularly arbitraryFor example ldquok-Means clusteringrdquo can be viewed as a technique for ldquodimensionality reductionrdquo ldquocompres-sionrdquo ldquofeature extractionrdquo ldquolatent variable measurementrdquo ldquounsupervised learningrdquo ldquocollaborative filteringrdquo

Clustering hashing quantization blocking compression

bull Dagger[Compression] Khalid Sayood 2012 Introduction to Data Compression 4th ed Springer httpwwwsciencedirectcomezaccesslibrariespsuedusciencebook9780124157965 (eg cod-ing blocking via vector quantization)

bull MMDS Ch3 ldquoFinding Similar Itemsrdquo (minhashing locality sensitive hashing)

bull Multivariate-R Ch 6 ldquoClusteringrdquo

bull DataScience-Python NB 0511 ldquok-Means Clusteringrdquo NG 0512 ldquoGaussian Mixture Modelsrdquo

bull Dagger[KMeansHashing] Kaiming He Fang Wen Jian Sun 2013 ldquoK-means Hashing An Affinity-Preserving Quantization Method for Learning Binary Compact Codesrdquo CVPR httpswwwcv-foundationorgopenaccesscontent_cvpr_2013papersHe_K-Means_Hashing_An_2013_CVPR_paper

pdf

bull dagger[Squashing] Madigan D Raghavan N Dumouchel W Nason M Posse C and RidgewayG (2002) ldquoLikelihood-based data squashing A modeling approach to instance constructionrdquo DataMining and Knowledge Discovery 6(2) 173-190 httpdxdoiorgezaccesslibrariespsu

edu101023A1014095614948

bull dagger[Core-sets] Piotr Indyk Sepideh Mahabadi Mohammad Mahdian Vahab S Mirrokni 2014 ldquoCom-posable core-sets for diversity and coverage maximizationrdquo PODS rsquo14 100-8 httpsdoiorg10114525945382594560

Feature selection feature extraction feature engineering weighting preprocessing

bull PatternRecognition Sections 26-7 ldquoFeature Selection Feature Extractionrdquo

bull DSHandbook-Py Ch 7 ldquoInterlude Feature Extraction Ideasrdquo

bull DataScience-Python NB 0504 ldquoFeature Engineeringrdquo

bull Features in text NLP and InfoRetrieval re tfidf and similar FightinWords

bull Features in images DataScience-Python NB 0514 ldquoImage Featuresrdquo

23

Dimensionality reduction decomposition factorization change of basis reparameterization matrix com-pletion latent variables source separation

bull MMDS Ch 9 ldquoRecommendation Systemsrdquo Ch11 ldquoDimensionality Reductionrdquo

bull DSHandbook-Py Ch 10 ldquoUnsupervised Learning Clustering and Dimensionality Reductionrdquo

bull Dagger[Shalizi-ADA] Cosma Rohilla Shalizi 2017 Advanced Data Analysis from an Elementary Pointof View Ch 16 (ldquoPrincipal Components Analysisrdquo) Ch 17 (ldquoFactor Modelsrdquo) httpwwwstat

cmuedu~cshaliziADAfaEPoV

See also Multivariate-R Ch 3 ldquoPrincipal Components Analysisrdquo Ch 4 ldquoMultidimensional ScalingrdquoCh 5 ldquoExploratory Factor Analysisrdquo Latent

bull DeepLearning Ch 2 ldquoLinear Algebrardquo Ch 13 ldquoLinear Factor Modelsrdquo

bull Dagger[GloVe] Jeffrey Pennington Richard Socher Christopher Manning 2014 ldquoGloVe Global vectorsfor word representationrdquo EMNLP httpsnlpstanfordeduprojectsglove

bull dagger[NMF] Daniel D Lee and H Sebastian Seung 1999 ldquoLearning the parts of objects by non-negativematrix factorizationrdquo Nature 401788-791 httpdxdoiorgezaccesslibrariespsuedu

10103844565

bull dagger[CUR] Michael W Mahoney and Petros Drineas 2009 ldquoCUR matrix decompositions for improveddata analysisrdquo PNAS httpwwwpnasorgcontent1063697full

bull Dagger[ICA] Aapo Hyvarinen and Erkki Oja 2000 ldquoIndependent Component Analysis Algorithmsand Applicationsrdquo Neural Networks 13(4-5) 411-30 httpswwwcshelsinkifiuahyvarin

papersNN00newpdf

bull [RandomProjection] Ella Bingham and Heikki Mannilla 2001 ldquoRandom projection in dimension-ality reduction applications to image and text datardquo KDD httpsdoiorg1011452F502512

502546

bull Compression eg ldquoTransform codingrdquo ldquoWaveletsrdquo

Nonlinear dimensionality reduction Manifold learning

bull DeepLearning Section 5113 ldquoManifold Learningrdquo

bull DataScience-Python NB 0510 ldquoManifold Learningrdquo (Locally linear embedding [LLE] Isomap)

bull Dagger[KernelPCA] Sebastian Raschka 2014 ldquoKernel tricks and nonlinear dimensionality reduction viaRBF kernel PCArdquo httpsebastianraschkacomArticles2014_kernel_pcahtml

bull Shalizi-ADA Ch 18 ldquoNonlinear Dimensionality Reductionrdquo (LLE)

bull Dagger[LaplacianEigenmaps] Mikhail Belkin and Partha Niyogi 2003 ldquoLaplacian eigenmaps for di-mensionality reduction and data representationrdquo Neural Computation 15(6) 13731396 http

webcseohio-stateedu~belkin8papersLEM_NC_03pdf

bull DeepLearning Ch14 (ldquoAutoencodersrdquo)

bull Dagger[word2vec] Tomas Mikolov Kai Chen Greg Corrado Jeffrey Dean 2013 ldquoEfficient Estimation ofWord Representations in Vector Spacerdquo httpsarxivorgabs13013781

bull Dagger[word2vecExplained] Yoav Goldberg and Omer Levy 2014 ldquoword2vec Explained DerivingMikolov et als Negative-Sampling Word-Embedding Methodrdquo httpsarxivorgabs14023722

bull Dagger[t-SNE] Laurens van der Maaten and Geoffrey Hinton 2008 ldquoVisualising Data using t-SNErdquo Jour-nal of Machine Learning Research 9 2579-2605 httpjmlrcsailmitedupapersvolume9

vandermaaten08avandermaaten08apdf

24

Computation and Scaling Up

Scientific computing computation at scale

bull NRCreport Ch 10 ldquoThe Seven Computational Giants of Massive Data Analysisrdquo

bull DSHandbook-Py Ch 21 ldquoPerformance and Computer Memoryrdquo Ch 22 ldquoComputer Memory andData Structuresrdquo

bull Dagger[ComputationalStatistics-Python] Cliburn Chan Computational Statistics in Python https

peopledukeedu~ccc14sta-663indexhtml

bull CRAN Task View High Performance Computinghttpscranr-projectorgwebviewsHighPerformanceComputinghtml

Numerical computing

bull [NumericalComputing] Ward Cheney and David Kincaid 2013 Numerical Mathematics andComputing 7th ed BrooksCole Cengage Learning

bull CRAN Task View Numerical Mathematicshttpscranr-projectorgwebviewsNumericalMathematicshtml

Optimization (eg MLE gradient descent stochastic gradient descent EM algorithm neural nets)

bull DSHandbook-Py Ch 23 ldquoMaximum Likelihood Estimation and Optimizationrdquo (gradient descent)

bull DeepLearning Ch 4 ldquoNumerical Computationrdquo Ch 6 ldquoDeep Feedforward Networksrdquo Ch 8ldquoOptimization for Training Deep Modelsrdquo

bull CRAN Task View Optimization and Mathematical Programminghttpscranr-projectorgwebviewsOptimizationhtml

Linear algebra matrix computations

bull [MatrixComputations] Gene H Golub and Charles F Van Loan 2013 Matrix Computations 4thed Johns Hopkins University Press

bull Dagger[NetflixMatrix] Yehuda Koren Robert Bell and Chris Volinsky 2009 ldquoMatrix FactorizationTechniques for Recommender Systemsrdquo Computer August 42-9 httpsdatajobscomdata-

science-repoRecommender-Systems-[Netflix]pdf

bull Dagger[BigDataPCA] Jianqing Fan Qiang Sun Wen-Xin Zhou Ziwei Zhu ldquoPrincipal Component Anal-ysis for Big Datardquo httpwwwprincetonedu~ziweizpcapdf

bull Dagger[RandomSVD] Andrew Tulloch 2009 ldquoFast Randomized SVDrdquo httpsresearchfbcom

fast-randomized-svd

bull Dagger[FactorSGD] Rainer Gemulla Peter J Haas Erik Nijkamp and Yannis Sismanis 2011 ldquoLarge-Scale Matrix Factorization with Distributed Stochastic Gradient Descentrdquo KDD httpwwwcs

utahedu~hariteachingbigdatagemulla11dsgdpdf

bull Dagger[Sparse] Max Grossman 2015 ldquo101 Ways to Store a Sparse Matrixrdquo httpsmediumcom

jmaxg3101-ways-to-store-a-sparse-matrix-c7f2bf15a229

bull Dagger[DontInvert] John D Cook 2010 ldquoDonrsquot Invert that Matrixrdquo httpswwwjohndcookcom

blog20100119dont-invert-that-matrix

Simulation-based inference resampling Monte Carlo methods MCMC Bayes approximate inference

25

bull Dagger[MarkovVisually] Victor Powell ldquoMarkov Chains Explained Visuallyrdquo httpsetosaioev

markov-chains

bull DSHandbook-Py Ch 25 ldquoStochastic Modelingrdquo (Markov chains MCMC HMM)

bull Bayes ComputationalStatistics-Py

bull DeepLearning Ch 17 ldquoMonte Carlo Methodsrdquo Ch 19 ldquoApproximate Inferencerdquo

bull Dagger[VariationalInference] Jason Eisner 2011 ldquoHigh-Level Explanation of Variational Inferencerdquohttpswwwcsjhuedu~jasontutorialsvariationalhtml

bull CRAN Task View Bayesian Inferencehttpscranr-projectorgwebviewsBayesianhtml

Parallelism MapReduce Split-Apply-Combine

bull Dagger[MapReduceIntuition] Jigsaw Academy 2014 ldquoBig Data Specialist MapReducerdquo https

wwwyoutubecomwatchv=TwcYQzFqg-8ampfeature=youtube

bull MMDS Ch 2 ldquoMap-Reduce and the New Software Stackrdquo (3rd Edition discusses Spark amp Tensor-Flow)

bull Algorithms Ch 3ldquoScalable Algorithms and Associative Statisticsrdquo Ch 4 ldquoHadoop and MapReducerdquo

bull NRCreport Ch 6 ldquoResources Trade-offs and Limitationsrdquo

bull TidyData-R (Split-apply-combine is the motivating principle behind the ldquotidyverserdquo approach) Seealso Part III ldquoProgramrdquo (pipes functions vectors iteration)

bull Dagger[TidySAC-Video] Hadley Wickham 2017 ldquoData Science in the Tidyverserdquo httpswww

rstudiocomresourcesvideosdata-science-in-the-tidyverse

bull See also FactorSGD

Functional Programming

bull ComputationalStatistics-Python ldquoFunctions are first class objectsrdquo through first exercises

bull DSHandbook-Py Ch 20 ldquoProgramming Language Conceptsrdquo

bull See also Haskell

Scaling iteration streaming data online algorithms (Spark)

bull NRCreport Ch 4 ldquoTemporal Data and Real-Time Algorithmsrdquo

bull MMDS Ch 4 ldquoMining Data Streamsrdquo

bull Dagger[BDAS] AMPLab BDAS The Berkeley Data Analytics Stack httpsamplabcsberkeleyedu

software

bull dagger[Spark] Mohammed Guller 2015 Big Data Analytics with Spark A Practitioners Guide toUsing Spark for Large-Scale Data Processing Machine Learning and Graph Analytics and High-Velocity Data Stream Processing Apress httpslink-springer-comezaccesslibrariespsu

edubook101007978-1-4842-0964-6

bull DSHandbook-Py Ch 13 ldquoBig Datardquo

bull DeepLearning Ch 10 ldquoSequence Modeling Recurrent and Recursive Netsrdquo

General Resources for Python and R

26

bull dagger[DSHandbook-Py] Field Cady 2017 The Data Science Handbook (Python-based) Wiley http

onlinelibrarywileycomezaccesslibrariespsuedubook1010029781119092919

bull Dagger[Tutorials-R] Ujjwal Karn 2017 ldquoA curated list of R tutorials for Data Science NLP and MachineLearningrdquo httpsgithubcomujjwalkarnDataScienceR

bull Dagger[Tutorials-Python] Ujjwal Karn 2017 ldquoA curated list of Python tutorials for Data Science NLPand Machine Learningrdquo httpsgithubcomujjwalkarnDataSciencePython

Cutting and Bleeding Edge of Data Science Languages

bull [Scala] daggerVishal Layka and David Pollak 2015 Beginning Scala httpslink-springer-

comezaccesslibrariespsuedubook1010072F978-1-4842-0232-6 daggerNicolas Patrick 2014Scala for Machine Learning httpsebookcentralproquestcomlibpensudetailaction

docID=1901910 DaggerScala site httpswwwscala-langorg (See also )

bull [Julia] daggerIvo Baobaert 2015 Getting Started with Julia httpsebookcentralproquestcom

libpensudetailactiondocID=1973847 DaggerDouglas Bates 2013 ldquoJulia for R Programmersrdquohttpwwwstatwiscedu~batesJuliaForRProgrammerspdf DaggerJulia site httpsjulialang

org

bull [Haskell] daggerRichard Bird 2014 Thinking Functionally with Haskell httpsdoi-orgezaccess

librariespsuedu101017CBO9781316092415 daggerHakim Cassimally 2017 Learning Haskell Pro-gramming httpswwwlyndacomHaskell-tutorialsLearning-Haskell-Programming604926-2html daggerJames Church 2017 Learning Haskell for Data Analysis httpswwwlyndacomDeveloper-tutorialsLearning-Haskell-Data-Analysis604234-2html Haskell site httpswwwhaskellorg

bull [Clojure] daggerMark McDonnell 2017 Quick Clojure Essential Functional Programming https

link-springer-comezaccesslibrariespsuedubook1010072F978-1-4842-2952-1 daggerAkhillWali 2014 Clojure for Machine Learning httpsebookcentralproquestcomlibpensu

detailactiondocID=1674848 daggerArthur Ulfeldt 2015 httpswwwlyndacomClojure-tutorialsUp-Running-Clojure413127-2html DaggerClojure site httpsclojureorg

bull [TensorFlow] DaggerAbadi et al (Google) 2016 ldquoTensorFlow A System for Large-Scale Ma-chine Learningrdquo httpsarxivorgabs160508695 DaggerUdacity ldquoDeep Learningrdquo httpswww

udacitycomcoursedeep-learning--ud730 DaggerTensorFlow site httpswwwtensorfloworg

bull [H2O] Darren Cool 2016 Practical Machine Learning with H2O Powerful Scalable Techniques forDeep Learning and AI OrsquoReilly Arno Candel and Viraj Parmar 2015 Deep Learning with H2ODaggerH2O site httpswwwh2oai

Social Data Structures

Space and Time

bull dagger[GIA] David OrsquoSullivan David J Unwin 2010 Geographic Information Analysis Second EditionJohn Wiley amp Sons httponlinelibrarywileycombook1010029780470549094

bull [Space-Time] Donna J Peuquet 2003 Representations of Space and Time Guilford

bull Roger S Bivand Edzer Pebesma Virgilio Gmez-Rubio 2013 Applied Spatial Data Analysis in RFree through library httplinkspringercomezaccesslibrariespsuedubook101007

2F978-1-4614-7618-4 See also Edzer Pebesma 2016 ldquoHandling and Analyzing Spatial Spa-tiotemporal and Movement Data in Rrdquo httpsedzergithubioUseR2016

bull Dagger[PySAL] Sergio J Rey and Dani Arribas-Bel 2016 ldquoGeographic Data Science with PySAL andthe pydata Stackrdquo httpdarribasorggds_scipy16

27

bull DataScience-Python NB 0413 ldquoGeographic Data with Basemaprdquo

bull Time ldquoLongitudinalrdquo intra-individual data as common in developmental psychology dagger[Longitudinal]Garrett Fitzmaurice Marie Davidian Geert Verbeke and Geert Molenberghs editors 2009 Longitu-dinal Data Analysis Chapman amp Hall CRC httppensueblibcompatronFullRecordaspxp=359998

bull Time ldquoTime Series Time Series Cross Section Panelrdquo as common in economics political scienceand sociology DataScience-Python Ch 3 re pandas DSHandbook-Py Ch 17 ldquoTime Series AnalysisrdquoEconometrics GelmanHill

bull Time ldquoSequential Data Streamsrdquo as in NLP machine learning Spark and other rdquostreaming datardquoreadings

bull Space-Time Data with continuity connectivity neighborhood structure See DeepLearning Chapter 9re convolution

bull CRAN Task Views Spatial httpscranr-projectorgwebviewsSpatialhtml Spatiotempo-ral httpscranr-projectorgwebviewsSpatioTemporalhtml Time Series httpscran

r-projectorgwebviewsTimeSerieshtml

Network Graphs

bull Dagger[Networks] David Easley and Jon Kleinberg 2010 Networks Crowds and Markets ReasoningAbout a Highly Connected World Cambridge University Press httpwwwcscornelleduhome

kleinbernetworks-book

bull MMDS Ch 5 ldquoLink Analysisrdquo Ch 10 ldquoMining Social-Network Graphsrdquo

bull dagger[Networks-R] Douglas Luke 2015 A Userrsquos Guide to Network Analysis in R Springer https

link-springer-comezaccesslibrariespsuedubook101007978-3-319-23883-8

bull dagger[Networks-Python] Mohammed Zuhair Al-Taie and Seifedine Kadry 2017 Python for Graph andNetwork Analysis Springer httpslink-springer-comezaccesslibrariespsuedubook

1010072F978-3-319-53004-8

Hierarchy Clustered Data Aggregation Mixed Models

bull dagger[GelmanHill] Andrew Gelman and Jennifer Hill 2006 Data Analysis Using Regression and Multi-levelHierarchical Models Cambridge University Press httppensueblibcompatronFullRecordaspxp=288457

bull dagger[MixedModels] Eugene Demidenko 2013 Mixed Models Theory and Applications with R 2nd edWiley httpsiteebrarycomezaccesslibrariespsuedulibpennstatedetailaction

docID=10748641

Social Data Channels

Language Text Speech Audio

bull Dagger[NLP] Daniel Jurafsky and James H Martin Forthcoming (2017) Speech and Language ProcessingAn Introduction to Natural Language Processing Speech Recognition and Computational Linguistics3rd ed Prentice-Hall Preprint httpswebstanfordedu~jurafskyslp3

bull InfoRetrieval

bull Dagger[CoreNLP] Stanford CoreNLP httpsstanfordnlpgithubioCoreNLP

28

bull dagger[TextAsData] Grimmer and Stewart 2013 ldquoText as Data The Promise and Pitfalls of AutomaticContent Analysis Methods for Political Textsrdquo Political Analysis httpsdoiorg101093pan

mps028

bull Quinn-Topics

bull FightinWords

bull dagger[NLTK-Python] Jacob Perkins 2010 Python Text Processing with NLTK 20 Cookbook PackthttpsiteebrarycomezaccesslibrariespsuedulibpennstatedetailactiondocID=10435387dagger[TextAnalytics-Python] Dipanjan Sarkar 2016 Text Analytics with Python Springer http

linkspringercomezaccesslibrariespsuedubook1010072F978-1-4842-2388-8

bull DSHandbook-Py Ch 16 ldquoNatural Language Processingrdquo

bull Matthew J Denny httpwwwmjdennycomText_Processing_In_Rhtml

bull CRAN Natural Language Processing httpscranr-projectorgwebviewsNaturalLanguageProcessinghtml

bull dagger[AudioData] Dean Knox and Christopher Lucas 2017 ldquoThe Speaker-Affect Model MeasuringEmotion in Political Speech with Audio Datardquo httpchristopherlucasorgfilesPDFssam

pdf httpswwwyoutubecomwatchv=Hs8A9dwkMzI

bull daggerMorgan amp Claypool Synthesis Lectures on Human Language Technologies httpwwwmorganclaypoolcomtochlt11

bull daggerMorgan amp Claypool Synthesis Lectures on Speech and Audio Processing httpwwwmorganclaypoolcomtocsap11

Vision Image Video

bull Dagger[ComputerVision] Richard Szelinski 2010 Computer Vision Algorithms and Applications SpringerhttpszeliskiorgBook

bull MixedModels Ch 11 ldquoStatistical Analysis of Shaperdquo Ch 12 ldquoStatistical Image Analysisrdquo

bull AudioVideoVolumetric See DeepLearning Chapter 9 re convolution

bull daggerMorgan amp Claypool Synthesis Lectures on Image Video and Multimedia Processing httpwww

morganclaypoolcomtocivm11

bull daggerMorgan amp Claypool Synthesis Lectures on Computer Vision httpwwwmorganclaypoolcom

toccov11

Approaches to Learning from Data (The Analytics Layer)

bull NRCreport Ch 7 ldquoBuilding Models from Massive Datardquo

bull MMDS (data-mining)

bull CausalInference

bull Dagger[VisualAnalytics] Daniel Keim Jorn Kohlhammer Geoffrey Ellis and Florian Mansmann editors2010 Mastering the Information Age Solving Problems with Visual Analytics The EurographicsAssociation Goslar Germany httpwwwvismastereubook See also daggerMorgan amp ClaypoolSynthesis Lectures on Visualization httpwwwmorganclaypoolcomtocvis21

bull Dagger[Econometrics] Bruce E Hansen 2014 Econometrics httpwwwsscwiscedu~bhansen

econometrics

29

bull Dagger[StatisticalLearning] Gareth James Daniela Witten Trevor Hastie and Robert Tibshirani 2013An Introduction to Statistical Learning with Applications in R Springer httpwww-bcfuscedu

~garethISLindexhtml

bull [MachineLearning] Christopher Bishop 2006 Pattern Recognition and Machine Learning Springer

bull dagger[Bayes] Peter D Hoff 2009 A First Course in Bayesian Statistical Methods Springer http

linkspringercombook1010072F978-0-387-92407-6

bull dagger[GraphicalModels] Soslashren Hoslashjsgaard David Edwards Steffen Lauritzen 2012 Graphical Mod-els with R Springer httpslink-springer-comezaccesslibrariespsuedubook101007

2F978-1-4614-2299-0

bull [InformationTheory] David Mackay 2003 Information Theory Inference and Learning AlgorithmsSpringer

bull Dagger[DeepLearning] Ian Goodfellow Yoshua Bengio and Aaron Courville 2016 Deep Learning MITPress httpwwwdeeplearningbookorg (see also Yann LeCun Yoshua Bengio and Geof-frey Hinton 2015 ldquoDeep Learningrdquo Nature 521(7553) 436ndash444 httpsdoiorg101038

nature14539)

See also Dagger[Keras] Keras The Python Deep Learning Library httpskerasio Francoios CholletDirectory of Keras tutorials httpsgithubcomfcholletkeras-resources

bull dagger[SignalProcessing] Jose Maria Giron-Sierra 2013 Digital Signal Processing with MATLAB Exam-ples (Volumes 1-3) Springer httpslink-springer-comezaccesslibrariespsuedubook

101007978-981-10-2534-1

30

Penn State Policy Statements

Academic Integrity

Academic integrity is the pursuit of scholarly activity in an open honest and responsible manner Academicintegrity is a basic guiding principle for all academic activity at The Pennsylvania State University and allmembers of the University community are expected to act in accordance with this principle Consistent withthis expectation the Universitys Code of Conduct states that all students should act with personal integrityrespect other students dignity rights and property and help create and maintain an environment in whichall can succeed through the fruits of their efforts

Academic integrity includes a commitment by all members of the University community not to engage in ortolerate acts of falsification misrepresentation or deception Such acts of dishonesty violate the fundamentalethical principles of the University community and compromise the worth of work completed by others

Disability Accomodation

Penn State welcomes students with disabilities into the Universitys educational programs Every Penn Statecampus has an office for students with disabilities Student Disability Resources (SDR) website provides con-tact information for every Penn State campus (httpequitypsuedusdrdisability-coordinator)For further information please visit the Student Disability Resources website (httpequitypsuedusdr)

In order to receive consideration for reasonable accommodations you must contact the appropriate disabilityservices office at the campus where you are officially enrolled participate in an intake interview and pro-vide documentation See documentation guidelines at (httpequitypsuedusdrguidelines) If thedocumentation supports your request for reasonable accommodations your campus disability services officewill provide you with an accommodation letter Please share this letter with your instructors and discussthe accommodations with them as early as possible You must follow this process for every semester thatyou request accommodations

Psychological Services

Many students at Penn State face personal challenges or have psychological needs that may interfere withtheir academic progress social development or emotional wellbeing The university offers a variety ofconfidential services to help you through difficult times including individual and group counseling crisisintervention consultations online chats and mental health screenings These services are provided by staffwho welcome all students and embrace a philosophy respectful of clients cultural and religious backgroundsand sensitive to differences in race ability gender identity and sexual orientation

bull Counseling and Psychological Services at University Park (CAPS) (httpstudentaffairspsueducounseling) 814-863-0395

bull Penn State Crisis Line (24 hours7 daysweek) 877-229-6400

bull Crisis Text Line (24 hours7 daysweek) Text LIONS to 741741

Educational Equity

Penn State takes great pride to foster a diverse and inclusive environment for students faculty and staffConsistent with University Policy AD29 students who believe they have experienced or observed a hatecrime an act of intolerance discrimination or harassment that occurs at Penn State are urged to report theseincidents as outlined on the Universitys Report Bias webpage (httpequitypsuedureportbias)

31

Background Knowledge

Students will have a variety of backgrounds Prior to beginning interdisciplinary coursework tofulfill Social Data Analytics degree requirements including SoDA 501 and 502 students are ex-pected to have advanced (graduate) training in at least one of the component areas of Social DataAnalytics and a familiarity with basic concepts in the others

With regard to specialization students are expected to have advanced (graduate) training in ONEof the following

bull quantitative social science methodology and a discipline of social science (as would be thecase for a second-year PhD student in Political Science Sociology Criminology Human De-velopment and Family Studies or Demography) OR

bull statistics (as would be the case for a second-year PhD student in Statistics) OR

bull information science or informatics (as would be the case for a second-year PhD student inInformation Science and Technology or a second-year PhD student in Geography specializingin GIScience) OR

bull computer science (as would be the case for a second-year PhD student in Computer Scienceand Engineering)

This requirement is met as a matter of meeting home program requirements for students in thedual-title PhD but may require additional coursework on the part of students in other programswishing to pursue the graduate minor

With regard to general preparation students are expected to have ALL of the following technicalknowledge

bull basic programming skills including basic facility with R ampor Python AND

bull basic knowledge of relational databases ampor geographic information systems AND

bull basic knowledge of probability applied statistics ampor social science research design AND

bull basic familiarity with a substantive or theoretical area of social science (eg 300-level course-work in political science sociology criminology human development psychology economicscommunication anthropology human geography social informatics or similar fields)

It is not unusual for students to have one or more gaps in this preparation at time of applicationto the SoDA program Students should work with Social Data Analytics advisers to develop aplan for timely remediation of any deficiencies which generally will not require formal courseworkfor students whose training and interests are otherwise appropriate for pursuit of the Social DataAnalytics degree Where possible this will be addressed at time of application to the Social DataAnalytics program

To this end some free training materials are linked in the reference section and there are alsoabundant free high-quality self-paced course-style training materials on these and related subjectsavailable through edX (httpwwwedxorg) Udacity (httpwwwudacitycom) Codecademy(httpcodecademycom) and Lynda (httpwwwlyndapsu)

32

Page 8: SoDA 501: Approaches and Issues in Big Social Data Spring ... · \Approaches and Issues in Big Social Data," o ered in the spring semester. Some SoDA / IGERT students will take more

Section ldquoSimulation resampling Markov Chains rdquo

DeepLearning (different kind of network)

bull 25 Project Review

March 22

bull Reading

DeepLearning Ch2 ldquoLinear Algebrardquo

Shalizi-ADA Chs 16-18 (ldquoPrincipal Components Analysisrdquo ldquoFactor Modelsrdquo ldquoNonlin-ear Dimension Reductionrdquo)

bull Further reference

Section ldquoDimensionality reduction decomposition rdquo

Section ldquoMultiple measures latent variable measurementrdquo

Section ldquoLinear algebra matrix computationsrdquo

March 29

bull Naomi Altman (STAT) ldquoGeneralizing PCArdquo

bull Reading

Altman ldquoGeneralizing PCArdquo 2015 slides httppersonalpsuedunsa1AltmanWebpagePCATorontopdf

MapReduceIntuition (7 minute video providing ldquodivide and conquerrdquo intuition to MapRe-duce)

TidySAC-Video Hadley Wickham on the tidyverse ldquosplit-apply-combinerdquo with dplyrand tidy data (1 hour video) (More detail but perhaps less intuition in book formdiscussion of ldquotidyingrdquo data Chapter 5 of TidyData-R)

MMDS Ch 2 (Read the Chapter 2 in the new ldquoBETArdquo version of the book whichalso touches on Spark and Tensorflow in section 24 ldquoExtensions to MapReducerdquo httpistanfordedu~ullmanmmdsnhtml)

bull Further reference

Section ldquoNonlinear dimension reduction manifold learningrdquo

Section ldquoDatabases and data managementrdquo (esp NoSQL)

Section ldquoTheoretically-structured approaches to data wranglingrdquo

Section ldquoParallelism MapReduce Split-Apply-Combinerdquo

Section ldquoFunctional programmingrdquo

Section ldquoCutting and bleeding edge rdquo

bull 50 Project Review

bull Exercise 3 (Individual) Due 45

8

April 5

bull Prasenjit Mitra (IST) ldquoClassification of Tweets from Disaster Scenariosrdquo

bull Reading

Prasenjit Mitra Rudra et al rdquoSummarizing Situational and Topical InformationDuring Crisesrdquo httpsarxivorgpdf161001561pdf Imran Mitra amp CastillordquoTwitter as a Lifeline Human-annotated Twitter Corpora for NLP of Crisis-relatedMessagesrdquo httpmimranmepapersimran_prasenjit_carlos_lrec2016pdf

NLP (Jurafsky and Martin) Chapters 15 and 16 ldquoVector Semanticsrdquo and ldquoSemanticswith Dense Vectorsrdquo

bull Further reference

Section ldquoLanguage text speech audiordquo

GloVe word2vec word2vecExplained t-SNE

Section ldquoMobile devices distributed sensors rdquo

Section ldquoCrowdsourcing human computation citizen science rdquo

Section ldquoScaling iteration streaming data online algorithmsrdquo

bull Exercise 3 Due

April 12

bull Reading

BitByBit Ch 4 ldquoRunning Experimentsrdquo review Ch 2 sections ldquoNatural experimentsin observable datardquo examples

CausalInference Chs 1-2

bull Further reference

Section ldquoExperimental and observational designs for causal inferencerdquo

Section ldquoCrowdsourcing web experimentsrdquo

Section ldquoHuman subjects rdquo

bull 75 Project Review

April 19

bull Reading

BitByBit Ch 6 ldquoEthicsrdquo

PSU-ORP ldquoCommon Rule and Other Changesrdquo httpswwwresearchpsuedu

irbcommonrulechanges

DataPrivacy

9

Reproducibility

Refresh on MachineBias

bull Further reference

Section ldquoEthics and Scientific Responsibility in Big Social Datardquo

Section ldquoiexclCuidadordquo

April 26 - LAST CLASS MEETING

bull Team Project Presentations

May 3 - Team Projects Due

10

Past Visiting Speakers in SoDA 501 and 502

SoDA 502 (Fall 2017)

bull Chris Zorn (PLSC)

bull Diane Felmlee (SOC)

bull Alan MacEachren (GEOG)

bull Eric Plutzer (PLSC)

bull James LeBreton (PSYCH)

bull Sesa Slavkovic (STAT)

bull Guido Cervone (GEOG)

bull Ashton Verdery (SOC)

bull Maggie Niu (STAT)

bull Reka Albert (PHYS)

SoDA 501 (Spring 2017)

bull Tim Brick (HDFS) ldquoTowards real-time monitoring and intervention using wearable technologyrdquo

bull Aylin Caliskan (Princeton) ldquoA Story of Discrimination and Unfairness Bias in Word Embeddingsrdquo

bull Jay Yonamine (Google - IGERT alum) ldquoData Science in Industryrdquo

bull Johnathan Rush (Illinois) ldquoGeospatial Data Science Workshoprdquo

bull Rick Gilmore (PSYCH) ldquoToward a more reproducible and robust science of human behaviorrdquo

bull Glenn Firebaugh (SOC) ldquoMeasuring Inequality and Segregation with US Census Datardquo

bull Charles Twardy (Sotera) ldquoData Science for Search and Rescuerdquo

bull Anna Smith (Ohio State) ldquoA Hierarchical Model for Network Data in a Latent Hyperbolic Spacerdquo

bull Rebecca Passonneau (CSE) ldquoOmnigraph Rich Feature Representation for Graph Kernel Learningrdquo

bull Alex Klippel (GEOG) ldquoVirtual Reality for Immersive Analyticsrdquo

bull Murali Haran (STAT) ldquoA Computationally Efficient Projection-based Approach for Spatial General-ized Mixed Modelsrdquo

SoDA 502 (Fall 2016)

bull Clio Andris (GEOG) ldquoIntegrating Social Network Data into GISystemsrdquo

bull Jia Li (STAT) ldquoClustering under the Wasserstein Metricrdquo

bull Rachel Smith (CAS) ldquoStigma Networks Perceptions of Sociogramsrdquo

bull Zita Oravecz (HDFS)

bull Bethany Bray (Methodology Center) ldquoLatent Class and Latent Transition Analysisrdquo

bull Dave Hunter (STAT) ldquoModel Based Clustering of Large Networksrdquo

bull Scott Bennett (PLSC) ldquoABM Model of Insurgencyrdquo

bull David Reitter (IST)

bull Suzanna Linn (PLSC) ldquoMethodological Issues in Automated Text Analysis Application to NewsCoverage of the US Economyrdquo

11

SoDA 501 (Spring 2016)

bull Bruce Desmarais (PLSC) ldquoLearning in the Sunshine Analysis of Local Government Email Corporardquo

bull Timothy Brick (HDFS) ldquoMapping and Manipulating Facial Expressionrdquo

bull Qunying Huang (USC) ldquoSocial Media An Emerging Data Source for Human Mobility Studiesrdquo

bull Lingzhou Zue (STAT) ldquoAn Introduction to High-Dimensional Graphical Modelsrdquo

bull Ashton Verdery (SOC) ldquoSampling from Network Datardquo

bull Sarah Battersby (Tableau) ldquoHelping People See and Understand Spatial Datardquo

bull Alexandra Slavkovic (STAT) ldquoStatistical Privacy with Network Datardquo

bull Lee Giles (IST) ldquoMachine Learning for Scholarly Big Datardquo

12

Readings and References (updated Spring 2018)

We will discuss a relatively small subset of the readings listed here and this will vary based on topics andreadings discussed by visiting speakers and you yourselves The remainder are provided here as curatedreferences for more in-depth investigation in both the theory and practice related to the topic (with thelatter heavily weighted toward resources in Python and R)

dagger Material that is at last check made available through the Penn State library Most journal links shouldwork if you are logged in to a Penn State machine or through the Penn State VPN Some article archivesand most books require additional authentication through webaccess If links are broken start directly froma search via httpswwwlibrariespsuedu Some books require installation of e-readers like AdobeDigital Editions Links to lyndacom can be accessed through httplyndapsuedu

Dagger Material that is at last check legally provided for free In some cases these are the preprint versionsof published material

sect Material for which a legal selection is or will be provided through the class Box folder

Big Data amp Social Data Analytics

Overviews

bull Dagger[CompSocSci] David Lazer Alex Pentland Lada Adamic Sinan Aral Albert-Laszlo BarabasiDevon Brewer Nicholas Christakis Noshir Contractor James Fowler Myron Gutmann Tony JebaraGary King Michael Macy Deb Roy and Marshall Van Alstyne 2009 ldquoComputational Social SciencerdquoScience 323(5915)721-3 + Supp Feb 6 httpsciencesciencemagorgcontent3235915

721full httpsgkingharvardedufilesgkingfilesLazPenAda09pdf

bull Dagger[NRCreport] National Research Council 2013 Frontiers in Massive Data Analysis NationalAcademies Press (Free w registration httpwwwnapeducatalogphprecord_id=18374)Ch1 ldquoIntroductionrdquo Ch2 ldquoMassive Data in Science Technology Commerce National DefenseTelecommunications and other Endeavorsrdquo

bull Dagger[MMDS] Jure Leskovec Anand Rajaraman and Jeff Ullman 2014 Mining of Massive DatasetsCambridge University Press httpwwwmmdsorg (BETA version of Third Edition httpi

stanfordedu~ullmanmmdsnhtml)

bull sect[BitByBit] Matthew J Salganik 2018 (Forthcoming) Bit by Bit Social Research in the DigitalAge Princeton University Press Ch 1 ldquoIntroductionrdquo Ch 2 ldquoObserving Behaviorrdquo

bull DSHandbook-Py Ch 1 ldquoIntroduction Becoming a Unicornrdquo Ch2 ldquoThe Data Science Road Maprdquo

Burt-schtick

bull Dagger[Monroe-5Vs] Burt L Monroe 2013 ldquoThe Five Vs of Big Data Political Science Introductionto the Special Issue on Big Data in Political Sciencerdquo Political Analysis 21(V5) 1ndash9 https

doiorg101017S1047198700014315 (Volume Velocity Variety Vinculation Validity)

bull dagger[Monroe-No] Burt L Monroe Jennifer Pan Margaret E Roberts Maya Sen and Betsy Sin-clair 2015 ldquoNo Formal Theory Causal Inference and Big Data Are Not Contradictory Trendsin Political Sciencerdquo PS Political Science amp Politics 48(1) 71ndash4 httpdxdoiorg101017

S1049096514001760

bull dagger[Quinn-Topics] Kevin M Quinn Burt L Monroe Michael Colaresi Michael H Crespin andDragomir R Radev 2010 ldquoHow to Analyze Political Attention with Minimal Assumptions andCostsrdquo American Journal of Political Science 54(1) 209ndash28 httponlinelibrarywileycom

13

doi101111j1540-5907200900427xfull (esp topic modeling as measurement approach tovalidation)

bull dagger[FightinWords] Burt L Monroe Michael Colaresi and Kevin M Quinn 2008 ldquoFightinrsquo WordsLexical Feature Selection and Evaluation for Identifying the Content of Political Conflictrdquo PoliticalAnalysis 16(4) 372-403 httpsdoiorg101093panmpn018 (esp the impact of samplingvariance and regularization through priors)

bull Dagger[BDSS-Census] Big Data Social Science PSU Team 2012 ldquoA Closer Look at the Kaggle CensusDatardquo httpsburtmonroegithubioBDSSKaggleCensus2012 (esp the relevance of the socialprocesses by which data come to exist as data)

Multidisciplinary Perspectives

bull dagger[Business-BigData] Andrew McAfee and Erik Brynjolfsson 2012 ldquoBig Data The ManagementRevolutionrdquo Harvard Business Review 90(10) 61ndash8 October httpshbrorg201210big-

data-the-management-revolution Thomas H Davenport and DJ Patel 2012 ldquoData ScientistThe Sexiest Job of the 21st Centuryrdquo Harvard Business Review 90(10)70-6 October https

hbrorg201210data-scientist-the-sexiest-job-of-the-21st-century

bull dagger[InfoSci-BigData] CL Philip Chen and Chun-Yang Zhang 2014 ldquoData-intensive ApplicationsChallenges Techniques and Technologies A Survey on Big Datardquo Information Sciences 275 314-47httpsdoiorg101016jins201401015

bull dagger[Informatics-BigData] Vasant G Honavar 2014 ldquoThe Promise and Potential of Big Data ACase for Discovery Informaticsrdquo Review of Policy Research 31(4) 326-330 httpsdoiorg10

1111ropr12080

bull Dagger[Stats-BigData] Beate Franke Jean-Francois Ribana Roscher Annie Lee Cathal Smyth ArminHatefi Fuqi Chen Einat Gil Alexander Schwing Alessandro Selvitella Michael M Hoffman RogerGrosse Dietrich Hendricks and Nancy Reid 2016 ldquoStatistical Inference Learning and Models in BigDatardquo International Statistical Review 84(3) 371-89 httponlinelibrarywileycomdoi101111insr12176full

bull dagger[Econ-BigData] Hal R Varian 2013 ldquoBig Data New Tricks for Econometricsrdquo Journal ofEconomic Perspectives 28(2) 3-28 httpsdoiorg101257jep2823

bull dagger[GeoViz-BigData] Alan MacEachren 2017 ldquoLeveraging Big (Geo) Data with (Geo) Visual An-alytics Place as the Next Frontierrdquo In Chenghu Zhou Fenzhen Su Francis Harvey and Jun Xueds Spatial Data Handling in the Big Data Era pp 139-155 Springer httpslink-springer-

comezaccesslibrariespsueduchapter101007978-981-10-4424-3_10

bull dagger[Soc-BigData] David Lazer and Jason Radford 2017 ldquoData ex Machina Introduction to BigDatardquo Annual Review of Sociology 43 19-39 httpsdoiorg101146annurev-soc-060116-

053457

bull sect[Politics-BigData] Keith T Poole L Jason Anasastopolous and James E Monagan III Forthcom-ing ldquoThe lsquoBig Datarsquo Revolution in Political Campaigning and Governancerdquo Oxford Bibliographies inPolitical Science

iexclCuidado Traps Biases Problems Pains Perils

bull dagger[GoogleFlu] David Lazer Ryan Kennedy Gary King and Alessandro Vespignani 2014 ldquoTheParable of Google Flu Traps in Big Data Analysisrdquo Science (343) 14 March httpgking

harvardedufilesgkingfiles0314policyforumffpdf

bull Dagger[MachineBias] ProPublica ldquoMachine Bias Investigating Algorithmic Injusticerdquo Series https

wwwpropublicaorgseriesmachine-bias See especially

14

Julia Angwin Jeff Larson Surya Mattu and Lauren Kirchner 2016 ldquoMachine Biasrdquo ProPub-lica May 23 httpswwwpropublicaorgarticlemachine-bias-risk-assessments-in-

criminal-sentencing

Julia Angwin Madeleine Varner and Ariana Tobin 2017 ldquoFacebook Enabled Adverstisers toReach Jew Hatersrdquo httpswwwpropublicaorgarticlefacebook-enabled-advertisers-

to-reach-jew-haters

bull Dagger[Polling2016] Doug Rivers (Nov 11 2016) ldquoFirst Thoughts on Polling Problems in the 2016 USElectionsrdquo httpstodayyougovcomnews20161111first-thoughts-polling-problems-

2016-us-elections

bull dagger[EventData] Wei Wang Ryan Kennedy David Lazer Naren Ramakrishnan 2016 ldquoGrowing Painsfor Global Monitoring of Societal Eventsrdquo Science 3536307 pp 1502ndash1503 httpsdoiorg101126scienceaaf6758

bull Dagger[GoogleBooks] Eitan Adam Pechenick Christopher M Danforth and Peter Sheridan Dodds 2015ldquoCharacterizing the Google Books Corpus Strong Limits to Inferences of Socio-Cultural and LinguisticEvolutionrdquo PLoS One httpsdoiorg101371journalpone0137041

bull Dagger[OkCupid] Michael Zimmer 2016 ldquoOkCupid Study Reveals the Perils of Big-Data Sciencerdquo Wiredhttpswwwwiredcom201605okcupid-study-reveals-perils-big-data-science May 14

bull dagger[CriticalQuestions] danah boyd and Kate Crawford 2011 ldquoCritical Questions for Big Data Provo-cations for a Cultural Technological and Scholarly Phenomenonrdquo Information Communication ampSociety 15(5) 662ndash79 httpdxdoiorg1010801369118X2012678878

bull Dagger[RacistBot] Daniel Victor 2016 ldquoMicrosoft created a Twitter bot to learn from users It quicklybecame a racist jerkrdquo New York Times httpswwwnytimescom20160325technology

microsoft-created-a-twitter-bot-to-learn-from-users-it-quickly-became-a-racist-jerk

html

bull MMDS 122-123 on Bonferroni

bull Also BDSS-Census

Research Design and Measurement

Overviews

bull BitByBit

bull Dagger[ResearchMethodsKB] William M Trochim 2006 The Research Methods Knowledge Basehttpwwwsocialresearchmethodsnetkb

bull [7Rules] Glenn Firebaugh 2008 Seven Rules for Social Research Princeton University Press

Measurement Reliability and Validity

bull ResearchMethodsKB ldquoMeasurementrdquo

bull 7Rules Ch 3 ldquoBuild Reality Checks into Your Researchrdquo

bull Reliability see also FightinWords

bull Validity see also Quinn10-Topics Monroe-5Vs

Indirect Unobtrusive Nonreactive Measures Data Exhaust

15

bull dagger[UnobtrusiveMeasures] Raymond M Lee 2015 ldquoUnobtrusive Measuresrdquo Oxford Bibliographieshttpsdoiorg101093OBO9780199846740-0048 (Canonical cite is Eugene J Webb 1966Unobtrusive Measures or Webb Donald T Campbell Richard D Schwartz Lee Sechrest 1999Unobtrusive Measures rev ed Sage)

bull BitByBit Examples in Chapter 2

Multiple Measures Latent Variable Measurement

bull 7Rules Ch 4 ldquoReplicate Where Possiblerdquo

bull dagger[LatentVariables] David J Bartholomew Martin Knott and Irini Moustaki 2011 Latent VariableModels and Factor Analysis A Unified Approach Wiley httpsiteebrarycomezaccess

librariespsuedulibpennstatedetailactiondocID=10483308) (esp Ch1 ldquoBasic ideas andexamplesrdquo)

bull dagger[Multivariate-R] Brian Everitt and Torsten Hothorn 2011 An Introduction to Applied MultivariateAnalysis with R Springer httplinkspringercomezaccesslibrariespsuedubook10

10072F978-1-4419-9650-3

bull Shalizi-ADA Ch17 ldquoFactor Modelsrdquo

bull Dagger[NetflixPrize] Edwin Chen 2011 ldquoWinning the Netflix Prize A Summaryrdquo

bull DaggerCRAN httpscranr-projectorgwebviewsMultivariatehtmlhttpscranr-projectorgwebviewsPsychometricshtmlhttpscranr-projectorgwebviewsClusterhtml

Sampling and Survey Design

bull dagger[Sampling] Steven K Thompson 2012 Sampling 3rd ed httpsk8es4mc2lsearchserialssolutionscomsid=sersolampSS_jc=TC_024492330amptitle=Wiley20Desktop20Editions203A20Sampling

bull ResearchMethodsKB ldquoSamplingrdquo

bull NRCreport Ch 8 ldquoSampling and Massive Datardquo

bull BitByBit Ch 3 ldquoAsking Questionsrdquo

bull Dagger[MSE] Daniel Manrique-Vallier Megan E Price and Anita Gohdes 2013 In Seybolt et al (eds)Counting Civilians ldquoMultiple Systems Estimation Techniques for Estimating Casualties in ArmedConflictsrdquo Preprint httpciteseerxistpsueduviewdocdownloaddoi=1011469939amp

rep=rep1amptype=pdf)

bull dagger[NetworkSampling] Ted Mouw and Ashton M Verdery 2012 ldquoNetwork Sampling with Memory AProposal for More Efficient Sampling from Social Networksrdquo Sociological Methodology 42(1)206ndash56httpsdxdoiorg1011772F0081175012461248

bull See also FightinWords (re hidden heteroskedasticity in sample variance)

bull Dagger[HashDontSample] Mudit Uppal 2016 ldquoProbabilistic data structures in the Big data world (+code)rdquo (re ldquoHash donrsquot samplerdquo) httpsmediumcommuppalprobabilistic-data-structures-

in-the-big-data-world-code-b9387cff0c55

bull MMDS Hash Functions (132) Sampling in streams (42)

bull DaggerCRAN httpscranr-projectorgwebviewsOfficialStatisticshtml (ldquoComplex SurveyDesignrdquo ldquoSmall Area Estimationrdquo)

Experimental and Observational Designs for Causal Inference

16

bull Dagger[CausalInference] Miguel A Hernan James M Robins Forthcoming (2017) Causal Infer-ence Chapman amp HallCRC Preprint httpswwwhsphharvardedumiguel-hernancausal-

inference-book Chs 1 ldquoA definition of causal effectrdquo Ch 2 ldquoRandomized experimentsrdquo Ch3 ldquoObservational studiesrdquo (See Ch 6 ldquoGraphical representation of causal effectsrdquo for integrationwith Judea Pearl approach)

bull BitByBit Chapter 4 ldquoRunning Experimentsrdquo Ch 2 Natural experiments in observable data exam-ples

bull ResearchMethodsKB ldquoDesignrdquo

bull Firebaugh-7Rules Ch 2 ldquoLook for Differences that Make a Difference and Report Themrdquo sectCh5 ldquoCompare Like with Likerdquo Ch 6 ldquoUse Panel Data to Study Individual Change and RepeatedCross-Section Data to Study Social Changerdquo

bull Shalizi-ADA Part IV ldquoCausal Inferencerdquo

bull Monroe-No

bull CRAN httpscranr-projectorgwebviewsExperimentalDesignhtml

Technologies for Primary and Secondary Data Collection

Mobile Devices Distributed Sensors Wearable Sensors Remote Sensing

bull dagger[RealityMining] Nathan Eagle and Alex (Sandy) Pentland 2006 ldquoReality Mining SensingComplex Social Systemsrdquo Personal and Ubiquitous Computing 10(4) 255ndash68 httpsdoiorg

101007s00779-005-0046-3

bull dagger[QuantifiedSelf ] Melanie Swan 2013 ldquoThe Quantified Self Fundamental Disruption in Big DataScience and Biological Discoveryrdquo Big Data 1(2) 85ndash99 httpsdoiorg101089big2012

0002

bull dagger[SensorData] Charu C Aggarwal (Ed) 2013 Managing and Mining Sensor Data Springer httplinkspringercomezaccesslibrariespsuedubook1010072F978-1-4614-6309-2 Ch1ldquoAn Introduction to Sensor Data Analyticsrdquo

bull dagger[NightLights] Thushyanthan Baskaran Brian Min Yogesh Uppal 2015 ldquoElection cycles andelectricity provision Evidence from a quasi-experiment with Indian special electionsrdquo Journal ofPublic Economics 12664-73 httpsdoiorg101016jjpubeco201503011

Crowdsourcing Human Computation Citizen Science Web Experiments

bull dagger[Crowdsourcing] Daren C Brabham 2013 Crowdsourcing MIT Press httpsiteebrary

comezaccesslibrariespsuedulibpennstatedetailactiondocID=10692208 ldquoIntroduc-tionrdquo Ch1 ldquoConcepts Theories and Cases of Crowdsourcingrdquo

bull BitByBit Ch 5 ldquoCreating Mass Collaborationrdquo

bull NRCreport Ch 9 ldquoHuman Interaction with Datardquo

bull dagger[MTurk] Krista Casler Lydia Bickel and Elizabeth Hackett 2013 ldquoSeparate but Equal AComparison of Participants and Data Gathered via Amazonrsquos MTurk Social Media and Face-to-FaceBehavioral Testingrdquo Computers in Human Behavior 29(6) 2156ndash60 httpdoiorg101016j

chb201305009

bull dagger[LabintheWild] Katharina Reinecke and Krzysztof Z Gajos 2015 ldquoLabintheWild ConductingLarge-Scale Online Experiments With Uncompensated Samplesrdquo In Proceedings of the 18th ACMConference on Computer Supported Cooperative Work amp Social Computing (CSCW rsquo15) 1364ndash1378httpdxdoiorg10114526751332675246

17

bull dagger[TweetmentEffects] Kevin Munger 2017 ldquoTweetment Effects on the Tweeted ExperimentallyReducing Racist Harassmentrdquo Political Behavior 39(3) 629ndash49 httpdoiorg101016jchb

201305009

bull dagger[HumanComputation] Edith Law and Luis van Ahn 2011 Human Computation Morgan amp Clay-pool httpwwwmorganclaypoolcomezaccesslibrariespsuedudoipdf102200S00371ED1V01Y201107AIM013

Open Data File Formats APIs Semantic Web Linked Data

bull DSHandbook-Py Ch 12 ldquoData Encodings and File Formatsrdquo

bull Dagger[OpenData] Open Data Institute ldquoWhat Is Open Datardquo httpstheodiorgwhat-is-open-

data

bull Dagger[OpenDataHandbook] Open Knowledge International The Open Data Handbook httpopendatahandbookorg Includes appendix ldquoFile Formatsrdquo httpopendatahandbookorgguideenappendices

file-formats

bull Dagger[APIs] Brian Cooksey 2016 An Introduction to APIs httpszapiercomlearnapis

bull Dagger[APIMarkets] RapidAPI mashape API marketplaces httpsdocsrapidapicom https

marketmashapecom ProgrammableWeb httpswwwprogrammablewebcom

bull dagger[LinkedData] Tom Heath and Christian Bizer 2011 Linked Data Evolving the Web into a GlobalData Space Morgan amp Claypool httpwwwmorganclaypoolcomezaccesslibrariespsu

edudoiabs102200S00334ED1V01Y201102WBE001 Ch 1 ldquoIntroductionrdquo Ch 2 ldquoPrinciples ofLinked Datardquo

bull dagger[SemanticWeb] Nikolaos Konstantinos and Dimitrios-Emmanuel Spanos 2015 Materializing theWeb of Linked Data Springer httplinkspringercomezaccesslibrariespsuedubook

1010072F978-3-319-16074-0 Ch 1 ldquoIntroduction Linked Data and the Semantic Webrdquo

bull daggerMorgan amp Claypool Synthesis Lectures on the Semantic Web Theory and Technology httpwww

morganclaypoolcomtocwbe111

Web Scraping

bull dagger[TheoryDrivenScraping] Richard N Landers Robert C Brusso Katelyn J Cavanaugh andAndrew B Colmus 2016 ldquoA Primer on Theory-Driven Web Scraping Automatic Extraction ofBig Data from the Internet for Use in Psychological Researchrdquo Psychological Methods 4 475ndash492httpdxdoiorgezaccesslibrariespsuedu101037met0000081

bull Dagger[Scraping-Py] Al Sweigart 2015 Automate the Boring Stuff with Python Practical Programmingfor Total Beginners Ch 11 Web-Scraping httpsautomatetheboringstuffcomchapter11

bull dagger[Scraping-R] Simon Munzert Christian Rubba Peter Meiszligner and Dominic Nyhuis 2014 Au-tomated Data Collection with R A Practical Guide to Web Scraping and Text Mining Wileyhttponlinelibrarywileycombook1010029781118834732

bull DaggerCRAN httpscranr-projectorgwebviewsWebTechnologieshtml

Ethics amp Scientific Responsibility in Big Social Data

Human Subjects Consent Privacy

bull BitByBit Ch 6 (ldquoEthicsrdquo)

bull Dagger[PSU-ORP] Penn State Office for Research Protections (under the VP for Research)

Human Subjects Research IRB httpswwwresearchpsueduirb

18

Revised Common Rule httpswwwresearchpsueduirbcommonrulechanges

Responsible Conduct of Research httpswwwresearchpsuedueducationrcr

Research Misconduct httpswwwresearchpsueduresearchmisconduct

SARI PSU (Scientific and Research Integrity Training) httpswwwresearchpsuedu

trainingsari

bull Dagger[MenloReport] David Dittrich and Erin Kenneally (Center for Applied Interneet Data Analysis)2012 ldquoThe Menlo Report Ethical Principles Guiding Information and Communication Technol-ogy Researchrdquo and companion report ldquoApplying Ethical Principles rdquo httpswwwcaidaorg

publicationspapers2012menlo_report_actual_formatted

bull Dagger[BigDataEthics-CBDES] Jacob Metcalf Emily F Keller and danah boyd 2016 ldquoPerspec-tives on Big Data Ethics and Societyrdquo Council for Big Data Ethics and Society httpbdes

datasocietynetcouncil-outputperspectives-on-big-data-ethics-and-society

bull Dagger[AoIRReport] Annette Markham and Elizabeth Buchanan (Association of Internet Researchers)2012 ldquoEthical Decision-Making and Internet Research Recommendations of the AoIR Ethics WorkingCommittee (Version 20)rdquo httpsaoirorgreportsethics2pdf and guidelines chart https

aoirorgwp-contentuploads201701aoir_ethics_graphic_2016pdf

bull Dagger[BigDataEthics-Wired] Sarah Zhang 2016 ldquoScientists are Just as Confused about the Ethics ofBig-Data Research as Yourdquo Wired httpswwwwiredcom201605scientists-just-confused-ethics-big-data-research

bull dagger[BigDataEthics-HerschelMori] Richard Herschel and Virginia M Mori 2017 ldquoEthics amp BigDatardquo Technology in Society 49 31-36 httpdoiorg101016jtechsoc201703003

The Science of Data Privacy

bull dagger[BigDataPrivacy] Terence Craig and Mary E Ludloff 2011 Privacy and Big Data OrsquoReilly MediahttppensueblibcompatronFullRecordaspxp=781814

bull Dagger[DataAnalysisPrivacy] John Abowd Lorenzo Alvisi Cynthia Dwork Sampath Kannan AshwinMachanavajjhala Jerome Reiter 2017 ldquoPrivacy-Preserving Data Analysis for the Federal Statisti-cal Agenciesrdquo A Computing Community Consortium white paper httpsarxivorgabs1701

00752

bull dagger[DataPrivacy] Stephen E Fienberg and Aleksandra B Slavkovic 2011 ldquoData Privacy and Con-fidentialityrdquo International Encyclopedia of Statistical Science 342ndash5 httpdoiorg978-3-642-

04898-2_202

bull Dagger[NetworksPrivacy] Vishesh Karwa and Aleksandra Slavkovic 2016 ldquoInference using noisy degreesDifferentially private β-model and synthetic graphsrdquo Annals of Statistics 44(1) 87-112 http

projecteuclidorgeuclidaos1449755958

bull dagger[DataPublishingPrivacy] Raymond Chi-Wing Wong and Ada Wai-Chee Fu 2010 Privacy-Preserving Data Publishing An Overview Morgan amp Claypool httpwwwmorganclaypoolcom

ezaccesslibrariespsuedudoipdfplus102200S00237ED1V01Y201003DTM002

bull DataMatching Ch 8 ldquoPrivacy Aspects of Data Matchingrdquo

Transparency Reproducibility and Team Science

bull Dagger[10RulesforData] Alyssa Goodman Alberto Pepe Alexander W Blocker Christine L BorgmanKyle Cranmer Merce Crosas Rosanne Di Stefano Yolanda Gil Paul Groth Margaret HedstromDavid W Hogg Vinay Kashyap Ashish Mahabal Aneta Siemiginowska and Aleksandra Slavkovic(2014) ldquoTen Simple Rules for the Care and Feeding of Scientific Datardquo PLoS Computational Biology10(4) e1003542 httpsdoiorg101371journalpcbi1003542 (Note esp curated resourcesfor reproducible research)

19

bull dagger[Transparency] E Miguel C Camerer K Casey J Cohen K M Esterling A Gerber R Glen-nerster D P Green M Humphreys G Imbens D Laitin T Madon L Nelson B A Nosek M Pe-tersen R Sedlmayr J P Simmons U Simonsohn M Van der Laan 2014 ldquoPromoting Transparencyin Social Science Researchrdquo Science 343(6166) 30ndash1 httpsdoiorg101126science1245317

bull dagger[Reproducibility] Marcus R Munafo Brian A Nosek Dorothy V M Bishop Katherine S ButtonChristopher D Chambers Nathalie Percie du Sert Uri Simonsohn Eric-Jan Wagenmakers JenniferJ Ware and John P A Ioannidis 2017 ldquoA Manifesto for Reproducible Sciencerdquo Nature HumanBehavior 0021(2017) httpsdoiorg101038s41562-016-0021

bull Dagger[SoftwareCarpentry] esp ldquoLessonsrdquo httpsoftware-carpentryorglessons

bull DSHandbook-Py Ch 9 ldquoTechnical Communication and Documentationrdquo Ch 15 ldquoSoftware Engi-neering Best Practicesrdquo

bull dagger[TeamScienceToolkit] Vogel AL Hall KL Fiore SM Klein JT Bennett LM Gadlin H Stokols DNebeling LC Wuchty S Patrick K Spotts EL Pohl C Riley WT Falk-Krzesinski HJ 2013 ldquoTheTeam Science Toolkit enhancing research collaboration through online knowledge sharingrdquo AmericanJournal of Preventive Medicine 45 787-9 http101016jamepre201309001

bull sect[Databrary] Kara Hall Robert Croyle and Amanda Vogel Forthcoming (2017) Advancing Socialand Behavioral Health Research through Cross-disciplinary Team Science Springer Includes RickO Gilmore and Karen E Adolph ldquoOpen Sharing of Research Video Breaking the Boundaries of theResearch Teamrdquo (See httpdatabraryorg)

bull CRAN httpscranr-projectorgwebviewsReproducibleResearchhtml

Social Bias Fair Algorithms

bull MachineBias

bull Dagger[EmbeddingsBias] Aylin Caliskan Joanna J Bryson and Arvind Narayanan 2017 ldquoSemanticsDerived Automatically from Language Corpora Contain Human Biasesrdquo Science httpsarxiv

orgabs160807187

bull Dagger[Debiasing] Tolga Bolukbasi Kai-Wei Chang James Zou Venkatesh Saligrama Adam Kalai 2016ldquoMan is to Computer Programmer as Woman is to Homemaker Debiasing Word Embeddingsrdquohttpsarxivorgabs160706520

bull Dagger[AvoidingBias] Moritz Hardt Eric Price Nathan Srebro 2016 ldquoEquality of Opportunity inSupervised Learningrdquo httpsarxivorgabs161002413

bull Dagger[InevitableBias] Jon Kleinberg Sendhil Mullainathan Manish Raghavan 2016 ldquoInherent Trade-Offs in the Fair Determination of Risk Scoresrdquo httpsarxivorgabs160905807

Databases and Data Management

bull NRCreport Ch 3 ldquoScaling the Infrastructure for Data Managementrdquo

bull dagger[SQL] Jan L Harrington 2010 SQL Clearly Explained Elsevier httpwwwsciencedirectcomezaccesslibrariespsuedusciencebook9780123756978

bull DSHandbook-Py Ch 14 ldquoDatabasesrdquo

bull dagger[noSQL] Guy Harrison 2015 Next Generation Databases NoSQL NewSQL and Big Data Apresshttpslink-springer-comezaccesslibrariespsuedubook101007978-1-4842-1329-2

bull dagger[Cloud] Divyakant Agrawal Sudipto Das and Amr El Abbadi 2012 Data Management inthe Cloud Challenges and Opportunities Morgan amp Claypool httpwwwmorganclaypoolcom

ezaccesslibrariespsuedudoipdfplus102200S00456ED1V01Y201211DTM032

20

bull daggerMorgan amp Claypool Synthesis Lectures on Data Management httpwwwmorganclaypoolcom

tocdtm11

Data Wrangling

Theoretically-structured approaches to data wrangling

bull Dagger[TidyData-R] Garrett Grolemund and Hadley Wickham 2017 R for Data Science (httpr4dshadconz) esp Ch 12 ldquoTidy Datardquo also Wickham 2014 ldquoTidy Datardquo Journalof Statistical Software 59(10) httpwwwjstatsoftorgv59i10paper Tools The tidyversehttptidyverseorg

bull Dagger[DataScience-Py] Jake VanderPlas 2016 Python Data Science Handbook (httpsgithubcomjakevdpPythonDSHandbook-Py) esp Ch 3 on pandas httppandaspydataorg

bull Dagger[DataCarpentry] Colin Gillespie and Robin Lovelace 2017 Efficient R Programming https

csgillespiegithubioefficientR esp Ch 6 on ldquoefficient data carpentryrdquo

bull sect[Wrangler] Joseph M Hellerstein Jeffrey Heer Tye Rattenbury and Sean Kandel 2017 DataWrangling Practical Techniques for Data Preparation Tool Trifacta Wrangler httpwwwtrifactacomproductswrangler

Data wrangling practice

bull DSHandbook-Py Ch 4 ldquoData Munging String Manipulation Regular Expressions and Data Clean-ingrdquo

bull dagger[DataSimplification] Jules J Berman 2016 Data Simplification Taming Information with OpenSource Tools Elsevier httpwwwsciencedirectcomezaccesslibrariespsueduscience

book9780128037812

bull dagger[Wrangling-Py] Jacqueline Kazil Katharine Jarmul 2016 Data Wrangling with Python OrsquoReillyhttpproquestcombosafaribooksonlinecomezaccesslibrariespsuedu9781491948804

bull dagger[Wrangling-R] Bradley C Boehmke 2016 Data Wrangling with R Springer httplink

springercomezaccesslibrariespsuedubook1010072F978-3-319-45599-0

Record Linkage Entity Resolution Deduplication

bull dagger[DataMatching] Peter Christen 2012 Data Matching Concepts and Techniques for Record Link-age Entity Resolution and Duplicate Detectionhttplinkspringercomezaccesslibrariespsuedubook1010072F978-3-642-31164-2

(esp Ch 2 ldquoThe Data Matching Processrdquo)

bull DataSimplification Chapter 5 ldquoIdentifying and Deidentifying Datardquo

bull Dagger[EntityResolution] Lise Getoor and Ashwin Machanavajjhala 2013 ldquoEntity Resolution for BigDatardquo Tutorial KDD httpwwwumiacsumdedu~getoorTutorialsER_KDD2013pdf

bull Dagger[SyrianCasualties] Peter Sadosky Anshumali Shrivastava Megan Price and Rebecca C Steorts2015 ldquoBlocking Methods Applied to Casualty Records from the Syrian Conflictrdquo httpsarxiv

orgabs151007714 (For more on blocking see DataMatching Ch 4 ldquoIndexingrdquo)

bull CRAN Task Views httpscranr-projectorgwebviewsOfficialStatisticshtml ldquoStatis-tical Matching and Record Linkagerdquo

ldquoMaking up datardquo Imputation Smoothers Kernels Priors Filters Teleportation Negative Sampling Con-volution Augmentation Adversarial Training

21

bull dagger[Imputation] Yi Deng Changgee Chang Moges Seyoum Ido and Qi Long 2016 ldquoMultiple Im-putation for General Missing Data Patterns in the Presence of High-dimensional Datardquo ScientificReports 6(21689) httpswwwnaturecomarticlessrep21689

bull Dagger[Overimputation] Matthew Blackwell James Honaker and Gary King 2017 ldquoA Unified Ap-proach to Measurement Error and Missing Data Overview and Applicationsrdquo Sociological Methodsamp Research 46(3) 303-341 httpgkingharvardedufilesgkingfilesmeasurepdf

bull Shalizi-ADA Sect 15 (Linear Smoothers) Ch 8 (Splines) Sect 144 (Kernel Density Estimates)DataScience-Python NB 0513 ldquoKernel Density Estimationrdquo

bull FightinWords (re Bayesian priors as additional data impact of priors on regularization)

bull DeepLearning Section 75 (Data Augmentation) 712 (Dropout) 713 (Adversarial Examples) Chap-ter 9 (Convolutional Networks)

bull dagger[Adversarial] Ian J Goodfellow Jonathon Shlens Christian Szegedy 2015 ldquoExplaining and Har-vesting Adversarial Examplesrdquo httpsarxivorgabs14126572

bull See also resampling and simulation methods

bull See also feature engineering preprocessing

bull CRAN Task Views httpscranr-projectorgwebviewsOfficialStatisticshtml ldquoImpu-tationrdquo

(Direct) Data Representations Data Mappings

bull NRCreport Ch 5 ldquoLarge-Scale Data Representationsrdquo

bull dagger[PatternRecognition] M Narasimha Murty and V Susheela Devi 2011 Pattern RecognitionAn Algorithmic Approach Springer httpslink-springer-comezaccesslibrariespsuedu

book1010072F978-0-85729-495-1 Section 21 ldquoData Structures for Pattern Representationrdquo

bull Dagger[InfoRetrieval] Christopher D Manning Prabhakar Raghavan and Hinrich Schutze 2009 Intro-duction to Information Retrieval Cambridge University Press httpnlpstanfordeduIR-bookCh 1 2 6 (also note slides used in their class)

bull dagger[Algorithms] Brian Steele John Chandler Swarna Reddy 2016 Algorithms for Data ScienceWiley httplinkspringercomezaccesslibrariespsuedubook1010072F978-3-319-

45797-0 Ch 2 ldquoData Mapping and Data Dictionariesrdquo

bull See also Social Data Structures

Similarity Distance Association Covariance The Kernel Trick

bull PatternRecognition Section 23 ldquoProximity Measuresrdquo

bull Dagger[Similarity] Brendan OrsquoConnor 2012 ldquoCosine similarity Pearson correlation and OLS coeffi-cientsrdquo httpsbrenoconcomblog201203cosine-similarity-pearson-correlation-and-

ols-coefficients

bull For relatively comprehensive lists see also

M-J Lesot M Rifqi and H Benhadda 2009 ldquoSimilarity measures for binary and numerical data a surveyrdquoInt J Knowledge Engineering and Soft Data Paradigms 1(1) 63- httpciteseerxistpsueduviewdoc

downloaddoi=10112126533amprep=rep1amptype=pdf

Seung-Seok Choi Sung-Hyuk Cha Charles C Tappert 2010 ldquoA Survey of Binary Similarity and DistanceMeasuresrdquo Journal of Systemics Cybernetics amp Informatics 8(1)43-8 httpwwwiiisciorgJournalCV$

scipdfsGS315JGpdf

22

Sung-Hyuk Cha 2007 ldquoComprehensive Survey on DistanceSimilarity Measures between Probability DensityFunctionsrdquo International Journal of Mathematical Models and Methods in Applied Sciences 4(1) 300-7 http

csispaceeductappertdpsd861-12session4-p2pdf

Anna Huang 2008 ldquoSimilarity Measures for Text Document Clusteringrdquo Proceedings of the 6th New ZealandComputer Science Research Student Conference 49-56 httpwwwnzcsrsc08canterburyacnzsiteproceedingsIndividual_Paperspg049_Similarity_Measures_for_Text_Document_Clusteringpdf

bull Michael B Jordan ldquoThe Kernel Trickrdquo (Lecture Notes) httpspeopleeecsberkeleyedu

~jordancourses281B-spring04lectureslec3pdf

bull Eric Kim 2017 ldquoEverything You Ever Wanted to Know about the Kernel Trick (But Were Afraidto Ask)rdquo httpwwweric-kimneteric-kim-netposts1kernel_trick_blog_ekim_12_20_

2017pdf

Derived Data Representations - Dimensionality Reduction Compression De-composition Embeddings

The groupings here and under the measurement multivariate statistics section are particularly arbitraryFor example ldquok-Means clusteringrdquo can be viewed as a technique for ldquodimensionality reductionrdquo ldquocompres-sionrdquo ldquofeature extractionrdquo ldquolatent variable measurementrdquo ldquounsupervised learningrdquo ldquocollaborative filteringrdquo

Clustering hashing quantization blocking compression

bull Dagger[Compression] Khalid Sayood 2012 Introduction to Data Compression 4th ed Springer httpwwwsciencedirectcomezaccesslibrariespsuedusciencebook9780124157965 (eg cod-ing blocking via vector quantization)

bull MMDS Ch3 ldquoFinding Similar Itemsrdquo (minhashing locality sensitive hashing)

bull Multivariate-R Ch 6 ldquoClusteringrdquo

bull DataScience-Python NB 0511 ldquok-Means Clusteringrdquo NG 0512 ldquoGaussian Mixture Modelsrdquo

bull Dagger[KMeansHashing] Kaiming He Fang Wen Jian Sun 2013 ldquoK-means Hashing An Affinity-Preserving Quantization Method for Learning Binary Compact Codesrdquo CVPR httpswwwcv-foundationorgopenaccesscontent_cvpr_2013papersHe_K-Means_Hashing_An_2013_CVPR_paper

pdf

bull dagger[Squashing] Madigan D Raghavan N Dumouchel W Nason M Posse C and RidgewayG (2002) ldquoLikelihood-based data squashing A modeling approach to instance constructionrdquo DataMining and Knowledge Discovery 6(2) 173-190 httpdxdoiorgezaccesslibrariespsu

edu101023A1014095614948

bull dagger[Core-sets] Piotr Indyk Sepideh Mahabadi Mohammad Mahdian Vahab S Mirrokni 2014 ldquoCom-posable core-sets for diversity and coverage maximizationrdquo PODS rsquo14 100-8 httpsdoiorg10114525945382594560

Feature selection feature extraction feature engineering weighting preprocessing

bull PatternRecognition Sections 26-7 ldquoFeature Selection Feature Extractionrdquo

bull DSHandbook-Py Ch 7 ldquoInterlude Feature Extraction Ideasrdquo

bull DataScience-Python NB 0504 ldquoFeature Engineeringrdquo

bull Features in text NLP and InfoRetrieval re tfidf and similar FightinWords

bull Features in images DataScience-Python NB 0514 ldquoImage Featuresrdquo

23

Dimensionality reduction decomposition factorization change of basis reparameterization matrix com-pletion latent variables source separation

bull MMDS Ch 9 ldquoRecommendation Systemsrdquo Ch11 ldquoDimensionality Reductionrdquo

bull DSHandbook-Py Ch 10 ldquoUnsupervised Learning Clustering and Dimensionality Reductionrdquo

bull Dagger[Shalizi-ADA] Cosma Rohilla Shalizi 2017 Advanced Data Analysis from an Elementary Pointof View Ch 16 (ldquoPrincipal Components Analysisrdquo) Ch 17 (ldquoFactor Modelsrdquo) httpwwwstat

cmuedu~cshaliziADAfaEPoV

See also Multivariate-R Ch 3 ldquoPrincipal Components Analysisrdquo Ch 4 ldquoMultidimensional ScalingrdquoCh 5 ldquoExploratory Factor Analysisrdquo Latent

bull DeepLearning Ch 2 ldquoLinear Algebrardquo Ch 13 ldquoLinear Factor Modelsrdquo

bull Dagger[GloVe] Jeffrey Pennington Richard Socher Christopher Manning 2014 ldquoGloVe Global vectorsfor word representationrdquo EMNLP httpsnlpstanfordeduprojectsglove

bull dagger[NMF] Daniel D Lee and H Sebastian Seung 1999 ldquoLearning the parts of objects by non-negativematrix factorizationrdquo Nature 401788-791 httpdxdoiorgezaccesslibrariespsuedu

10103844565

bull dagger[CUR] Michael W Mahoney and Petros Drineas 2009 ldquoCUR matrix decompositions for improveddata analysisrdquo PNAS httpwwwpnasorgcontent1063697full

bull Dagger[ICA] Aapo Hyvarinen and Erkki Oja 2000 ldquoIndependent Component Analysis Algorithmsand Applicationsrdquo Neural Networks 13(4-5) 411-30 httpswwwcshelsinkifiuahyvarin

papersNN00newpdf

bull [RandomProjection] Ella Bingham and Heikki Mannilla 2001 ldquoRandom projection in dimension-ality reduction applications to image and text datardquo KDD httpsdoiorg1011452F502512

502546

bull Compression eg ldquoTransform codingrdquo ldquoWaveletsrdquo

Nonlinear dimensionality reduction Manifold learning

bull DeepLearning Section 5113 ldquoManifold Learningrdquo

bull DataScience-Python NB 0510 ldquoManifold Learningrdquo (Locally linear embedding [LLE] Isomap)

bull Dagger[KernelPCA] Sebastian Raschka 2014 ldquoKernel tricks and nonlinear dimensionality reduction viaRBF kernel PCArdquo httpsebastianraschkacomArticles2014_kernel_pcahtml

bull Shalizi-ADA Ch 18 ldquoNonlinear Dimensionality Reductionrdquo (LLE)

bull Dagger[LaplacianEigenmaps] Mikhail Belkin and Partha Niyogi 2003 ldquoLaplacian eigenmaps for di-mensionality reduction and data representationrdquo Neural Computation 15(6) 13731396 http

webcseohio-stateedu~belkin8papersLEM_NC_03pdf

bull DeepLearning Ch14 (ldquoAutoencodersrdquo)

bull Dagger[word2vec] Tomas Mikolov Kai Chen Greg Corrado Jeffrey Dean 2013 ldquoEfficient Estimation ofWord Representations in Vector Spacerdquo httpsarxivorgabs13013781

bull Dagger[word2vecExplained] Yoav Goldberg and Omer Levy 2014 ldquoword2vec Explained DerivingMikolov et als Negative-Sampling Word-Embedding Methodrdquo httpsarxivorgabs14023722

bull Dagger[t-SNE] Laurens van der Maaten and Geoffrey Hinton 2008 ldquoVisualising Data using t-SNErdquo Jour-nal of Machine Learning Research 9 2579-2605 httpjmlrcsailmitedupapersvolume9

vandermaaten08avandermaaten08apdf

24

Computation and Scaling Up

Scientific computing computation at scale

bull NRCreport Ch 10 ldquoThe Seven Computational Giants of Massive Data Analysisrdquo

bull DSHandbook-Py Ch 21 ldquoPerformance and Computer Memoryrdquo Ch 22 ldquoComputer Memory andData Structuresrdquo

bull Dagger[ComputationalStatistics-Python] Cliburn Chan Computational Statistics in Python https

peopledukeedu~ccc14sta-663indexhtml

bull CRAN Task View High Performance Computinghttpscranr-projectorgwebviewsHighPerformanceComputinghtml

Numerical computing

bull [NumericalComputing] Ward Cheney and David Kincaid 2013 Numerical Mathematics andComputing 7th ed BrooksCole Cengage Learning

bull CRAN Task View Numerical Mathematicshttpscranr-projectorgwebviewsNumericalMathematicshtml

Optimization (eg MLE gradient descent stochastic gradient descent EM algorithm neural nets)

bull DSHandbook-Py Ch 23 ldquoMaximum Likelihood Estimation and Optimizationrdquo (gradient descent)

bull DeepLearning Ch 4 ldquoNumerical Computationrdquo Ch 6 ldquoDeep Feedforward Networksrdquo Ch 8ldquoOptimization for Training Deep Modelsrdquo

bull CRAN Task View Optimization and Mathematical Programminghttpscranr-projectorgwebviewsOptimizationhtml

Linear algebra matrix computations

bull [MatrixComputations] Gene H Golub and Charles F Van Loan 2013 Matrix Computations 4thed Johns Hopkins University Press

bull Dagger[NetflixMatrix] Yehuda Koren Robert Bell and Chris Volinsky 2009 ldquoMatrix FactorizationTechniques for Recommender Systemsrdquo Computer August 42-9 httpsdatajobscomdata-

science-repoRecommender-Systems-[Netflix]pdf

bull Dagger[BigDataPCA] Jianqing Fan Qiang Sun Wen-Xin Zhou Ziwei Zhu ldquoPrincipal Component Anal-ysis for Big Datardquo httpwwwprincetonedu~ziweizpcapdf

bull Dagger[RandomSVD] Andrew Tulloch 2009 ldquoFast Randomized SVDrdquo httpsresearchfbcom

fast-randomized-svd

bull Dagger[FactorSGD] Rainer Gemulla Peter J Haas Erik Nijkamp and Yannis Sismanis 2011 ldquoLarge-Scale Matrix Factorization with Distributed Stochastic Gradient Descentrdquo KDD httpwwwcs

utahedu~hariteachingbigdatagemulla11dsgdpdf

bull Dagger[Sparse] Max Grossman 2015 ldquo101 Ways to Store a Sparse Matrixrdquo httpsmediumcom

jmaxg3101-ways-to-store-a-sparse-matrix-c7f2bf15a229

bull Dagger[DontInvert] John D Cook 2010 ldquoDonrsquot Invert that Matrixrdquo httpswwwjohndcookcom

blog20100119dont-invert-that-matrix

Simulation-based inference resampling Monte Carlo methods MCMC Bayes approximate inference

25

bull Dagger[MarkovVisually] Victor Powell ldquoMarkov Chains Explained Visuallyrdquo httpsetosaioev

markov-chains

bull DSHandbook-Py Ch 25 ldquoStochastic Modelingrdquo (Markov chains MCMC HMM)

bull Bayes ComputationalStatistics-Py

bull DeepLearning Ch 17 ldquoMonte Carlo Methodsrdquo Ch 19 ldquoApproximate Inferencerdquo

bull Dagger[VariationalInference] Jason Eisner 2011 ldquoHigh-Level Explanation of Variational Inferencerdquohttpswwwcsjhuedu~jasontutorialsvariationalhtml

bull CRAN Task View Bayesian Inferencehttpscranr-projectorgwebviewsBayesianhtml

Parallelism MapReduce Split-Apply-Combine

bull Dagger[MapReduceIntuition] Jigsaw Academy 2014 ldquoBig Data Specialist MapReducerdquo https

wwwyoutubecomwatchv=TwcYQzFqg-8ampfeature=youtube

bull MMDS Ch 2 ldquoMap-Reduce and the New Software Stackrdquo (3rd Edition discusses Spark amp Tensor-Flow)

bull Algorithms Ch 3ldquoScalable Algorithms and Associative Statisticsrdquo Ch 4 ldquoHadoop and MapReducerdquo

bull NRCreport Ch 6 ldquoResources Trade-offs and Limitationsrdquo

bull TidyData-R (Split-apply-combine is the motivating principle behind the ldquotidyverserdquo approach) Seealso Part III ldquoProgramrdquo (pipes functions vectors iteration)

bull Dagger[TidySAC-Video] Hadley Wickham 2017 ldquoData Science in the Tidyverserdquo httpswww

rstudiocomresourcesvideosdata-science-in-the-tidyverse

bull See also FactorSGD

Functional Programming

bull ComputationalStatistics-Python ldquoFunctions are first class objectsrdquo through first exercises

bull DSHandbook-Py Ch 20 ldquoProgramming Language Conceptsrdquo

bull See also Haskell

Scaling iteration streaming data online algorithms (Spark)

bull NRCreport Ch 4 ldquoTemporal Data and Real-Time Algorithmsrdquo

bull MMDS Ch 4 ldquoMining Data Streamsrdquo

bull Dagger[BDAS] AMPLab BDAS The Berkeley Data Analytics Stack httpsamplabcsberkeleyedu

software

bull dagger[Spark] Mohammed Guller 2015 Big Data Analytics with Spark A Practitioners Guide toUsing Spark for Large-Scale Data Processing Machine Learning and Graph Analytics and High-Velocity Data Stream Processing Apress httpslink-springer-comezaccesslibrariespsu

edubook101007978-1-4842-0964-6

bull DSHandbook-Py Ch 13 ldquoBig Datardquo

bull DeepLearning Ch 10 ldquoSequence Modeling Recurrent and Recursive Netsrdquo

General Resources for Python and R

26

bull dagger[DSHandbook-Py] Field Cady 2017 The Data Science Handbook (Python-based) Wiley http

onlinelibrarywileycomezaccesslibrariespsuedubook1010029781119092919

bull Dagger[Tutorials-R] Ujjwal Karn 2017 ldquoA curated list of R tutorials for Data Science NLP and MachineLearningrdquo httpsgithubcomujjwalkarnDataScienceR

bull Dagger[Tutorials-Python] Ujjwal Karn 2017 ldquoA curated list of Python tutorials for Data Science NLPand Machine Learningrdquo httpsgithubcomujjwalkarnDataSciencePython

Cutting and Bleeding Edge of Data Science Languages

bull [Scala] daggerVishal Layka and David Pollak 2015 Beginning Scala httpslink-springer-

comezaccesslibrariespsuedubook1010072F978-1-4842-0232-6 daggerNicolas Patrick 2014Scala for Machine Learning httpsebookcentralproquestcomlibpensudetailaction

docID=1901910 DaggerScala site httpswwwscala-langorg (See also )

bull [Julia] daggerIvo Baobaert 2015 Getting Started with Julia httpsebookcentralproquestcom

libpensudetailactiondocID=1973847 DaggerDouglas Bates 2013 ldquoJulia for R Programmersrdquohttpwwwstatwiscedu~batesJuliaForRProgrammerspdf DaggerJulia site httpsjulialang

org

bull [Haskell] daggerRichard Bird 2014 Thinking Functionally with Haskell httpsdoi-orgezaccess

librariespsuedu101017CBO9781316092415 daggerHakim Cassimally 2017 Learning Haskell Pro-gramming httpswwwlyndacomHaskell-tutorialsLearning-Haskell-Programming604926-2html daggerJames Church 2017 Learning Haskell for Data Analysis httpswwwlyndacomDeveloper-tutorialsLearning-Haskell-Data-Analysis604234-2html Haskell site httpswwwhaskellorg

bull [Clojure] daggerMark McDonnell 2017 Quick Clojure Essential Functional Programming https

link-springer-comezaccesslibrariespsuedubook1010072F978-1-4842-2952-1 daggerAkhillWali 2014 Clojure for Machine Learning httpsebookcentralproquestcomlibpensu

detailactiondocID=1674848 daggerArthur Ulfeldt 2015 httpswwwlyndacomClojure-tutorialsUp-Running-Clojure413127-2html DaggerClojure site httpsclojureorg

bull [TensorFlow] DaggerAbadi et al (Google) 2016 ldquoTensorFlow A System for Large-Scale Ma-chine Learningrdquo httpsarxivorgabs160508695 DaggerUdacity ldquoDeep Learningrdquo httpswww

udacitycomcoursedeep-learning--ud730 DaggerTensorFlow site httpswwwtensorfloworg

bull [H2O] Darren Cool 2016 Practical Machine Learning with H2O Powerful Scalable Techniques forDeep Learning and AI OrsquoReilly Arno Candel and Viraj Parmar 2015 Deep Learning with H2ODaggerH2O site httpswwwh2oai

Social Data Structures

Space and Time

bull dagger[GIA] David OrsquoSullivan David J Unwin 2010 Geographic Information Analysis Second EditionJohn Wiley amp Sons httponlinelibrarywileycombook1010029780470549094

bull [Space-Time] Donna J Peuquet 2003 Representations of Space and Time Guilford

bull Roger S Bivand Edzer Pebesma Virgilio Gmez-Rubio 2013 Applied Spatial Data Analysis in RFree through library httplinkspringercomezaccesslibrariespsuedubook101007

2F978-1-4614-7618-4 See also Edzer Pebesma 2016 ldquoHandling and Analyzing Spatial Spa-tiotemporal and Movement Data in Rrdquo httpsedzergithubioUseR2016

bull Dagger[PySAL] Sergio J Rey and Dani Arribas-Bel 2016 ldquoGeographic Data Science with PySAL andthe pydata Stackrdquo httpdarribasorggds_scipy16

27

bull DataScience-Python NB 0413 ldquoGeographic Data with Basemaprdquo

bull Time ldquoLongitudinalrdquo intra-individual data as common in developmental psychology dagger[Longitudinal]Garrett Fitzmaurice Marie Davidian Geert Verbeke and Geert Molenberghs editors 2009 Longitu-dinal Data Analysis Chapman amp Hall CRC httppensueblibcompatronFullRecordaspxp=359998

bull Time ldquoTime Series Time Series Cross Section Panelrdquo as common in economics political scienceand sociology DataScience-Python Ch 3 re pandas DSHandbook-Py Ch 17 ldquoTime Series AnalysisrdquoEconometrics GelmanHill

bull Time ldquoSequential Data Streamsrdquo as in NLP machine learning Spark and other rdquostreaming datardquoreadings

bull Space-Time Data with continuity connectivity neighborhood structure See DeepLearning Chapter 9re convolution

bull CRAN Task Views Spatial httpscranr-projectorgwebviewsSpatialhtml Spatiotempo-ral httpscranr-projectorgwebviewsSpatioTemporalhtml Time Series httpscran

r-projectorgwebviewsTimeSerieshtml

Network Graphs

bull Dagger[Networks] David Easley and Jon Kleinberg 2010 Networks Crowds and Markets ReasoningAbout a Highly Connected World Cambridge University Press httpwwwcscornelleduhome

kleinbernetworks-book

bull MMDS Ch 5 ldquoLink Analysisrdquo Ch 10 ldquoMining Social-Network Graphsrdquo

bull dagger[Networks-R] Douglas Luke 2015 A Userrsquos Guide to Network Analysis in R Springer https

link-springer-comezaccesslibrariespsuedubook101007978-3-319-23883-8

bull dagger[Networks-Python] Mohammed Zuhair Al-Taie and Seifedine Kadry 2017 Python for Graph andNetwork Analysis Springer httpslink-springer-comezaccesslibrariespsuedubook

1010072F978-3-319-53004-8

Hierarchy Clustered Data Aggregation Mixed Models

bull dagger[GelmanHill] Andrew Gelman and Jennifer Hill 2006 Data Analysis Using Regression and Multi-levelHierarchical Models Cambridge University Press httppensueblibcompatronFullRecordaspxp=288457

bull dagger[MixedModels] Eugene Demidenko 2013 Mixed Models Theory and Applications with R 2nd edWiley httpsiteebrarycomezaccesslibrariespsuedulibpennstatedetailaction

docID=10748641

Social Data Channels

Language Text Speech Audio

bull Dagger[NLP] Daniel Jurafsky and James H Martin Forthcoming (2017) Speech and Language ProcessingAn Introduction to Natural Language Processing Speech Recognition and Computational Linguistics3rd ed Prentice-Hall Preprint httpswebstanfordedu~jurafskyslp3

bull InfoRetrieval

bull Dagger[CoreNLP] Stanford CoreNLP httpsstanfordnlpgithubioCoreNLP

28

bull dagger[TextAsData] Grimmer and Stewart 2013 ldquoText as Data The Promise and Pitfalls of AutomaticContent Analysis Methods for Political Textsrdquo Political Analysis httpsdoiorg101093pan

mps028

bull Quinn-Topics

bull FightinWords

bull dagger[NLTK-Python] Jacob Perkins 2010 Python Text Processing with NLTK 20 Cookbook PackthttpsiteebrarycomezaccesslibrariespsuedulibpennstatedetailactiondocID=10435387dagger[TextAnalytics-Python] Dipanjan Sarkar 2016 Text Analytics with Python Springer http

linkspringercomezaccesslibrariespsuedubook1010072F978-1-4842-2388-8

bull DSHandbook-Py Ch 16 ldquoNatural Language Processingrdquo

bull Matthew J Denny httpwwwmjdennycomText_Processing_In_Rhtml

bull CRAN Natural Language Processing httpscranr-projectorgwebviewsNaturalLanguageProcessinghtml

bull dagger[AudioData] Dean Knox and Christopher Lucas 2017 ldquoThe Speaker-Affect Model MeasuringEmotion in Political Speech with Audio Datardquo httpchristopherlucasorgfilesPDFssam

pdf httpswwwyoutubecomwatchv=Hs8A9dwkMzI

bull daggerMorgan amp Claypool Synthesis Lectures on Human Language Technologies httpwwwmorganclaypoolcomtochlt11

bull daggerMorgan amp Claypool Synthesis Lectures on Speech and Audio Processing httpwwwmorganclaypoolcomtocsap11

Vision Image Video

bull Dagger[ComputerVision] Richard Szelinski 2010 Computer Vision Algorithms and Applications SpringerhttpszeliskiorgBook

bull MixedModels Ch 11 ldquoStatistical Analysis of Shaperdquo Ch 12 ldquoStatistical Image Analysisrdquo

bull AudioVideoVolumetric See DeepLearning Chapter 9 re convolution

bull daggerMorgan amp Claypool Synthesis Lectures on Image Video and Multimedia Processing httpwww

morganclaypoolcomtocivm11

bull daggerMorgan amp Claypool Synthesis Lectures on Computer Vision httpwwwmorganclaypoolcom

toccov11

Approaches to Learning from Data (The Analytics Layer)

bull NRCreport Ch 7 ldquoBuilding Models from Massive Datardquo

bull MMDS (data-mining)

bull CausalInference

bull Dagger[VisualAnalytics] Daniel Keim Jorn Kohlhammer Geoffrey Ellis and Florian Mansmann editors2010 Mastering the Information Age Solving Problems with Visual Analytics The EurographicsAssociation Goslar Germany httpwwwvismastereubook See also daggerMorgan amp ClaypoolSynthesis Lectures on Visualization httpwwwmorganclaypoolcomtocvis21

bull Dagger[Econometrics] Bruce E Hansen 2014 Econometrics httpwwwsscwiscedu~bhansen

econometrics

29

bull Dagger[StatisticalLearning] Gareth James Daniela Witten Trevor Hastie and Robert Tibshirani 2013An Introduction to Statistical Learning with Applications in R Springer httpwww-bcfuscedu

~garethISLindexhtml

bull [MachineLearning] Christopher Bishop 2006 Pattern Recognition and Machine Learning Springer

bull dagger[Bayes] Peter D Hoff 2009 A First Course in Bayesian Statistical Methods Springer http

linkspringercombook1010072F978-0-387-92407-6

bull dagger[GraphicalModels] Soslashren Hoslashjsgaard David Edwards Steffen Lauritzen 2012 Graphical Mod-els with R Springer httpslink-springer-comezaccesslibrariespsuedubook101007

2F978-1-4614-2299-0

bull [InformationTheory] David Mackay 2003 Information Theory Inference and Learning AlgorithmsSpringer

bull Dagger[DeepLearning] Ian Goodfellow Yoshua Bengio and Aaron Courville 2016 Deep Learning MITPress httpwwwdeeplearningbookorg (see also Yann LeCun Yoshua Bengio and Geof-frey Hinton 2015 ldquoDeep Learningrdquo Nature 521(7553) 436ndash444 httpsdoiorg101038

nature14539)

See also Dagger[Keras] Keras The Python Deep Learning Library httpskerasio Francoios CholletDirectory of Keras tutorials httpsgithubcomfcholletkeras-resources

bull dagger[SignalProcessing] Jose Maria Giron-Sierra 2013 Digital Signal Processing with MATLAB Exam-ples (Volumes 1-3) Springer httpslink-springer-comezaccesslibrariespsuedubook

101007978-981-10-2534-1

30

Penn State Policy Statements

Academic Integrity

Academic integrity is the pursuit of scholarly activity in an open honest and responsible manner Academicintegrity is a basic guiding principle for all academic activity at The Pennsylvania State University and allmembers of the University community are expected to act in accordance with this principle Consistent withthis expectation the Universitys Code of Conduct states that all students should act with personal integrityrespect other students dignity rights and property and help create and maintain an environment in whichall can succeed through the fruits of their efforts

Academic integrity includes a commitment by all members of the University community not to engage in ortolerate acts of falsification misrepresentation or deception Such acts of dishonesty violate the fundamentalethical principles of the University community and compromise the worth of work completed by others

Disability Accomodation

Penn State welcomes students with disabilities into the Universitys educational programs Every Penn Statecampus has an office for students with disabilities Student Disability Resources (SDR) website provides con-tact information for every Penn State campus (httpequitypsuedusdrdisability-coordinator)For further information please visit the Student Disability Resources website (httpequitypsuedusdr)

In order to receive consideration for reasonable accommodations you must contact the appropriate disabilityservices office at the campus where you are officially enrolled participate in an intake interview and pro-vide documentation See documentation guidelines at (httpequitypsuedusdrguidelines) If thedocumentation supports your request for reasonable accommodations your campus disability services officewill provide you with an accommodation letter Please share this letter with your instructors and discussthe accommodations with them as early as possible You must follow this process for every semester thatyou request accommodations

Psychological Services

Many students at Penn State face personal challenges or have psychological needs that may interfere withtheir academic progress social development or emotional wellbeing The university offers a variety ofconfidential services to help you through difficult times including individual and group counseling crisisintervention consultations online chats and mental health screenings These services are provided by staffwho welcome all students and embrace a philosophy respectful of clients cultural and religious backgroundsand sensitive to differences in race ability gender identity and sexual orientation

bull Counseling and Psychological Services at University Park (CAPS) (httpstudentaffairspsueducounseling) 814-863-0395

bull Penn State Crisis Line (24 hours7 daysweek) 877-229-6400

bull Crisis Text Line (24 hours7 daysweek) Text LIONS to 741741

Educational Equity

Penn State takes great pride to foster a diverse and inclusive environment for students faculty and staffConsistent with University Policy AD29 students who believe they have experienced or observed a hatecrime an act of intolerance discrimination or harassment that occurs at Penn State are urged to report theseincidents as outlined on the Universitys Report Bias webpage (httpequitypsuedureportbias)

31

Background Knowledge

Students will have a variety of backgrounds Prior to beginning interdisciplinary coursework tofulfill Social Data Analytics degree requirements including SoDA 501 and 502 students are ex-pected to have advanced (graduate) training in at least one of the component areas of Social DataAnalytics and a familiarity with basic concepts in the others

With regard to specialization students are expected to have advanced (graduate) training in ONEof the following

bull quantitative social science methodology and a discipline of social science (as would be thecase for a second-year PhD student in Political Science Sociology Criminology Human De-velopment and Family Studies or Demography) OR

bull statistics (as would be the case for a second-year PhD student in Statistics) OR

bull information science or informatics (as would be the case for a second-year PhD student inInformation Science and Technology or a second-year PhD student in Geography specializingin GIScience) OR

bull computer science (as would be the case for a second-year PhD student in Computer Scienceand Engineering)

This requirement is met as a matter of meeting home program requirements for students in thedual-title PhD but may require additional coursework on the part of students in other programswishing to pursue the graduate minor

With regard to general preparation students are expected to have ALL of the following technicalknowledge

bull basic programming skills including basic facility with R ampor Python AND

bull basic knowledge of relational databases ampor geographic information systems AND

bull basic knowledge of probability applied statistics ampor social science research design AND

bull basic familiarity with a substantive or theoretical area of social science (eg 300-level course-work in political science sociology criminology human development psychology economicscommunication anthropology human geography social informatics or similar fields)

It is not unusual for students to have one or more gaps in this preparation at time of applicationto the SoDA program Students should work with Social Data Analytics advisers to develop aplan for timely remediation of any deficiencies which generally will not require formal courseworkfor students whose training and interests are otherwise appropriate for pursuit of the Social DataAnalytics degree Where possible this will be addressed at time of application to the Social DataAnalytics program

To this end some free training materials are linked in the reference section and there are alsoabundant free high-quality self-paced course-style training materials on these and related subjectsavailable through edX (httpwwwedxorg) Udacity (httpwwwudacitycom) Codecademy(httpcodecademycom) and Lynda (httpwwwlyndapsu)

32

Page 9: SoDA 501: Approaches and Issues in Big Social Data Spring ... · \Approaches and Issues in Big Social Data," o ered in the spring semester. Some SoDA / IGERT students will take more

April 5

bull Prasenjit Mitra (IST) ldquoClassification of Tweets from Disaster Scenariosrdquo

bull Reading

Prasenjit Mitra Rudra et al rdquoSummarizing Situational and Topical InformationDuring Crisesrdquo httpsarxivorgpdf161001561pdf Imran Mitra amp CastillordquoTwitter as a Lifeline Human-annotated Twitter Corpora for NLP of Crisis-relatedMessagesrdquo httpmimranmepapersimran_prasenjit_carlos_lrec2016pdf

NLP (Jurafsky and Martin) Chapters 15 and 16 ldquoVector Semanticsrdquo and ldquoSemanticswith Dense Vectorsrdquo

bull Further reference

Section ldquoLanguage text speech audiordquo

GloVe word2vec word2vecExplained t-SNE

Section ldquoMobile devices distributed sensors rdquo

Section ldquoCrowdsourcing human computation citizen science rdquo

Section ldquoScaling iteration streaming data online algorithmsrdquo

bull Exercise 3 Due

April 12

bull Reading

BitByBit Ch 4 ldquoRunning Experimentsrdquo review Ch 2 sections ldquoNatural experimentsin observable datardquo examples

CausalInference Chs 1-2

bull Further reference

Section ldquoExperimental and observational designs for causal inferencerdquo

Section ldquoCrowdsourcing web experimentsrdquo

Section ldquoHuman subjects rdquo

bull 75 Project Review

April 19

bull Reading

BitByBit Ch 6 ldquoEthicsrdquo

PSU-ORP ldquoCommon Rule and Other Changesrdquo httpswwwresearchpsuedu

irbcommonrulechanges

DataPrivacy

9

Reproducibility

Refresh on MachineBias

bull Further reference

Section ldquoEthics and Scientific Responsibility in Big Social Datardquo

Section ldquoiexclCuidadordquo

April 26 - LAST CLASS MEETING

bull Team Project Presentations

May 3 - Team Projects Due

10

Past Visiting Speakers in SoDA 501 and 502

SoDA 502 (Fall 2017)

bull Chris Zorn (PLSC)

bull Diane Felmlee (SOC)

bull Alan MacEachren (GEOG)

bull Eric Plutzer (PLSC)

bull James LeBreton (PSYCH)

bull Sesa Slavkovic (STAT)

bull Guido Cervone (GEOG)

bull Ashton Verdery (SOC)

bull Maggie Niu (STAT)

bull Reka Albert (PHYS)

SoDA 501 (Spring 2017)

bull Tim Brick (HDFS) ldquoTowards real-time monitoring and intervention using wearable technologyrdquo

bull Aylin Caliskan (Princeton) ldquoA Story of Discrimination and Unfairness Bias in Word Embeddingsrdquo

bull Jay Yonamine (Google - IGERT alum) ldquoData Science in Industryrdquo

bull Johnathan Rush (Illinois) ldquoGeospatial Data Science Workshoprdquo

bull Rick Gilmore (PSYCH) ldquoToward a more reproducible and robust science of human behaviorrdquo

bull Glenn Firebaugh (SOC) ldquoMeasuring Inequality and Segregation with US Census Datardquo

bull Charles Twardy (Sotera) ldquoData Science for Search and Rescuerdquo

bull Anna Smith (Ohio State) ldquoA Hierarchical Model for Network Data in a Latent Hyperbolic Spacerdquo

bull Rebecca Passonneau (CSE) ldquoOmnigraph Rich Feature Representation for Graph Kernel Learningrdquo

bull Alex Klippel (GEOG) ldquoVirtual Reality for Immersive Analyticsrdquo

bull Murali Haran (STAT) ldquoA Computationally Efficient Projection-based Approach for Spatial General-ized Mixed Modelsrdquo

SoDA 502 (Fall 2016)

bull Clio Andris (GEOG) ldquoIntegrating Social Network Data into GISystemsrdquo

bull Jia Li (STAT) ldquoClustering under the Wasserstein Metricrdquo

bull Rachel Smith (CAS) ldquoStigma Networks Perceptions of Sociogramsrdquo

bull Zita Oravecz (HDFS)

bull Bethany Bray (Methodology Center) ldquoLatent Class and Latent Transition Analysisrdquo

bull Dave Hunter (STAT) ldquoModel Based Clustering of Large Networksrdquo

bull Scott Bennett (PLSC) ldquoABM Model of Insurgencyrdquo

bull David Reitter (IST)

bull Suzanna Linn (PLSC) ldquoMethodological Issues in Automated Text Analysis Application to NewsCoverage of the US Economyrdquo

11

SoDA 501 (Spring 2016)

bull Bruce Desmarais (PLSC) ldquoLearning in the Sunshine Analysis of Local Government Email Corporardquo

bull Timothy Brick (HDFS) ldquoMapping and Manipulating Facial Expressionrdquo

bull Qunying Huang (USC) ldquoSocial Media An Emerging Data Source for Human Mobility Studiesrdquo

bull Lingzhou Zue (STAT) ldquoAn Introduction to High-Dimensional Graphical Modelsrdquo

bull Ashton Verdery (SOC) ldquoSampling from Network Datardquo

bull Sarah Battersby (Tableau) ldquoHelping People See and Understand Spatial Datardquo

bull Alexandra Slavkovic (STAT) ldquoStatistical Privacy with Network Datardquo

bull Lee Giles (IST) ldquoMachine Learning for Scholarly Big Datardquo

12

Readings and References (updated Spring 2018)

We will discuss a relatively small subset of the readings listed here and this will vary based on topics andreadings discussed by visiting speakers and you yourselves The remainder are provided here as curatedreferences for more in-depth investigation in both the theory and practice related to the topic (with thelatter heavily weighted toward resources in Python and R)

dagger Material that is at last check made available through the Penn State library Most journal links shouldwork if you are logged in to a Penn State machine or through the Penn State VPN Some article archivesand most books require additional authentication through webaccess If links are broken start directly froma search via httpswwwlibrariespsuedu Some books require installation of e-readers like AdobeDigital Editions Links to lyndacom can be accessed through httplyndapsuedu

Dagger Material that is at last check legally provided for free In some cases these are the preprint versionsof published material

sect Material for which a legal selection is or will be provided through the class Box folder

Big Data amp Social Data Analytics

Overviews

bull Dagger[CompSocSci] David Lazer Alex Pentland Lada Adamic Sinan Aral Albert-Laszlo BarabasiDevon Brewer Nicholas Christakis Noshir Contractor James Fowler Myron Gutmann Tony JebaraGary King Michael Macy Deb Roy and Marshall Van Alstyne 2009 ldquoComputational Social SciencerdquoScience 323(5915)721-3 + Supp Feb 6 httpsciencesciencemagorgcontent3235915

721full httpsgkingharvardedufilesgkingfilesLazPenAda09pdf

bull Dagger[NRCreport] National Research Council 2013 Frontiers in Massive Data Analysis NationalAcademies Press (Free w registration httpwwwnapeducatalogphprecord_id=18374)Ch1 ldquoIntroductionrdquo Ch2 ldquoMassive Data in Science Technology Commerce National DefenseTelecommunications and other Endeavorsrdquo

bull Dagger[MMDS] Jure Leskovec Anand Rajaraman and Jeff Ullman 2014 Mining of Massive DatasetsCambridge University Press httpwwwmmdsorg (BETA version of Third Edition httpi

stanfordedu~ullmanmmdsnhtml)

bull sect[BitByBit] Matthew J Salganik 2018 (Forthcoming) Bit by Bit Social Research in the DigitalAge Princeton University Press Ch 1 ldquoIntroductionrdquo Ch 2 ldquoObserving Behaviorrdquo

bull DSHandbook-Py Ch 1 ldquoIntroduction Becoming a Unicornrdquo Ch2 ldquoThe Data Science Road Maprdquo

Burt-schtick

bull Dagger[Monroe-5Vs] Burt L Monroe 2013 ldquoThe Five Vs of Big Data Political Science Introductionto the Special Issue on Big Data in Political Sciencerdquo Political Analysis 21(V5) 1ndash9 https

doiorg101017S1047198700014315 (Volume Velocity Variety Vinculation Validity)

bull dagger[Monroe-No] Burt L Monroe Jennifer Pan Margaret E Roberts Maya Sen and Betsy Sin-clair 2015 ldquoNo Formal Theory Causal Inference and Big Data Are Not Contradictory Trendsin Political Sciencerdquo PS Political Science amp Politics 48(1) 71ndash4 httpdxdoiorg101017

S1049096514001760

bull dagger[Quinn-Topics] Kevin M Quinn Burt L Monroe Michael Colaresi Michael H Crespin andDragomir R Radev 2010 ldquoHow to Analyze Political Attention with Minimal Assumptions andCostsrdquo American Journal of Political Science 54(1) 209ndash28 httponlinelibrarywileycom

13

doi101111j1540-5907200900427xfull (esp topic modeling as measurement approach tovalidation)

bull dagger[FightinWords] Burt L Monroe Michael Colaresi and Kevin M Quinn 2008 ldquoFightinrsquo WordsLexical Feature Selection and Evaluation for Identifying the Content of Political Conflictrdquo PoliticalAnalysis 16(4) 372-403 httpsdoiorg101093panmpn018 (esp the impact of samplingvariance and regularization through priors)

bull Dagger[BDSS-Census] Big Data Social Science PSU Team 2012 ldquoA Closer Look at the Kaggle CensusDatardquo httpsburtmonroegithubioBDSSKaggleCensus2012 (esp the relevance of the socialprocesses by which data come to exist as data)

Multidisciplinary Perspectives

bull dagger[Business-BigData] Andrew McAfee and Erik Brynjolfsson 2012 ldquoBig Data The ManagementRevolutionrdquo Harvard Business Review 90(10) 61ndash8 October httpshbrorg201210big-

data-the-management-revolution Thomas H Davenport and DJ Patel 2012 ldquoData ScientistThe Sexiest Job of the 21st Centuryrdquo Harvard Business Review 90(10)70-6 October https

hbrorg201210data-scientist-the-sexiest-job-of-the-21st-century

bull dagger[InfoSci-BigData] CL Philip Chen and Chun-Yang Zhang 2014 ldquoData-intensive ApplicationsChallenges Techniques and Technologies A Survey on Big Datardquo Information Sciences 275 314-47httpsdoiorg101016jins201401015

bull dagger[Informatics-BigData] Vasant G Honavar 2014 ldquoThe Promise and Potential of Big Data ACase for Discovery Informaticsrdquo Review of Policy Research 31(4) 326-330 httpsdoiorg10

1111ropr12080

bull Dagger[Stats-BigData] Beate Franke Jean-Francois Ribana Roscher Annie Lee Cathal Smyth ArminHatefi Fuqi Chen Einat Gil Alexander Schwing Alessandro Selvitella Michael M Hoffman RogerGrosse Dietrich Hendricks and Nancy Reid 2016 ldquoStatistical Inference Learning and Models in BigDatardquo International Statistical Review 84(3) 371-89 httponlinelibrarywileycomdoi101111insr12176full

bull dagger[Econ-BigData] Hal R Varian 2013 ldquoBig Data New Tricks for Econometricsrdquo Journal ofEconomic Perspectives 28(2) 3-28 httpsdoiorg101257jep2823

bull dagger[GeoViz-BigData] Alan MacEachren 2017 ldquoLeveraging Big (Geo) Data with (Geo) Visual An-alytics Place as the Next Frontierrdquo In Chenghu Zhou Fenzhen Su Francis Harvey and Jun Xueds Spatial Data Handling in the Big Data Era pp 139-155 Springer httpslink-springer-

comezaccesslibrariespsueduchapter101007978-981-10-4424-3_10

bull dagger[Soc-BigData] David Lazer and Jason Radford 2017 ldquoData ex Machina Introduction to BigDatardquo Annual Review of Sociology 43 19-39 httpsdoiorg101146annurev-soc-060116-

053457

bull sect[Politics-BigData] Keith T Poole L Jason Anasastopolous and James E Monagan III Forthcom-ing ldquoThe lsquoBig Datarsquo Revolution in Political Campaigning and Governancerdquo Oxford Bibliographies inPolitical Science

iexclCuidado Traps Biases Problems Pains Perils

bull dagger[GoogleFlu] David Lazer Ryan Kennedy Gary King and Alessandro Vespignani 2014 ldquoTheParable of Google Flu Traps in Big Data Analysisrdquo Science (343) 14 March httpgking

harvardedufilesgkingfiles0314policyforumffpdf

bull Dagger[MachineBias] ProPublica ldquoMachine Bias Investigating Algorithmic Injusticerdquo Series https

wwwpropublicaorgseriesmachine-bias See especially

14

Julia Angwin Jeff Larson Surya Mattu and Lauren Kirchner 2016 ldquoMachine Biasrdquo ProPub-lica May 23 httpswwwpropublicaorgarticlemachine-bias-risk-assessments-in-

criminal-sentencing

Julia Angwin Madeleine Varner and Ariana Tobin 2017 ldquoFacebook Enabled Adverstisers toReach Jew Hatersrdquo httpswwwpropublicaorgarticlefacebook-enabled-advertisers-

to-reach-jew-haters

bull Dagger[Polling2016] Doug Rivers (Nov 11 2016) ldquoFirst Thoughts on Polling Problems in the 2016 USElectionsrdquo httpstodayyougovcomnews20161111first-thoughts-polling-problems-

2016-us-elections

bull dagger[EventData] Wei Wang Ryan Kennedy David Lazer Naren Ramakrishnan 2016 ldquoGrowing Painsfor Global Monitoring of Societal Eventsrdquo Science 3536307 pp 1502ndash1503 httpsdoiorg101126scienceaaf6758

bull Dagger[GoogleBooks] Eitan Adam Pechenick Christopher M Danforth and Peter Sheridan Dodds 2015ldquoCharacterizing the Google Books Corpus Strong Limits to Inferences of Socio-Cultural and LinguisticEvolutionrdquo PLoS One httpsdoiorg101371journalpone0137041

bull Dagger[OkCupid] Michael Zimmer 2016 ldquoOkCupid Study Reveals the Perils of Big-Data Sciencerdquo Wiredhttpswwwwiredcom201605okcupid-study-reveals-perils-big-data-science May 14

bull dagger[CriticalQuestions] danah boyd and Kate Crawford 2011 ldquoCritical Questions for Big Data Provo-cations for a Cultural Technological and Scholarly Phenomenonrdquo Information Communication ampSociety 15(5) 662ndash79 httpdxdoiorg1010801369118X2012678878

bull Dagger[RacistBot] Daniel Victor 2016 ldquoMicrosoft created a Twitter bot to learn from users It quicklybecame a racist jerkrdquo New York Times httpswwwnytimescom20160325technology

microsoft-created-a-twitter-bot-to-learn-from-users-it-quickly-became-a-racist-jerk

html

bull MMDS 122-123 on Bonferroni

bull Also BDSS-Census

Research Design and Measurement

Overviews

bull BitByBit

bull Dagger[ResearchMethodsKB] William M Trochim 2006 The Research Methods Knowledge Basehttpwwwsocialresearchmethodsnetkb

bull [7Rules] Glenn Firebaugh 2008 Seven Rules for Social Research Princeton University Press

Measurement Reliability and Validity

bull ResearchMethodsKB ldquoMeasurementrdquo

bull 7Rules Ch 3 ldquoBuild Reality Checks into Your Researchrdquo

bull Reliability see also FightinWords

bull Validity see also Quinn10-Topics Monroe-5Vs

Indirect Unobtrusive Nonreactive Measures Data Exhaust

15

bull dagger[UnobtrusiveMeasures] Raymond M Lee 2015 ldquoUnobtrusive Measuresrdquo Oxford Bibliographieshttpsdoiorg101093OBO9780199846740-0048 (Canonical cite is Eugene J Webb 1966Unobtrusive Measures or Webb Donald T Campbell Richard D Schwartz Lee Sechrest 1999Unobtrusive Measures rev ed Sage)

bull BitByBit Examples in Chapter 2

Multiple Measures Latent Variable Measurement

bull 7Rules Ch 4 ldquoReplicate Where Possiblerdquo

bull dagger[LatentVariables] David J Bartholomew Martin Knott and Irini Moustaki 2011 Latent VariableModels and Factor Analysis A Unified Approach Wiley httpsiteebrarycomezaccess

librariespsuedulibpennstatedetailactiondocID=10483308) (esp Ch1 ldquoBasic ideas andexamplesrdquo)

bull dagger[Multivariate-R] Brian Everitt and Torsten Hothorn 2011 An Introduction to Applied MultivariateAnalysis with R Springer httplinkspringercomezaccesslibrariespsuedubook10

10072F978-1-4419-9650-3

bull Shalizi-ADA Ch17 ldquoFactor Modelsrdquo

bull Dagger[NetflixPrize] Edwin Chen 2011 ldquoWinning the Netflix Prize A Summaryrdquo

bull DaggerCRAN httpscranr-projectorgwebviewsMultivariatehtmlhttpscranr-projectorgwebviewsPsychometricshtmlhttpscranr-projectorgwebviewsClusterhtml

Sampling and Survey Design

bull dagger[Sampling] Steven K Thompson 2012 Sampling 3rd ed httpsk8es4mc2lsearchserialssolutionscomsid=sersolampSS_jc=TC_024492330amptitle=Wiley20Desktop20Editions203A20Sampling

bull ResearchMethodsKB ldquoSamplingrdquo

bull NRCreport Ch 8 ldquoSampling and Massive Datardquo

bull BitByBit Ch 3 ldquoAsking Questionsrdquo

bull Dagger[MSE] Daniel Manrique-Vallier Megan E Price and Anita Gohdes 2013 In Seybolt et al (eds)Counting Civilians ldquoMultiple Systems Estimation Techniques for Estimating Casualties in ArmedConflictsrdquo Preprint httpciteseerxistpsueduviewdocdownloaddoi=1011469939amp

rep=rep1amptype=pdf)

bull dagger[NetworkSampling] Ted Mouw and Ashton M Verdery 2012 ldquoNetwork Sampling with Memory AProposal for More Efficient Sampling from Social Networksrdquo Sociological Methodology 42(1)206ndash56httpsdxdoiorg1011772F0081175012461248

bull See also FightinWords (re hidden heteroskedasticity in sample variance)

bull Dagger[HashDontSample] Mudit Uppal 2016 ldquoProbabilistic data structures in the Big data world (+code)rdquo (re ldquoHash donrsquot samplerdquo) httpsmediumcommuppalprobabilistic-data-structures-

in-the-big-data-world-code-b9387cff0c55

bull MMDS Hash Functions (132) Sampling in streams (42)

bull DaggerCRAN httpscranr-projectorgwebviewsOfficialStatisticshtml (ldquoComplex SurveyDesignrdquo ldquoSmall Area Estimationrdquo)

Experimental and Observational Designs for Causal Inference

16

bull Dagger[CausalInference] Miguel A Hernan James M Robins Forthcoming (2017) Causal Infer-ence Chapman amp HallCRC Preprint httpswwwhsphharvardedumiguel-hernancausal-

inference-book Chs 1 ldquoA definition of causal effectrdquo Ch 2 ldquoRandomized experimentsrdquo Ch3 ldquoObservational studiesrdquo (See Ch 6 ldquoGraphical representation of causal effectsrdquo for integrationwith Judea Pearl approach)

bull BitByBit Chapter 4 ldquoRunning Experimentsrdquo Ch 2 Natural experiments in observable data exam-ples

bull ResearchMethodsKB ldquoDesignrdquo

bull Firebaugh-7Rules Ch 2 ldquoLook for Differences that Make a Difference and Report Themrdquo sectCh5 ldquoCompare Like with Likerdquo Ch 6 ldquoUse Panel Data to Study Individual Change and RepeatedCross-Section Data to Study Social Changerdquo

bull Shalizi-ADA Part IV ldquoCausal Inferencerdquo

bull Monroe-No

bull CRAN httpscranr-projectorgwebviewsExperimentalDesignhtml

Technologies for Primary and Secondary Data Collection

Mobile Devices Distributed Sensors Wearable Sensors Remote Sensing

bull dagger[RealityMining] Nathan Eagle and Alex (Sandy) Pentland 2006 ldquoReality Mining SensingComplex Social Systemsrdquo Personal and Ubiquitous Computing 10(4) 255ndash68 httpsdoiorg

101007s00779-005-0046-3

bull dagger[QuantifiedSelf ] Melanie Swan 2013 ldquoThe Quantified Self Fundamental Disruption in Big DataScience and Biological Discoveryrdquo Big Data 1(2) 85ndash99 httpsdoiorg101089big2012

0002

bull dagger[SensorData] Charu C Aggarwal (Ed) 2013 Managing and Mining Sensor Data Springer httplinkspringercomezaccesslibrariespsuedubook1010072F978-1-4614-6309-2 Ch1ldquoAn Introduction to Sensor Data Analyticsrdquo

bull dagger[NightLights] Thushyanthan Baskaran Brian Min Yogesh Uppal 2015 ldquoElection cycles andelectricity provision Evidence from a quasi-experiment with Indian special electionsrdquo Journal ofPublic Economics 12664-73 httpsdoiorg101016jjpubeco201503011

Crowdsourcing Human Computation Citizen Science Web Experiments

bull dagger[Crowdsourcing] Daren C Brabham 2013 Crowdsourcing MIT Press httpsiteebrary

comezaccesslibrariespsuedulibpennstatedetailactiondocID=10692208 ldquoIntroduc-tionrdquo Ch1 ldquoConcepts Theories and Cases of Crowdsourcingrdquo

bull BitByBit Ch 5 ldquoCreating Mass Collaborationrdquo

bull NRCreport Ch 9 ldquoHuman Interaction with Datardquo

bull dagger[MTurk] Krista Casler Lydia Bickel and Elizabeth Hackett 2013 ldquoSeparate but Equal AComparison of Participants and Data Gathered via Amazonrsquos MTurk Social Media and Face-to-FaceBehavioral Testingrdquo Computers in Human Behavior 29(6) 2156ndash60 httpdoiorg101016j

chb201305009

bull dagger[LabintheWild] Katharina Reinecke and Krzysztof Z Gajos 2015 ldquoLabintheWild ConductingLarge-Scale Online Experiments With Uncompensated Samplesrdquo In Proceedings of the 18th ACMConference on Computer Supported Cooperative Work amp Social Computing (CSCW rsquo15) 1364ndash1378httpdxdoiorg10114526751332675246

17

bull dagger[TweetmentEffects] Kevin Munger 2017 ldquoTweetment Effects on the Tweeted ExperimentallyReducing Racist Harassmentrdquo Political Behavior 39(3) 629ndash49 httpdoiorg101016jchb

201305009

bull dagger[HumanComputation] Edith Law and Luis van Ahn 2011 Human Computation Morgan amp Clay-pool httpwwwmorganclaypoolcomezaccesslibrariespsuedudoipdf102200S00371ED1V01Y201107AIM013

Open Data File Formats APIs Semantic Web Linked Data

bull DSHandbook-Py Ch 12 ldquoData Encodings and File Formatsrdquo

bull Dagger[OpenData] Open Data Institute ldquoWhat Is Open Datardquo httpstheodiorgwhat-is-open-

data

bull Dagger[OpenDataHandbook] Open Knowledge International The Open Data Handbook httpopendatahandbookorg Includes appendix ldquoFile Formatsrdquo httpopendatahandbookorgguideenappendices

file-formats

bull Dagger[APIs] Brian Cooksey 2016 An Introduction to APIs httpszapiercomlearnapis

bull Dagger[APIMarkets] RapidAPI mashape API marketplaces httpsdocsrapidapicom https

marketmashapecom ProgrammableWeb httpswwwprogrammablewebcom

bull dagger[LinkedData] Tom Heath and Christian Bizer 2011 Linked Data Evolving the Web into a GlobalData Space Morgan amp Claypool httpwwwmorganclaypoolcomezaccesslibrariespsu

edudoiabs102200S00334ED1V01Y201102WBE001 Ch 1 ldquoIntroductionrdquo Ch 2 ldquoPrinciples ofLinked Datardquo

bull dagger[SemanticWeb] Nikolaos Konstantinos and Dimitrios-Emmanuel Spanos 2015 Materializing theWeb of Linked Data Springer httplinkspringercomezaccesslibrariespsuedubook

1010072F978-3-319-16074-0 Ch 1 ldquoIntroduction Linked Data and the Semantic Webrdquo

bull daggerMorgan amp Claypool Synthesis Lectures on the Semantic Web Theory and Technology httpwww

morganclaypoolcomtocwbe111

Web Scraping

bull dagger[TheoryDrivenScraping] Richard N Landers Robert C Brusso Katelyn J Cavanaugh andAndrew B Colmus 2016 ldquoA Primer on Theory-Driven Web Scraping Automatic Extraction ofBig Data from the Internet for Use in Psychological Researchrdquo Psychological Methods 4 475ndash492httpdxdoiorgezaccesslibrariespsuedu101037met0000081

bull Dagger[Scraping-Py] Al Sweigart 2015 Automate the Boring Stuff with Python Practical Programmingfor Total Beginners Ch 11 Web-Scraping httpsautomatetheboringstuffcomchapter11

bull dagger[Scraping-R] Simon Munzert Christian Rubba Peter Meiszligner and Dominic Nyhuis 2014 Au-tomated Data Collection with R A Practical Guide to Web Scraping and Text Mining Wileyhttponlinelibrarywileycombook1010029781118834732

bull DaggerCRAN httpscranr-projectorgwebviewsWebTechnologieshtml

Ethics amp Scientific Responsibility in Big Social Data

Human Subjects Consent Privacy

bull BitByBit Ch 6 (ldquoEthicsrdquo)

bull Dagger[PSU-ORP] Penn State Office for Research Protections (under the VP for Research)

Human Subjects Research IRB httpswwwresearchpsueduirb

18

Revised Common Rule httpswwwresearchpsueduirbcommonrulechanges

Responsible Conduct of Research httpswwwresearchpsuedueducationrcr

Research Misconduct httpswwwresearchpsueduresearchmisconduct

SARI PSU (Scientific and Research Integrity Training) httpswwwresearchpsuedu

trainingsari

bull Dagger[MenloReport] David Dittrich and Erin Kenneally (Center for Applied Interneet Data Analysis)2012 ldquoThe Menlo Report Ethical Principles Guiding Information and Communication Technol-ogy Researchrdquo and companion report ldquoApplying Ethical Principles rdquo httpswwwcaidaorg

publicationspapers2012menlo_report_actual_formatted

bull Dagger[BigDataEthics-CBDES] Jacob Metcalf Emily F Keller and danah boyd 2016 ldquoPerspec-tives on Big Data Ethics and Societyrdquo Council for Big Data Ethics and Society httpbdes

datasocietynetcouncil-outputperspectives-on-big-data-ethics-and-society

bull Dagger[AoIRReport] Annette Markham and Elizabeth Buchanan (Association of Internet Researchers)2012 ldquoEthical Decision-Making and Internet Research Recommendations of the AoIR Ethics WorkingCommittee (Version 20)rdquo httpsaoirorgreportsethics2pdf and guidelines chart https

aoirorgwp-contentuploads201701aoir_ethics_graphic_2016pdf

bull Dagger[BigDataEthics-Wired] Sarah Zhang 2016 ldquoScientists are Just as Confused about the Ethics ofBig-Data Research as Yourdquo Wired httpswwwwiredcom201605scientists-just-confused-ethics-big-data-research

bull dagger[BigDataEthics-HerschelMori] Richard Herschel and Virginia M Mori 2017 ldquoEthics amp BigDatardquo Technology in Society 49 31-36 httpdoiorg101016jtechsoc201703003

The Science of Data Privacy

bull dagger[BigDataPrivacy] Terence Craig and Mary E Ludloff 2011 Privacy and Big Data OrsquoReilly MediahttppensueblibcompatronFullRecordaspxp=781814

bull Dagger[DataAnalysisPrivacy] John Abowd Lorenzo Alvisi Cynthia Dwork Sampath Kannan AshwinMachanavajjhala Jerome Reiter 2017 ldquoPrivacy-Preserving Data Analysis for the Federal Statisti-cal Agenciesrdquo A Computing Community Consortium white paper httpsarxivorgabs1701

00752

bull dagger[DataPrivacy] Stephen E Fienberg and Aleksandra B Slavkovic 2011 ldquoData Privacy and Con-fidentialityrdquo International Encyclopedia of Statistical Science 342ndash5 httpdoiorg978-3-642-

04898-2_202

bull Dagger[NetworksPrivacy] Vishesh Karwa and Aleksandra Slavkovic 2016 ldquoInference using noisy degreesDifferentially private β-model and synthetic graphsrdquo Annals of Statistics 44(1) 87-112 http

projecteuclidorgeuclidaos1449755958

bull dagger[DataPublishingPrivacy] Raymond Chi-Wing Wong and Ada Wai-Chee Fu 2010 Privacy-Preserving Data Publishing An Overview Morgan amp Claypool httpwwwmorganclaypoolcom

ezaccesslibrariespsuedudoipdfplus102200S00237ED1V01Y201003DTM002

bull DataMatching Ch 8 ldquoPrivacy Aspects of Data Matchingrdquo

Transparency Reproducibility and Team Science

bull Dagger[10RulesforData] Alyssa Goodman Alberto Pepe Alexander W Blocker Christine L BorgmanKyle Cranmer Merce Crosas Rosanne Di Stefano Yolanda Gil Paul Groth Margaret HedstromDavid W Hogg Vinay Kashyap Ashish Mahabal Aneta Siemiginowska and Aleksandra Slavkovic(2014) ldquoTen Simple Rules for the Care and Feeding of Scientific Datardquo PLoS Computational Biology10(4) e1003542 httpsdoiorg101371journalpcbi1003542 (Note esp curated resourcesfor reproducible research)

19

bull dagger[Transparency] E Miguel C Camerer K Casey J Cohen K M Esterling A Gerber R Glen-nerster D P Green M Humphreys G Imbens D Laitin T Madon L Nelson B A Nosek M Pe-tersen R Sedlmayr J P Simmons U Simonsohn M Van der Laan 2014 ldquoPromoting Transparencyin Social Science Researchrdquo Science 343(6166) 30ndash1 httpsdoiorg101126science1245317

bull dagger[Reproducibility] Marcus R Munafo Brian A Nosek Dorothy V M Bishop Katherine S ButtonChristopher D Chambers Nathalie Percie du Sert Uri Simonsohn Eric-Jan Wagenmakers JenniferJ Ware and John P A Ioannidis 2017 ldquoA Manifesto for Reproducible Sciencerdquo Nature HumanBehavior 0021(2017) httpsdoiorg101038s41562-016-0021

bull Dagger[SoftwareCarpentry] esp ldquoLessonsrdquo httpsoftware-carpentryorglessons

bull DSHandbook-Py Ch 9 ldquoTechnical Communication and Documentationrdquo Ch 15 ldquoSoftware Engi-neering Best Practicesrdquo

bull dagger[TeamScienceToolkit] Vogel AL Hall KL Fiore SM Klein JT Bennett LM Gadlin H Stokols DNebeling LC Wuchty S Patrick K Spotts EL Pohl C Riley WT Falk-Krzesinski HJ 2013 ldquoTheTeam Science Toolkit enhancing research collaboration through online knowledge sharingrdquo AmericanJournal of Preventive Medicine 45 787-9 http101016jamepre201309001

bull sect[Databrary] Kara Hall Robert Croyle and Amanda Vogel Forthcoming (2017) Advancing Socialand Behavioral Health Research through Cross-disciplinary Team Science Springer Includes RickO Gilmore and Karen E Adolph ldquoOpen Sharing of Research Video Breaking the Boundaries of theResearch Teamrdquo (See httpdatabraryorg)

bull CRAN httpscranr-projectorgwebviewsReproducibleResearchhtml

Social Bias Fair Algorithms

bull MachineBias

bull Dagger[EmbeddingsBias] Aylin Caliskan Joanna J Bryson and Arvind Narayanan 2017 ldquoSemanticsDerived Automatically from Language Corpora Contain Human Biasesrdquo Science httpsarxiv

orgabs160807187

bull Dagger[Debiasing] Tolga Bolukbasi Kai-Wei Chang James Zou Venkatesh Saligrama Adam Kalai 2016ldquoMan is to Computer Programmer as Woman is to Homemaker Debiasing Word Embeddingsrdquohttpsarxivorgabs160706520

bull Dagger[AvoidingBias] Moritz Hardt Eric Price Nathan Srebro 2016 ldquoEquality of Opportunity inSupervised Learningrdquo httpsarxivorgabs161002413

bull Dagger[InevitableBias] Jon Kleinberg Sendhil Mullainathan Manish Raghavan 2016 ldquoInherent Trade-Offs in the Fair Determination of Risk Scoresrdquo httpsarxivorgabs160905807

Databases and Data Management

bull NRCreport Ch 3 ldquoScaling the Infrastructure for Data Managementrdquo

bull dagger[SQL] Jan L Harrington 2010 SQL Clearly Explained Elsevier httpwwwsciencedirectcomezaccesslibrariespsuedusciencebook9780123756978

bull DSHandbook-Py Ch 14 ldquoDatabasesrdquo

bull dagger[noSQL] Guy Harrison 2015 Next Generation Databases NoSQL NewSQL and Big Data Apresshttpslink-springer-comezaccesslibrariespsuedubook101007978-1-4842-1329-2

bull dagger[Cloud] Divyakant Agrawal Sudipto Das and Amr El Abbadi 2012 Data Management inthe Cloud Challenges and Opportunities Morgan amp Claypool httpwwwmorganclaypoolcom

ezaccesslibrariespsuedudoipdfplus102200S00456ED1V01Y201211DTM032

20

bull daggerMorgan amp Claypool Synthesis Lectures on Data Management httpwwwmorganclaypoolcom

tocdtm11

Data Wrangling

Theoretically-structured approaches to data wrangling

bull Dagger[TidyData-R] Garrett Grolemund and Hadley Wickham 2017 R for Data Science (httpr4dshadconz) esp Ch 12 ldquoTidy Datardquo also Wickham 2014 ldquoTidy Datardquo Journalof Statistical Software 59(10) httpwwwjstatsoftorgv59i10paper Tools The tidyversehttptidyverseorg

bull Dagger[DataScience-Py] Jake VanderPlas 2016 Python Data Science Handbook (httpsgithubcomjakevdpPythonDSHandbook-Py) esp Ch 3 on pandas httppandaspydataorg

bull Dagger[DataCarpentry] Colin Gillespie and Robin Lovelace 2017 Efficient R Programming https

csgillespiegithubioefficientR esp Ch 6 on ldquoefficient data carpentryrdquo

bull sect[Wrangler] Joseph M Hellerstein Jeffrey Heer Tye Rattenbury and Sean Kandel 2017 DataWrangling Practical Techniques for Data Preparation Tool Trifacta Wrangler httpwwwtrifactacomproductswrangler

Data wrangling practice

bull DSHandbook-Py Ch 4 ldquoData Munging String Manipulation Regular Expressions and Data Clean-ingrdquo

bull dagger[DataSimplification] Jules J Berman 2016 Data Simplification Taming Information with OpenSource Tools Elsevier httpwwwsciencedirectcomezaccesslibrariespsueduscience

book9780128037812

bull dagger[Wrangling-Py] Jacqueline Kazil Katharine Jarmul 2016 Data Wrangling with Python OrsquoReillyhttpproquestcombosafaribooksonlinecomezaccesslibrariespsuedu9781491948804

bull dagger[Wrangling-R] Bradley C Boehmke 2016 Data Wrangling with R Springer httplink

springercomezaccesslibrariespsuedubook1010072F978-3-319-45599-0

Record Linkage Entity Resolution Deduplication

bull dagger[DataMatching] Peter Christen 2012 Data Matching Concepts and Techniques for Record Link-age Entity Resolution and Duplicate Detectionhttplinkspringercomezaccesslibrariespsuedubook1010072F978-3-642-31164-2

(esp Ch 2 ldquoThe Data Matching Processrdquo)

bull DataSimplification Chapter 5 ldquoIdentifying and Deidentifying Datardquo

bull Dagger[EntityResolution] Lise Getoor and Ashwin Machanavajjhala 2013 ldquoEntity Resolution for BigDatardquo Tutorial KDD httpwwwumiacsumdedu~getoorTutorialsER_KDD2013pdf

bull Dagger[SyrianCasualties] Peter Sadosky Anshumali Shrivastava Megan Price and Rebecca C Steorts2015 ldquoBlocking Methods Applied to Casualty Records from the Syrian Conflictrdquo httpsarxiv

orgabs151007714 (For more on blocking see DataMatching Ch 4 ldquoIndexingrdquo)

bull CRAN Task Views httpscranr-projectorgwebviewsOfficialStatisticshtml ldquoStatis-tical Matching and Record Linkagerdquo

ldquoMaking up datardquo Imputation Smoothers Kernels Priors Filters Teleportation Negative Sampling Con-volution Augmentation Adversarial Training

21

bull dagger[Imputation] Yi Deng Changgee Chang Moges Seyoum Ido and Qi Long 2016 ldquoMultiple Im-putation for General Missing Data Patterns in the Presence of High-dimensional Datardquo ScientificReports 6(21689) httpswwwnaturecomarticlessrep21689

bull Dagger[Overimputation] Matthew Blackwell James Honaker and Gary King 2017 ldquoA Unified Ap-proach to Measurement Error and Missing Data Overview and Applicationsrdquo Sociological Methodsamp Research 46(3) 303-341 httpgkingharvardedufilesgkingfilesmeasurepdf

bull Shalizi-ADA Sect 15 (Linear Smoothers) Ch 8 (Splines) Sect 144 (Kernel Density Estimates)DataScience-Python NB 0513 ldquoKernel Density Estimationrdquo

bull FightinWords (re Bayesian priors as additional data impact of priors on regularization)

bull DeepLearning Section 75 (Data Augmentation) 712 (Dropout) 713 (Adversarial Examples) Chap-ter 9 (Convolutional Networks)

bull dagger[Adversarial] Ian J Goodfellow Jonathon Shlens Christian Szegedy 2015 ldquoExplaining and Har-vesting Adversarial Examplesrdquo httpsarxivorgabs14126572

bull See also resampling and simulation methods

bull See also feature engineering preprocessing

bull CRAN Task Views httpscranr-projectorgwebviewsOfficialStatisticshtml ldquoImpu-tationrdquo

(Direct) Data Representations Data Mappings

bull NRCreport Ch 5 ldquoLarge-Scale Data Representationsrdquo

bull dagger[PatternRecognition] M Narasimha Murty and V Susheela Devi 2011 Pattern RecognitionAn Algorithmic Approach Springer httpslink-springer-comezaccesslibrariespsuedu

book1010072F978-0-85729-495-1 Section 21 ldquoData Structures for Pattern Representationrdquo

bull Dagger[InfoRetrieval] Christopher D Manning Prabhakar Raghavan and Hinrich Schutze 2009 Intro-duction to Information Retrieval Cambridge University Press httpnlpstanfordeduIR-bookCh 1 2 6 (also note slides used in their class)

bull dagger[Algorithms] Brian Steele John Chandler Swarna Reddy 2016 Algorithms for Data ScienceWiley httplinkspringercomezaccesslibrariespsuedubook1010072F978-3-319-

45797-0 Ch 2 ldquoData Mapping and Data Dictionariesrdquo

bull See also Social Data Structures

Similarity Distance Association Covariance The Kernel Trick

bull PatternRecognition Section 23 ldquoProximity Measuresrdquo

bull Dagger[Similarity] Brendan OrsquoConnor 2012 ldquoCosine similarity Pearson correlation and OLS coeffi-cientsrdquo httpsbrenoconcomblog201203cosine-similarity-pearson-correlation-and-

ols-coefficients

bull For relatively comprehensive lists see also

M-J Lesot M Rifqi and H Benhadda 2009 ldquoSimilarity measures for binary and numerical data a surveyrdquoInt J Knowledge Engineering and Soft Data Paradigms 1(1) 63- httpciteseerxistpsueduviewdoc

downloaddoi=10112126533amprep=rep1amptype=pdf

Seung-Seok Choi Sung-Hyuk Cha Charles C Tappert 2010 ldquoA Survey of Binary Similarity and DistanceMeasuresrdquo Journal of Systemics Cybernetics amp Informatics 8(1)43-8 httpwwwiiisciorgJournalCV$

scipdfsGS315JGpdf

22

Sung-Hyuk Cha 2007 ldquoComprehensive Survey on DistanceSimilarity Measures between Probability DensityFunctionsrdquo International Journal of Mathematical Models and Methods in Applied Sciences 4(1) 300-7 http

csispaceeductappertdpsd861-12session4-p2pdf

Anna Huang 2008 ldquoSimilarity Measures for Text Document Clusteringrdquo Proceedings of the 6th New ZealandComputer Science Research Student Conference 49-56 httpwwwnzcsrsc08canterburyacnzsiteproceedingsIndividual_Paperspg049_Similarity_Measures_for_Text_Document_Clusteringpdf

bull Michael B Jordan ldquoThe Kernel Trickrdquo (Lecture Notes) httpspeopleeecsberkeleyedu

~jordancourses281B-spring04lectureslec3pdf

bull Eric Kim 2017 ldquoEverything You Ever Wanted to Know about the Kernel Trick (But Were Afraidto Ask)rdquo httpwwweric-kimneteric-kim-netposts1kernel_trick_blog_ekim_12_20_

2017pdf

Derived Data Representations - Dimensionality Reduction Compression De-composition Embeddings

The groupings here and under the measurement multivariate statistics section are particularly arbitraryFor example ldquok-Means clusteringrdquo can be viewed as a technique for ldquodimensionality reductionrdquo ldquocompres-sionrdquo ldquofeature extractionrdquo ldquolatent variable measurementrdquo ldquounsupervised learningrdquo ldquocollaborative filteringrdquo

Clustering hashing quantization blocking compression

bull Dagger[Compression] Khalid Sayood 2012 Introduction to Data Compression 4th ed Springer httpwwwsciencedirectcomezaccesslibrariespsuedusciencebook9780124157965 (eg cod-ing blocking via vector quantization)

bull MMDS Ch3 ldquoFinding Similar Itemsrdquo (minhashing locality sensitive hashing)

bull Multivariate-R Ch 6 ldquoClusteringrdquo

bull DataScience-Python NB 0511 ldquok-Means Clusteringrdquo NG 0512 ldquoGaussian Mixture Modelsrdquo

bull Dagger[KMeansHashing] Kaiming He Fang Wen Jian Sun 2013 ldquoK-means Hashing An Affinity-Preserving Quantization Method for Learning Binary Compact Codesrdquo CVPR httpswwwcv-foundationorgopenaccesscontent_cvpr_2013papersHe_K-Means_Hashing_An_2013_CVPR_paper

pdf

bull dagger[Squashing] Madigan D Raghavan N Dumouchel W Nason M Posse C and RidgewayG (2002) ldquoLikelihood-based data squashing A modeling approach to instance constructionrdquo DataMining and Knowledge Discovery 6(2) 173-190 httpdxdoiorgezaccesslibrariespsu

edu101023A1014095614948

bull dagger[Core-sets] Piotr Indyk Sepideh Mahabadi Mohammad Mahdian Vahab S Mirrokni 2014 ldquoCom-posable core-sets for diversity and coverage maximizationrdquo PODS rsquo14 100-8 httpsdoiorg10114525945382594560

Feature selection feature extraction feature engineering weighting preprocessing

bull PatternRecognition Sections 26-7 ldquoFeature Selection Feature Extractionrdquo

bull DSHandbook-Py Ch 7 ldquoInterlude Feature Extraction Ideasrdquo

bull DataScience-Python NB 0504 ldquoFeature Engineeringrdquo

bull Features in text NLP and InfoRetrieval re tfidf and similar FightinWords

bull Features in images DataScience-Python NB 0514 ldquoImage Featuresrdquo

23

Dimensionality reduction decomposition factorization change of basis reparameterization matrix com-pletion latent variables source separation

bull MMDS Ch 9 ldquoRecommendation Systemsrdquo Ch11 ldquoDimensionality Reductionrdquo

bull DSHandbook-Py Ch 10 ldquoUnsupervised Learning Clustering and Dimensionality Reductionrdquo

bull Dagger[Shalizi-ADA] Cosma Rohilla Shalizi 2017 Advanced Data Analysis from an Elementary Pointof View Ch 16 (ldquoPrincipal Components Analysisrdquo) Ch 17 (ldquoFactor Modelsrdquo) httpwwwstat

cmuedu~cshaliziADAfaEPoV

See also Multivariate-R Ch 3 ldquoPrincipal Components Analysisrdquo Ch 4 ldquoMultidimensional ScalingrdquoCh 5 ldquoExploratory Factor Analysisrdquo Latent

bull DeepLearning Ch 2 ldquoLinear Algebrardquo Ch 13 ldquoLinear Factor Modelsrdquo

bull Dagger[GloVe] Jeffrey Pennington Richard Socher Christopher Manning 2014 ldquoGloVe Global vectorsfor word representationrdquo EMNLP httpsnlpstanfordeduprojectsglove

bull dagger[NMF] Daniel D Lee and H Sebastian Seung 1999 ldquoLearning the parts of objects by non-negativematrix factorizationrdquo Nature 401788-791 httpdxdoiorgezaccesslibrariespsuedu

10103844565

bull dagger[CUR] Michael W Mahoney and Petros Drineas 2009 ldquoCUR matrix decompositions for improveddata analysisrdquo PNAS httpwwwpnasorgcontent1063697full

bull Dagger[ICA] Aapo Hyvarinen and Erkki Oja 2000 ldquoIndependent Component Analysis Algorithmsand Applicationsrdquo Neural Networks 13(4-5) 411-30 httpswwwcshelsinkifiuahyvarin

papersNN00newpdf

bull [RandomProjection] Ella Bingham and Heikki Mannilla 2001 ldquoRandom projection in dimension-ality reduction applications to image and text datardquo KDD httpsdoiorg1011452F502512

502546

bull Compression eg ldquoTransform codingrdquo ldquoWaveletsrdquo

Nonlinear dimensionality reduction Manifold learning

bull DeepLearning Section 5113 ldquoManifold Learningrdquo

bull DataScience-Python NB 0510 ldquoManifold Learningrdquo (Locally linear embedding [LLE] Isomap)

bull Dagger[KernelPCA] Sebastian Raschka 2014 ldquoKernel tricks and nonlinear dimensionality reduction viaRBF kernel PCArdquo httpsebastianraschkacomArticles2014_kernel_pcahtml

bull Shalizi-ADA Ch 18 ldquoNonlinear Dimensionality Reductionrdquo (LLE)

bull Dagger[LaplacianEigenmaps] Mikhail Belkin and Partha Niyogi 2003 ldquoLaplacian eigenmaps for di-mensionality reduction and data representationrdquo Neural Computation 15(6) 13731396 http

webcseohio-stateedu~belkin8papersLEM_NC_03pdf

bull DeepLearning Ch14 (ldquoAutoencodersrdquo)

bull Dagger[word2vec] Tomas Mikolov Kai Chen Greg Corrado Jeffrey Dean 2013 ldquoEfficient Estimation ofWord Representations in Vector Spacerdquo httpsarxivorgabs13013781

bull Dagger[word2vecExplained] Yoav Goldberg and Omer Levy 2014 ldquoword2vec Explained DerivingMikolov et als Negative-Sampling Word-Embedding Methodrdquo httpsarxivorgabs14023722

bull Dagger[t-SNE] Laurens van der Maaten and Geoffrey Hinton 2008 ldquoVisualising Data using t-SNErdquo Jour-nal of Machine Learning Research 9 2579-2605 httpjmlrcsailmitedupapersvolume9

vandermaaten08avandermaaten08apdf

24

Computation and Scaling Up

Scientific computing computation at scale

bull NRCreport Ch 10 ldquoThe Seven Computational Giants of Massive Data Analysisrdquo

bull DSHandbook-Py Ch 21 ldquoPerformance and Computer Memoryrdquo Ch 22 ldquoComputer Memory andData Structuresrdquo

bull Dagger[ComputationalStatistics-Python] Cliburn Chan Computational Statistics in Python https

peopledukeedu~ccc14sta-663indexhtml

bull CRAN Task View High Performance Computinghttpscranr-projectorgwebviewsHighPerformanceComputinghtml

Numerical computing

bull [NumericalComputing] Ward Cheney and David Kincaid 2013 Numerical Mathematics andComputing 7th ed BrooksCole Cengage Learning

bull CRAN Task View Numerical Mathematicshttpscranr-projectorgwebviewsNumericalMathematicshtml

Optimization (eg MLE gradient descent stochastic gradient descent EM algorithm neural nets)

bull DSHandbook-Py Ch 23 ldquoMaximum Likelihood Estimation and Optimizationrdquo (gradient descent)

bull DeepLearning Ch 4 ldquoNumerical Computationrdquo Ch 6 ldquoDeep Feedforward Networksrdquo Ch 8ldquoOptimization for Training Deep Modelsrdquo

bull CRAN Task View Optimization and Mathematical Programminghttpscranr-projectorgwebviewsOptimizationhtml

Linear algebra matrix computations

bull [MatrixComputations] Gene H Golub and Charles F Van Loan 2013 Matrix Computations 4thed Johns Hopkins University Press

bull Dagger[NetflixMatrix] Yehuda Koren Robert Bell and Chris Volinsky 2009 ldquoMatrix FactorizationTechniques for Recommender Systemsrdquo Computer August 42-9 httpsdatajobscomdata-

science-repoRecommender-Systems-[Netflix]pdf

bull Dagger[BigDataPCA] Jianqing Fan Qiang Sun Wen-Xin Zhou Ziwei Zhu ldquoPrincipal Component Anal-ysis for Big Datardquo httpwwwprincetonedu~ziweizpcapdf

bull Dagger[RandomSVD] Andrew Tulloch 2009 ldquoFast Randomized SVDrdquo httpsresearchfbcom

fast-randomized-svd

bull Dagger[FactorSGD] Rainer Gemulla Peter J Haas Erik Nijkamp and Yannis Sismanis 2011 ldquoLarge-Scale Matrix Factorization with Distributed Stochastic Gradient Descentrdquo KDD httpwwwcs

utahedu~hariteachingbigdatagemulla11dsgdpdf

bull Dagger[Sparse] Max Grossman 2015 ldquo101 Ways to Store a Sparse Matrixrdquo httpsmediumcom

jmaxg3101-ways-to-store-a-sparse-matrix-c7f2bf15a229

bull Dagger[DontInvert] John D Cook 2010 ldquoDonrsquot Invert that Matrixrdquo httpswwwjohndcookcom

blog20100119dont-invert-that-matrix

Simulation-based inference resampling Monte Carlo methods MCMC Bayes approximate inference

25

bull Dagger[MarkovVisually] Victor Powell ldquoMarkov Chains Explained Visuallyrdquo httpsetosaioev

markov-chains

bull DSHandbook-Py Ch 25 ldquoStochastic Modelingrdquo (Markov chains MCMC HMM)

bull Bayes ComputationalStatistics-Py

bull DeepLearning Ch 17 ldquoMonte Carlo Methodsrdquo Ch 19 ldquoApproximate Inferencerdquo

bull Dagger[VariationalInference] Jason Eisner 2011 ldquoHigh-Level Explanation of Variational Inferencerdquohttpswwwcsjhuedu~jasontutorialsvariationalhtml

bull CRAN Task View Bayesian Inferencehttpscranr-projectorgwebviewsBayesianhtml

Parallelism MapReduce Split-Apply-Combine

bull Dagger[MapReduceIntuition] Jigsaw Academy 2014 ldquoBig Data Specialist MapReducerdquo https

wwwyoutubecomwatchv=TwcYQzFqg-8ampfeature=youtube

bull MMDS Ch 2 ldquoMap-Reduce and the New Software Stackrdquo (3rd Edition discusses Spark amp Tensor-Flow)

bull Algorithms Ch 3ldquoScalable Algorithms and Associative Statisticsrdquo Ch 4 ldquoHadoop and MapReducerdquo

bull NRCreport Ch 6 ldquoResources Trade-offs and Limitationsrdquo

bull TidyData-R (Split-apply-combine is the motivating principle behind the ldquotidyverserdquo approach) Seealso Part III ldquoProgramrdquo (pipes functions vectors iteration)

bull Dagger[TidySAC-Video] Hadley Wickham 2017 ldquoData Science in the Tidyverserdquo httpswww

rstudiocomresourcesvideosdata-science-in-the-tidyverse

bull See also FactorSGD

Functional Programming

bull ComputationalStatistics-Python ldquoFunctions are first class objectsrdquo through first exercises

bull DSHandbook-Py Ch 20 ldquoProgramming Language Conceptsrdquo

bull See also Haskell

Scaling iteration streaming data online algorithms (Spark)

bull NRCreport Ch 4 ldquoTemporal Data and Real-Time Algorithmsrdquo

bull MMDS Ch 4 ldquoMining Data Streamsrdquo

bull Dagger[BDAS] AMPLab BDAS The Berkeley Data Analytics Stack httpsamplabcsberkeleyedu

software

bull dagger[Spark] Mohammed Guller 2015 Big Data Analytics with Spark A Practitioners Guide toUsing Spark for Large-Scale Data Processing Machine Learning and Graph Analytics and High-Velocity Data Stream Processing Apress httpslink-springer-comezaccesslibrariespsu

edubook101007978-1-4842-0964-6

bull DSHandbook-Py Ch 13 ldquoBig Datardquo

bull DeepLearning Ch 10 ldquoSequence Modeling Recurrent and Recursive Netsrdquo

General Resources for Python and R

26

bull dagger[DSHandbook-Py] Field Cady 2017 The Data Science Handbook (Python-based) Wiley http

onlinelibrarywileycomezaccesslibrariespsuedubook1010029781119092919

bull Dagger[Tutorials-R] Ujjwal Karn 2017 ldquoA curated list of R tutorials for Data Science NLP and MachineLearningrdquo httpsgithubcomujjwalkarnDataScienceR

bull Dagger[Tutorials-Python] Ujjwal Karn 2017 ldquoA curated list of Python tutorials for Data Science NLPand Machine Learningrdquo httpsgithubcomujjwalkarnDataSciencePython

Cutting and Bleeding Edge of Data Science Languages

bull [Scala] daggerVishal Layka and David Pollak 2015 Beginning Scala httpslink-springer-

comezaccesslibrariespsuedubook1010072F978-1-4842-0232-6 daggerNicolas Patrick 2014Scala for Machine Learning httpsebookcentralproquestcomlibpensudetailaction

docID=1901910 DaggerScala site httpswwwscala-langorg (See also )

bull [Julia] daggerIvo Baobaert 2015 Getting Started with Julia httpsebookcentralproquestcom

libpensudetailactiondocID=1973847 DaggerDouglas Bates 2013 ldquoJulia for R Programmersrdquohttpwwwstatwiscedu~batesJuliaForRProgrammerspdf DaggerJulia site httpsjulialang

org

bull [Haskell] daggerRichard Bird 2014 Thinking Functionally with Haskell httpsdoi-orgezaccess

librariespsuedu101017CBO9781316092415 daggerHakim Cassimally 2017 Learning Haskell Pro-gramming httpswwwlyndacomHaskell-tutorialsLearning-Haskell-Programming604926-2html daggerJames Church 2017 Learning Haskell for Data Analysis httpswwwlyndacomDeveloper-tutorialsLearning-Haskell-Data-Analysis604234-2html Haskell site httpswwwhaskellorg

bull [Clojure] daggerMark McDonnell 2017 Quick Clojure Essential Functional Programming https

link-springer-comezaccesslibrariespsuedubook1010072F978-1-4842-2952-1 daggerAkhillWali 2014 Clojure for Machine Learning httpsebookcentralproquestcomlibpensu

detailactiondocID=1674848 daggerArthur Ulfeldt 2015 httpswwwlyndacomClojure-tutorialsUp-Running-Clojure413127-2html DaggerClojure site httpsclojureorg

bull [TensorFlow] DaggerAbadi et al (Google) 2016 ldquoTensorFlow A System for Large-Scale Ma-chine Learningrdquo httpsarxivorgabs160508695 DaggerUdacity ldquoDeep Learningrdquo httpswww

udacitycomcoursedeep-learning--ud730 DaggerTensorFlow site httpswwwtensorfloworg

bull [H2O] Darren Cool 2016 Practical Machine Learning with H2O Powerful Scalable Techniques forDeep Learning and AI OrsquoReilly Arno Candel and Viraj Parmar 2015 Deep Learning with H2ODaggerH2O site httpswwwh2oai

Social Data Structures

Space and Time

bull dagger[GIA] David OrsquoSullivan David J Unwin 2010 Geographic Information Analysis Second EditionJohn Wiley amp Sons httponlinelibrarywileycombook1010029780470549094

bull [Space-Time] Donna J Peuquet 2003 Representations of Space and Time Guilford

bull Roger S Bivand Edzer Pebesma Virgilio Gmez-Rubio 2013 Applied Spatial Data Analysis in RFree through library httplinkspringercomezaccesslibrariespsuedubook101007

2F978-1-4614-7618-4 See also Edzer Pebesma 2016 ldquoHandling and Analyzing Spatial Spa-tiotemporal and Movement Data in Rrdquo httpsedzergithubioUseR2016

bull Dagger[PySAL] Sergio J Rey and Dani Arribas-Bel 2016 ldquoGeographic Data Science with PySAL andthe pydata Stackrdquo httpdarribasorggds_scipy16

27

bull DataScience-Python NB 0413 ldquoGeographic Data with Basemaprdquo

bull Time ldquoLongitudinalrdquo intra-individual data as common in developmental psychology dagger[Longitudinal]Garrett Fitzmaurice Marie Davidian Geert Verbeke and Geert Molenberghs editors 2009 Longitu-dinal Data Analysis Chapman amp Hall CRC httppensueblibcompatronFullRecordaspxp=359998

bull Time ldquoTime Series Time Series Cross Section Panelrdquo as common in economics political scienceand sociology DataScience-Python Ch 3 re pandas DSHandbook-Py Ch 17 ldquoTime Series AnalysisrdquoEconometrics GelmanHill

bull Time ldquoSequential Data Streamsrdquo as in NLP machine learning Spark and other rdquostreaming datardquoreadings

bull Space-Time Data with continuity connectivity neighborhood structure See DeepLearning Chapter 9re convolution

bull CRAN Task Views Spatial httpscranr-projectorgwebviewsSpatialhtml Spatiotempo-ral httpscranr-projectorgwebviewsSpatioTemporalhtml Time Series httpscran

r-projectorgwebviewsTimeSerieshtml

Network Graphs

bull Dagger[Networks] David Easley and Jon Kleinberg 2010 Networks Crowds and Markets ReasoningAbout a Highly Connected World Cambridge University Press httpwwwcscornelleduhome

kleinbernetworks-book

bull MMDS Ch 5 ldquoLink Analysisrdquo Ch 10 ldquoMining Social-Network Graphsrdquo

bull dagger[Networks-R] Douglas Luke 2015 A Userrsquos Guide to Network Analysis in R Springer https

link-springer-comezaccesslibrariespsuedubook101007978-3-319-23883-8

bull dagger[Networks-Python] Mohammed Zuhair Al-Taie and Seifedine Kadry 2017 Python for Graph andNetwork Analysis Springer httpslink-springer-comezaccesslibrariespsuedubook

1010072F978-3-319-53004-8

Hierarchy Clustered Data Aggregation Mixed Models

bull dagger[GelmanHill] Andrew Gelman and Jennifer Hill 2006 Data Analysis Using Regression and Multi-levelHierarchical Models Cambridge University Press httppensueblibcompatronFullRecordaspxp=288457

bull dagger[MixedModels] Eugene Demidenko 2013 Mixed Models Theory and Applications with R 2nd edWiley httpsiteebrarycomezaccesslibrariespsuedulibpennstatedetailaction

docID=10748641

Social Data Channels

Language Text Speech Audio

bull Dagger[NLP] Daniel Jurafsky and James H Martin Forthcoming (2017) Speech and Language ProcessingAn Introduction to Natural Language Processing Speech Recognition and Computational Linguistics3rd ed Prentice-Hall Preprint httpswebstanfordedu~jurafskyslp3

bull InfoRetrieval

bull Dagger[CoreNLP] Stanford CoreNLP httpsstanfordnlpgithubioCoreNLP

28

bull dagger[TextAsData] Grimmer and Stewart 2013 ldquoText as Data The Promise and Pitfalls of AutomaticContent Analysis Methods for Political Textsrdquo Political Analysis httpsdoiorg101093pan

mps028

bull Quinn-Topics

bull FightinWords

bull dagger[NLTK-Python] Jacob Perkins 2010 Python Text Processing with NLTK 20 Cookbook PackthttpsiteebrarycomezaccesslibrariespsuedulibpennstatedetailactiondocID=10435387dagger[TextAnalytics-Python] Dipanjan Sarkar 2016 Text Analytics with Python Springer http

linkspringercomezaccesslibrariespsuedubook1010072F978-1-4842-2388-8

bull DSHandbook-Py Ch 16 ldquoNatural Language Processingrdquo

bull Matthew J Denny httpwwwmjdennycomText_Processing_In_Rhtml

bull CRAN Natural Language Processing httpscranr-projectorgwebviewsNaturalLanguageProcessinghtml

bull dagger[AudioData] Dean Knox and Christopher Lucas 2017 ldquoThe Speaker-Affect Model MeasuringEmotion in Political Speech with Audio Datardquo httpchristopherlucasorgfilesPDFssam

pdf httpswwwyoutubecomwatchv=Hs8A9dwkMzI

bull daggerMorgan amp Claypool Synthesis Lectures on Human Language Technologies httpwwwmorganclaypoolcomtochlt11

bull daggerMorgan amp Claypool Synthesis Lectures on Speech and Audio Processing httpwwwmorganclaypoolcomtocsap11

Vision Image Video

bull Dagger[ComputerVision] Richard Szelinski 2010 Computer Vision Algorithms and Applications SpringerhttpszeliskiorgBook

bull MixedModels Ch 11 ldquoStatistical Analysis of Shaperdquo Ch 12 ldquoStatistical Image Analysisrdquo

bull AudioVideoVolumetric See DeepLearning Chapter 9 re convolution

bull daggerMorgan amp Claypool Synthesis Lectures on Image Video and Multimedia Processing httpwww

morganclaypoolcomtocivm11

bull daggerMorgan amp Claypool Synthesis Lectures on Computer Vision httpwwwmorganclaypoolcom

toccov11

Approaches to Learning from Data (The Analytics Layer)

bull NRCreport Ch 7 ldquoBuilding Models from Massive Datardquo

bull MMDS (data-mining)

bull CausalInference

bull Dagger[VisualAnalytics] Daniel Keim Jorn Kohlhammer Geoffrey Ellis and Florian Mansmann editors2010 Mastering the Information Age Solving Problems with Visual Analytics The EurographicsAssociation Goslar Germany httpwwwvismastereubook See also daggerMorgan amp ClaypoolSynthesis Lectures on Visualization httpwwwmorganclaypoolcomtocvis21

bull Dagger[Econometrics] Bruce E Hansen 2014 Econometrics httpwwwsscwiscedu~bhansen

econometrics

29

bull Dagger[StatisticalLearning] Gareth James Daniela Witten Trevor Hastie and Robert Tibshirani 2013An Introduction to Statistical Learning with Applications in R Springer httpwww-bcfuscedu

~garethISLindexhtml

bull [MachineLearning] Christopher Bishop 2006 Pattern Recognition and Machine Learning Springer

bull dagger[Bayes] Peter D Hoff 2009 A First Course in Bayesian Statistical Methods Springer http

linkspringercombook1010072F978-0-387-92407-6

bull dagger[GraphicalModels] Soslashren Hoslashjsgaard David Edwards Steffen Lauritzen 2012 Graphical Mod-els with R Springer httpslink-springer-comezaccesslibrariespsuedubook101007

2F978-1-4614-2299-0

bull [InformationTheory] David Mackay 2003 Information Theory Inference and Learning AlgorithmsSpringer

bull Dagger[DeepLearning] Ian Goodfellow Yoshua Bengio and Aaron Courville 2016 Deep Learning MITPress httpwwwdeeplearningbookorg (see also Yann LeCun Yoshua Bengio and Geof-frey Hinton 2015 ldquoDeep Learningrdquo Nature 521(7553) 436ndash444 httpsdoiorg101038

nature14539)

See also Dagger[Keras] Keras The Python Deep Learning Library httpskerasio Francoios CholletDirectory of Keras tutorials httpsgithubcomfcholletkeras-resources

bull dagger[SignalProcessing] Jose Maria Giron-Sierra 2013 Digital Signal Processing with MATLAB Exam-ples (Volumes 1-3) Springer httpslink-springer-comezaccesslibrariespsuedubook

101007978-981-10-2534-1

30

Penn State Policy Statements

Academic Integrity

Academic integrity is the pursuit of scholarly activity in an open honest and responsible manner Academicintegrity is a basic guiding principle for all academic activity at The Pennsylvania State University and allmembers of the University community are expected to act in accordance with this principle Consistent withthis expectation the Universitys Code of Conduct states that all students should act with personal integrityrespect other students dignity rights and property and help create and maintain an environment in whichall can succeed through the fruits of their efforts

Academic integrity includes a commitment by all members of the University community not to engage in ortolerate acts of falsification misrepresentation or deception Such acts of dishonesty violate the fundamentalethical principles of the University community and compromise the worth of work completed by others

Disability Accomodation

Penn State welcomes students with disabilities into the Universitys educational programs Every Penn Statecampus has an office for students with disabilities Student Disability Resources (SDR) website provides con-tact information for every Penn State campus (httpequitypsuedusdrdisability-coordinator)For further information please visit the Student Disability Resources website (httpequitypsuedusdr)

In order to receive consideration for reasonable accommodations you must contact the appropriate disabilityservices office at the campus where you are officially enrolled participate in an intake interview and pro-vide documentation See documentation guidelines at (httpequitypsuedusdrguidelines) If thedocumentation supports your request for reasonable accommodations your campus disability services officewill provide you with an accommodation letter Please share this letter with your instructors and discussthe accommodations with them as early as possible You must follow this process for every semester thatyou request accommodations

Psychological Services

Many students at Penn State face personal challenges or have psychological needs that may interfere withtheir academic progress social development or emotional wellbeing The university offers a variety ofconfidential services to help you through difficult times including individual and group counseling crisisintervention consultations online chats and mental health screenings These services are provided by staffwho welcome all students and embrace a philosophy respectful of clients cultural and religious backgroundsand sensitive to differences in race ability gender identity and sexual orientation

bull Counseling and Psychological Services at University Park (CAPS) (httpstudentaffairspsueducounseling) 814-863-0395

bull Penn State Crisis Line (24 hours7 daysweek) 877-229-6400

bull Crisis Text Line (24 hours7 daysweek) Text LIONS to 741741

Educational Equity

Penn State takes great pride to foster a diverse and inclusive environment for students faculty and staffConsistent with University Policy AD29 students who believe they have experienced or observed a hatecrime an act of intolerance discrimination or harassment that occurs at Penn State are urged to report theseincidents as outlined on the Universitys Report Bias webpage (httpequitypsuedureportbias)

31

Background Knowledge

Students will have a variety of backgrounds Prior to beginning interdisciplinary coursework tofulfill Social Data Analytics degree requirements including SoDA 501 and 502 students are ex-pected to have advanced (graduate) training in at least one of the component areas of Social DataAnalytics and a familiarity with basic concepts in the others

With regard to specialization students are expected to have advanced (graduate) training in ONEof the following

bull quantitative social science methodology and a discipline of social science (as would be thecase for a second-year PhD student in Political Science Sociology Criminology Human De-velopment and Family Studies or Demography) OR

bull statistics (as would be the case for a second-year PhD student in Statistics) OR

bull information science or informatics (as would be the case for a second-year PhD student inInformation Science and Technology or a second-year PhD student in Geography specializingin GIScience) OR

bull computer science (as would be the case for a second-year PhD student in Computer Scienceand Engineering)

This requirement is met as a matter of meeting home program requirements for students in thedual-title PhD but may require additional coursework on the part of students in other programswishing to pursue the graduate minor

With regard to general preparation students are expected to have ALL of the following technicalknowledge

bull basic programming skills including basic facility with R ampor Python AND

bull basic knowledge of relational databases ampor geographic information systems AND

bull basic knowledge of probability applied statistics ampor social science research design AND

bull basic familiarity with a substantive or theoretical area of social science (eg 300-level course-work in political science sociology criminology human development psychology economicscommunication anthropology human geography social informatics or similar fields)

It is not unusual for students to have one or more gaps in this preparation at time of applicationto the SoDA program Students should work with Social Data Analytics advisers to develop aplan for timely remediation of any deficiencies which generally will not require formal courseworkfor students whose training and interests are otherwise appropriate for pursuit of the Social DataAnalytics degree Where possible this will be addressed at time of application to the Social DataAnalytics program

To this end some free training materials are linked in the reference section and there are alsoabundant free high-quality self-paced course-style training materials on these and related subjectsavailable through edX (httpwwwedxorg) Udacity (httpwwwudacitycom) Codecademy(httpcodecademycom) and Lynda (httpwwwlyndapsu)

32

Page 10: SoDA 501: Approaches and Issues in Big Social Data Spring ... · \Approaches and Issues in Big Social Data," o ered in the spring semester. Some SoDA / IGERT students will take more

Reproducibility

Refresh on MachineBias

bull Further reference

Section ldquoEthics and Scientific Responsibility in Big Social Datardquo

Section ldquoiexclCuidadordquo

April 26 - LAST CLASS MEETING

bull Team Project Presentations

May 3 - Team Projects Due

10

Past Visiting Speakers in SoDA 501 and 502

SoDA 502 (Fall 2017)

bull Chris Zorn (PLSC)

bull Diane Felmlee (SOC)

bull Alan MacEachren (GEOG)

bull Eric Plutzer (PLSC)

bull James LeBreton (PSYCH)

bull Sesa Slavkovic (STAT)

bull Guido Cervone (GEOG)

bull Ashton Verdery (SOC)

bull Maggie Niu (STAT)

bull Reka Albert (PHYS)

SoDA 501 (Spring 2017)

bull Tim Brick (HDFS) ldquoTowards real-time monitoring and intervention using wearable technologyrdquo

bull Aylin Caliskan (Princeton) ldquoA Story of Discrimination and Unfairness Bias in Word Embeddingsrdquo

bull Jay Yonamine (Google - IGERT alum) ldquoData Science in Industryrdquo

bull Johnathan Rush (Illinois) ldquoGeospatial Data Science Workshoprdquo

bull Rick Gilmore (PSYCH) ldquoToward a more reproducible and robust science of human behaviorrdquo

bull Glenn Firebaugh (SOC) ldquoMeasuring Inequality and Segregation with US Census Datardquo

bull Charles Twardy (Sotera) ldquoData Science for Search and Rescuerdquo

bull Anna Smith (Ohio State) ldquoA Hierarchical Model for Network Data in a Latent Hyperbolic Spacerdquo

bull Rebecca Passonneau (CSE) ldquoOmnigraph Rich Feature Representation for Graph Kernel Learningrdquo

bull Alex Klippel (GEOG) ldquoVirtual Reality for Immersive Analyticsrdquo

bull Murali Haran (STAT) ldquoA Computationally Efficient Projection-based Approach for Spatial General-ized Mixed Modelsrdquo

SoDA 502 (Fall 2016)

bull Clio Andris (GEOG) ldquoIntegrating Social Network Data into GISystemsrdquo

bull Jia Li (STAT) ldquoClustering under the Wasserstein Metricrdquo

bull Rachel Smith (CAS) ldquoStigma Networks Perceptions of Sociogramsrdquo

bull Zita Oravecz (HDFS)

bull Bethany Bray (Methodology Center) ldquoLatent Class and Latent Transition Analysisrdquo

bull Dave Hunter (STAT) ldquoModel Based Clustering of Large Networksrdquo

bull Scott Bennett (PLSC) ldquoABM Model of Insurgencyrdquo

bull David Reitter (IST)

bull Suzanna Linn (PLSC) ldquoMethodological Issues in Automated Text Analysis Application to NewsCoverage of the US Economyrdquo

11

SoDA 501 (Spring 2016)

bull Bruce Desmarais (PLSC) ldquoLearning in the Sunshine Analysis of Local Government Email Corporardquo

bull Timothy Brick (HDFS) ldquoMapping and Manipulating Facial Expressionrdquo

bull Qunying Huang (USC) ldquoSocial Media An Emerging Data Source for Human Mobility Studiesrdquo

bull Lingzhou Zue (STAT) ldquoAn Introduction to High-Dimensional Graphical Modelsrdquo

bull Ashton Verdery (SOC) ldquoSampling from Network Datardquo

bull Sarah Battersby (Tableau) ldquoHelping People See and Understand Spatial Datardquo

bull Alexandra Slavkovic (STAT) ldquoStatistical Privacy with Network Datardquo

bull Lee Giles (IST) ldquoMachine Learning for Scholarly Big Datardquo

12

Readings and References (updated Spring 2018)

We will discuss a relatively small subset of the readings listed here and this will vary based on topics andreadings discussed by visiting speakers and you yourselves The remainder are provided here as curatedreferences for more in-depth investigation in both the theory and practice related to the topic (with thelatter heavily weighted toward resources in Python and R)

dagger Material that is at last check made available through the Penn State library Most journal links shouldwork if you are logged in to a Penn State machine or through the Penn State VPN Some article archivesand most books require additional authentication through webaccess If links are broken start directly froma search via httpswwwlibrariespsuedu Some books require installation of e-readers like AdobeDigital Editions Links to lyndacom can be accessed through httplyndapsuedu

Dagger Material that is at last check legally provided for free In some cases these are the preprint versionsof published material

sect Material for which a legal selection is or will be provided through the class Box folder

Big Data amp Social Data Analytics

Overviews

bull Dagger[CompSocSci] David Lazer Alex Pentland Lada Adamic Sinan Aral Albert-Laszlo BarabasiDevon Brewer Nicholas Christakis Noshir Contractor James Fowler Myron Gutmann Tony JebaraGary King Michael Macy Deb Roy and Marshall Van Alstyne 2009 ldquoComputational Social SciencerdquoScience 323(5915)721-3 + Supp Feb 6 httpsciencesciencemagorgcontent3235915

721full httpsgkingharvardedufilesgkingfilesLazPenAda09pdf

bull Dagger[NRCreport] National Research Council 2013 Frontiers in Massive Data Analysis NationalAcademies Press (Free w registration httpwwwnapeducatalogphprecord_id=18374)Ch1 ldquoIntroductionrdquo Ch2 ldquoMassive Data in Science Technology Commerce National DefenseTelecommunications and other Endeavorsrdquo

bull Dagger[MMDS] Jure Leskovec Anand Rajaraman and Jeff Ullman 2014 Mining of Massive DatasetsCambridge University Press httpwwwmmdsorg (BETA version of Third Edition httpi

stanfordedu~ullmanmmdsnhtml)

bull sect[BitByBit] Matthew J Salganik 2018 (Forthcoming) Bit by Bit Social Research in the DigitalAge Princeton University Press Ch 1 ldquoIntroductionrdquo Ch 2 ldquoObserving Behaviorrdquo

bull DSHandbook-Py Ch 1 ldquoIntroduction Becoming a Unicornrdquo Ch2 ldquoThe Data Science Road Maprdquo

Burt-schtick

bull Dagger[Monroe-5Vs] Burt L Monroe 2013 ldquoThe Five Vs of Big Data Political Science Introductionto the Special Issue on Big Data in Political Sciencerdquo Political Analysis 21(V5) 1ndash9 https

doiorg101017S1047198700014315 (Volume Velocity Variety Vinculation Validity)

bull dagger[Monroe-No] Burt L Monroe Jennifer Pan Margaret E Roberts Maya Sen and Betsy Sin-clair 2015 ldquoNo Formal Theory Causal Inference and Big Data Are Not Contradictory Trendsin Political Sciencerdquo PS Political Science amp Politics 48(1) 71ndash4 httpdxdoiorg101017

S1049096514001760

bull dagger[Quinn-Topics] Kevin M Quinn Burt L Monroe Michael Colaresi Michael H Crespin andDragomir R Radev 2010 ldquoHow to Analyze Political Attention with Minimal Assumptions andCostsrdquo American Journal of Political Science 54(1) 209ndash28 httponlinelibrarywileycom

13

doi101111j1540-5907200900427xfull (esp topic modeling as measurement approach tovalidation)

bull dagger[FightinWords] Burt L Monroe Michael Colaresi and Kevin M Quinn 2008 ldquoFightinrsquo WordsLexical Feature Selection and Evaluation for Identifying the Content of Political Conflictrdquo PoliticalAnalysis 16(4) 372-403 httpsdoiorg101093panmpn018 (esp the impact of samplingvariance and regularization through priors)

bull Dagger[BDSS-Census] Big Data Social Science PSU Team 2012 ldquoA Closer Look at the Kaggle CensusDatardquo httpsburtmonroegithubioBDSSKaggleCensus2012 (esp the relevance of the socialprocesses by which data come to exist as data)

Multidisciplinary Perspectives

bull dagger[Business-BigData] Andrew McAfee and Erik Brynjolfsson 2012 ldquoBig Data The ManagementRevolutionrdquo Harvard Business Review 90(10) 61ndash8 October httpshbrorg201210big-

data-the-management-revolution Thomas H Davenport and DJ Patel 2012 ldquoData ScientistThe Sexiest Job of the 21st Centuryrdquo Harvard Business Review 90(10)70-6 October https

hbrorg201210data-scientist-the-sexiest-job-of-the-21st-century

bull dagger[InfoSci-BigData] CL Philip Chen and Chun-Yang Zhang 2014 ldquoData-intensive ApplicationsChallenges Techniques and Technologies A Survey on Big Datardquo Information Sciences 275 314-47httpsdoiorg101016jins201401015

bull dagger[Informatics-BigData] Vasant G Honavar 2014 ldquoThe Promise and Potential of Big Data ACase for Discovery Informaticsrdquo Review of Policy Research 31(4) 326-330 httpsdoiorg10

1111ropr12080

bull Dagger[Stats-BigData] Beate Franke Jean-Francois Ribana Roscher Annie Lee Cathal Smyth ArminHatefi Fuqi Chen Einat Gil Alexander Schwing Alessandro Selvitella Michael M Hoffman RogerGrosse Dietrich Hendricks and Nancy Reid 2016 ldquoStatistical Inference Learning and Models in BigDatardquo International Statistical Review 84(3) 371-89 httponlinelibrarywileycomdoi101111insr12176full

bull dagger[Econ-BigData] Hal R Varian 2013 ldquoBig Data New Tricks for Econometricsrdquo Journal ofEconomic Perspectives 28(2) 3-28 httpsdoiorg101257jep2823

bull dagger[GeoViz-BigData] Alan MacEachren 2017 ldquoLeveraging Big (Geo) Data with (Geo) Visual An-alytics Place as the Next Frontierrdquo In Chenghu Zhou Fenzhen Su Francis Harvey and Jun Xueds Spatial Data Handling in the Big Data Era pp 139-155 Springer httpslink-springer-

comezaccesslibrariespsueduchapter101007978-981-10-4424-3_10

bull dagger[Soc-BigData] David Lazer and Jason Radford 2017 ldquoData ex Machina Introduction to BigDatardquo Annual Review of Sociology 43 19-39 httpsdoiorg101146annurev-soc-060116-

053457

bull sect[Politics-BigData] Keith T Poole L Jason Anasastopolous and James E Monagan III Forthcom-ing ldquoThe lsquoBig Datarsquo Revolution in Political Campaigning and Governancerdquo Oxford Bibliographies inPolitical Science

iexclCuidado Traps Biases Problems Pains Perils

bull dagger[GoogleFlu] David Lazer Ryan Kennedy Gary King and Alessandro Vespignani 2014 ldquoTheParable of Google Flu Traps in Big Data Analysisrdquo Science (343) 14 March httpgking

harvardedufilesgkingfiles0314policyforumffpdf

bull Dagger[MachineBias] ProPublica ldquoMachine Bias Investigating Algorithmic Injusticerdquo Series https

wwwpropublicaorgseriesmachine-bias See especially

14

Julia Angwin Jeff Larson Surya Mattu and Lauren Kirchner 2016 ldquoMachine Biasrdquo ProPub-lica May 23 httpswwwpropublicaorgarticlemachine-bias-risk-assessments-in-

criminal-sentencing

Julia Angwin Madeleine Varner and Ariana Tobin 2017 ldquoFacebook Enabled Adverstisers toReach Jew Hatersrdquo httpswwwpropublicaorgarticlefacebook-enabled-advertisers-

to-reach-jew-haters

bull Dagger[Polling2016] Doug Rivers (Nov 11 2016) ldquoFirst Thoughts on Polling Problems in the 2016 USElectionsrdquo httpstodayyougovcomnews20161111first-thoughts-polling-problems-

2016-us-elections

bull dagger[EventData] Wei Wang Ryan Kennedy David Lazer Naren Ramakrishnan 2016 ldquoGrowing Painsfor Global Monitoring of Societal Eventsrdquo Science 3536307 pp 1502ndash1503 httpsdoiorg101126scienceaaf6758

bull Dagger[GoogleBooks] Eitan Adam Pechenick Christopher M Danforth and Peter Sheridan Dodds 2015ldquoCharacterizing the Google Books Corpus Strong Limits to Inferences of Socio-Cultural and LinguisticEvolutionrdquo PLoS One httpsdoiorg101371journalpone0137041

bull Dagger[OkCupid] Michael Zimmer 2016 ldquoOkCupid Study Reveals the Perils of Big-Data Sciencerdquo Wiredhttpswwwwiredcom201605okcupid-study-reveals-perils-big-data-science May 14

bull dagger[CriticalQuestions] danah boyd and Kate Crawford 2011 ldquoCritical Questions for Big Data Provo-cations for a Cultural Technological and Scholarly Phenomenonrdquo Information Communication ampSociety 15(5) 662ndash79 httpdxdoiorg1010801369118X2012678878

bull Dagger[RacistBot] Daniel Victor 2016 ldquoMicrosoft created a Twitter bot to learn from users It quicklybecame a racist jerkrdquo New York Times httpswwwnytimescom20160325technology

microsoft-created-a-twitter-bot-to-learn-from-users-it-quickly-became-a-racist-jerk

html

bull MMDS 122-123 on Bonferroni

bull Also BDSS-Census

Research Design and Measurement

Overviews

bull BitByBit

bull Dagger[ResearchMethodsKB] William M Trochim 2006 The Research Methods Knowledge Basehttpwwwsocialresearchmethodsnetkb

bull [7Rules] Glenn Firebaugh 2008 Seven Rules for Social Research Princeton University Press

Measurement Reliability and Validity

bull ResearchMethodsKB ldquoMeasurementrdquo

bull 7Rules Ch 3 ldquoBuild Reality Checks into Your Researchrdquo

bull Reliability see also FightinWords

bull Validity see also Quinn10-Topics Monroe-5Vs

Indirect Unobtrusive Nonreactive Measures Data Exhaust

15

bull dagger[UnobtrusiveMeasures] Raymond M Lee 2015 ldquoUnobtrusive Measuresrdquo Oxford Bibliographieshttpsdoiorg101093OBO9780199846740-0048 (Canonical cite is Eugene J Webb 1966Unobtrusive Measures or Webb Donald T Campbell Richard D Schwartz Lee Sechrest 1999Unobtrusive Measures rev ed Sage)

bull BitByBit Examples in Chapter 2

Multiple Measures Latent Variable Measurement

bull 7Rules Ch 4 ldquoReplicate Where Possiblerdquo

bull dagger[LatentVariables] David J Bartholomew Martin Knott and Irini Moustaki 2011 Latent VariableModels and Factor Analysis A Unified Approach Wiley httpsiteebrarycomezaccess

librariespsuedulibpennstatedetailactiondocID=10483308) (esp Ch1 ldquoBasic ideas andexamplesrdquo)

bull dagger[Multivariate-R] Brian Everitt and Torsten Hothorn 2011 An Introduction to Applied MultivariateAnalysis with R Springer httplinkspringercomezaccesslibrariespsuedubook10

10072F978-1-4419-9650-3

bull Shalizi-ADA Ch17 ldquoFactor Modelsrdquo

bull Dagger[NetflixPrize] Edwin Chen 2011 ldquoWinning the Netflix Prize A Summaryrdquo

bull DaggerCRAN httpscranr-projectorgwebviewsMultivariatehtmlhttpscranr-projectorgwebviewsPsychometricshtmlhttpscranr-projectorgwebviewsClusterhtml

Sampling and Survey Design

bull dagger[Sampling] Steven K Thompson 2012 Sampling 3rd ed httpsk8es4mc2lsearchserialssolutionscomsid=sersolampSS_jc=TC_024492330amptitle=Wiley20Desktop20Editions203A20Sampling

bull ResearchMethodsKB ldquoSamplingrdquo

bull NRCreport Ch 8 ldquoSampling and Massive Datardquo

bull BitByBit Ch 3 ldquoAsking Questionsrdquo

bull Dagger[MSE] Daniel Manrique-Vallier Megan E Price and Anita Gohdes 2013 In Seybolt et al (eds)Counting Civilians ldquoMultiple Systems Estimation Techniques for Estimating Casualties in ArmedConflictsrdquo Preprint httpciteseerxistpsueduviewdocdownloaddoi=1011469939amp

rep=rep1amptype=pdf)

bull dagger[NetworkSampling] Ted Mouw and Ashton M Verdery 2012 ldquoNetwork Sampling with Memory AProposal for More Efficient Sampling from Social Networksrdquo Sociological Methodology 42(1)206ndash56httpsdxdoiorg1011772F0081175012461248

bull See also FightinWords (re hidden heteroskedasticity in sample variance)

bull Dagger[HashDontSample] Mudit Uppal 2016 ldquoProbabilistic data structures in the Big data world (+code)rdquo (re ldquoHash donrsquot samplerdquo) httpsmediumcommuppalprobabilistic-data-structures-

in-the-big-data-world-code-b9387cff0c55

bull MMDS Hash Functions (132) Sampling in streams (42)

bull DaggerCRAN httpscranr-projectorgwebviewsOfficialStatisticshtml (ldquoComplex SurveyDesignrdquo ldquoSmall Area Estimationrdquo)

Experimental and Observational Designs for Causal Inference

16

bull Dagger[CausalInference] Miguel A Hernan James M Robins Forthcoming (2017) Causal Infer-ence Chapman amp HallCRC Preprint httpswwwhsphharvardedumiguel-hernancausal-

inference-book Chs 1 ldquoA definition of causal effectrdquo Ch 2 ldquoRandomized experimentsrdquo Ch3 ldquoObservational studiesrdquo (See Ch 6 ldquoGraphical representation of causal effectsrdquo for integrationwith Judea Pearl approach)

bull BitByBit Chapter 4 ldquoRunning Experimentsrdquo Ch 2 Natural experiments in observable data exam-ples

bull ResearchMethodsKB ldquoDesignrdquo

bull Firebaugh-7Rules Ch 2 ldquoLook for Differences that Make a Difference and Report Themrdquo sectCh5 ldquoCompare Like with Likerdquo Ch 6 ldquoUse Panel Data to Study Individual Change and RepeatedCross-Section Data to Study Social Changerdquo

bull Shalizi-ADA Part IV ldquoCausal Inferencerdquo

bull Monroe-No

bull CRAN httpscranr-projectorgwebviewsExperimentalDesignhtml

Technologies for Primary and Secondary Data Collection

Mobile Devices Distributed Sensors Wearable Sensors Remote Sensing

bull dagger[RealityMining] Nathan Eagle and Alex (Sandy) Pentland 2006 ldquoReality Mining SensingComplex Social Systemsrdquo Personal and Ubiquitous Computing 10(4) 255ndash68 httpsdoiorg

101007s00779-005-0046-3

bull dagger[QuantifiedSelf ] Melanie Swan 2013 ldquoThe Quantified Self Fundamental Disruption in Big DataScience and Biological Discoveryrdquo Big Data 1(2) 85ndash99 httpsdoiorg101089big2012

0002

bull dagger[SensorData] Charu C Aggarwal (Ed) 2013 Managing and Mining Sensor Data Springer httplinkspringercomezaccesslibrariespsuedubook1010072F978-1-4614-6309-2 Ch1ldquoAn Introduction to Sensor Data Analyticsrdquo

bull dagger[NightLights] Thushyanthan Baskaran Brian Min Yogesh Uppal 2015 ldquoElection cycles andelectricity provision Evidence from a quasi-experiment with Indian special electionsrdquo Journal ofPublic Economics 12664-73 httpsdoiorg101016jjpubeco201503011

Crowdsourcing Human Computation Citizen Science Web Experiments

bull dagger[Crowdsourcing] Daren C Brabham 2013 Crowdsourcing MIT Press httpsiteebrary

comezaccesslibrariespsuedulibpennstatedetailactiondocID=10692208 ldquoIntroduc-tionrdquo Ch1 ldquoConcepts Theories and Cases of Crowdsourcingrdquo

bull BitByBit Ch 5 ldquoCreating Mass Collaborationrdquo

bull NRCreport Ch 9 ldquoHuman Interaction with Datardquo

bull dagger[MTurk] Krista Casler Lydia Bickel and Elizabeth Hackett 2013 ldquoSeparate but Equal AComparison of Participants and Data Gathered via Amazonrsquos MTurk Social Media and Face-to-FaceBehavioral Testingrdquo Computers in Human Behavior 29(6) 2156ndash60 httpdoiorg101016j

chb201305009

bull dagger[LabintheWild] Katharina Reinecke and Krzysztof Z Gajos 2015 ldquoLabintheWild ConductingLarge-Scale Online Experiments With Uncompensated Samplesrdquo In Proceedings of the 18th ACMConference on Computer Supported Cooperative Work amp Social Computing (CSCW rsquo15) 1364ndash1378httpdxdoiorg10114526751332675246

17

bull dagger[TweetmentEffects] Kevin Munger 2017 ldquoTweetment Effects on the Tweeted ExperimentallyReducing Racist Harassmentrdquo Political Behavior 39(3) 629ndash49 httpdoiorg101016jchb

201305009

bull dagger[HumanComputation] Edith Law and Luis van Ahn 2011 Human Computation Morgan amp Clay-pool httpwwwmorganclaypoolcomezaccesslibrariespsuedudoipdf102200S00371ED1V01Y201107AIM013

Open Data File Formats APIs Semantic Web Linked Data

bull DSHandbook-Py Ch 12 ldquoData Encodings and File Formatsrdquo

bull Dagger[OpenData] Open Data Institute ldquoWhat Is Open Datardquo httpstheodiorgwhat-is-open-

data

bull Dagger[OpenDataHandbook] Open Knowledge International The Open Data Handbook httpopendatahandbookorg Includes appendix ldquoFile Formatsrdquo httpopendatahandbookorgguideenappendices

file-formats

bull Dagger[APIs] Brian Cooksey 2016 An Introduction to APIs httpszapiercomlearnapis

bull Dagger[APIMarkets] RapidAPI mashape API marketplaces httpsdocsrapidapicom https

marketmashapecom ProgrammableWeb httpswwwprogrammablewebcom

bull dagger[LinkedData] Tom Heath and Christian Bizer 2011 Linked Data Evolving the Web into a GlobalData Space Morgan amp Claypool httpwwwmorganclaypoolcomezaccesslibrariespsu

edudoiabs102200S00334ED1V01Y201102WBE001 Ch 1 ldquoIntroductionrdquo Ch 2 ldquoPrinciples ofLinked Datardquo

bull dagger[SemanticWeb] Nikolaos Konstantinos and Dimitrios-Emmanuel Spanos 2015 Materializing theWeb of Linked Data Springer httplinkspringercomezaccesslibrariespsuedubook

1010072F978-3-319-16074-0 Ch 1 ldquoIntroduction Linked Data and the Semantic Webrdquo

bull daggerMorgan amp Claypool Synthesis Lectures on the Semantic Web Theory and Technology httpwww

morganclaypoolcomtocwbe111

Web Scraping

bull dagger[TheoryDrivenScraping] Richard N Landers Robert C Brusso Katelyn J Cavanaugh andAndrew B Colmus 2016 ldquoA Primer on Theory-Driven Web Scraping Automatic Extraction ofBig Data from the Internet for Use in Psychological Researchrdquo Psychological Methods 4 475ndash492httpdxdoiorgezaccesslibrariespsuedu101037met0000081

bull Dagger[Scraping-Py] Al Sweigart 2015 Automate the Boring Stuff with Python Practical Programmingfor Total Beginners Ch 11 Web-Scraping httpsautomatetheboringstuffcomchapter11

bull dagger[Scraping-R] Simon Munzert Christian Rubba Peter Meiszligner and Dominic Nyhuis 2014 Au-tomated Data Collection with R A Practical Guide to Web Scraping and Text Mining Wileyhttponlinelibrarywileycombook1010029781118834732

bull DaggerCRAN httpscranr-projectorgwebviewsWebTechnologieshtml

Ethics amp Scientific Responsibility in Big Social Data

Human Subjects Consent Privacy

bull BitByBit Ch 6 (ldquoEthicsrdquo)

bull Dagger[PSU-ORP] Penn State Office for Research Protections (under the VP for Research)

Human Subjects Research IRB httpswwwresearchpsueduirb

18

Revised Common Rule httpswwwresearchpsueduirbcommonrulechanges

Responsible Conduct of Research httpswwwresearchpsuedueducationrcr

Research Misconduct httpswwwresearchpsueduresearchmisconduct

SARI PSU (Scientific and Research Integrity Training) httpswwwresearchpsuedu

trainingsari

bull Dagger[MenloReport] David Dittrich and Erin Kenneally (Center for Applied Interneet Data Analysis)2012 ldquoThe Menlo Report Ethical Principles Guiding Information and Communication Technol-ogy Researchrdquo and companion report ldquoApplying Ethical Principles rdquo httpswwwcaidaorg

publicationspapers2012menlo_report_actual_formatted

bull Dagger[BigDataEthics-CBDES] Jacob Metcalf Emily F Keller and danah boyd 2016 ldquoPerspec-tives on Big Data Ethics and Societyrdquo Council for Big Data Ethics and Society httpbdes

datasocietynetcouncil-outputperspectives-on-big-data-ethics-and-society

bull Dagger[AoIRReport] Annette Markham and Elizabeth Buchanan (Association of Internet Researchers)2012 ldquoEthical Decision-Making and Internet Research Recommendations of the AoIR Ethics WorkingCommittee (Version 20)rdquo httpsaoirorgreportsethics2pdf and guidelines chart https

aoirorgwp-contentuploads201701aoir_ethics_graphic_2016pdf

bull Dagger[BigDataEthics-Wired] Sarah Zhang 2016 ldquoScientists are Just as Confused about the Ethics ofBig-Data Research as Yourdquo Wired httpswwwwiredcom201605scientists-just-confused-ethics-big-data-research

bull dagger[BigDataEthics-HerschelMori] Richard Herschel and Virginia M Mori 2017 ldquoEthics amp BigDatardquo Technology in Society 49 31-36 httpdoiorg101016jtechsoc201703003

The Science of Data Privacy

bull dagger[BigDataPrivacy] Terence Craig and Mary E Ludloff 2011 Privacy and Big Data OrsquoReilly MediahttppensueblibcompatronFullRecordaspxp=781814

bull Dagger[DataAnalysisPrivacy] John Abowd Lorenzo Alvisi Cynthia Dwork Sampath Kannan AshwinMachanavajjhala Jerome Reiter 2017 ldquoPrivacy-Preserving Data Analysis for the Federal Statisti-cal Agenciesrdquo A Computing Community Consortium white paper httpsarxivorgabs1701

00752

bull dagger[DataPrivacy] Stephen E Fienberg and Aleksandra B Slavkovic 2011 ldquoData Privacy and Con-fidentialityrdquo International Encyclopedia of Statistical Science 342ndash5 httpdoiorg978-3-642-

04898-2_202

bull Dagger[NetworksPrivacy] Vishesh Karwa and Aleksandra Slavkovic 2016 ldquoInference using noisy degreesDifferentially private β-model and synthetic graphsrdquo Annals of Statistics 44(1) 87-112 http

projecteuclidorgeuclidaos1449755958

bull dagger[DataPublishingPrivacy] Raymond Chi-Wing Wong and Ada Wai-Chee Fu 2010 Privacy-Preserving Data Publishing An Overview Morgan amp Claypool httpwwwmorganclaypoolcom

ezaccesslibrariespsuedudoipdfplus102200S00237ED1V01Y201003DTM002

bull DataMatching Ch 8 ldquoPrivacy Aspects of Data Matchingrdquo

Transparency Reproducibility and Team Science

bull Dagger[10RulesforData] Alyssa Goodman Alberto Pepe Alexander W Blocker Christine L BorgmanKyle Cranmer Merce Crosas Rosanne Di Stefano Yolanda Gil Paul Groth Margaret HedstromDavid W Hogg Vinay Kashyap Ashish Mahabal Aneta Siemiginowska and Aleksandra Slavkovic(2014) ldquoTen Simple Rules for the Care and Feeding of Scientific Datardquo PLoS Computational Biology10(4) e1003542 httpsdoiorg101371journalpcbi1003542 (Note esp curated resourcesfor reproducible research)

19

bull dagger[Transparency] E Miguel C Camerer K Casey J Cohen K M Esterling A Gerber R Glen-nerster D P Green M Humphreys G Imbens D Laitin T Madon L Nelson B A Nosek M Pe-tersen R Sedlmayr J P Simmons U Simonsohn M Van der Laan 2014 ldquoPromoting Transparencyin Social Science Researchrdquo Science 343(6166) 30ndash1 httpsdoiorg101126science1245317

bull dagger[Reproducibility] Marcus R Munafo Brian A Nosek Dorothy V M Bishop Katherine S ButtonChristopher D Chambers Nathalie Percie du Sert Uri Simonsohn Eric-Jan Wagenmakers JenniferJ Ware and John P A Ioannidis 2017 ldquoA Manifesto for Reproducible Sciencerdquo Nature HumanBehavior 0021(2017) httpsdoiorg101038s41562-016-0021

bull Dagger[SoftwareCarpentry] esp ldquoLessonsrdquo httpsoftware-carpentryorglessons

bull DSHandbook-Py Ch 9 ldquoTechnical Communication and Documentationrdquo Ch 15 ldquoSoftware Engi-neering Best Practicesrdquo

bull dagger[TeamScienceToolkit] Vogel AL Hall KL Fiore SM Klein JT Bennett LM Gadlin H Stokols DNebeling LC Wuchty S Patrick K Spotts EL Pohl C Riley WT Falk-Krzesinski HJ 2013 ldquoTheTeam Science Toolkit enhancing research collaboration through online knowledge sharingrdquo AmericanJournal of Preventive Medicine 45 787-9 http101016jamepre201309001

bull sect[Databrary] Kara Hall Robert Croyle and Amanda Vogel Forthcoming (2017) Advancing Socialand Behavioral Health Research through Cross-disciplinary Team Science Springer Includes RickO Gilmore and Karen E Adolph ldquoOpen Sharing of Research Video Breaking the Boundaries of theResearch Teamrdquo (See httpdatabraryorg)

bull CRAN httpscranr-projectorgwebviewsReproducibleResearchhtml

Social Bias Fair Algorithms

bull MachineBias

bull Dagger[EmbeddingsBias] Aylin Caliskan Joanna J Bryson and Arvind Narayanan 2017 ldquoSemanticsDerived Automatically from Language Corpora Contain Human Biasesrdquo Science httpsarxiv

orgabs160807187

bull Dagger[Debiasing] Tolga Bolukbasi Kai-Wei Chang James Zou Venkatesh Saligrama Adam Kalai 2016ldquoMan is to Computer Programmer as Woman is to Homemaker Debiasing Word Embeddingsrdquohttpsarxivorgabs160706520

bull Dagger[AvoidingBias] Moritz Hardt Eric Price Nathan Srebro 2016 ldquoEquality of Opportunity inSupervised Learningrdquo httpsarxivorgabs161002413

bull Dagger[InevitableBias] Jon Kleinberg Sendhil Mullainathan Manish Raghavan 2016 ldquoInherent Trade-Offs in the Fair Determination of Risk Scoresrdquo httpsarxivorgabs160905807

Databases and Data Management

bull NRCreport Ch 3 ldquoScaling the Infrastructure for Data Managementrdquo

bull dagger[SQL] Jan L Harrington 2010 SQL Clearly Explained Elsevier httpwwwsciencedirectcomezaccesslibrariespsuedusciencebook9780123756978

bull DSHandbook-Py Ch 14 ldquoDatabasesrdquo

bull dagger[noSQL] Guy Harrison 2015 Next Generation Databases NoSQL NewSQL and Big Data Apresshttpslink-springer-comezaccesslibrariespsuedubook101007978-1-4842-1329-2

bull dagger[Cloud] Divyakant Agrawal Sudipto Das and Amr El Abbadi 2012 Data Management inthe Cloud Challenges and Opportunities Morgan amp Claypool httpwwwmorganclaypoolcom

ezaccesslibrariespsuedudoipdfplus102200S00456ED1V01Y201211DTM032

20

bull daggerMorgan amp Claypool Synthesis Lectures on Data Management httpwwwmorganclaypoolcom

tocdtm11

Data Wrangling

Theoretically-structured approaches to data wrangling

bull Dagger[TidyData-R] Garrett Grolemund and Hadley Wickham 2017 R for Data Science (httpr4dshadconz) esp Ch 12 ldquoTidy Datardquo also Wickham 2014 ldquoTidy Datardquo Journalof Statistical Software 59(10) httpwwwjstatsoftorgv59i10paper Tools The tidyversehttptidyverseorg

bull Dagger[DataScience-Py] Jake VanderPlas 2016 Python Data Science Handbook (httpsgithubcomjakevdpPythonDSHandbook-Py) esp Ch 3 on pandas httppandaspydataorg

bull Dagger[DataCarpentry] Colin Gillespie and Robin Lovelace 2017 Efficient R Programming https

csgillespiegithubioefficientR esp Ch 6 on ldquoefficient data carpentryrdquo

bull sect[Wrangler] Joseph M Hellerstein Jeffrey Heer Tye Rattenbury and Sean Kandel 2017 DataWrangling Practical Techniques for Data Preparation Tool Trifacta Wrangler httpwwwtrifactacomproductswrangler

Data wrangling practice

bull DSHandbook-Py Ch 4 ldquoData Munging String Manipulation Regular Expressions and Data Clean-ingrdquo

bull dagger[DataSimplification] Jules J Berman 2016 Data Simplification Taming Information with OpenSource Tools Elsevier httpwwwsciencedirectcomezaccesslibrariespsueduscience

book9780128037812

bull dagger[Wrangling-Py] Jacqueline Kazil Katharine Jarmul 2016 Data Wrangling with Python OrsquoReillyhttpproquestcombosafaribooksonlinecomezaccesslibrariespsuedu9781491948804

bull dagger[Wrangling-R] Bradley C Boehmke 2016 Data Wrangling with R Springer httplink

springercomezaccesslibrariespsuedubook1010072F978-3-319-45599-0

Record Linkage Entity Resolution Deduplication

bull dagger[DataMatching] Peter Christen 2012 Data Matching Concepts and Techniques for Record Link-age Entity Resolution and Duplicate Detectionhttplinkspringercomezaccesslibrariespsuedubook1010072F978-3-642-31164-2

(esp Ch 2 ldquoThe Data Matching Processrdquo)

bull DataSimplification Chapter 5 ldquoIdentifying and Deidentifying Datardquo

bull Dagger[EntityResolution] Lise Getoor and Ashwin Machanavajjhala 2013 ldquoEntity Resolution for BigDatardquo Tutorial KDD httpwwwumiacsumdedu~getoorTutorialsER_KDD2013pdf

bull Dagger[SyrianCasualties] Peter Sadosky Anshumali Shrivastava Megan Price and Rebecca C Steorts2015 ldquoBlocking Methods Applied to Casualty Records from the Syrian Conflictrdquo httpsarxiv

orgabs151007714 (For more on blocking see DataMatching Ch 4 ldquoIndexingrdquo)

bull CRAN Task Views httpscranr-projectorgwebviewsOfficialStatisticshtml ldquoStatis-tical Matching and Record Linkagerdquo

ldquoMaking up datardquo Imputation Smoothers Kernels Priors Filters Teleportation Negative Sampling Con-volution Augmentation Adversarial Training

21

bull dagger[Imputation] Yi Deng Changgee Chang Moges Seyoum Ido and Qi Long 2016 ldquoMultiple Im-putation for General Missing Data Patterns in the Presence of High-dimensional Datardquo ScientificReports 6(21689) httpswwwnaturecomarticlessrep21689

bull Dagger[Overimputation] Matthew Blackwell James Honaker and Gary King 2017 ldquoA Unified Ap-proach to Measurement Error and Missing Data Overview and Applicationsrdquo Sociological Methodsamp Research 46(3) 303-341 httpgkingharvardedufilesgkingfilesmeasurepdf

bull Shalizi-ADA Sect 15 (Linear Smoothers) Ch 8 (Splines) Sect 144 (Kernel Density Estimates)DataScience-Python NB 0513 ldquoKernel Density Estimationrdquo

bull FightinWords (re Bayesian priors as additional data impact of priors on regularization)

bull DeepLearning Section 75 (Data Augmentation) 712 (Dropout) 713 (Adversarial Examples) Chap-ter 9 (Convolutional Networks)

bull dagger[Adversarial] Ian J Goodfellow Jonathon Shlens Christian Szegedy 2015 ldquoExplaining and Har-vesting Adversarial Examplesrdquo httpsarxivorgabs14126572

bull See also resampling and simulation methods

bull See also feature engineering preprocessing

bull CRAN Task Views httpscranr-projectorgwebviewsOfficialStatisticshtml ldquoImpu-tationrdquo

(Direct) Data Representations Data Mappings

bull NRCreport Ch 5 ldquoLarge-Scale Data Representationsrdquo

bull dagger[PatternRecognition] M Narasimha Murty and V Susheela Devi 2011 Pattern RecognitionAn Algorithmic Approach Springer httpslink-springer-comezaccesslibrariespsuedu

book1010072F978-0-85729-495-1 Section 21 ldquoData Structures for Pattern Representationrdquo

bull Dagger[InfoRetrieval] Christopher D Manning Prabhakar Raghavan and Hinrich Schutze 2009 Intro-duction to Information Retrieval Cambridge University Press httpnlpstanfordeduIR-bookCh 1 2 6 (also note slides used in their class)

bull dagger[Algorithms] Brian Steele John Chandler Swarna Reddy 2016 Algorithms for Data ScienceWiley httplinkspringercomezaccesslibrariespsuedubook1010072F978-3-319-

45797-0 Ch 2 ldquoData Mapping and Data Dictionariesrdquo

bull See also Social Data Structures

Similarity Distance Association Covariance The Kernel Trick

bull PatternRecognition Section 23 ldquoProximity Measuresrdquo

bull Dagger[Similarity] Brendan OrsquoConnor 2012 ldquoCosine similarity Pearson correlation and OLS coeffi-cientsrdquo httpsbrenoconcomblog201203cosine-similarity-pearson-correlation-and-

ols-coefficients

bull For relatively comprehensive lists see also

M-J Lesot M Rifqi and H Benhadda 2009 ldquoSimilarity measures for binary and numerical data a surveyrdquoInt J Knowledge Engineering and Soft Data Paradigms 1(1) 63- httpciteseerxistpsueduviewdoc

downloaddoi=10112126533amprep=rep1amptype=pdf

Seung-Seok Choi Sung-Hyuk Cha Charles C Tappert 2010 ldquoA Survey of Binary Similarity and DistanceMeasuresrdquo Journal of Systemics Cybernetics amp Informatics 8(1)43-8 httpwwwiiisciorgJournalCV$

scipdfsGS315JGpdf

22

Sung-Hyuk Cha 2007 ldquoComprehensive Survey on DistanceSimilarity Measures between Probability DensityFunctionsrdquo International Journal of Mathematical Models and Methods in Applied Sciences 4(1) 300-7 http

csispaceeductappertdpsd861-12session4-p2pdf

Anna Huang 2008 ldquoSimilarity Measures for Text Document Clusteringrdquo Proceedings of the 6th New ZealandComputer Science Research Student Conference 49-56 httpwwwnzcsrsc08canterburyacnzsiteproceedingsIndividual_Paperspg049_Similarity_Measures_for_Text_Document_Clusteringpdf

bull Michael B Jordan ldquoThe Kernel Trickrdquo (Lecture Notes) httpspeopleeecsberkeleyedu

~jordancourses281B-spring04lectureslec3pdf

bull Eric Kim 2017 ldquoEverything You Ever Wanted to Know about the Kernel Trick (But Were Afraidto Ask)rdquo httpwwweric-kimneteric-kim-netposts1kernel_trick_blog_ekim_12_20_

2017pdf

Derived Data Representations - Dimensionality Reduction Compression De-composition Embeddings

The groupings here and under the measurement multivariate statistics section are particularly arbitraryFor example ldquok-Means clusteringrdquo can be viewed as a technique for ldquodimensionality reductionrdquo ldquocompres-sionrdquo ldquofeature extractionrdquo ldquolatent variable measurementrdquo ldquounsupervised learningrdquo ldquocollaborative filteringrdquo

Clustering hashing quantization blocking compression

bull Dagger[Compression] Khalid Sayood 2012 Introduction to Data Compression 4th ed Springer httpwwwsciencedirectcomezaccesslibrariespsuedusciencebook9780124157965 (eg cod-ing blocking via vector quantization)

bull MMDS Ch3 ldquoFinding Similar Itemsrdquo (minhashing locality sensitive hashing)

bull Multivariate-R Ch 6 ldquoClusteringrdquo

bull DataScience-Python NB 0511 ldquok-Means Clusteringrdquo NG 0512 ldquoGaussian Mixture Modelsrdquo

bull Dagger[KMeansHashing] Kaiming He Fang Wen Jian Sun 2013 ldquoK-means Hashing An Affinity-Preserving Quantization Method for Learning Binary Compact Codesrdquo CVPR httpswwwcv-foundationorgopenaccesscontent_cvpr_2013papersHe_K-Means_Hashing_An_2013_CVPR_paper

pdf

bull dagger[Squashing] Madigan D Raghavan N Dumouchel W Nason M Posse C and RidgewayG (2002) ldquoLikelihood-based data squashing A modeling approach to instance constructionrdquo DataMining and Knowledge Discovery 6(2) 173-190 httpdxdoiorgezaccesslibrariespsu

edu101023A1014095614948

bull dagger[Core-sets] Piotr Indyk Sepideh Mahabadi Mohammad Mahdian Vahab S Mirrokni 2014 ldquoCom-posable core-sets for diversity and coverage maximizationrdquo PODS rsquo14 100-8 httpsdoiorg10114525945382594560

Feature selection feature extraction feature engineering weighting preprocessing

bull PatternRecognition Sections 26-7 ldquoFeature Selection Feature Extractionrdquo

bull DSHandbook-Py Ch 7 ldquoInterlude Feature Extraction Ideasrdquo

bull DataScience-Python NB 0504 ldquoFeature Engineeringrdquo

bull Features in text NLP and InfoRetrieval re tfidf and similar FightinWords

bull Features in images DataScience-Python NB 0514 ldquoImage Featuresrdquo

23

Dimensionality reduction decomposition factorization change of basis reparameterization matrix com-pletion latent variables source separation

bull MMDS Ch 9 ldquoRecommendation Systemsrdquo Ch11 ldquoDimensionality Reductionrdquo

bull DSHandbook-Py Ch 10 ldquoUnsupervised Learning Clustering and Dimensionality Reductionrdquo

bull Dagger[Shalizi-ADA] Cosma Rohilla Shalizi 2017 Advanced Data Analysis from an Elementary Pointof View Ch 16 (ldquoPrincipal Components Analysisrdquo) Ch 17 (ldquoFactor Modelsrdquo) httpwwwstat

cmuedu~cshaliziADAfaEPoV

See also Multivariate-R Ch 3 ldquoPrincipal Components Analysisrdquo Ch 4 ldquoMultidimensional ScalingrdquoCh 5 ldquoExploratory Factor Analysisrdquo Latent

bull DeepLearning Ch 2 ldquoLinear Algebrardquo Ch 13 ldquoLinear Factor Modelsrdquo

bull Dagger[GloVe] Jeffrey Pennington Richard Socher Christopher Manning 2014 ldquoGloVe Global vectorsfor word representationrdquo EMNLP httpsnlpstanfordeduprojectsglove

bull dagger[NMF] Daniel D Lee and H Sebastian Seung 1999 ldquoLearning the parts of objects by non-negativematrix factorizationrdquo Nature 401788-791 httpdxdoiorgezaccesslibrariespsuedu

10103844565

bull dagger[CUR] Michael W Mahoney and Petros Drineas 2009 ldquoCUR matrix decompositions for improveddata analysisrdquo PNAS httpwwwpnasorgcontent1063697full

bull Dagger[ICA] Aapo Hyvarinen and Erkki Oja 2000 ldquoIndependent Component Analysis Algorithmsand Applicationsrdquo Neural Networks 13(4-5) 411-30 httpswwwcshelsinkifiuahyvarin

papersNN00newpdf

bull [RandomProjection] Ella Bingham and Heikki Mannilla 2001 ldquoRandom projection in dimension-ality reduction applications to image and text datardquo KDD httpsdoiorg1011452F502512

502546

bull Compression eg ldquoTransform codingrdquo ldquoWaveletsrdquo

Nonlinear dimensionality reduction Manifold learning

bull DeepLearning Section 5113 ldquoManifold Learningrdquo

bull DataScience-Python NB 0510 ldquoManifold Learningrdquo (Locally linear embedding [LLE] Isomap)

bull Dagger[KernelPCA] Sebastian Raschka 2014 ldquoKernel tricks and nonlinear dimensionality reduction viaRBF kernel PCArdquo httpsebastianraschkacomArticles2014_kernel_pcahtml

bull Shalizi-ADA Ch 18 ldquoNonlinear Dimensionality Reductionrdquo (LLE)

bull Dagger[LaplacianEigenmaps] Mikhail Belkin and Partha Niyogi 2003 ldquoLaplacian eigenmaps for di-mensionality reduction and data representationrdquo Neural Computation 15(6) 13731396 http

webcseohio-stateedu~belkin8papersLEM_NC_03pdf

bull DeepLearning Ch14 (ldquoAutoencodersrdquo)

bull Dagger[word2vec] Tomas Mikolov Kai Chen Greg Corrado Jeffrey Dean 2013 ldquoEfficient Estimation ofWord Representations in Vector Spacerdquo httpsarxivorgabs13013781

bull Dagger[word2vecExplained] Yoav Goldberg and Omer Levy 2014 ldquoword2vec Explained DerivingMikolov et als Negative-Sampling Word-Embedding Methodrdquo httpsarxivorgabs14023722

bull Dagger[t-SNE] Laurens van der Maaten and Geoffrey Hinton 2008 ldquoVisualising Data using t-SNErdquo Jour-nal of Machine Learning Research 9 2579-2605 httpjmlrcsailmitedupapersvolume9

vandermaaten08avandermaaten08apdf

24

Computation and Scaling Up

Scientific computing computation at scale

bull NRCreport Ch 10 ldquoThe Seven Computational Giants of Massive Data Analysisrdquo

bull DSHandbook-Py Ch 21 ldquoPerformance and Computer Memoryrdquo Ch 22 ldquoComputer Memory andData Structuresrdquo

bull Dagger[ComputationalStatistics-Python] Cliburn Chan Computational Statistics in Python https

peopledukeedu~ccc14sta-663indexhtml

bull CRAN Task View High Performance Computinghttpscranr-projectorgwebviewsHighPerformanceComputinghtml

Numerical computing

bull [NumericalComputing] Ward Cheney and David Kincaid 2013 Numerical Mathematics andComputing 7th ed BrooksCole Cengage Learning

bull CRAN Task View Numerical Mathematicshttpscranr-projectorgwebviewsNumericalMathematicshtml

Optimization (eg MLE gradient descent stochastic gradient descent EM algorithm neural nets)

bull DSHandbook-Py Ch 23 ldquoMaximum Likelihood Estimation and Optimizationrdquo (gradient descent)

bull DeepLearning Ch 4 ldquoNumerical Computationrdquo Ch 6 ldquoDeep Feedforward Networksrdquo Ch 8ldquoOptimization for Training Deep Modelsrdquo

bull CRAN Task View Optimization and Mathematical Programminghttpscranr-projectorgwebviewsOptimizationhtml

Linear algebra matrix computations

bull [MatrixComputations] Gene H Golub and Charles F Van Loan 2013 Matrix Computations 4thed Johns Hopkins University Press

bull Dagger[NetflixMatrix] Yehuda Koren Robert Bell and Chris Volinsky 2009 ldquoMatrix FactorizationTechniques for Recommender Systemsrdquo Computer August 42-9 httpsdatajobscomdata-

science-repoRecommender-Systems-[Netflix]pdf

bull Dagger[BigDataPCA] Jianqing Fan Qiang Sun Wen-Xin Zhou Ziwei Zhu ldquoPrincipal Component Anal-ysis for Big Datardquo httpwwwprincetonedu~ziweizpcapdf

bull Dagger[RandomSVD] Andrew Tulloch 2009 ldquoFast Randomized SVDrdquo httpsresearchfbcom

fast-randomized-svd

bull Dagger[FactorSGD] Rainer Gemulla Peter J Haas Erik Nijkamp and Yannis Sismanis 2011 ldquoLarge-Scale Matrix Factorization with Distributed Stochastic Gradient Descentrdquo KDD httpwwwcs

utahedu~hariteachingbigdatagemulla11dsgdpdf

bull Dagger[Sparse] Max Grossman 2015 ldquo101 Ways to Store a Sparse Matrixrdquo httpsmediumcom

jmaxg3101-ways-to-store-a-sparse-matrix-c7f2bf15a229

bull Dagger[DontInvert] John D Cook 2010 ldquoDonrsquot Invert that Matrixrdquo httpswwwjohndcookcom

blog20100119dont-invert-that-matrix

Simulation-based inference resampling Monte Carlo methods MCMC Bayes approximate inference

25

bull Dagger[MarkovVisually] Victor Powell ldquoMarkov Chains Explained Visuallyrdquo httpsetosaioev

markov-chains

bull DSHandbook-Py Ch 25 ldquoStochastic Modelingrdquo (Markov chains MCMC HMM)

bull Bayes ComputationalStatistics-Py

bull DeepLearning Ch 17 ldquoMonte Carlo Methodsrdquo Ch 19 ldquoApproximate Inferencerdquo

bull Dagger[VariationalInference] Jason Eisner 2011 ldquoHigh-Level Explanation of Variational Inferencerdquohttpswwwcsjhuedu~jasontutorialsvariationalhtml

bull CRAN Task View Bayesian Inferencehttpscranr-projectorgwebviewsBayesianhtml

Parallelism MapReduce Split-Apply-Combine

bull Dagger[MapReduceIntuition] Jigsaw Academy 2014 ldquoBig Data Specialist MapReducerdquo https

wwwyoutubecomwatchv=TwcYQzFqg-8ampfeature=youtube

bull MMDS Ch 2 ldquoMap-Reduce and the New Software Stackrdquo (3rd Edition discusses Spark amp Tensor-Flow)

bull Algorithms Ch 3ldquoScalable Algorithms and Associative Statisticsrdquo Ch 4 ldquoHadoop and MapReducerdquo

bull NRCreport Ch 6 ldquoResources Trade-offs and Limitationsrdquo

bull TidyData-R (Split-apply-combine is the motivating principle behind the ldquotidyverserdquo approach) Seealso Part III ldquoProgramrdquo (pipes functions vectors iteration)

bull Dagger[TidySAC-Video] Hadley Wickham 2017 ldquoData Science in the Tidyverserdquo httpswww

rstudiocomresourcesvideosdata-science-in-the-tidyverse

bull See also FactorSGD

Functional Programming

bull ComputationalStatistics-Python ldquoFunctions are first class objectsrdquo through first exercises

bull DSHandbook-Py Ch 20 ldquoProgramming Language Conceptsrdquo

bull See also Haskell

Scaling iteration streaming data online algorithms (Spark)

bull NRCreport Ch 4 ldquoTemporal Data and Real-Time Algorithmsrdquo

bull MMDS Ch 4 ldquoMining Data Streamsrdquo

bull Dagger[BDAS] AMPLab BDAS The Berkeley Data Analytics Stack httpsamplabcsberkeleyedu

software

bull dagger[Spark] Mohammed Guller 2015 Big Data Analytics with Spark A Practitioners Guide toUsing Spark for Large-Scale Data Processing Machine Learning and Graph Analytics and High-Velocity Data Stream Processing Apress httpslink-springer-comezaccesslibrariespsu

edubook101007978-1-4842-0964-6

bull DSHandbook-Py Ch 13 ldquoBig Datardquo

bull DeepLearning Ch 10 ldquoSequence Modeling Recurrent and Recursive Netsrdquo

General Resources for Python and R

26

bull dagger[DSHandbook-Py] Field Cady 2017 The Data Science Handbook (Python-based) Wiley http

onlinelibrarywileycomezaccesslibrariespsuedubook1010029781119092919

bull Dagger[Tutorials-R] Ujjwal Karn 2017 ldquoA curated list of R tutorials for Data Science NLP and MachineLearningrdquo httpsgithubcomujjwalkarnDataScienceR

bull Dagger[Tutorials-Python] Ujjwal Karn 2017 ldquoA curated list of Python tutorials for Data Science NLPand Machine Learningrdquo httpsgithubcomujjwalkarnDataSciencePython

Cutting and Bleeding Edge of Data Science Languages

bull [Scala] daggerVishal Layka and David Pollak 2015 Beginning Scala httpslink-springer-

comezaccesslibrariespsuedubook1010072F978-1-4842-0232-6 daggerNicolas Patrick 2014Scala for Machine Learning httpsebookcentralproquestcomlibpensudetailaction

docID=1901910 DaggerScala site httpswwwscala-langorg (See also )

bull [Julia] daggerIvo Baobaert 2015 Getting Started with Julia httpsebookcentralproquestcom

libpensudetailactiondocID=1973847 DaggerDouglas Bates 2013 ldquoJulia for R Programmersrdquohttpwwwstatwiscedu~batesJuliaForRProgrammerspdf DaggerJulia site httpsjulialang

org

bull [Haskell] daggerRichard Bird 2014 Thinking Functionally with Haskell httpsdoi-orgezaccess

librariespsuedu101017CBO9781316092415 daggerHakim Cassimally 2017 Learning Haskell Pro-gramming httpswwwlyndacomHaskell-tutorialsLearning-Haskell-Programming604926-2html daggerJames Church 2017 Learning Haskell for Data Analysis httpswwwlyndacomDeveloper-tutorialsLearning-Haskell-Data-Analysis604234-2html Haskell site httpswwwhaskellorg

bull [Clojure] daggerMark McDonnell 2017 Quick Clojure Essential Functional Programming https

link-springer-comezaccesslibrariespsuedubook1010072F978-1-4842-2952-1 daggerAkhillWali 2014 Clojure for Machine Learning httpsebookcentralproquestcomlibpensu

detailactiondocID=1674848 daggerArthur Ulfeldt 2015 httpswwwlyndacomClojure-tutorialsUp-Running-Clojure413127-2html DaggerClojure site httpsclojureorg

bull [TensorFlow] DaggerAbadi et al (Google) 2016 ldquoTensorFlow A System for Large-Scale Ma-chine Learningrdquo httpsarxivorgabs160508695 DaggerUdacity ldquoDeep Learningrdquo httpswww

udacitycomcoursedeep-learning--ud730 DaggerTensorFlow site httpswwwtensorfloworg

bull [H2O] Darren Cool 2016 Practical Machine Learning with H2O Powerful Scalable Techniques forDeep Learning and AI OrsquoReilly Arno Candel and Viraj Parmar 2015 Deep Learning with H2ODaggerH2O site httpswwwh2oai

Social Data Structures

Space and Time

bull dagger[GIA] David OrsquoSullivan David J Unwin 2010 Geographic Information Analysis Second EditionJohn Wiley amp Sons httponlinelibrarywileycombook1010029780470549094

bull [Space-Time] Donna J Peuquet 2003 Representations of Space and Time Guilford

bull Roger S Bivand Edzer Pebesma Virgilio Gmez-Rubio 2013 Applied Spatial Data Analysis in RFree through library httplinkspringercomezaccesslibrariespsuedubook101007

2F978-1-4614-7618-4 See also Edzer Pebesma 2016 ldquoHandling and Analyzing Spatial Spa-tiotemporal and Movement Data in Rrdquo httpsedzergithubioUseR2016

bull Dagger[PySAL] Sergio J Rey and Dani Arribas-Bel 2016 ldquoGeographic Data Science with PySAL andthe pydata Stackrdquo httpdarribasorggds_scipy16

27

bull DataScience-Python NB 0413 ldquoGeographic Data with Basemaprdquo

bull Time ldquoLongitudinalrdquo intra-individual data as common in developmental psychology dagger[Longitudinal]Garrett Fitzmaurice Marie Davidian Geert Verbeke and Geert Molenberghs editors 2009 Longitu-dinal Data Analysis Chapman amp Hall CRC httppensueblibcompatronFullRecordaspxp=359998

bull Time ldquoTime Series Time Series Cross Section Panelrdquo as common in economics political scienceand sociology DataScience-Python Ch 3 re pandas DSHandbook-Py Ch 17 ldquoTime Series AnalysisrdquoEconometrics GelmanHill

bull Time ldquoSequential Data Streamsrdquo as in NLP machine learning Spark and other rdquostreaming datardquoreadings

bull Space-Time Data with continuity connectivity neighborhood structure See DeepLearning Chapter 9re convolution

bull CRAN Task Views Spatial httpscranr-projectorgwebviewsSpatialhtml Spatiotempo-ral httpscranr-projectorgwebviewsSpatioTemporalhtml Time Series httpscran

r-projectorgwebviewsTimeSerieshtml

Network Graphs

bull Dagger[Networks] David Easley and Jon Kleinberg 2010 Networks Crowds and Markets ReasoningAbout a Highly Connected World Cambridge University Press httpwwwcscornelleduhome

kleinbernetworks-book

bull MMDS Ch 5 ldquoLink Analysisrdquo Ch 10 ldquoMining Social-Network Graphsrdquo

bull dagger[Networks-R] Douglas Luke 2015 A Userrsquos Guide to Network Analysis in R Springer https

link-springer-comezaccesslibrariespsuedubook101007978-3-319-23883-8

bull dagger[Networks-Python] Mohammed Zuhair Al-Taie and Seifedine Kadry 2017 Python for Graph andNetwork Analysis Springer httpslink-springer-comezaccesslibrariespsuedubook

1010072F978-3-319-53004-8

Hierarchy Clustered Data Aggregation Mixed Models

bull dagger[GelmanHill] Andrew Gelman and Jennifer Hill 2006 Data Analysis Using Regression and Multi-levelHierarchical Models Cambridge University Press httppensueblibcompatronFullRecordaspxp=288457

bull dagger[MixedModels] Eugene Demidenko 2013 Mixed Models Theory and Applications with R 2nd edWiley httpsiteebrarycomezaccesslibrariespsuedulibpennstatedetailaction

docID=10748641

Social Data Channels

Language Text Speech Audio

bull Dagger[NLP] Daniel Jurafsky and James H Martin Forthcoming (2017) Speech and Language ProcessingAn Introduction to Natural Language Processing Speech Recognition and Computational Linguistics3rd ed Prentice-Hall Preprint httpswebstanfordedu~jurafskyslp3

bull InfoRetrieval

bull Dagger[CoreNLP] Stanford CoreNLP httpsstanfordnlpgithubioCoreNLP

28

bull dagger[TextAsData] Grimmer and Stewart 2013 ldquoText as Data The Promise and Pitfalls of AutomaticContent Analysis Methods for Political Textsrdquo Political Analysis httpsdoiorg101093pan

mps028

bull Quinn-Topics

bull FightinWords

bull dagger[NLTK-Python] Jacob Perkins 2010 Python Text Processing with NLTK 20 Cookbook PackthttpsiteebrarycomezaccesslibrariespsuedulibpennstatedetailactiondocID=10435387dagger[TextAnalytics-Python] Dipanjan Sarkar 2016 Text Analytics with Python Springer http

linkspringercomezaccesslibrariespsuedubook1010072F978-1-4842-2388-8

bull DSHandbook-Py Ch 16 ldquoNatural Language Processingrdquo

bull Matthew J Denny httpwwwmjdennycomText_Processing_In_Rhtml

bull CRAN Natural Language Processing httpscranr-projectorgwebviewsNaturalLanguageProcessinghtml

bull dagger[AudioData] Dean Knox and Christopher Lucas 2017 ldquoThe Speaker-Affect Model MeasuringEmotion in Political Speech with Audio Datardquo httpchristopherlucasorgfilesPDFssam

pdf httpswwwyoutubecomwatchv=Hs8A9dwkMzI

bull daggerMorgan amp Claypool Synthesis Lectures on Human Language Technologies httpwwwmorganclaypoolcomtochlt11

bull daggerMorgan amp Claypool Synthesis Lectures on Speech and Audio Processing httpwwwmorganclaypoolcomtocsap11

Vision Image Video

bull Dagger[ComputerVision] Richard Szelinski 2010 Computer Vision Algorithms and Applications SpringerhttpszeliskiorgBook

bull MixedModels Ch 11 ldquoStatistical Analysis of Shaperdquo Ch 12 ldquoStatistical Image Analysisrdquo

bull AudioVideoVolumetric See DeepLearning Chapter 9 re convolution

bull daggerMorgan amp Claypool Synthesis Lectures on Image Video and Multimedia Processing httpwww

morganclaypoolcomtocivm11

bull daggerMorgan amp Claypool Synthesis Lectures on Computer Vision httpwwwmorganclaypoolcom

toccov11

Approaches to Learning from Data (The Analytics Layer)

bull NRCreport Ch 7 ldquoBuilding Models from Massive Datardquo

bull MMDS (data-mining)

bull CausalInference

bull Dagger[VisualAnalytics] Daniel Keim Jorn Kohlhammer Geoffrey Ellis and Florian Mansmann editors2010 Mastering the Information Age Solving Problems with Visual Analytics The EurographicsAssociation Goslar Germany httpwwwvismastereubook See also daggerMorgan amp ClaypoolSynthesis Lectures on Visualization httpwwwmorganclaypoolcomtocvis21

bull Dagger[Econometrics] Bruce E Hansen 2014 Econometrics httpwwwsscwiscedu~bhansen

econometrics

29

bull Dagger[StatisticalLearning] Gareth James Daniela Witten Trevor Hastie and Robert Tibshirani 2013An Introduction to Statistical Learning with Applications in R Springer httpwww-bcfuscedu

~garethISLindexhtml

bull [MachineLearning] Christopher Bishop 2006 Pattern Recognition and Machine Learning Springer

bull dagger[Bayes] Peter D Hoff 2009 A First Course in Bayesian Statistical Methods Springer http

linkspringercombook1010072F978-0-387-92407-6

bull dagger[GraphicalModels] Soslashren Hoslashjsgaard David Edwards Steffen Lauritzen 2012 Graphical Mod-els with R Springer httpslink-springer-comezaccesslibrariespsuedubook101007

2F978-1-4614-2299-0

bull [InformationTheory] David Mackay 2003 Information Theory Inference and Learning AlgorithmsSpringer

bull Dagger[DeepLearning] Ian Goodfellow Yoshua Bengio and Aaron Courville 2016 Deep Learning MITPress httpwwwdeeplearningbookorg (see also Yann LeCun Yoshua Bengio and Geof-frey Hinton 2015 ldquoDeep Learningrdquo Nature 521(7553) 436ndash444 httpsdoiorg101038

nature14539)

See also Dagger[Keras] Keras The Python Deep Learning Library httpskerasio Francoios CholletDirectory of Keras tutorials httpsgithubcomfcholletkeras-resources

bull dagger[SignalProcessing] Jose Maria Giron-Sierra 2013 Digital Signal Processing with MATLAB Exam-ples (Volumes 1-3) Springer httpslink-springer-comezaccesslibrariespsuedubook

101007978-981-10-2534-1

30

Penn State Policy Statements

Academic Integrity

Academic integrity is the pursuit of scholarly activity in an open honest and responsible manner Academicintegrity is a basic guiding principle for all academic activity at The Pennsylvania State University and allmembers of the University community are expected to act in accordance with this principle Consistent withthis expectation the Universitys Code of Conduct states that all students should act with personal integrityrespect other students dignity rights and property and help create and maintain an environment in whichall can succeed through the fruits of their efforts

Academic integrity includes a commitment by all members of the University community not to engage in ortolerate acts of falsification misrepresentation or deception Such acts of dishonesty violate the fundamentalethical principles of the University community and compromise the worth of work completed by others

Disability Accomodation

Penn State welcomes students with disabilities into the Universitys educational programs Every Penn Statecampus has an office for students with disabilities Student Disability Resources (SDR) website provides con-tact information for every Penn State campus (httpequitypsuedusdrdisability-coordinator)For further information please visit the Student Disability Resources website (httpequitypsuedusdr)

In order to receive consideration for reasonable accommodations you must contact the appropriate disabilityservices office at the campus where you are officially enrolled participate in an intake interview and pro-vide documentation See documentation guidelines at (httpequitypsuedusdrguidelines) If thedocumentation supports your request for reasonable accommodations your campus disability services officewill provide you with an accommodation letter Please share this letter with your instructors and discussthe accommodations with them as early as possible You must follow this process for every semester thatyou request accommodations

Psychological Services

Many students at Penn State face personal challenges or have psychological needs that may interfere withtheir academic progress social development or emotional wellbeing The university offers a variety ofconfidential services to help you through difficult times including individual and group counseling crisisintervention consultations online chats and mental health screenings These services are provided by staffwho welcome all students and embrace a philosophy respectful of clients cultural and religious backgroundsand sensitive to differences in race ability gender identity and sexual orientation

bull Counseling and Psychological Services at University Park (CAPS) (httpstudentaffairspsueducounseling) 814-863-0395

bull Penn State Crisis Line (24 hours7 daysweek) 877-229-6400

bull Crisis Text Line (24 hours7 daysweek) Text LIONS to 741741

Educational Equity

Penn State takes great pride to foster a diverse and inclusive environment for students faculty and staffConsistent with University Policy AD29 students who believe they have experienced or observed a hatecrime an act of intolerance discrimination or harassment that occurs at Penn State are urged to report theseincidents as outlined on the Universitys Report Bias webpage (httpequitypsuedureportbias)

31

Background Knowledge

Students will have a variety of backgrounds Prior to beginning interdisciplinary coursework tofulfill Social Data Analytics degree requirements including SoDA 501 and 502 students are ex-pected to have advanced (graduate) training in at least one of the component areas of Social DataAnalytics and a familiarity with basic concepts in the others

With regard to specialization students are expected to have advanced (graduate) training in ONEof the following

bull quantitative social science methodology and a discipline of social science (as would be thecase for a second-year PhD student in Political Science Sociology Criminology Human De-velopment and Family Studies or Demography) OR

bull statistics (as would be the case for a second-year PhD student in Statistics) OR

bull information science or informatics (as would be the case for a second-year PhD student inInformation Science and Technology or a second-year PhD student in Geography specializingin GIScience) OR

bull computer science (as would be the case for a second-year PhD student in Computer Scienceand Engineering)

This requirement is met as a matter of meeting home program requirements for students in thedual-title PhD but may require additional coursework on the part of students in other programswishing to pursue the graduate minor

With regard to general preparation students are expected to have ALL of the following technicalknowledge

bull basic programming skills including basic facility with R ampor Python AND

bull basic knowledge of relational databases ampor geographic information systems AND

bull basic knowledge of probability applied statistics ampor social science research design AND

bull basic familiarity with a substantive or theoretical area of social science (eg 300-level course-work in political science sociology criminology human development psychology economicscommunication anthropology human geography social informatics or similar fields)

It is not unusual for students to have one or more gaps in this preparation at time of applicationto the SoDA program Students should work with Social Data Analytics advisers to develop aplan for timely remediation of any deficiencies which generally will not require formal courseworkfor students whose training and interests are otherwise appropriate for pursuit of the Social DataAnalytics degree Where possible this will be addressed at time of application to the Social DataAnalytics program

To this end some free training materials are linked in the reference section and there are alsoabundant free high-quality self-paced course-style training materials on these and related subjectsavailable through edX (httpwwwedxorg) Udacity (httpwwwudacitycom) Codecademy(httpcodecademycom) and Lynda (httpwwwlyndapsu)

32

Page 11: SoDA 501: Approaches and Issues in Big Social Data Spring ... · \Approaches and Issues in Big Social Data," o ered in the spring semester. Some SoDA / IGERT students will take more

Past Visiting Speakers in SoDA 501 and 502

SoDA 502 (Fall 2017)

bull Chris Zorn (PLSC)

bull Diane Felmlee (SOC)

bull Alan MacEachren (GEOG)

bull Eric Plutzer (PLSC)

bull James LeBreton (PSYCH)

bull Sesa Slavkovic (STAT)

bull Guido Cervone (GEOG)

bull Ashton Verdery (SOC)

bull Maggie Niu (STAT)

bull Reka Albert (PHYS)

SoDA 501 (Spring 2017)

bull Tim Brick (HDFS) ldquoTowards real-time monitoring and intervention using wearable technologyrdquo

bull Aylin Caliskan (Princeton) ldquoA Story of Discrimination and Unfairness Bias in Word Embeddingsrdquo

bull Jay Yonamine (Google - IGERT alum) ldquoData Science in Industryrdquo

bull Johnathan Rush (Illinois) ldquoGeospatial Data Science Workshoprdquo

bull Rick Gilmore (PSYCH) ldquoToward a more reproducible and robust science of human behaviorrdquo

bull Glenn Firebaugh (SOC) ldquoMeasuring Inequality and Segregation with US Census Datardquo

bull Charles Twardy (Sotera) ldquoData Science for Search and Rescuerdquo

bull Anna Smith (Ohio State) ldquoA Hierarchical Model for Network Data in a Latent Hyperbolic Spacerdquo

bull Rebecca Passonneau (CSE) ldquoOmnigraph Rich Feature Representation for Graph Kernel Learningrdquo

bull Alex Klippel (GEOG) ldquoVirtual Reality for Immersive Analyticsrdquo

bull Murali Haran (STAT) ldquoA Computationally Efficient Projection-based Approach for Spatial General-ized Mixed Modelsrdquo

SoDA 502 (Fall 2016)

bull Clio Andris (GEOG) ldquoIntegrating Social Network Data into GISystemsrdquo

bull Jia Li (STAT) ldquoClustering under the Wasserstein Metricrdquo

bull Rachel Smith (CAS) ldquoStigma Networks Perceptions of Sociogramsrdquo

bull Zita Oravecz (HDFS)

bull Bethany Bray (Methodology Center) ldquoLatent Class and Latent Transition Analysisrdquo

bull Dave Hunter (STAT) ldquoModel Based Clustering of Large Networksrdquo

bull Scott Bennett (PLSC) ldquoABM Model of Insurgencyrdquo

bull David Reitter (IST)

bull Suzanna Linn (PLSC) ldquoMethodological Issues in Automated Text Analysis Application to NewsCoverage of the US Economyrdquo

11

SoDA 501 (Spring 2016)

bull Bruce Desmarais (PLSC) ldquoLearning in the Sunshine Analysis of Local Government Email Corporardquo

bull Timothy Brick (HDFS) ldquoMapping and Manipulating Facial Expressionrdquo

bull Qunying Huang (USC) ldquoSocial Media An Emerging Data Source for Human Mobility Studiesrdquo

bull Lingzhou Zue (STAT) ldquoAn Introduction to High-Dimensional Graphical Modelsrdquo

bull Ashton Verdery (SOC) ldquoSampling from Network Datardquo

bull Sarah Battersby (Tableau) ldquoHelping People See and Understand Spatial Datardquo

bull Alexandra Slavkovic (STAT) ldquoStatistical Privacy with Network Datardquo

bull Lee Giles (IST) ldquoMachine Learning for Scholarly Big Datardquo

12

Readings and References (updated Spring 2018)

We will discuss a relatively small subset of the readings listed here and this will vary based on topics andreadings discussed by visiting speakers and you yourselves The remainder are provided here as curatedreferences for more in-depth investigation in both the theory and practice related to the topic (with thelatter heavily weighted toward resources in Python and R)

dagger Material that is at last check made available through the Penn State library Most journal links shouldwork if you are logged in to a Penn State machine or through the Penn State VPN Some article archivesand most books require additional authentication through webaccess If links are broken start directly froma search via httpswwwlibrariespsuedu Some books require installation of e-readers like AdobeDigital Editions Links to lyndacom can be accessed through httplyndapsuedu

Dagger Material that is at last check legally provided for free In some cases these are the preprint versionsof published material

sect Material for which a legal selection is or will be provided through the class Box folder

Big Data amp Social Data Analytics

Overviews

bull Dagger[CompSocSci] David Lazer Alex Pentland Lada Adamic Sinan Aral Albert-Laszlo BarabasiDevon Brewer Nicholas Christakis Noshir Contractor James Fowler Myron Gutmann Tony JebaraGary King Michael Macy Deb Roy and Marshall Van Alstyne 2009 ldquoComputational Social SciencerdquoScience 323(5915)721-3 + Supp Feb 6 httpsciencesciencemagorgcontent3235915

721full httpsgkingharvardedufilesgkingfilesLazPenAda09pdf

bull Dagger[NRCreport] National Research Council 2013 Frontiers in Massive Data Analysis NationalAcademies Press (Free w registration httpwwwnapeducatalogphprecord_id=18374)Ch1 ldquoIntroductionrdquo Ch2 ldquoMassive Data in Science Technology Commerce National DefenseTelecommunications and other Endeavorsrdquo

bull Dagger[MMDS] Jure Leskovec Anand Rajaraman and Jeff Ullman 2014 Mining of Massive DatasetsCambridge University Press httpwwwmmdsorg (BETA version of Third Edition httpi

stanfordedu~ullmanmmdsnhtml)

bull sect[BitByBit] Matthew J Salganik 2018 (Forthcoming) Bit by Bit Social Research in the DigitalAge Princeton University Press Ch 1 ldquoIntroductionrdquo Ch 2 ldquoObserving Behaviorrdquo

bull DSHandbook-Py Ch 1 ldquoIntroduction Becoming a Unicornrdquo Ch2 ldquoThe Data Science Road Maprdquo

Burt-schtick

bull Dagger[Monroe-5Vs] Burt L Monroe 2013 ldquoThe Five Vs of Big Data Political Science Introductionto the Special Issue on Big Data in Political Sciencerdquo Political Analysis 21(V5) 1ndash9 https

doiorg101017S1047198700014315 (Volume Velocity Variety Vinculation Validity)

bull dagger[Monroe-No] Burt L Monroe Jennifer Pan Margaret E Roberts Maya Sen and Betsy Sin-clair 2015 ldquoNo Formal Theory Causal Inference and Big Data Are Not Contradictory Trendsin Political Sciencerdquo PS Political Science amp Politics 48(1) 71ndash4 httpdxdoiorg101017

S1049096514001760

bull dagger[Quinn-Topics] Kevin M Quinn Burt L Monroe Michael Colaresi Michael H Crespin andDragomir R Radev 2010 ldquoHow to Analyze Political Attention with Minimal Assumptions andCostsrdquo American Journal of Political Science 54(1) 209ndash28 httponlinelibrarywileycom

13

doi101111j1540-5907200900427xfull (esp topic modeling as measurement approach tovalidation)

bull dagger[FightinWords] Burt L Monroe Michael Colaresi and Kevin M Quinn 2008 ldquoFightinrsquo WordsLexical Feature Selection and Evaluation for Identifying the Content of Political Conflictrdquo PoliticalAnalysis 16(4) 372-403 httpsdoiorg101093panmpn018 (esp the impact of samplingvariance and regularization through priors)

bull Dagger[BDSS-Census] Big Data Social Science PSU Team 2012 ldquoA Closer Look at the Kaggle CensusDatardquo httpsburtmonroegithubioBDSSKaggleCensus2012 (esp the relevance of the socialprocesses by which data come to exist as data)

Multidisciplinary Perspectives

bull dagger[Business-BigData] Andrew McAfee and Erik Brynjolfsson 2012 ldquoBig Data The ManagementRevolutionrdquo Harvard Business Review 90(10) 61ndash8 October httpshbrorg201210big-

data-the-management-revolution Thomas H Davenport and DJ Patel 2012 ldquoData ScientistThe Sexiest Job of the 21st Centuryrdquo Harvard Business Review 90(10)70-6 October https

hbrorg201210data-scientist-the-sexiest-job-of-the-21st-century

bull dagger[InfoSci-BigData] CL Philip Chen and Chun-Yang Zhang 2014 ldquoData-intensive ApplicationsChallenges Techniques and Technologies A Survey on Big Datardquo Information Sciences 275 314-47httpsdoiorg101016jins201401015

bull dagger[Informatics-BigData] Vasant G Honavar 2014 ldquoThe Promise and Potential of Big Data ACase for Discovery Informaticsrdquo Review of Policy Research 31(4) 326-330 httpsdoiorg10

1111ropr12080

bull Dagger[Stats-BigData] Beate Franke Jean-Francois Ribana Roscher Annie Lee Cathal Smyth ArminHatefi Fuqi Chen Einat Gil Alexander Schwing Alessandro Selvitella Michael M Hoffman RogerGrosse Dietrich Hendricks and Nancy Reid 2016 ldquoStatistical Inference Learning and Models in BigDatardquo International Statistical Review 84(3) 371-89 httponlinelibrarywileycomdoi101111insr12176full

bull dagger[Econ-BigData] Hal R Varian 2013 ldquoBig Data New Tricks for Econometricsrdquo Journal ofEconomic Perspectives 28(2) 3-28 httpsdoiorg101257jep2823

bull dagger[GeoViz-BigData] Alan MacEachren 2017 ldquoLeveraging Big (Geo) Data with (Geo) Visual An-alytics Place as the Next Frontierrdquo In Chenghu Zhou Fenzhen Su Francis Harvey and Jun Xueds Spatial Data Handling in the Big Data Era pp 139-155 Springer httpslink-springer-

comezaccesslibrariespsueduchapter101007978-981-10-4424-3_10

bull dagger[Soc-BigData] David Lazer and Jason Radford 2017 ldquoData ex Machina Introduction to BigDatardquo Annual Review of Sociology 43 19-39 httpsdoiorg101146annurev-soc-060116-

053457

bull sect[Politics-BigData] Keith T Poole L Jason Anasastopolous and James E Monagan III Forthcom-ing ldquoThe lsquoBig Datarsquo Revolution in Political Campaigning and Governancerdquo Oxford Bibliographies inPolitical Science

iexclCuidado Traps Biases Problems Pains Perils

bull dagger[GoogleFlu] David Lazer Ryan Kennedy Gary King and Alessandro Vespignani 2014 ldquoTheParable of Google Flu Traps in Big Data Analysisrdquo Science (343) 14 March httpgking

harvardedufilesgkingfiles0314policyforumffpdf

bull Dagger[MachineBias] ProPublica ldquoMachine Bias Investigating Algorithmic Injusticerdquo Series https

wwwpropublicaorgseriesmachine-bias See especially

14

Julia Angwin Jeff Larson Surya Mattu and Lauren Kirchner 2016 ldquoMachine Biasrdquo ProPub-lica May 23 httpswwwpropublicaorgarticlemachine-bias-risk-assessments-in-

criminal-sentencing

Julia Angwin Madeleine Varner and Ariana Tobin 2017 ldquoFacebook Enabled Adverstisers toReach Jew Hatersrdquo httpswwwpropublicaorgarticlefacebook-enabled-advertisers-

to-reach-jew-haters

bull Dagger[Polling2016] Doug Rivers (Nov 11 2016) ldquoFirst Thoughts on Polling Problems in the 2016 USElectionsrdquo httpstodayyougovcomnews20161111first-thoughts-polling-problems-

2016-us-elections

bull dagger[EventData] Wei Wang Ryan Kennedy David Lazer Naren Ramakrishnan 2016 ldquoGrowing Painsfor Global Monitoring of Societal Eventsrdquo Science 3536307 pp 1502ndash1503 httpsdoiorg101126scienceaaf6758

bull Dagger[GoogleBooks] Eitan Adam Pechenick Christopher M Danforth and Peter Sheridan Dodds 2015ldquoCharacterizing the Google Books Corpus Strong Limits to Inferences of Socio-Cultural and LinguisticEvolutionrdquo PLoS One httpsdoiorg101371journalpone0137041

bull Dagger[OkCupid] Michael Zimmer 2016 ldquoOkCupid Study Reveals the Perils of Big-Data Sciencerdquo Wiredhttpswwwwiredcom201605okcupid-study-reveals-perils-big-data-science May 14

bull dagger[CriticalQuestions] danah boyd and Kate Crawford 2011 ldquoCritical Questions for Big Data Provo-cations for a Cultural Technological and Scholarly Phenomenonrdquo Information Communication ampSociety 15(5) 662ndash79 httpdxdoiorg1010801369118X2012678878

bull Dagger[RacistBot] Daniel Victor 2016 ldquoMicrosoft created a Twitter bot to learn from users It quicklybecame a racist jerkrdquo New York Times httpswwwnytimescom20160325technology

microsoft-created-a-twitter-bot-to-learn-from-users-it-quickly-became-a-racist-jerk

html

bull MMDS 122-123 on Bonferroni

bull Also BDSS-Census

Research Design and Measurement

Overviews

bull BitByBit

bull Dagger[ResearchMethodsKB] William M Trochim 2006 The Research Methods Knowledge Basehttpwwwsocialresearchmethodsnetkb

bull [7Rules] Glenn Firebaugh 2008 Seven Rules for Social Research Princeton University Press

Measurement Reliability and Validity

bull ResearchMethodsKB ldquoMeasurementrdquo

bull 7Rules Ch 3 ldquoBuild Reality Checks into Your Researchrdquo

bull Reliability see also FightinWords

bull Validity see also Quinn10-Topics Monroe-5Vs

Indirect Unobtrusive Nonreactive Measures Data Exhaust

15

bull dagger[UnobtrusiveMeasures] Raymond M Lee 2015 ldquoUnobtrusive Measuresrdquo Oxford Bibliographieshttpsdoiorg101093OBO9780199846740-0048 (Canonical cite is Eugene J Webb 1966Unobtrusive Measures or Webb Donald T Campbell Richard D Schwartz Lee Sechrest 1999Unobtrusive Measures rev ed Sage)

bull BitByBit Examples in Chapter 2

Multiple Measures Latent Variable Measurement

bull 7Rules Ch 4 ldquoReplicate Where Possiblerdquo

bull dagger[LatentVariables] David J Bartholomew Martin Knott and Irini Moustaki 2011 Latent VariableModels and Factor Analysis A Unified Approach Wiley httpsiteebrarycomezaccess

librariespsuedulibpennstatedetailactiondocID=10483308) (esp Ch1 ldquoBasic ideas andexamplesrdquo)

bull dagger[Multivariate-R] Brian Everitt and Torsten Hothorn 2011 An Introduction to Applied MultivariateAnalysis with R Springer httplinkspringercomezaccesslibrariespsuedubook10

10072F978-1-4419-9650-3

bull Shalizi-ADA Ch17 ldquoFactor Modelsrdquo

bull Dagger[NetflixPrize] Edwin Chen 2011 ldquoWinning the Netflix Prize A Summaryrdquo

bull DaggerCRAN httpscranr-projectorgwebviewsMultivariatehtmlhttpscranr-projectorgwebviewsPsychometricshtmlhttpscranr-projectorgwebviewsClusterhtml

Sampling and Survey Design

bull dagger[Sampling] Steven K Thompson 2012 Sampling 3rd ed httpsk8es4mc2lsearchserialssolutionscomsid=sersolampSS_jc=TC_024492330amptitle=Wiley20Desktop20Editions203A20Sampling

bull ResearchMethodsKB ldquoSamplingrdquo

bull NRCreport Ch 8 ldquoSampling and Massive Datardquo

bull BitByBit Ch 3 ldquoAsking Questionsrdquo

bull Dagger[MSE] Daniel Manrique-Vallier Megan E Price and Anita Gohdes 2013 In Seybolt et al (eds)Counting Civilians ldquoMultiple Systems Estimation Techniques for Estimating Casualties in ArmedConflictsrdquo Preprint httpciteseerxistpsueduviewdocdownloaddoi=1011469939amp

rep=rep1amptype=pdf)

bull dagger[NetworkSampling] Ted Mouw and Ashton M Verdery 2012 ldquoNetwork Sampling with Memory AProposal for More Efficient Sampling from Social Networksrdquo Sociological Methodology 42(1)206ndash56httpsdxdoiorg1011772F0081175012461248

bull See also FightinWords (re hidden heteroskedasticity in sample variance)

bull Dagger[HashDontSample] Mudit Uppal 2016 ldquoProbabilistic data structures in the Big data world (+code)rdquo (re ldquoHash donrsquot samplerdquo) httpsmediumcommuppalprobabilistic-data-structures-

in-the-big-data-world-code-b9387cff0c55

bull MMDS Hash Functions (132) Sampling in streams (42)

bull DaggerCRAN httpscranr-projectorgwebviewsOfficialStatisticshtml (ldquoComplex SurveyDesignrdquo ldquoSmall Area Estimationrdquo)

Experimental and Observational Designs for Causal Inference

16

bull Dagger[CausalInference] Miguel A Hernan James M Robins Forthcoming (2017) Causal Infer-ence Chapman amp HallCRC Preprint httpswwwhsphharvardedumiguel-hernancausal-

inference-book Chs 1 ldquoA definition of causal effectrdquo Ch 2 ldquoRandomized experimentsrdquo Ch3 ldquoObservational studiesrdquo (See Ch 6 ldquoGraphical representation of causal effectsrdquo for integrationwith Judea Pearl approach)

bull BitByBit Chapter 4 ldquoRunning Experimentsrdquo Ch 2 Natural experiments in observable data exam-ples

bull ResearchMethodsKB ldquoDesignrdquo

bull Firebaugh-7Rules Ch 2 ldquoLook for Differences that Make a Difference and Report Themrdquo sectCh5 ldquoCompare Like with Likerdquo Ch 6 ldquoUse Panel Data to Study Individual Change and RepeatedCross-Section Data to Study Social Changerdquo

bull Shalizi-ADA Part IV ldquoCausal Inferencerdquo

bull Monroe-No

bull CRAN httpscranr-projectorgwebviewsExperimentalDesignhtml

Technologies for Primary and Secondary Data Collection

Mobile Devices Distributed Sensors Wearable Sensors Remote Sensing

bull dagger[RealityMining] Nathan Eagle and Alex (Sandy) Pentland 2006 ldquoReality Mining SensingComplex Social Systemsrdquo Personal and Ubiquitous Computing 10(4) 255ndash68 httpsdoiorg

101007s00779-005-0046-3

bull dagger[QuantifiedSelf ] Melanie Swan 2013 ldquoThe Quantified Self Fundamental Disruption in Big DataScience and Biological Discoveryrdquo Big Data 1(2) 85ndash99 httpsdoiorg101089big2012

0002

bull dagger[SensorData] Charu C Aggarwal (Ed) 2013 Managing and Mining Sensor Data Springer httplinkspringercomezaccesslibrariespsuedubook1010072F978-1-4614-6309-2 Ch1ldquoAn Introduction to Sensor Data Analyticsrdquo

bull dagger[NightLights] Thushyanthan Baskaran Brian Min Yogesh Uppal 2015 ldquoElection cycles andelectricity provision Evidence from a quasi-experiment with Indian special electionsrdquo Journal ofPublic Economics 12664-73 httpsdoiorg101016jjpubeco201503011

Crowdsourcing Human Computation Citizen Science Web Experiments

bull dagger[Crowdsourcing] Daren C Brabham 2013 Crowdsourcing MIT Press httpsiteebrary

comezaccesslibrariespsuedulibpennstatedetailactiondocID=10692208 ldquoIntroduc-tionrdquo Ch1 ldquoConcepts Theories and Cases of Crowdsourcingrdquo

bull BitByBit Ch 5 ldquoCreating Mass Collaborationrdquo

bull NRCreport Ch 9 ldquoHuman Interaction with Datardquo

bull dagger[MTurk] Krista Casler Lydia Bickel and Elizabeth Hackett 2013 ldquoSeparate but Equal AComparison of Participants and Data Gathered via Amazonrsquos MTurk Social Media and Face-to-FaceBehavioral Testingrdquo Computers in Human Behavior 29(6) 2156ndash60 httpdoiorg101016j

chb201305009

bull dagger[LabintheWild] Katharina Reinecke and Krzysztof Z Gajos 2015 ldquoLabintheWild ConductingLarge-Scale Online Experiments With Uncompensated Samplesrdquo In Proceedings of the 18th ACMConference on Computer Supported Cooperative Work amp Social Computing (CSCW rsquo15) 1364ndash1378httpdxdoiorg10114526751332675246

17

bull dagger[TweetmentEffects] Kevin Munger 2017 ldquoTweetment Effects on the Tweeted ExperimentallyReducing Racist Harassmentrdquo Political Behavior 39(3) 629ndash49 httpdoiorg101016jchb

201305009

bull dagger[HumanComputation] Edith Law and Luis van Ahn 2011 Human Computation Morgan amp Clay-pool httpwwwmorganclaypoolcomezaccesslibrariespsuedudoipdf102200S00371ED1V01Y201107AIM013

Open Data File Formats APIs Semantic Web Linked Data

bull DSHandbook-Py Ch 12 ldquoData Encodings and File Formatsrdquo

bull Dagger[OpenData] Open Data Institute ldquoWhat Is Open Datardquo httpstheodiorgwhat-is-open-

data

bull Dagger[OpenDataHandbook] Open Knowledge International The Open Data Handbook httpopendatahandbookorg Includes appendix ldquoFile Formatsrdquo httpopendatahandbookorgguideenappendices

file-formats

bull Dagger[APIs] Brian Cooksey 2016 An Introduction to APIs httpszapiercomlearnapis

bull Dagger[APIMarkets] RapidAPI mashape API marketplaces httpsdocsrapidapicom https

marketmashapecom ProgrammableWeb httpswwwprogrammablewebcom

bull dagger[LinkedData] Tom Heath and Christian Bizer 2011 Linked Data Evolving the Web into a GlobalData Space Morgan amp Claypool httpwwwmorganclaypoolcomezaccesslibrariespsu

edudoiabs102200S00334ED1V01Y201102WBE001 Ch 1 ldquoIntroductionrdquo Ch 2 ldquoPrinciples ofLinked Datardquo

bull dagger[SemanticWeb] Nikolaos Konstantinos and Dimitrios-Emmanuel Spanos 2015 Materializing theWeb of Linked Data Springer httplinkspringercomezaccesslibrariespsuedubook

1010072F978-3-319-16074-0 Ch 1 ldquoIntroduction Linked Data and the Semantic Webrdquo

bull daggerMorgan amp Claypool Synthesis Lectures on the Semantic Web Theory and Technology httpwww

morganclaypoolcomtocwbe111

Web Scraping

bull dagger[TheoryDrivenScraping] Richard N Landers Robert C Brusso Katelyn J Cavanaugh andAndrew B Colmus 2016 ldquoA Primer on Theory-Driven Web Scraping Automatic Extraction ofBig Data from the Internet for Use in Psychological Researchrdquo Psychological Methods 4 475ndash492httpdxdoiorgezaccesslibrariespsuedu101037met0000081

bull Dagger[Scraping-Py] Al Sweigart 2015 Automate the Boring Stuff with Python Practical Programmingfor Total Beginners Ch 11 Web-Scraping httpsautomatetheboringstuffcomchapter11

bull dagger[Scraping-R] Simon Munzert Christian Rubba Peter Meiszligner and Dominic Nyhuis 2014 Au-tomated Data Collection with R A Practical Guide to Web Scraping and Text Mining Wileyhttponlinelibrarywileycombook1010029781118834732

bull DaggerCRAN httpscranr-projectorgwebviewsWebTechnologieshtml

Ethics amp Scientific Responsibility in Big Social Data

Human Subjects Consent Privacy

bull BitByBit Ch 6 (ldquoEthicsrdquo)

bull Dagger[PSU-ORP] Penn State Office for Research Protections (under the VP for Research)

Human Subjects Research IRB httpswwwresearchpsueduirb

18

Revised Common Rule httpswwwresearchpsueduirbcommonrulechanges

Responsible Conduct of Research httpswwwresearchpsuedueducationrcr

Research Misconduct httpswwwresearchpsueduresearchmisconduct

SARI PSU (Scientific and Research Integrity Training) httpswwwresearchpsuedu

trainingsari

bull Dagger[MenloReport] David Dittrich and Erin Kenneally (Center for Applied Interneet Data Analysis)2012 ldquoThe Menlo Report Ethical Principles Guiding Information and Communication Technol-ogy Researchrdquo and companion report ldquoApplying Ethical Principles rdquo httpswwwcaidaorg

publicationspapers2012menlo_report_actual_formatted

bull Dagger[BigDataEthics-CBDES] Jacob Metcalf Emily F Keller and danah boyd 2016 ldquoPerspec-tives on Big Data Ethics and Societyrdquo Council for Big Data Ethics and Society httpbdes

datasocietynetcouncil-outputperspectives-on-big-data-ethics-and-society

bull Dagger[AoIRReport] Annette Markham and Elizabeth Buchanan (Association of Internet Researchers)2012 ldquoEthical Decision-Making and Internet Research Recommendations of the AoIR Ethics WorkingCommittee (Version 20)rdquo httpsaoirorgreportsethics2pdf and guidelines chart https

aoirorgwp-contentuploads201701aoir_ethics_graphic_2016pdf

bull Dagger[BigDataEthics-Wired] Sarah Zhang 2016 ldquoScientists are Just as Confused about the Ethics ofBig-Data Research as Yourdquo Wired httpswwwwiredcom201605scientists-just-confused-ethics-big-data-research

bull dagger[BigDataEthics-HerschelMori] Richard Herschel and Virginia M Mori 2017 ldquoEthics amp BigDatardquo Technology in Society 49 31-36 httpdoiorg101016jtechsoc201703003

The Science of Data Privacy

bull dagger[BigDataPrivacy] Terence Craig and Mary E Ludloff 2011 Privacy and Big Data OrsquoReilly MediahttppensueblibcompatronFullRecordaspxp=781814

bull Dagger[DataAnalysisPrivacy] John Abowd Lorenzo Alvisi Cynthia Dwork Sampath Kannan AshwinMachanavajjhala Jerome Reiter 2017 ldquoPrivacy-Preserving Data Analysis for the Federal Statisti-cal Agenciesrdquo A Computing Community Consortium white paper httpsarxivorgabs1701

00752

bull dagger[DataPrivacy] Stephen E Fienberg and Aleksandra B Slavkovic 2011 ldquoData Privacy and Con-fidentialityrdquo International Encyclopedia of Statistical Science 342ndash5 httpdoiorg978-3-642-

04898-2_202

bull Dagger[NetworksPrivacy] Vishesh Karwa and Aleksandra Slavkovic 2016 ldquoInference using noisy degreesDifferentially private β-model and synthetic graphsrdquo Annals of Statistics 44(1) 87-112 http

projecteuclidorgeuclidaos1449755958

bull dagger[DataPublishingPrivacy] Raymond Chi-Wing Wong and Ada Wai-Chee Fu 2010 Privacy-Preserving Data Publishing An Overview Morgan amp Claypool httpwwwmorganclaypoolcom

ezaccesslibrariespsuedudoipdfplus102200S00237ED1V01Y201003DTM002

bull DataMatching Ch 8 ldquoPrivacy Aspects of Data Matchingrdquo

Transparency Reproducibility and Team Science

bull Dagger[10RulesforData] Alyssa Goodman Alberto Pepe Alexander W Blocker Christine L BorgmanKyle Cranmer Merce Crosas Rosanne Di Stefano Yolanda Gil Paul Groth Margaret HedstromDavid W Hogg Vinay Kashyap Ashish Mahabal Aneta Siemiginowska and Aleksandra Slavkovic(2014) ldquoTen Simple Rules for the Care and Feeding of Scientific Datardquo PLoS Computational Biology10(4) e1003542 httpsdoiorg101371journalpcbi1003542 (Note esp curated resourcesfor reproducible research)

19

bull dagger[Transparency] E Miguel C Camerer K Casey J Cohen K M Esterling A Gerber R Glen-nerster D P Green M Humphreys G Imbens D Laitin T Madon L Nelson B A Nosek M Pe-tersen R Sedlmayr J P Simmons U Simonsohn M Van der Laan 2014 ldquoPromoting Transparencyin Social Science Researchrdquo Science 343(6166) 30ndash1 httpsdoiorg101126science1245317

bull dagger[Reproducibility] Marcus R Munafo Brian A Nosek Dorothy V M Bishop Katherine S ButtonChristopher D Chambers Nathalie Percie du Sert Uri Simonsohn Eric-Jan Wagenmakers JenniferJ Ware and John P A Ioannidis 2017 ldquoA Manifesto for Reproducible Sciencerdquo Nature HumanBehavior 0021(2017) httpsdoiorg101038s41562-016-0021

bull Dagger[SoftwareCarpentry] esp ldquoLessonsrdquo httpsoftware-carpentryorglessons

bull DSHandbook-Py Ch 9 ldquoTechnical Communication and Documentationrdquo Ch 15 ldquoSoftware Engi-neering Best Practicesrdquo

bull dagger[TeamScienceToolkit] Vogel AL Hall KL Fiore SM Klein JT Bennett LM Gadlin H Stokols DNebeling LC Wuchty S Patrick K Spotts EL Pohl C Riley WT Falk-Krzesinski HJ 2013 ldquoTheTeam Science Toolkit enhancing research collaboration through online knowledge sharingrdquo AmericanJournal of Preventive Medicine 45 787-9 http101016jamepre201309001

bull sect[Databrary] Kara Hall Robert Croyle and Amanda Vogel Forthcoming (2017) Advancing Socialand Behavioral Health Research through Cross-disciplinary Team Science Springer Includes RickO Gilmore and Karen E Adolph ldquoOpen Sharing of Research Video Breaking the Boundaries of theResearch Teamrdquo (See httpdatabraryorg)

bull CRAN httpscranr-projectorgwebviewsReproducibleResearchhtml

Social Bias Fair Algorithms

bull MachineBias

bull Dagger[EmbeddingsBias] Aylin Caliskan Joanna J Bryson and Arvind Narayanan 2017 ldquoSemanticsDerived Automatically from Language Corpora Contain Human Biasesrdquo Science httpsarxiv

orgabs160807187

bull Dagger[Debiasing] Tolga Bolukbasi Kai-Wei Chang James Zou Venkatesh Saligrama Adam Kalai 2016ldquoMan is to Computer Programmer as Woman is to Homemaker Debiasing Word Embeddingsrdquohttpsarxivorgabs160706520

bull Dagger[AvoidingBias] Moritz Hardt Eric Price Nathan Srebro 2016 ldquoEquality of Opportunity inSupervised Learningrdquo httpsarxivorgabs161002413

bull Dagger[InevitableBias] Jon Kleinberg Sendhil Mullainathan Manish Raghavan 2016 ldquoInherent Trade-Offs in the Fair Determination of Risk Scoresrdquo httpsarxivorgabs160905807

Databases and Data Management

bull NRCreport Ch 3 ldquoScaling the Infrastructure for Data Managementrdquo

bull dagger[SQL] Jan L Harrington 2010 SQL Clearly Explained Elsevier httpwwwsciencedirectcomezaccesslibrariespsuedusciencebook9780123756978

bull DSHandbook-Py Ch 14 ldquoDatabasesrdquo

bull dagger[noSQL] Guy Harrison 2015 Next Generation Databases NoSQL NewSQL and Big Data Apresshttpslink-springer-comezaccesslibrariespsuedubook101007978-1-4842-1329-2

bull dagger[Cloud] Divyakant Agrawal Sudipto Das and Amr El Abbadi 2012 Data Management inthe Cloud Challenges and Opportunities Morgan amp Claypool httpwwwmorganclaypoolcom

ezaccesslibrariespsuedudoipdfplus102200S00456ED1V01Y201211DTM032

20

bull daggerMorgan amp Claypool Synthesis Lectures on Data Management httpwwwmorganclaypoolcom

tocdtm11

Data Wrangling

Theoretically-structured approaches to data wrangling

bull Dagger[TidyData-R] Garrett Grolemund and Hadley Wickham 2017 R for Data Science (httpr4dshadconz) esp Ch 12 ldquoTidy Datardquo also Wickham 2014 ldquoTidy Datardquo Journalof Statistical Software 59(10) httpwwwjstatsoftorgv59i10paper Tools The tidyversehttptidyverseorg

bull Dagger[DataScience-Py] Jake VanderPlas 2016 Python Data Science Handbook (httpsgithubcomjakevdpPythonDSHandbook-Py) esp Ch 3 on pandas httppandaspydataorg

bull Dagger[DataCarpentry] Colin Gillespie and Robin Lovelace 2017 Efficient R Programming https

csgillespiegithubioefficientR esp Ch 6 on ldquoefficient data carpentryrdquo

bull sect[Wrangler] Joseph M Hellerstein Jeffrey Heer Tye Rattenbury and Sean Kandel 2017 DataWrangling Practical Techniques for Data Preparation Tool Trifacta Wrangler httpwwwtrifactacomproductswrangler

Data wrangling practice

bull DSHandbook-Py Ch 4 ldquoData Munging String Manipulation Regular Expressions and Data Clean-ingrdquo

bull dagger[DataSimplification] Jules J Berman 2016 Data Simplification Taming Information with OpenSource Tools Elsevier httpwwwsciencedirectcomezaccesslibrariespsueduscience

book9780128037812

bull dagger[Wrangling-Py] Jacqueline Kazil Katharine Jarmul 2016 Data Wrangling with Python OrsquoReillyhttpproquestcombosafaribooksonlinecomezaccesslibrariespsuedu9781491948804

bull dagger[Wrangling-R] Bradley C Boehmke 2016 Data Wrangling with R Springer httplink

springercomezaccesslibrariespsuedubook1010072F978-3-319-45599-0

Record Linkage Entity Resolution Deduplication

bull dagger[DataMatching] Peter Christen 2012 Data Matching Concepts and Techniques for Record Link-age Entity Resolution and Duplicate Detectionhttplinkspringercomezaccesslibrariespsuedubook1010072F978-3-642-31164-2

(esp Ch 2 ldquoThe Data Matching Processrdquo)

bull DataSimplification Chapter 5 ldquoIdentifying and Deidentifying Datardquo

bull Dagger[EntityResolution] Lise Getoor and Ashwin Machanavajjhala 2013 ldquoEntity Resolution for BigDatardquo Tutorial KDD httpwwwumiacsumdedu~getoorTutorialsER_KDD2013pdf

bull Dagger[SyrianCasualties] Peter Sadosky Anshumali Shrivastava Megan Price and Rebecca C Steorts2015 ldquoBlocking Methods Applied to Casualty Records from the Syrian Conflictrdquo httpsarxiv

orgabs151007714 (For more on blocking see DataMatching Ch 4 ldquoIndexingrdquo)

bull CRAN Task Views httpscranr-projectorgwebviewsOfficialStatisticshtml ldquoStatis-tical Matching and Record Linkagerdquo

ldquoMaking up datardquo Imputation Smoothers Kernels Priors Filters Teleportation Negative Sampling Con-volution Augmentation Adversarial Training

21

bull dagger[Imputation] Yi Deng Changgee Chang Moges Seyoum Ido and Qi Long 2016 ldquoMultiple Im-putation for General Missing Data Patterns in the Presence of High-dimensional Datardquo ScientificReports 6(21689) httpswwwnaturecomarticlessrep21689

bull Dagger[Overimputation] Matthew Blackwell James Honaker and Gary King 2017 ldquoA Unified Ap-proach to Measurement Error and Missing Data Overview and Applicationsrdquo Sociological Methodsamp Research 46(3) 303-341 httpgkingharvardedufilesgkingfilesmeasurepdf

bull Shalizi-ADA Sect 15 (Linear Smoothers) Ch 8 (Splines) Sect 144 (Kernel Density Estimates)DataScience-Python NB 0513 ldquoKernel Density Estimationrdquo

bull FightinWords (re Bayesian priors as additional data impact of priors on regularization)

bull DeepLearning Section 75 (Data Augmentation) 712 (Dropout) 713 (Adversarial Examples) Chap-ter 9 (Convolutional Networks)

bull dagger[Adversarial] Ian J Goodfellow Jonathon Shlens Christian Szegedy 2015 ldquoExplaining and Har-vesting Adversarial Examplesrdquo httpsarxivorgabs14126572

bull See also resampling and simulation methods

bull See also feature engineering preprocessing

bull CRAN Task Views httpscranr-projectorgwebviewsOfficialStatisticshtml ldquoImpu-tationrdquo

(Direct) Data Representations Data Mappings

bull NRCreport Ch 5 ldquoLarge-Scale Data Representationsrdquo

bull dagger[PatternRecognition] M Narasimha Murty and V Susheela Devi 2011 Pattern RecognitionAn Algorithmic Approach Springer httpslink-springer-comezaccesslibrariespsuedu

book1010072F978-0-85729-495-1 Section 21 ldquoData Structures for Pattern Representationrdquo

bull Dagger[InfoRetrieval] Christopher D Manning Prabhakar Raghavan and Hinrich Schutze 2009 Intro-duction to Information Retrieval Cambridge University Press httpnlpstanfordeduIR-bookCh 1 2 6 (also note slides used in their class)

bull dagger[Algorithms] Brian Steele John Chandler Swarna Reddy 2016 Algorithms for Data ScienceWiley httplinkspringercomezaccesslibrariespsuedubook1010072F978-3-319-

45797-0 Ch 2 ldquoData Mapping and Data Dictionariesrdquo

bull See also Social Data Structures

Similarity Distance Association Covariance The Kernel Trick

bull PatternRecognition Section 23 ldquoProximity Measuresrdquo

bull Dagger[Similarity] Brendan OrsquoConnor 2012 ldquoCosine similarity Pearson correlation and OLS coeffi-cientsrdquo httpsbrenoconcomblog201203cosine-similarity-pearson-correlation-and-

ols-coefficients

bull For relatively comprehensive lists see also

M-J Lesot M Rifqi and H Benhadda 2009 ldquoSimilarity measures for binary and numerical data a surveyrdquoInt J Knowledge Engineering and Soft Data Paradigms 1(1) 63- httpciteseerxistpsueduviewdoc

downloaddoi=10112126533amprep=rep1amptype=pdf

Seung-Seok Choi Sung-Hyuk Cha Charles C Tappert 2010 ldquoA Survey of Binary Similarity and DistanceMeasuresrdquo Journal of Systemics Cybernetics amp Informatics 8(1)43-8 httpwwwiiisciorgJournalCV$

scipdfsGS315JGpdf

22

Sung-Hyuk Cha 2007 ldquoComprehensive Survey on DistanceSimilarity Measures between Probability DensityFunctionsrdquo International Journal of Mathematical Models and Methods in Applied Sciences 4(1) 300-7 http

csispaceeductappertdpsd861-12session4-p2pdf

Anna Huang 2008 ldquoSimilarity Measures for Text Document Clusteringrdquo Proceedings of the 6th New ZealandComputer Science Research Student Conference 49-56 httpwwwnzcsrsc08canterburyacnzsiteproceedingsIndividual_Paperspg049_Similarity_Measures_for_Text_Document_Clusteringpdf

bull Michael B Jordan ldquoThe Kernel Trickrdquo (Lecture Notes) httpspeopleeecsberkeleyedu

~jordancourses281B-spring04lectureslec3pdf

bull Eric Kim 2017 ldquoEverything You Ever Wanted to Know about the Kernel Trick (But Were Afraidto Ask)rdquo httpwwweric-kimneteric-kim-netposts1kernel_trick_blog_ekim_12_20_

2017pdf

Derived Data Representations - Dimensionality Reduction Compression De-composition Embeddings

The groupings here and under the measurement multivariate statistics section are particularly arbitraryFor example ldquok-Means clusteringrdquo can be viewed as a technique for ldquodimensionality reductionrdquo ldquocompres-sionrdquo ldquofeature extractionrdquo ldquolatent variable measurementrdquo ldquounsupervised learningrdquo ldquocollaborative filteringrdquo

Clustering hashing quantization blocking compression

bull Dagger[Compression] Khalid Sayood 2012 Introduction to Data Compression 4th ed Springer httpwwwsciencedirectcomezaccesslibrariespsuedusciencebook9780124157965 (eg cod-ing blocking via vector quantization)

bull MMDS Ch3 ldquoFinding Similar Itemsrdquo (minhashing locality sensitive hashing)

bull Multivariate-R Ch 6 ldquoClusteringrdquo

bull DataScience-Python NB 0511 ldquok-Means Clusteringrdquo NG 0512 ldquoGaussian Mixture Modelsrdquo

bull Dagger[KMeansHashing] Kaiming He Fang Wen Jian Sun 2013 ldquoK-means Hashing An Affinity-Preserving Quantization Method for Learning Binary Compact Codesrdquo CVPR httpswwwcv-foundationorgopenaccesscontent_cvpr_2013papersHe_K-Means_Hashing_An_2013_CVPR_paper

pdf

bull dagger[Squashing] Madigan D Raghavan N Dumouchel W Nason M Posse C and RidgewayG (2002) ldquoLikelihood-based data squashing A modeling approach to instance constructionrdquo DataMining and Knowledge Discovery 6(2) 173-190 httpdxdoiorgezaccesslibrariespsu

edu101023A1014095614948

bull dagger[Core-sets] Piotr Indyk Sepideh Mahabadi Mohammad Mahdian Vahab S Mirrokni 2014 ldquoCom-posable core-sets for diversity and coverage maximizationrdquo PODS rsquo14 100-8 httpsdoiorg10114525945382594560

Feature selection feature extraction feature engineering weighting preprocessing

bull PatternRecognition Sections 26-7 ldquoFeature Selection Feature Extractionrdquo

bull DSHandbook-Py Ch 7 ldquoInterlude Feature Extraction Ideasrdquo

bull DataScience-Python NB 0504 ldquoFeature Engineeringrdquo

bull Features in text NLP and InfoRetrieval re tfidf and similar FightinWords

bull Features in images DataScience-Python NB 0514 ldquoImage Featuresrdquo

23

Dimensionality reduction decomposition factorization change of basis reparameterization matrix com-pletion latent variables source separation

bull MMDS Ch 9 ldquoRecommendation Systemsrdquo Ch11 ldquoDimensionality Reductionrdquo

bull DSHandbook-Py Ch 10 ldquoUnsupervised Learning Clustering and Dimensionality Reductionrdquo

bull Dagger[Shalizi-ADA] Cosma Rohilla Shalizi 2017 Advanced Data Analysis from an Elementary Pointof View Ch 16 (ldquoPrincipal Components Analysisrdquo) Ch 17 (ldquoFactor Modelsrdquo) httpwwwstat

cmuedu~cshaliziADAfaEPoV

See also Multivariate-R Ch 3 ldquoPrincipal Components Analysisrdquo Ch 4 ldquoMultidimensional ScalingrdquoCh 5 ldquoExploratory Factor Analysisrdquo Latent

bull DeepLearning Ch 2 ldquoLinear Algebrardquo Ch 13 ldquoLinear Factor Modelsrdquo

bull Dagger[GloVe] Jeffrey Pennington Richard Socher Christopher Manning 2014 ldquoGloVe Global vectorsfor word representationrdquo EMNLP httpsnlpstanfordeduprojectsglove

bull dagger[NMF] Daniel D Lee and H Sebastian Seung 1999 ldquoLearning the parts of objects by non-negativematrix factorizationrdquo Nature 401788-791 httpdxdoiorgezaccesslibrariespsuedu

10103844565

bull dagger[CUR] Michael W Mahoney and Petros Drineas 2009 ldquoCUR matrix decompositions for improveddata analysisrdquo PNAS httpwwwpnasorgcontent1063697full

bull Dagger[ICA] Aapo Hyvarinen and Erkki Oja 2000 ldquoIndependent Component Analysis Algorithmsand Applicationsrdquo Neural Networks 13(4-5) 411-30 httpswwwcshelsinkifiuahyvarin

papersNN00newpdf

bull [RandomProjection] Ella Bingham and Heikki Mannilla 2001 ldquoRandom projection in dimension-ality reduction applications to image and text datardquo KDD httpsdoiorg1011452F502512

502546

bull Compression eg ldquoTransform codingrdquo ldquoWaveletsrdquo

Nonlinear dimensionality reduction Manifold learning

bull DeepLearning Section 5113 ldquoManifold Learningrdquo

bull DataScience-Python NB 0510 ldquoManifold Learningrdquo (Locally linear embedding [LLE] Isomap)

bull Dagger[KernelPCA] Sebastian Raschka 2014 ldquoKernel tricks and nonlinear dimensionality reduction viaRBF kernel PCArdquo httpsebastianraschkacomArticles2014_kernel_pcahtml

bull Shalizi-ADA Ch 18 ldquoNonlinear Dimensionality Reductionrdquo (LLE)

bull Dagger[LaplacianEigenmaps] Mikhail Belkin and Partha Niyogi 2003 ldquoLaplacian eigenmaps for di-mensionality reduction and data representationrdquo Neural Computation 15(6) 13731396 http

webcseohio-stateedu~belkin8papersLEM_NC_03pdf

bull DeepLearning Ch14 (ldquoAutoencodersrdquo)

bull Dagger[word2vec] Tomas Mikolov Kai Chen Greg Corrado Jeffrey Dean 2013 ldquoEfficient Estimation ofWord Representations in Vector Spacerdquo httpsarxivorgabs13013781

bull Dagger[word2vecExplained] Yoav Goldberg and Omer Levy 2014 ldquoword2vec Explained DerivingMikolov et als Negative-Sampling Word-Embedding Methodrdquo httpsarxivorgabs14023722

bull Dagger[t-SNE] Laurens van der Maaten and Geoffrey Hinton 2008 ldquoVisualising Data using t-SNErdquo Jour-nal of Machine Learning Research 9 2579-2605 httpjmlrcsailmitedupapersvolume9

vandermaaten08avandermaaten08apdf

24

Computation and Scaling Up

Scientific computing computation at scale

bull NRCreport Ch 10 ldquoThe Seven Computational Giants of Massive Data Analysisrdquo

bull DSHandbook-Py Ch 21 ldquoPerformance and Computer Memoryrdquo Ch 22 ldquoComputer Memory andData Structuresrdquo

bull Dagger[ComputationalStatistics-Python] Cliburn Chan Computational Statistics in Python https

peopledukeedu~ccc14sta-663indexhtml

bull CRAN Task View High Performance Computinghttpscranr-projectorgwebviewsHighPerformanceComputinghtml

Numerical computing

bull [NumericalComputing] Ward Cheney and David Kincaid 2013 Numerical Mathematics andComputing 7th ed BrooksCole Cengage Learning

bull CRAN Task View Numerical Mathematicshttpscranr-projectorgwebviewsNumericalMathematicshtml

Optimization (eg MLE gradient descent stochastic gradient descent EM algorithm neural nets)

bull DSHandbook-Py Ch 23 ldquoMaximum Likelihood Estimation and Optimizationrdquo (gradient descent)

bull DeepLearning Ch 4 ldquoNumerical Computationrdquo Ch 6 ldquoDeep Feedforward Networksrdquo Ch 8ldquoOptimization for Training Deep Modelsrdquo

bull CRAN Task View Optimization and Mathematical Programminghttpscranr-projectorgwebviewsOptimizationhtml

Linear algebra matrix computations

bull [MatrixComputations] Gene H Golub and Charles F Van Loan 2013 Matrix Computations 4thed Johns Hopkins University Press

bull Dagger[NetflixMatrix] Yehuda Koren Robert Bell and Chris Volinsky 2009 ldquoMatrix FactorizationTechniques for Recommender Systemsrdquo Computer August 42-9 httpsdatajobscomdata-

science-repoRecommender-Systems-[Netflix]pdf

bull Dagger[BigDataPCA] Jianqing Fan Qiang Sun Wen-Xin Zhou Ziwei Zhu ldquoPrincipal Component Anal-ysis for Big Datardquo httpwwwprincetonedu~ziweizpcapdf

bull Dagger[RandomSVD] Andrew Tulloch 2009 ldquoFast Randomized SVDrdquo httpsresearchfbcom

fast-randomized-svd

bull Dagger[FactorSGD] Rainer Gemulla Peter J Haas Erik Nijkamp and Yannis Sismanis 2011 ldquoLarge-Scale Matrix Factorization with Distributed Stochastic Gradient Descentrdquo KDD httpwwwcs

utahedu~hariteachingbigdatagemulla11dsgdpdf

bull Dagger[Sparse] Max Grossman 2015 ldquo101 Ways to Store a Sparse Matrixrdquo httpsmediumcom

jmaxg3101-ways-to-store-a-sparse-matrix-c7f2bf15a229

bull Dagger[DontInvert] John D Cook 2010 ldquoDonrsquot Invert that Matrixrdquo httpswwwjohndcookcom

blog20100119dont-invert-that-matrix

Simulation-based inference resampling Monte Carlo methods MCMC Bayes approximate inference

25

bull Dagger[MarkovVisually] Victor Powell ldquoMarkov Chains Explained Visuallyrdquo httpsetosaioev

markov-chains

bull DSHandbook-Py Ch 25 ldquoStochastic Modelingrdquo (Markov chains MCMC HMM)

bull Bayes ComputationalStatistics-Py

bull DeepLearning Ch 17 ldquoMonte Carlo Methodsrdquo Ch 19 ldquoApproximate Inferencerdquo

bull Dagger[VariationalInference] Jason Eisner 2011 ldquoHigh-Level Explanation of Variational Inferencerdquohttpswwwcsjhuedu~jasontutorialsvariationalhtml

bull CRAN Task View Bayesian Inferencehttpscranr-projectorgwebviewsBayesianhtml

Parallelism MapReduce Split-Apply-Combine

bull Dagger[MapReduceIntuition] Jigsaw Academy 2014 ldquoBig Data Specialist MapReducerdquo https

wwwyoutubecomwatchv=TwcYQzFqg-8ampfeature=youtube

bull MMDS Ch 2 ldquoMap-Reduce and the New Software Stackrdquo (3rd Edition discusses Spark amp Tensor-Flow)

bull Algorithms Ch 3ldquoScalable Algorithms and Associative Statisticsrdquo Ch 4 ldquoHadoop and MapReducerdquo

bull NRCreport Ch 6 ldquoResources Trade-offs and Limitationsrdquo

bull TidyData-R (Split-apply-combine is the motivating principle behind the ldquotidyverserdquo approach) Seealso Part III ldquoProgramrdquo (pipes functions vectors iteration)

bull Dagger[TidySAC-Video] Hadley Wickham 2017 ldquoData Science in the Tidyverserdquo httpswww

rstudiocomresourcesvideosdata-science-in-the-tidyverse

bull See also FactorSGD

Functional Programming

bull ComputationalStatistics-Python ldquoFunctions are first class objectsrdquo through first exercises

bull DSHandbook-Py Ch 20 ldquoProgramming Language Conceptsrdquo

bull See also Haskell

Scaling iteration streaming data online algorithms (Spark)

bull NRCreport Ch 4 ldquoTemporal Data and Real-Time Algorithmsrdquo

bull MMDS Ch 4 ldquoMining Data Streamsrdquo

bull Dagger[BDAS] AMPLab BDAS The Berkeley Data Analytics Stack httpsamplabcsberkeleyedu

software

bull dagger[Spark] Mohammed Guller 2015 Big Data Analytics with Spark A Practitioners Guide toUsing Spark for Large-Scale Data Processing Machine Learning and Graph Analytics and High-Velocity Data Stream Processing Apress httpslink-springer-comezaccesslibrariespsu

edubook101007978-1-4842-0964-6

bull DSHandbook-Py Ch 13 ldquoBig Datardquo

bull DeepLearning Ch 10 ldquoSequence Modeling Recurrent and Recursive Netsrdquo

General Resources for Python and R

26

bull dagger[DSHandbook-Py] Field Cady 2017 The Data Science Handbook (Python-based) Wiley http

onlinelibrarywileycomezaccesslibrariespsuedubook1010029781119092919

bull Dagger[Tutorials-R] Ujjwal Karn 2017 ldquoA curated list of R tutorials for Data Science NLP and MachineLearningrdquo httpsgithubcomujjwalkarnDataScienceR

bull Dagger[Tutorials-Python] Ujjwal Karn 2017 ldquoA curated list of Python tutorials for Data Science NLPand Machine Learningrdquo httpsgithubcomujjwalkarnDataSciencePython

Cutting and Bleeding Edge of Data Science Languages

bull [Scala] daggerVishal Layka and David Pollak 2015 Beginning Scala httpslink-springer-

comezaccesslibrariespsuedubook1010072F978-1-4842-0232-6 daggerNicolas Patrick 2014Scala for Machine Learning httpsebookcentralproquestcomlibpensudetailaction

docID=1901910 DaggerScala site httpswwwscala-langorg (See also )

bull [Julia] daggerIvo Baobaert 2015 Getting Started with Julia httpsebookcentralproquestcom

libpensudetailactiondocID=1973847 DaggerDouglas Bates 2013 ldquoJulia for R Programmersrdquohttpwwwstatwiscedu~batesJuliaForRProgrammerspdf DaggerJulia site httpsjulialang

org

bull [Haskell] daggerRichard Bird 2014 Thinking Functionally with Haskell httpsdoi-orgezaccess

librariespsuedu101017CBO9781316092415 daggerHakim Cassimally 2017 Learning Haskell Pro-gramming httpswwwlyndacomHaskell-tutorialsLearning-Haskell-Programming604926-2html daggerJames Church 2017 Learning Haskell for Data Analysis httpswwwlyndacomDeveloper-tutorialsLearning-Haskell-Data-Analysis604234-2html Haskell site httpswwwhaskellorg

bull [Clojure] daggerMark McDonnell 2017 Quick Clojure Essential Functional Programming https

link-springer-comezaccesslibrariespsuedubook1010072F978-1-4842-2952-1 daggerAkhillWali 2014 Clojure for Machine Learning httpsebookcentralproquestcomlibpensu

detailactiondocID=1674848 daggerArthur Ulfeldt 2015 httpswwwlyndacomClojure-tutorialsUp-Running-Clojure413127-2html DaggerClojure site httpsclojureorg

bull [TensorFlow] DaggerAbadi et al (Google) 2016 ldquoTensorFlow A System for Large-Scale Ma-chine Learningrdquo httpsarxivorgabs160508695 DaggerUdacity ldquoDeep Learningrdquo httpswww

udacitycomcoursedeep-learning--ud730 DaggerTensorFlow site httpswwwtensorfloworg

bull [H2O] Darren Cool 2016 Practical Machine Learning with H2O Powerful Scalable Techniques forDeep Learning and AI OrsquoReilly Arno Candel and Viraj Parmar 2015 Deep Learning with H2ODaggerH2O site httpswwwh2oai

Social Data Structures

Space and Time

bull dagger[GIA] David OrsquoSullivan David J Unwin 2010 Geographic Information Analysis Second EditionJohn Wiley amp Sons httponlinelibrarywileycombook1010029780470549094

bull [Space-Time] Donna J Peuquet 2003 Representations of Space and Time Guilford

bull Roger S Bivand Edzer Pebesma Virgilio Gmez-Rubio 2013 Applied Spatial Data Analysis in RFree through library httplinkspringercomezaccesslibrariespsuedubook101007

2F978-1-4614-7618-4 See also Edzer Pebesma 2016 ldquoHandling and Analyzing Spatial Spa-tiotemporal and Movement Data in Rrdquo httpsedzergithubioUseR2016

bull Dagger[PySAL] Sergio J Rey and Dani Arribas-Bel 2016 ldquoGeographic Data Science with PySAL andthe pydata Stackrdquo httpdarribasorggds_scipy16

27

bull DataScience-Python NB 0413 ldquoGeographic Data with Basemaprdquo

bull Time ldquoLongitudinalrdquo intra-individual data as common in developmental psychology dagger[Longitudinal]Garrett Fitzmaurice Marie Davidian Geert Verbeke and Geert Molenberghs editors 2009 Longitu-dinal Data Analysis Chapman amp Hall CRC httppensueblibcompatronFullRecordaspxp=359998

bull Time ldquoTime Series Time Series Cross Section Panelrdquo as common in economics political scienceand sociology DataScience-Python Ch 3 re pandas DSHandbook-Py Ch 17 ldquoTime Series AnalysisrdquoEconometrics GelmanHill

bull Time ldquoSequential Data Streamsrdquo as in NLP machine learning Spark and other rdquostreaming datardquoreadings

bull Space-Time Data with continuity connectivity neighborhood structure See DeepLearning Chapter 9re convolution

bull CRAN Task Views Spatial httpscranr-projectorgwebviewsSpatialhtml Spatiotempo-ral httpscranr-projectorgwebviewsSpatioTemporalhtml Time Series httpscran

r-projectorgwebviewsTimeSerieshtml

Network Graphs

bull Dagger[Networks] David Easley and Jon Kleinberg 2010 Networks Crowds and Markets ReasoningAbout a Highly Connected World Cambridge University Press httpwwwcscornelleduhome

kleinbernetworks-book

bull MMDS Ch 5 ldquoLink Analysisrdquo Ch 10 ldquoMining Social-Network Graphsrdquo

bull dagger[Networks-R] Douglas Luke 2015 A Userrsquos Guide to Network Analysis in R Springer https

link-springer-comezaccesslibrariespsuedubook101007978-3-319-23883-8

bull dagger[Networks-Python] Mohammed Zuhair Al-Taie and Seifedine Kadry 2017 Python for Graph andNetwork Analysis Springer httpslink-springer-comezaccesslibrariespsuedubook

1010072F978-3-319-53004-8

Hierarchy Clustered Data Aggregation Mixed Models

bull dagger[GelmanHill] Andrew Gelman and Jennifer Hill 2006 Data Analysis Using Regression and Multi-levelHierarchical Models Cambridge University Press httppensueblibcompatronFullRecordaspxp=288457

bull dagger[MixedModels] Eugene Demidenko 2013 Mixed Models Theory and Applications with R 2nd edWiley httpsiteebrarycomezaccesslibrariespsuedulibpennstatedetailaction

docID=10748641

Social Data Channels

Language Text Speech Audio

bull Dagger[NLP] Daniel Jurafsky and James H Martin Forthcoming (2017) Speech and Language ProcessingAn Introduction to Natural Language Processing Speech Recognition and Computational Linguistics3rd ed Prentice-Hall Preprint httpswebstanfordedu~jurafskyslp3

bull InfoRetrieval

bull Dagger[CoreNLP] Stanford CoreNLP httpsstanfordnlpgithubioCoreNLP

28

bull dagger[TextAsData] Grimmer and Stewart 2013 ldquoText as Data The Promise and Pitfalls of AutomaticContent Analysis Methods for Political Textsrdquo Political Analysis httpsdoiorg101093pan

mps028

bull Quinn-Topics

bull FightinWords

bull dagger[NLTK-Python] Jacob Perkins 2010 Python Text Processing with NLTK 20 Cookbook PackthttpsiteebrarycomezaccesslibrariespsuedulibpennstatedetailactiondocID=10435387dagger[TextAnalytics-Python] Dipanjan Sarkar 2016 Text Analytics with Python Springer http

linkspringercomezaccesslibrariespsuedubook1010072F978-1-4842-2388-8

bull DSHandbook-Py Ch 16 ldquoNatural Language Processingrdquo

bull Matthew J Denny httpwwwmjdennycomText_Processing_In_Rhtml

bull CRAN Natural Language Processing httpscranr-projectorgwebviewsNaturalLanguageProcessinghtml

bull dagger[AudioData] Dean Knox and Christopher Lucas 2017 ldquoThe Speaker-Affect Model MeasuringEmotion in Political Speech with Audio Datardquo httpchristopherlucasorgfilesPDFssam

pdf httpswwwyoutubecomwatchv=Hs8A9dwkMzI

bull daggerMorgan amp Claypool Synthesis Lectures on Human Language Technologies httpwwwmorganclaypoolcomtochlt11

bull daggerMorgan amp Claypool Synthesis Lectures on Speech and Audio Processing httpwwwmorganclaypoolcomtocsap11

Vision Image Video

bull Dagger[ComputerVision] Richard Szelinski 2010 Computer Vision Algorithms and Applications SpringerhttpszeliskiorgBook

bull MixedModels Ch 11 ldquoStatistical Analysis of Shaperdquo Ch 12 ldquoStatistical Image Analysisrdquo

bull AudioVideoVolumetric See DeepLearning Chapter 9 re convolution

bull daggerMorgan amp Claypool Synthesis Lectures on Image Video and Multimedia Processing httpwww

morganclaypoolcomtocivm11

bull daggerMorgan amp Claypool Synthesis Lectures on Computer Vision httpwwwmorganclaypoolcom

toccov11

Approaches to Learning from Data (The Analytics Layer)

bull NRCreport Ch 7 ldquoBuilding Models from Massive Datardquo

bull MMDS (data-mining)

bull CausalInference

bull Dagger[VisualAnalytics] Daniel Keim Jorn Kohlhammer Geoffrey Ellis and Florian Mansmann editors2010 Mastering the Information Age Solving Problems with Visual Analytics The EurographicsAssociation Goslar Germany httpwwwvismastereubook See also daggerMorgan amp ClaypoolSynthesis Lectures on Visualization httpwwwmorganclaypoolcomtocvis21

bull Dagger[Econometrics] Bruce E Hansen 2014 Econometrics httpwwwsscwiscedu~bhansen

econometrics

29

bull Dagger[StatisticalLearning] Gareth James Daniela Witten Trevor Hastie and Robert Tibshirani 2013An Introduction to Statistical Learning with Applications in R Springer httpwww-bcfuscedu

~garethISLindexhtml

bull [MachineLearning] Christopher Bishop 2006 Pattern Recognition and Machine Learning Springer

bull dagger[Bayes] Peter D Hoff 2009 A First Course in Bayesian Statistical Methods Springer http

linkspringercombook1010072F978-0-387-92407-6

bull dagger[GraphicalModels] Soslashren Hoslashjsgaard David Edwards Steffen Lauritzen 2012 Graphical Mod-els with R Springer httpslink-springer-comezaccesslibrariespsuedubook101007

2F978-1-4614-2299-0

bull [InformationTheory] David Mackay 2003 Information Theory Inference and Learning AlgorithmsSpringer

bull Dagger[DeepLearning] Ian Goodfellow Yoshua Bengio and Aaron Courville 2016 Deep Learning MITPress httpwwwdeeplearningbookorg (see also Yann LeCun Yoshua Bengio and Geof-frey Hinton 2015 ldquoDeep Learningrdquo Nature 521(7553) 436ndash444 httpsdoiorg101038

nature14539)

See also Dagger[Keras] Keras The Python Deep Learning Library httpskerasio Francoios CholletDirectory of Keras tutorials httpsgithubcomfcholletkeras-resources

bull dagger[SignalProcessing] Jose Maria Giron-Sierra 2013 Digital Signal Processing with MATLAB Exam-ples (Volumes 1-3) Springer httpslink-springer-comezaccesslibrariespsuedubook

101007978-981-10-2534-1

30

Penn State Policy Statements

Academic Integrity

Academic integrity is the pursuit of scholarly activity in an open honest and responsible manner Academicintegrity is a basic guiding principle for all academic activity at The Pennsylvania State University and allmembers of the University community are expected to act in accordance with this principle Consistent withthis expectation the Universitys Code of Conduct states that all students should act with personal integrityrespect other students dignity rights and property and help create and maintain an environment in whichall can succeed through the fruits of their efforts

Academic integrity includes a commitment by all members of the University community not to engage in ortolerate acts of falsification misrepresentation or deception Such acts of dishonesty violate the fundamentalethical principles of the University community and compromise the worth of work completed by others

Disability Accomodation

Penn State welcomes students with disabilities into the Universitys educational programs Every Penn Statecampus has an office for students with disabilities Student Disability Resources (SDR) website provides con-tact information for every Penn State campus (httpequitypsuedusdrdisability-coordinator)For further information please visit the Student Disability Resources website (httpequitypsuedusdr)

In order to receive consideration for reasonable accommodations you must contact the appropriate disabilityservices office at the campus where you are officially enrolled participate in an intake interview and pro-vide documentation See documentation guidelines at (httpequitypsuedusdrguidelines) If thedocumentation supports your request for reasonable accommodations your campus disability services officewill provide you with an accommodation letter Please share this letter with your instructors and discussthe accommodations with them as early as possible You must follow this process for every semester thatyou request accommodations

Psychological Services

Many students at Penn State face personal challenges or have psychological needs that may interfere withtheir academic progress social development or emotional wellbeing The university offers a variety ofconfidential services to help you through difficult times including individual and group counseling crisisintervention consultations online chats and mental health screenings These services are provided by staffwho welcome all students and embrace a philosophy respectful of clients cultural and religious backgroundsand sensitive to differences in race ability gender identity and sexual orientation

bull Counseling and Psychological Services at University Park (CAPS) (httpstudentaffairspsueducounseling) 814-863-0395

bull Penn State Crisis Line (24 hours7 daysweek) 877-229-6400

bull Crisis Text Line (24 hours7 daysweek) Text LIONS to 741741

Educational Equity

Penn State takes great pride to foster a diverse and inclusive environment for students faculty and staffConsistent with University Policy AD29 students who believe they have experienced or observed a hatecrime an act of intolerance discrimination or harassment that occurs at Penn State are urged to report theseincidents as outlined on the Universitys Report Bias webpage (httpequitypsuedureportbias)

31

Background Knowledge

Students will have a variety of backgrounds Prior to beginning interdisciplinary coursework tofulfill Social Data Analytics degree requirements including SoDA 501 and 502 students are ex-pected to have advanced (graduate) training in at least one of the component areas of Social DataAnalytics and a familiarity with basic concepts in the others

With regard to specialization students are expected to have advanced (graduate) training in ONEof the following

bull quantitative social science methodology and a discipline of social science (as would be thecase for a second-year PhD student in Political Science Sociology Criminology Human De-velopment and Family Studies or Demography) OR

bull statistics (as would be the case for a second-year PhD student in Statistics) OR

bull information science or informatics (as would be the case for a second-year PhD student inInformation Science and Technology or a second-year PhD student in Geography specializingin GIScience) OR

bull computer science (as would be the case for a second-year PhD student in Computer Scienceand Engineering)

This requirement is met as a matter of meeting home program requirements for students in thedual-title PhD but may require additional coursework on the part of students in other programswishing to pursue the graduate minor

With regard to general preparation students are expected to have ALL of the following technicalknowledge

bull basic programming skills including basic facility with R ampor Python AND

bull basic knowledge of relational databases ampor geographic information systems AND

bull basic knowledge of probability applied statistics ampor social science research design AND

bull basic familiarity with a substantive or theoretical area of social science (eg 300-level course-work in political science sociology criminology human development psychology economicscommunication anthropology human geography social informatics or similar fields)

It is not unusual for students to have one or more gaps in this preparation at time of applicationto the SoDA program Students should work with Social Data Analytics advisers to develop aplan for timely remediation of any deficiencies which generally will not require formal courseworkfor students whose training and interests are otherwise appropriate for pursuit of the Social DataAnalytics degree Where possible this will be addressed at time of application to the Social DataAnalytics program

To this end some free training materials are linked in the reference section and there are alsoabundant free high-quality self-paced course-style training materials on these and related subjectsavailable through edX (httpwwwedxorg) Udacity (httpwwwudacitycom) Codecademy(httpcodecademycom) and Lynda (httpwwwlyndapsu)

32

Page 12: SoDA 501: Approaches and Issues in Big Social Data Spring ... · \Approaches and Issues in Big Social Data," o ered in the spring semester. Some SoDA / IGERT students will take more

SoDA 501 (Spring 2016)

bull Bruce Desmarais (PLSC) ldquoLearning in the Sunshine Analysis of Local Government Email Corporardquo

bull Timothy Brick (HDFS) ldquoMapping and Manipulating Facial Expressionrdquo

bull Qunying Huang (USC) ldquoSocial Media An Emerging Data Source for Human Mobility Studiesrdquo

bull Lingzhou Zue (STAT) ldquoAn Introduction to High-Dimensional Graphical Modelsrdquo

bull Ashton Verdery (SOC) ldquoSampling from Network Datardquo

bull Sarah Battersby (Tableau) ldquoHelping People See and Understand Spatial Datardquo

bull Alexandra Slavkovic (STAT) ldquoStatistical Privacy with Network Datardquo

bull Lee Giles (IST) ldquoMachine Learning for Scholarly Big Datardquo

12

Readings and References (updated Spring 2018)

We will discuss a relatively small subset of the readings listed here and this will vary based on topics andreadings discussed by visiting speakers and you yourselves The remainder are provided here as curatedreferences for more in-depth investigation in both the theory and practice related to the topic (with thelatter heavily weighted toward resources in Python and R)

dagger Material that is at last check made available through the Penn State library Most journal links shouldwork if you are logged in to a Penn State machine or through the Penn State VPN Some article archivesand most books require additional authentication through webaccess If links are broken start directly froma search via httpswwwlibrariespsuedu Some books require installation of e-readers like AdobeDigital Editions Links to lyndacom can be accessed through httplyndapsuedu

Dagger Material that is at last check legally provided for free In some cases these are the preprint versionsof published material

sect Material for which a legal selection is or will be provided through the class Box folder

Big Data amp Social Data Analytics

Overviews

bull Dagger[CompSocSci] David Lazer Alex Pentland Lada Adamic Sinan Aral Albert-Laszlo BarabasiDevon Brewer Nicholas Christakis Noshir Contractor James Fowler Myron Gutmann Tony JebaraGary King Michael Macy Deb Roy and Marshall Van Alstyne 2009 ldquoComputational Social SciencerdquoScience 323(5915)721-3 + Supp Feb 6 httpsciencesciencemagorgcontent3235915

721full httpsgkingharvardedufilesgkingfilesLazPenAda09pdf

bull Dagger[NRCreport] National Research Council 2013 Frontiers in Massive Data Analysis NationalAcademies Press (Free w registration httpwwwnapeducatalogphprecord_id=18374)Ch1 ldquoIntroductionrdquo Ch2 ldquoMassive Data in Science Technology Commerce National DefenseTelecommunications and other Endeavorsrdquo

bull Dagger[MMDS] Jure Leskovec Anand Rajaraman and Jeff Ullman 2014 Mining of Massive DatasetsCambridge University Press httpwwwmmdsorg (BETA version of Third Edition httpi

stanfordedu~ullmanmmdsnhtml)

bull sect[BitByBit] Matthew J Salganik 2018 (Forthcoming) Bit by Bit Social Research in the DigitalAge Princeton University Press Ch 1 ldquoIntroductionrdquo Ch 2 ldquoObserving Behaviorrdquo

bull DSHandbook-Py Ch 1 ldquoIntroduction Becoming a Unicornrdquo Ch2 ldquoThe Data Science Road Maprdquo

Burt-schtick

bull Dagger[Monroe-5Vs] Burt L Monroe 2013 ldquoThe Five Vs of Big Data Political Science Introductionto the Special Issue on Big Data in Political Sciencerdquo Political Analysis 21(V5) 1ndash9 https

doiorg101017S1047198700014315 (Volume Velocity Variety Vinculation Validity)

bull dagger[Monroe-No] Burt L Monroe Jennifer Pan Margaret E Roberts Maya Sen and Betsy Sin-clair 2015 ldquoNo Formal Theory Causal Inference and Big Data Are Not Contradictory Trendsin Political Sciencerdquo PS Political Science amp Politics 48(1) 71ndash4 httpdxdoiorg101017

S1049096514001760

bull dagger[Quinn-Topics] Kevin M Quinn Burt L Monroe Michael Colaresi Michael H Crespin andDragomir R Radev 2010 ldquoHow to Analyze Political Attention with Minimal Assumptions andCostsrdquo American Journal of Political Science 54(1) 209ndash28 httponlinelibrarywileycom

13

doi101111j1540-5907200900427xfull (esp topic modeling as measurement approach tovalidation)

bull dagger[FightinWords] Burt L Monroe Michael Colaresi and Kevin M Quinn 2008 ldquoFightinrsquo WordsLexical Feature Selection and Evaluation for Identifying the Content of Political Conflictrdquo PoliticalAnalysis 16(4) 372-403 httpsdoiorg101093panmpn018 (esp the impact of samplingvariance and regularization through priors)

bull Dagger[BDSS-Census] Big Data Social Science PSU Team 2012 ldquoA Closer Look at the Kaggle CensusDatardquo httpsburtmonroegithubioBDSSKaggleCensus2012 (esp the relevance of the socialprocesses by which data come to exist as data)

Multidisciplinary Perspectives

bull dagger[Business-BigData] Andrew McAfee and Erik Brynjolfsson 2012 ldquoBig Data The ManagementRevolutionrdquo Harvard Business Review 90(10) 61ndash8 October httpshbrorg201210big-

data-the-management-revolution Thomas H Davenport and DJ Patel 2012 ldquoData ScientistThe Sexiest Job of the 21st Centuryrdquo Harvard Business Review 90(10)70-6 October https

hbrorg201210data-scientist-the-sexiest-job-of-the-21st-century

bull dagger[InfoSci-BigData] CL Philip Chen and Chun-Yang Zhang 2014 ldquoData-intensive ApplicationsChallenges Techniques and Technologies A Survey on Big Datardquo Information Sciences 275 314-47httpsdoiorg101016jins201401015

bull dagger[Informatics-BigData] Vasant G Honavar 2014 ldquoThe Promise and Potential of Big Data ACase for Discovery Informaticsrdquo Review of Policy Research 31(4) 326-330 httpsdoiorg10

1111ropr12080

bull Dagger[Stats-BigData] Beate Franke Jean-Francois Ribana Roscher Annie Lee Cathal Smyth ArminHatefi Fuqi Chen Einat Gil Alexander Schwing Alessandro Selvitella Michael M Hoffman RogerGrosse Dietrich Hendricks and Nancy Reid 2016 ldquoStatistical Inference Learning and Models in BigDatardquo International Statistical Review 84(3) 371-89 httponlinelibrarywileycomdoi101111insr12176full

bull dagger[Econ-BigData] Hal R Varian 2013 ldquoBig Data New Tricks for Econometricsrdquo Journal ofEconomic Perspectives 28(2) 3-28 httpsdoiorg101257jep2823

bull dagger[GeoViz-BigData] Alan MacEachren 2017 ldquoLeveraging Big (Geo) Data with (Geo) Visual An-alytics Place as the Next Frontierrdquo In Chenghu Zhou Fenzhen Su Francis Harvey and Jun Xueds Spatial Data Handling in the Big Data Era pp 139-155 Springer httpslink-springer-

comezaccesslibrariespsueduchapter101007978-981-10-4424-3_10

bull dagger[Soc-BigData] David Lazer and Jason Radford 2017 ldquoData ex Machina Introduction to BigDatardquo Annual Review of Sociology 43 19-39 httpsdoiorg101146annurev-soc-060116-

053457

bull sect[Politics-BigData] Keith T Poole L Jason Anasastopolous and James E Monagan III Forthcom-ing ldquoThe lsquoBig Datarsquo Revolution in Political Campaigning and Governancerdquo Oxford Bibliographies inPolitical Science

iexclCuidado Traps Biases Problems Pains Perils

bull dagger[GoogleFlu] David Lazer Ryan Kennedy Gary King and Alessandro Vespignani 2014 ldquoTheParable of Google Flu Traps in Big Data Analysisrdquo Science (343) 14 March httpgking

harvardedufilesgkingfiles0314policyforumffpdf

bull Dagger[MachineBias] ProPublica ldquoMachine Bias Investigating Algorithmic Injusticerdquo Series https

wwwpropublicaorgseriesmachine-bias See especially

14

Julia Angwin Jeff Larson Surya Mattu and Lauren Kirchner 2016 ldquoMachine Biasrdquo ProPub-lica May 23 httpswwwpropublicaorgarticlemachine-bias-risk-assessments-in-

criminal-sentencing

Julia Angwin Madeleine Varner and Ariana Tobin 2017 ldquoFacebook Enabled Adverstisers toReach Jew Hatersrdquo httpswwwpropublicaorgarticlefacebook-enabled-advertisers-

to-reach-jew-haters

bull Dagger[Polling2016] Doug Rivers (Nov 11 2016) ldquoFirst Thoughts on Polling Problems in the 2016 USElectionsrdquo httpstodayyougovcomnews20161111first-thoughts-polling-problems-

2016-us-elections

bull dagger[EventData] Wei Wang Ryan Kennedy David Lazer Naren Ramakrishnan 2016 ldquoGrowing Painsfor Global Monitoring of Societal Eventsrdquo Science 3536307 pp 1502ndash1503 httpsdoiorg101126scienceaaf6758

bull Dagger[GoogleBooks] Eitan Adam Pechenick Christopher M Danforth and Peter Sheridan Dodds 2015ldquoCharacterizing the Google Books Corpus Strong Limits to Inferences of Socio-Cultural and LinguisticEvolutionrdquo PLoS One httpsdoiorg101371journalpone0137041

bull Dagger[OkCupid] Michael Zimmer 2016 ldquoOkCupid Study Reveals the Perils of Big-Data Sciencerdquo Wiredhttpswwwwiredcom201605okcupid-study-reveals-perils-big-data-science May 14

bull dagger[CriticalQuestions] danah boyd and Kate Crawford 2011 ldquoCritical Questions for Big Data Provo-cations for a Cultural Technological and Scholarly Phenomenonrdquo Information Communication ampSociety 15(5) 662ndash79 httpdxdoiorg1010801369118X2012678878

bull Dagger[RacistBot] Daniel Victor 2016 ldquoMicrosoft created a Twitter bot to learn from users It quicklybecame a racist jerkrdquo New York Times httpswwwnytimescom20160325technology

microsoft-created-a-twitter-bot-to-learn-from-users-it-quickly-became-a-racist-jerk

html

bull MMDS 122-123 on Bonferroni

bull Also BDSS-Census

Research Design and Measurement

Overviews

bull BitByBit

bull Dagger[ResearchMethodsKB] William M Trochim 2006 The Research Methods Knowledge Basehttpwwwsocialresearchmethodsnetkb

bull [7Rules] Glenn Firebaugh 2008 Seven Rules for Social Research Princeton University Press

Measurement Reliability and Validity

bull ResearchMethodsKB ldquoMeasurementrdquo

bull 7Rules Ch 3 ldquoBuild Reality Checks into Your Researchrdquo

bull Reliability see also FightinWords

bull Validity see also Quinn10-Topics Monroe-5Vs

Indirect Unobtrusive Nonreactive Measures Data Exhaust

15

bull dagger[UnobtrusiveMeasures] Raymond M Lee 2015 ldquoUnobtrusive Measuresrdquo Oxford Bibliographieshttpsdoiorg101093OBO9780199846740-0048 (Canonical cite is Eugene J Webb 1966Unobtrusive Measures or Webb Donald T Campbell Richard D Schwartz Lee Sechrest 1999Unobtrusive Measures rev ed Sage)

bull BitByBit Examples in Chapter 2

Multiple Measures Latent Variable Measurement

bull 7Rules Ch 4 ldquoReplicate Where Possiblerdquo

bull dagger[LatentVariables] David J Bartholomew Martin Knott and Irini Moustaki 2011 Latent VariableModels and Factor Analysis A Unified Approach Wiley httpsiteebrarycomezaccess

librariespsuedulibpennstatedetailactiondocID=10483308) (esp Ch1 ldquoBasic ideas andexamplesrdquo)

bull dagger[Multivariate-R] Brian Everitt and Torsten Hothorn 2011 An Introduction to Applied MultivariateAnalysis with R Springer httplinkspringercomezaccesslibrariespsuedubook10

10072F978-1-4419-9650-3

bull Shalizi-ADA Ch17 ldquoFactor Modelsrdquo

bull Dagger[NetflixPrize] Edwin Chen 2011 ldquoWinning the Netflix Prize A Summaryrdquo

bull DaggerCRAN httpscranr-projectorgwebviewsMultivariatehtmlhttpscranr-projectorgwebviewsPsychometricshtmlhttpscranr-projectorgwebviewsClusterhtml

Sampling and Survey Design

bull dagger[Sampling] Steven K Thompson 2012 Sampling 3rd ed httpsk8es4mc2lsearchserialssolutionscomsid=sersolampSS_jc=TC_024492330amptitle=Wiley20Desktop20Editions203A20Sampling

bull ResearchMethodsKB ldquoSamplingrdquo

bull NRCreport Ch 8 ldquoSampling and Massive Datardquo

bull BitByBit Ch 3 ldquoAsking Questionsrdquo

bull Dagger[MSE] Daniel Manrique-Vallier Megan E Price and Anita Gohdes 2013 In Seybolt et al (eds)Counting Civilians ldquoMultiple Systems Estimation Techniques for Estimating Casualties in ArmedConflictsrdquo Preprint httpciteseerxistpsueduviewdocdownloaddoi=1011469939amp

rep=rep1amptype=pdf)

bull dagger[NetworkSampling] Ted Mouw and Ashton M Verdery 2012 ldquoNetwork Sampling with Memory AProposal for More Efficient Sampling from Social Networksrdquo Sociological Methodology 42(1)206ndash56httpsdxdoiorg1011772F0081175012461248

bull See also FightinWords (re hidden heteroskedasticity in sample variance)

bull Dagger[HashDontSample] Mudit Uppal 2016 ldquoProbabilistic data structures in the Big data world (+code)rdquo (re ldquoHash donrsquot samplerdquo) httpsmediumcommuppalprobabilistic-data-structures-

in-the-big-data-world-code-b9387cff0c55

bull MMDS Hash Functions (132) Sampling in streams (42)

bull DaggerCRAN httpscranr-projectorgwebviewsOfficialStatisticshtml (ldquoComplex SurveyDesignrdquo ldquoSmall Area Estimationrdquo)

Experimental and Observational Designs for Causal Inference

16

bull Dagger[CausalInference] Miguel A Hernan James M Robins Forthcoming (2017) Causal Infer-ence Chapman amp HallCRC Preprint httpswwwhsphharvardedumiguel-hernancausal-

inference-book Chs 1 ldquoA definition of causal effectrdquo Ch 2 ldquoRandomized experimentsrdquo Ch3 ldquoObservational studiesrdquo (See Ch 6 ldquoGraphical representation of causal effectsrdquo for integrationwith Judea Pearl approach)

bull BitByBit Chapter 4 ldquoRunning Experimentsrdquo Ch 2 Natural experiments in observable data exam-ples

bull ResearchMethodsKB ldquoDesignrdquo

bull Firebaugh-7Rules Ch 2 ldquoLook for Differences that Make a Difference and Report Themrdquo sectCh5 ldquoCompare Like with Likerdquo Ch 6 ldquoUse Panel Data to Study Individual Change and RepeatedCross-Section Data to Study Social Changerdquo

bull Shalizi-ADA Part IV ldquoCausal Inferencerdquo

bull Monroe-No

bull CRAN httpscranr-projectorgwebviewsExperimentalDesignhtml

Technologies for Primary and Secondary Data Collection

Mobile Devices Distributed Sensors Wearable Sensors Remote Sensing

bull dagger[RealityMining] Nathan Eagle and Alex (Sandy) Pentland 2006 ldquoReality Mining SensingComplex Social Systemsrdquo Personal and Ubiquitous Computing 10(4) 255ndash68 httpsdoiorg

101007s00779-005-0046-3

bull dagger[QuantifiedSelf ] Melanie Swan 2013 ldquoThe Quantified Self Fundamental Disruption in Big DataScience and Biological Discoveryrdquo Big Data 1(2) 85ndash99 httpsdoiorg101089big2012

0002

bull dagger[SensorData] Charu C Aggarwal (Ed) 2013 Managing and Mining Sensor Data Springer httplinkspringercomezaccesslibrariespsuedubook1010072F978-1-4614-6309-2 Ch1ldquoAn Introduction to Sensor Data Analyticsrdquo

bull dagger[NightLights] Thushyanthan Baskaran Brian Min Yogesh Uppal 2015 ldquoElection cycles andelectricity provision Evidence from a quasi-experiment with Indian special electionsrdquo Journal ofPublic Economics 12664-73 httpsdoiorg101016jjpubeco201503011

Crowdsourcing Human Computation Citizen Science Web Experiments

bull dagger[Crowdsourcing] Daren C Brabham 2013 Crowdsourcing MIT Press httpsiteebrary

comezaccesslibrariespsuedulibpennstatedetailactiondocID=10692208 ldquoIntroduc-tionrdquo Ch1 ldquoConcepts Theories and Cases of Crowdsourcingrdquo

bull BitByBit Ch 5 ldquoCreating Mass Collaborationrdquo

bull NRCreport Ch 9 ldquoHuman Interaction with Datardquo

bull dagger[MTurk] Krista Casler Lydia Bickel and Elizabeth Hackett 2013 ldquoSeparate but Equal AComparison of Participants and Data Gathered via Amazonrsquos MTurk Social Media and Face-to-FaceBehavioral Testingrdquo Computers in Human Behavior 29(6) 2156ndash60 httpdoiorg101016j

chb201305009

bull dagger[LabintheWild] Katharina Reinecke and Krzysztof Z Gajos 2015 ldquoLabintheWild ConductingLarge-Scale Online Experiments With Uncompensated Samplesrdquo In Proceedings of the 18th ACMConference on Computer Supported Cooperative Work amp Social Computing (CSCW rsquo15) 1364ndash1378httpdxdoiorg10114526751332675246

17

bull dagger[TweetmentEffects] Kevin Munger 2017 ldquoTweetment Effects on the Tweeted ExperimentallyReducing Racist Harassmentrdquo Political Behavior 39(3) 629ndash49 httpdoiorg101016jchb

201305009

bull dagger[HumanComputation] Edith Law and Luis van Ahn 2011 Human Computation Morgan amp Clay-pool httpwwwmorganclaypoolcomezaccesslibrariespsuedudoipdf102200S00371ED1V01Y201107AIM013

Open Data File Formats APIs Semantic Web Linked Data

bull DSHandbook-Py Ch 12 ldquoData Encodings and File Formatsrdquo

bull Dagger[OpenData] Open Data Institute ldquoWhat Is Open Datardquo httpstheodiorgwhat-is-open-

data

bull Dagger[OpenDataHandbook] Open Knowledge International The Open Data Handbook httpopendatahandbookorg Includes appendix ldquoFile Formatsrdquo httpopendatahandbookorgguideenappendices

file-formats

bull Dagger[APIs] Brian Cooksey 2016 An Introduction to APIs httpszapiercomlearnapis

bull Dagger[APIMarkets] RapidAPI mashape API marketplaces httpsdocsrapidapicom https

marketmashapecom ProgrammableWeb httpswwwprogrammablewebcom

bull dagger[LinkedData] Tom Heath and Christian Bizer 2011 Linked Data Evolving the Web into a GlobalData Space Morgan amp Claypool httpwwwmorganclaypoolcomezaccesslibrariespsu

edudoiabs102200S00334ED1V01Y201102WBE001 Ch 1 ldquoIntroductionrdquo Ch 2 ldquoPrinciples ofLinked Datardquo

bull dagger[SemanticWeb] Nikolaos Konstantinos and Dimitrios-Emmanuel Spanos 2015 Materializing theWeb of Linked Data Springer httplinkspringercomezaccesslibrariespsuedubook

1010072F978-3-319-16074-0 Ch 1 ldquoIntroduction Linked Data and the Semantic Webrdquo

bull daggerMorgan amp Claypool Synthesis Lectures on the Semantic Web Theory and Technology httpwww

morganclaypoolcomtocwbe111

Web Scraping

bull dagger[TheoryDrivenScraping] Richard N Landers Robert C Brusso Katelyn J Cavanaugh andAndrew B Colmus 2016 ldquoA Primer on Theory-Driven Web Scraping Automatic Extraction ofBig Data from the Internet for Use in Psychological Researchrdquo Psychological Methods 4 475ndash492httpdxdoiorgezaccesslibrariespsuedu101037met0000081

bull Dagger[Scraping-Py] Al Sweigart 2015 Automate the Boring Stuff with Python Practical Programmingfor Total Beginners Ch 11 Web-Scraping httpsautomatetheboringstuffcomchapter11

bull dagger[Scraping-R] Simon Munzert Christian Rubba Peter Meiszligner and Dominic Nyhuis 2014 Au-tomated Data Collection with R A Practical Guide to Web Scraping and Text Mining Wileyhttponlinelibrarywileycombook1010029781118834732

bull DaggerCRAN httpscranr-projectorgwebviewsWebTechnologieshtml

Ethics amp Scientific Responsibility in Big Social Data

Human Subjects Consent Privacy

bull BitByBit Ch 6 (ldquoEthicsrdquo)

bull Dagger[PSU-ORP] Penn State Office for Research Protections (under the VP for Research)

Human Subjects Research IRB httpswwwresearchpsueduirb

18

Revised Common Rule httpswwwresearchpsueduirbcommonrulechanges

Responsible Conduct of Research httpswwwresearchpsuedueducationrcr

Research Misconduct httpswwwresearchpsueduresearchmisconduct

SARI PSU (Scientific and Research Integrity Training) httpswwwresearchpsuedu

trainingsari

bull Dagger[MenloReport] David Dittrich and Erin Kenneally (Center for Applied Interneet Data Analysis)2012 ldquoThe Menlo Report Ethical Principles Guiding Information and Communication Technol-ogy Researchrdquo and companion report ldquoApplying Ethical Principles rdquo httpswwwcaidaorg

publicationspapers2012menlo_report_actual_formatted

bull Dagger[BigDataEthics-CBDES] Jacob Metcalf Emily F Keller and danah boyd 2016 ldquoPerspec-tives on Big Data Ethics and Societyrdquo Council for Big Data Ethics and Society httpbdes

datasocietynetcouncil-outputperspectives-on-big-data-ethics-and-society

bull Dagger[AoIRReport] Annette Markham and Elizabeth Buchanan (Association of Internet Researchers)2012 ldquoEthical Decision-Making and Internet Research Recommendations of the AoIR Ethics WorkingCommittee (Version 20)rdquo httpsaoirorgreportsethics2pdf and guidelines chart https

aoirorgwp-contentuploads201701aoir_ethics_graphic_2016pdf

bull Dagger[BigDataEthics-Wired] Sarah Zhang 2016 ldquoScientists are Just as Confused about the Ethics ofBig-Data Research as Yourdquo Wired httpswwwwiredcom201605scientists-just-confused-ethics-big-data-research

bull dagger[BigDataEthics-HerschelMori] Richard Herschel and Virginia M Mori 2017 ldquoEthics amp BigDatardquo Technology in Society 49 31-36 httpdoiorg101016jtechsoc201703003

The Science of Data Privacy

bull dagger[BigDataPrivacy] Terence Craig and Mary E Ludloff 2011 Privacy and Big Data OrsquoReilly MediahttppensueblibcompatronFullRecordaspxp=781814

bull Dagger[DataAnalysisPrivacy] John Abowd Lorenzo Alvisi Cynthia Dwork Sampath Kannan AshwinMachanavajjhala Jerome Reiter 2017 ldquoPrivacy-Preserving Data Analysis for the Federal Statisti-cal Agenciesrdquo A Computing Community Consortium white paper httpsarxivorgabs1701

00752

bull dagger[DataPrivacy] Stephen E Fienberg and Aleksandra B Slavkovic 2011 ldquoData Privacy and Con-fidentialityrdquo International Encyclopedia of Statistical Science 342ndash5 httpdoiorg978-3-642-

04898-2_202

bull Dagger[NetworksPrivacy] Vishesh Karwa and Aleksandra Slavkovic 2016 ldquoInference using noisy degreesDifferentially private β-model and synthetic graphsrdquo Annals of Statistics 44(1) 87-112 http

projecteuclidorgeuclidaos1449755958

bull dagger[DataPublishingPrivacy] Raymond Chi-Wing Wong and Ada Wai-Chee Fu 2010 Privacy-Preserving Data Publishing An Overview Morgan amp Claypool httpwwwmorganclaypoolcom

ezaccesslibrariespsuedudoipdfplus102200S00237ED1V01Y201003DTM002

bull DataMatching Ch 8 ldquoPrivacy Aspects of Data Matchingrdquo

Transparency Reproducibility and Team Science

bull Dagger[10RulesforData] Alyssa Goodman Alberto Pepe Alexander W Blocker Christine L BorgmanKyle Cranmer Merce Crosas Rosanne Di Stefano Yolanda Gil Paul Groth Margaret HedstromDavid W Hogg Vinay Kashyap Ashish Mahabal Aneta Siemiginowska and Aleksandra Slavkovic(2014) ldquoTen Simple Rules for the Care and Feeding of Scientific Datardquo PLoS Computational Biology10(4) e1003542 httpsdoiorg101371journalpcbi1003542 (Note esp curated resourcesfor reproducible research)

19

bull dagger[Transparency] E Miguel C Camerer K Casey J Cohen K M Esterling A Gerber R Glen-nerster D P Green M Humphreys G Imbens D Laitin T Madon L Nelson B A Nosek M Pe-tersen R Sedlmayr J P Simmons U Simonsohn M Van der Laan 2014 ldquoPromoting Transparencyin Social Science Researchrdquo Science 343(6166) 30ndash1 httpsdoiorg101126science1245317

bull dagger[Reproducibility] Marcus R Munafo Brian A Nosek Dorothy V M Bishop Katherine S ButtonChristopher D Chambers Nathalie Percie du Sert Uri Simonsohn Eric-Jan Wagenmakers JenniferJ Ware and John P A Ioannidis 2017 ldquoA Manifesto for Reproducible Sciencerdquo Nature HumanBehavior 0021(2017) httpsdoiorg101038s41562-016-0021

bull Dagger[SoftwareCarpentry] esp ldquoLessonsrdquo httpsoftware-carpentryorglessons

bull DSHandbook-Py Ch 9 ldquoTechnical Communication and Documentationrdquo Ch 15 ldquoSoftware Engi-neering Best Practicesrdquo

bull dagger[TeamScienceToolkit] Vogel AL Hall KL Fiore SM Klein JT Bennett LM Gadlin H Stokols DNebeling LC Wuchty S Patrick K Spotts EL Pohl C Riley WT Falk-Krzesinski HJ 2013 ldquoTheTeam Science Toolkit enhancing research collaboration through online knowledge sharingrdquo AmericanJournal of Preventive Medicine 45 787-9 http101016jamepre201309001

bull sect[Databrary] Kara Hall Robert Croyle and Amanda Vogel Forthcoming (2017) Advancing Socialand Behavioral Health Research through Cross-disciplinary Team Science Springer Includes RickO Gilmore and Karen E Adolph ldquoOpen Sharing of Research Video Breaking the Boundaries of theResearch Teamrdquo (See httpdatabraryorg)

bull CRAN httpscranr-projectorgwebviewsReproducibleResearchhtml

Social Bias Fair Algorithms

bull MachineBias

bull Dagger[EmbeddingsBias] Aylin Caliskan Joanna J Bryson and Arvind Narayanan 2017 ldquoSemanticsDerived Automatically from Language Corpora Contain Human Biasesrdquo Science httpsarxiv

orgabs160807187

bull Dagger[Debiasing] Tolga Bolukbasi Kai-Wei Chang James Zou Venkatesh Saligrama Adam Kalai 2016ldquoMan is to Computer Programmer as Woman is to Homemaker Debiasing Word Embeddingsrdquohttpsarxivorgabs160706520

bull Dagger[AvoidingBias] Moritz Hardt Eric Price Nathan Srebro 2016 ldquoEquality of Opportunity inSupervised Learningrdquo httpsarxivorgabs161002413

bull Dagger[InevitableBias] Jon Kleinberg Sendhil Mullainathan Manish Raghavan 2016 ldquoInherent Trade-Offs in the Fair Determination of Risk Scoresrdquo httpsarxivorgabs160905807

Databases and Data Management

bull NRCreport Ch 3 ldquoScaling the Infrastructure for Data Managementrdquo

bull dagger[SQL] Jan L Harrington 2010 SQL Clearly Explained Elsevier httpwwwsciencedirectcomezaccesslibrariespsuedusciencebook9780123756978

bull DSHandbook-Py Ch 14 ldquoDatabasesrdquo

bull dagger[noSQL] Guy Harrison 2015 Next Generation Databases NoSQL NewSQL and Big Data Apresshttpslink-springer-comezaccesslibrariespsuedubook101007978-1-4842-1329-2

bull dagger[Cloud] Divyakant Agrawal Sudipto Das and Amr El Abbadi 2012 Data Management inthe Cloud Challenges and Opportunities Morgan amp Claypool httpwwwmorganclaypoolcom

ezaccesslibrariespsuedudoipdfplus102200S00456ED1V01Y201211DTM032

20

bull daggerMorgan amp Claypool Synthesis Lectures on Data Management httpwwwmorganclaypoolcom

tocdtm11

Data Wrangling

Theoretically-structured approaches to data wrangling

bull Dagger[TidyData-R] Garrett Grolemund and Hadley Wickham 2017 R for Data Science (httpr4dshadconz) esp Ch 12 ldquoTidy Datardquo also Wickham 2014 ldquoTidy Datardquo Journalof Statistical Software 59(10) httpwwwjstatsoftorgv59i10paper Tools The tidyversehttptidyverseorg

bull Dagger[DataScience-Py] Jake VanderPlas 2016 Python Data Science Handbook (httpsgithubcomjakevdpPythonDSHandbook-Py) esp Ch 3 on pandas httppandaspydataorg

bull Dagger[DataCarpentry] Colin Gillespie and Robin Lovelace 2017 Efficient R Programming https

csgillespiegithubioefficientR esp Ch 6 on ldquoefficient data carpentryrdquo

bull sect[Wrangler] Joseph M Hellerstein Jeffrey Heer Tye Rattenbury and Sean Kandel 2017 DataWrangling Practical Techniques for Data Preparation Tool Trifacta Wrangler httpwwwtrifactacomproductswrangler

Data wrangling practice

bull DSHandbook-Py Ch 4 ldquoData Munging String Manipulation Regular Expressions and Data Clean-ingrdquo

bull dagger[DataSimplification] Jules J Berman 2016 Data Simplification Taming Information with OpenSource Tools Elsevier httpwwwsciencedirectcomezaccesslibrariespsueduscience

book9780128037812

bull dagger[Wrangling-Py] Jacqueline Kazil Katharine Jarmul 2016 Data Wrangling with Python OrsquoReillyhttpproquestcombosafaribooksonlinecomezaccesslibrariespsuedu9781491948804

bull dagger[Wrangling-R] Bradley C Boehmke 2016 Data Wrangling with R Springer httplink

springercomezaccesslibrariespsuedubook1010072F978-3-319-45599-0

Record Linkage Entity Resolution Deduplication

bull dagger[DataMatching] Peter Christen 2012 Data Matching Concepts and Techniques for Record Link-age Entity Resolution and Duplicate Detectionhttplinkspringercomezaccesslibrariespsuedubook1010072F978-3-642-31164-2

(esp Ch 2 ldquoThe Data Matching Processrdquo)

bull DataSimplification Chapter 5 ldquoIdentifying and Deidentifying Datardquo

bull Dagger[EntityResolution] Lise Getoor and Ashwin Machanavajjhala 2013 ldquoEntity Resolution for BigDatardquo Tutorial KDD httpwwwumiacsumdedu~getoorTutorialsER_KDD2013pdf

bull Dagger[SyrianCasualties] Peter Sadosky Anshumali Shrivastava Megan Price and Rebecca C Steorts2015 ldquoBlocking Methods Applied to Casualty Records from the Syrian Conflictrdquo httpsarxiv

orgabs151007714 (For more on blocking see DataMatching Ch 4 ldquoIndexingrdquo)

bull CRAN Task Views httpscranr-projectorgwebviewsOfficialStatisticshtml ldquoStatis-tical Matching and Record Linkagerdquo

ldquoMaking up datardquo Imputation Smoothers Kernels Priors Filters Teleportation Negative Sampling Con-volution Augmentation Adversarial Training

21

bull dagger[Imputation] Yi Deng Changgee Chang Moges Seyoum Ido and Qi Long 2016 ldquoMultiple Im-putation for General Missing Data Patterns in the Presence of High-dimensional Datardquo ScientificReports 6(21689) httpswwwnaturecomarticlessrep21689

bull Dagger[Overimputation] Matthew Blackwell James Honaker and Gary King 2017 ldquoA Unified Ap-proach to Measurement Error and Missing Data Overview and Applicationsrdquo Sociological Methodsamp Research 46(3) 303-341 httpgkingharvardedufilesgkingfilesmeasurepdf

bull Shalizi-ADA Sect 15 (Linear Smoothers) Ch 8 (Splines) Sect 144 (Kernel Density Estimates)DataScience-Python NB 0513 ldquoKernel Density Estimationrdquo

bull FightinWords (re Bayesian priors as additional data impact of priors on regularization)

bull DeepLearning Section 75 (Data Augmentation) 712 (Dropout) 713 (Adversarial Examples) Chap-ter 9 (Convolutional Networks)

bull dagger[Adversarial] Ian J Goodfellow Jonathon Shlens Christian Szegedy 2015 ldquoExplaining and Har-vesting Adversarial Examplesrdquo httpsarxivorgabs14126572

bull See also resampling and simulation methods

bull See also feature engineering preprocessing

bull CRAN Task Views httpscranr-projectorgwebviewsOfficialStatisticshtml ldquoImpu-tationrdquo

(Direct) Data Representations Data Mappings

bull NRCreport Ch 5 ldquoLarge-Scale Data Representationsrdquo

bull dagger[PatternRecognition] M Narasimha Murty and V Susheela Devi 2011 Pattern RecognitionAn Algorithmic Approach Springer httpslink-springer-comezaccesslibrariespsuedu

book1010072F978-0-85729-495-1 Section 21 ldquoData Structures for Pattern Representationrdquo

bull Dagger[InfoRetrieval] Christopher D Manning Prabhakar Raghavan and Hinrich Schutze 2009 Intro-duction to Information Retrieval Cambridge University Press httpnlpstanfordeduIR-bookCh 1 2 6 (also note slides used in their class)

bull dagger[Algorithms] Brian Steele John Chandler Swarna Reddy 2016 Algorithms for Data ScienceWiley httplinkspringercomezaccesslibrariespsuedubook1010072F978-3-319-

45797-0 Ch 2 ldquoData Mapping and Data Dictionariesrdquo

bull See also Social Data Structures

Similarity Distance Association Covariance The Kernel Trick

bull PatternRecognition Section 23 ldquoProximity Measuresrdquo

bull Dagger[Similarity] Brendan OrsquoConnor 2012 ldquoCosine similarity Pearson correlation and OLS coeffi-cientsrdquo httpsbrenoconcomblog201203cosine-similarity-pearson-correlation-and-

ols-coefficients

bull For relatively comprehensive lists see also

M-J Lesot M Rifqi and H Benhadda 2009 ldquoSimilarity measures for binary and numerical data a surveyrdquoInt J Knowledge Engineering and Soft Data Paradigms 1(1) 63- httpciteseerxistpsueduviewdoc

downloaddoi=10112126533amprep=rep1amptype=pdf

Seung-Seok Choi Sung-Hyuk Cha Charles C Tappert 2010 ldquoA Survey of Binary Similarity and DistanceMeasuresrdquo Journal of Systemics Cybernetics amp Informatics 8(1)43-8 httpwwwiiisciorgJournalCV$

scipdfsGS315JGpdf

22

Sung-Hyuk Cha 2007 ldquoComprehensive Survey on DistanceSimilarity Measures between Probability DensityFunctionsrdquo International Journal of Mathematical Models and Methods in Applied Sciences 4(1) 300-7 http

csispaceeductappertdpsd861-12session4-p2pdf

Anna Huang 2008 ldquoSimilarity Measures for Text Document Clusteringrdquo Proceedings of the 6th New ZealandComputer Science Research Student Conference 49-56 httpwwwnzcsrsc08canterburyacnzsiteproceedingsIndividual_Paperspg049_Similarity_Measures_for_Text_Document_Clusteringpdf

bull Michael B Jordan ldquoThe Kernel Trickrdquo (Lecture Notes) httpspeopleeecsberkeleyedu

~jordancourses281B-spring04lectureslec3pdf

bull Eric Kim 2017 ldquoEverything You Ever Wanted to Know about the Kernel Trick (But Were Afraidto Ask)rdquo httpwwweric-kimneteric-kim-netposts1kernel_trick_blog_ekim_12_20_

2017pdf

Derived Data Representations - Dimensionality Reduction Compression De-composition Embeddings

The groupings here and under the measurement multivariate statistics section are particularly arbitraryFor example ldquok-Means clusteringrdquo can be viewed as a technique for ldquodimensionality reductionrdquo ldquocompres-sionrdquo ldquofeature extractionrdquo ldquolatent variable measurementrdquo ldquounsupervised learningrdquo ldquocollaborative filteringrdquo

Clustering hashing quantization blocking compression

bull Dagger[Compression] Khalid Sayood 2012 Introduction to Data Compression 4th ed Springer httpwwwsciencedirectcomezaccesslibrariespsuedusciencebook9780124157965 (eg cod-ing blocking via vector quantization)

bull MMDS Ch3 ldquoFinding Similar Itemsrdquo (minhashing locality sensitive hashing)

bull Multivariate-R Ch 6 ldquoClusteringrdquo

bull DataScience-Python NB 0511 ldquok-Means Clusteringrdquo NG 0512 ldquoGaussian Mixture Modelsrdquo

bull Dagger[KMeansHashing] Kaiming He Fang Wen Jian Sun 2013 ldquoK-means Hashing An Affinity-Preserving Quantization Method for Learning Binary Compact Codesrdquo CVPR httpswwwcv-foundationorgopenaccesscontent_cvpr_2013papersHe_K-Means_Hashing_An_2013_CVPR_paper

pdf

bull dagger[Squashing] Madigan D Raghavan N Dumouchel W Nason M Posse C and RidgewayG (2002) ldquoLikelihood-based data squashing A modeling approach to instance constructionrdquo DataMining and Knowledge Discovery 6(2) 173-190 httpdxdoiorgezaccesslibrariespsu

edu101023A1014095614948

bull dagger[Core-sets] Piotr Indyk Sepideh Mahabadi Mohammad Mahdian Vahab S Mirrokni 2014 ldquoCom-posable core-sets for diversity and coverage maximizationrdquo PODS rsquo14 100-8 httpsdoiorg10114525945382594560

Feature selection feature extraction feature engineering weighting preprocessing

bull PatternRecognition Sections 26-7 ldquoFeature Selection Feature Extractionrdquo

bull DSHandbook-Py Ch 7 ldquoInterlude Feature Extraction Ideasrdquo

bull DataScience-Python NB 0504 ldquoFeature Engineeringrdquo

bull Features in text NLP and InfoRetrieval re tfidf and similar FightinWords

bull Features in images DataScience-Python NB 0514 ldquoImage Featuresrdquo

23

Dimensionality reduction decomposition factorization change of basis reparameterization matrix com-pletion latent variables source separation

bull MMDS Ch 9 ldquoRecommendation Systemsrdquo Ch11 ldquoDimensionality Reductionrdquo

bull DSHandbook-Py Ch 10 ldquoUnsupervised Learning Clustering and Dimensionality Reductionrdquo

bull Dagger[Shalizi-ADA] Cosma Rohilla Shalizi 2017 Advanced Data Analysis from an Elementary Pointof View Ch 16 (ldquoPrincipal Components Analysisrdquo) Ch 17 (ldquoFactor Modelsrdquo) httpwwwstat

cmuedu~cshaliziADAfaEPoV

See also Multivariate-R Ch 3 ldquoPrincipal Components Analysisrdquo Ch 4 ldquoMultidimensional ScalingrdquoCh 5 ldquoExploratory Factor Analysisrdquo Latent

bull DeepLearning Ch 2 ldquoLinear Algebrardquo Ch 13 ldquoLinear Factor Modelsrdquo

bull Dagger[GloVe] Jeffrey Pennington Richard Socher Christopher Manning 2014 ldquoGloVe Global vectorsfor word representationrdquo EMNLP httpsnlpstanfordeduprojectsglove

bull dagger[NMF] Daniel D Lee and H Sebastian Seung 1999 ldquoLearning the parts of objects by non-negativematrix factorizationrdquo Nature 401788-791 httpdxdoiorgezaccesslibrariespsuedu

10103844565

bull dagger[CUR] Michael W Mahoney and Petros Drineas 2009 ldquoCUR matrix decompositions for improveddata analysisrdquo PNAS httpwwwpnasorgcontent1063697full

bull Dagger[ICA] Aapo Hyvarinen and Erkki Oja 2000 ldquoIndependent Component Analysis Algorithmsand Applicationsrdquo Neural Networks 13(4-5) 411-30 httpswwwcshelsinkifiuahyvarin

papersNN00newpdf

bull [RandomProjection] Ella Bingham and Heikki Mannilla 2001 ldquoRandom projection in dimension-ality reduction applications to image and text datardquo KDD httpsdoiorg1011452F502512

502546

bull Compression eg ldquoTransform codingrdquo ldquoWaveletsrdquo

Nonlinear dimensionality reduction Manifold learning

bull DeepLearning Section 5113 ldquoManifold Learningrdquo

bull DataScience-Python NB 0510 ldquoManifold Learningrdquo (Locally linear embedding [LLE] Isomap)

bull Dagger[KernelPCA] Sebastian Raschka 2014 ldquoKernel tricks and nonlinear dimensionality reduction viaRBF kernel PCArdquo httpsebastianraschkacomArticles2014_kernel_pcahtml

bull Shalizi-ADA Ch 18 ldquoNonlinear Dimensionality Reductionrdquo (LLE)

bull Dagger[LaplacianEigenmaps] Mikhail Belkin and Partha Niyogi 2003 ldquoLaplacian eigenmaps for di-mensionality reduction and data representationrdquo Neural Computation 15(6) 13731396 http

webcseohio-stateedu~belkin8papersLEM_NC_03pdf

bull DeepLearning Ch14 (ldquoAutoencodersrdquo)

bull Dagger[word2vec] Tomas Mikolov Kai Chen Greg Corrado Jeffrey Dean 2013 ldquoEfficient Estimation ofWord Representations in Vector Spacerdquo httpsarxivorgabs13013781

bull Dagger[word2vecExplained] Yoav Goldberg and Omer Levy 2014 ldquoword2vec Explained DerivingMikolov et als Negative-Sampling Word-Embedding Methodrdquo httpsarxivorgabs14023722

bull Dagger[t-SNE] Laurens van der Maaten and Geoffrey Hinton 2008 ldquoVisualising Data using t-SNErdquo Jour-nal of Machine Learning Research 9 2579-2605 httpjmlrcsailmitedupapersvolume9

vandermaaten08avandermaaten08apdf

24

Computation and Scaling Up

Scientific computing computation at scale

bull NRCreport Ch 10 ldquoThe Seven Computational Giants of Massive Data Analysisrdquo

bull DSHandbook-Py Ch 21 ldquoPerformance and Computer Memoryrdquo Ch 22 ldquoComputer Memory andData Structuresrdquo

bull Dagger[ComputationalStatistics-Python] Cliburn Chan Computational Statistics in Python https

peopledukeedu~ccc14sta-663indexhtml

bull CRAN Task View High Performance Computinghttpscranr-projectorgwebviewsHighPerformanceComputinghtml

Numerical computing

bull [NumericalComputing] Ward Cheney and David Kincaid 2013 Numerical Mathematics andComputing 7th ed BrooksCole Cengage Learning

bull CRAN Task View Numerical Mathematicshttpscranr-projectorgwebviewsNumericalMathematicshtml

Optimization (eg MLE gradient descent stochastic gradient descent EM algorithm neural nets)

bull DSHandbook-Py Ch 23 ldquoMaximum Likelihood Estimation and Optimizationrdquo (gradient descent)

bull DeepLearning Ch 4 ldquoNumerical Computationrdquo Ch 6 ldquoDeep Feedforward Networksrdquo Ch 8ldquoOptimization for Training Deep Modelsrdquo

bull CRAN Task View Optimization and Mathematical Programminghttpscranr-projectorgwebviewsOptimizationhtml

Linear algebra matrix computations

bull [MatrixComputations] Gene H Golub and Charles F Van Loan 2013 Matrix Computations 4thed Johns Hopkins University Press

bull Dagger[NetflixMatrix] Yehuda Koren Robert Bell and Chris Volinsky 2009 ldquoMatrix FactorizationTechniques for Recommender Systemsrdquo Computer August 42-9 httpsdatajobscomdata-

science-repoRecommender-Systems-[Netflix]pdf

bull Dagger[BigDataPCA] Jianqing Fan Qiang Sun Wen-Xin Zhou Ziwei Zhu ldquoPrincipal Component Anal-ysis for Big Datardquo httpwwwprincetonedu~ziweizpcapdf

bull Dagger[RandomSVD] Andrew Tulloch 2009 ldquoFast Randomized SVDrdquo httpsresearchfbcom

fast-randomized-svd

bull Dagger[FactorSGD] Rainer Gemulla Peter J Haas Erik Nijkamp and Yannis Sismanis 2011 ldquoLarge-Scale Matrix Factorization with Distributed Stochastic Gradient Descentrdquo KDD httpwwwcs

utahedu~hariteachingbigdatagemulla11dsgdpdf

bull Dagger[Sparse] Max Grossman 2015 ldquo101 Ways to Store a Sparse Matrixrdquo httpsmediumcom

jmaxg3101-ways-to-store-a-sparse-matrix-c7f2bf15a229

bull Dagger[DontInvert] John D Cook 2010 ldquoDonrsquot Invert that Matrixrdquo httpswwwjohndcookcom

blog20100119dont-invert-that-matrix

Simulation-based inference resampling Monte Carlo methods MCMC Bayes approximate inference

25

bull Dagger[MarkovVisually] Victor Powell ldquoMarkov Chains Explained Visuallyrdquo httpsetosaioev

markov-chains

bull DSHandbook-Py Ch 25 ldquoStochastic Modelingrdquo (Markov chains MCMC HMM)

bull Bayes ComputationalStatistics-Py

bull DeepLearning Ch 17 ldquoMonte Carlo Methodsrdquo Ch 19 ldquoApproximate Inferencerdquo

bull Dagger[VariationalInference] Jason Eisner 2011 ldquoHigh-Level Explanation of Variational Inferencerdquohttpswwwcsjhuedu~jasontutorialsvariationalhtml

bull CRAN Task View Bayesian Inferencehttpscranr-projectorgwebviewsBayesianhtml

Parallelism MapReduce Split-Apply-Combine

bull Dagger[MapReduceIntuition] Jigsaw Academy 2014 ldquoBig Data Specialist MapReducerdquo https

wwwyoutubecomwatchv=TwcYQzFqg-8ampfeature=youtube

bull MMDS Ch 2 ldquoMap-Reduce and the New Software Stackrdquo (3rd Edition discusses Spark amp Tensor-Flow)

bull Algorithms Ch 3ldquoScalable Algorithms and Associative Statisticsrdquo Ch 4 ldquoHadoop and MapReducerdquo

bull NRCreport Ch 6 ldquoResources Trade-offs and Limitationsrdquo

bull TidyData-R (Split-apply-combine is the motivating principle behind the ldquotidyverserdquo approach) Seealso Part III ldquoProgramrdquo (pipes functions vectors iteration)

bull Dagger[TidySAC-Video] Hadley Wickham 2017 ldquoData Science in the Tidyverserdquo httpswww

rstudiocomresourcesvideosdata-science-in-the-tidyverse

bull See also FactorSGD

Functional Programming

bull ComputationalStatistics-Python ldquoFunctions are first class objectsrdquo through first exercises

bull DSHandbook-Py Ch 20 ldquoProgramming Language Conceptsrdquo

bull See also Haskell

Scaling iteration streaming data online algorithms (Spark)

bull NRCreport Ch 4 ldquoTemporal Data and Real-Time Algorithmsrdquo

bull MMDS Ch 4 ldquoMining Data Streamsrdquo

bull Dagger[BDAS] AMPLab BDAS The Berkeley Data Analytics Stack httpsamplabcsberkeleyedu

software

bull dagger[Spark] Mohammed Guller 2015 Big Data Analytics with Spark A Practitioners Guide toUsing Spark for Large-Scale Data Processing Machine Learning and Graph Analytics and High-Velocity Data Stream Processing Apress httpslink-springer-comezaccesslibrariespsu

edubook101007978-1-4842-0964-6

bull DSHandbook-Py Ch 13 ldquoBig Datardquo

bull DeepLearning Ch 10 ldquoSequence Modeling Recurrent and Recursive Netsrdquo

General Resources for Python and R

26

bull dagger[DSHandbook-Py] Field Cady 2017 The Data Science Handbook (Python-based) Wiley http

onlinelibrarywileycomezaccesslibrariespsuedubook1010029781119092919

bull Dagger[Tutorials-R] Ujjwal Karn 2017 ldquoA curated list of R tutorials for Data Science NLP and MachineLearningrdquo httpsgithubcomujjwalkarnDataScienceR

bull Dagger[Tutorials-Python] Ujjwal Karn 2017 ldquoA curated list of Python tutorials for Data Science NLPand Machine Learningrdquo httpsgithubcomujjwalkarnDataSciencePython

Cutting and Bleeding Edge of Data Science Languages

bull [Scala] daggerVishal Layka and David Pollak 2015 Beginning Scala httpslink-springer-

comezaccesslibrariespsuedubook1010072F978-1-4842-0232-6 daggerNicolas Patrick 2014Scala for Machine Learning httpsebookcentralproquestcomlibpensudetailaction

docID=1901910 DaggerScala site httpswwwscala-langorg (See also )

bull [Julia] daggerIvo Baobaert 2015 Getting Started with Julia httpsebookcentralproquestcom

libpensudetailactiondocID=1973847 DaggerDouglas Bates 2013 ldquoJulia for R Programmersrdquohttpwwwstatwiscedu~batesJuliaForRProgrammerspdf DaggerJulia site httpsjulialang

org

bull [Haskell] daggerRichard Bird 2014 Thinking Functionally with Haskell httpsdoi-orgezaccess

librariespsuedu101017CBO9781316092415 daggerHakim Cassimally 2017 Learning Haskell Pro-gramming httpswwwlyndacomHaskell-tutorialsLearning-Haskell-Programming604926-2html daggerJames Church 2017 Learning Haskell for Data Analysis httpswwwlyndacomDeveloper-tutorialsLearning-Haskell-Data-Analysis604234-2html Haskell site httpswwwhaskellorg

bull [Clojure] daggerMark McDonnell 2017 Quick Clojure Essential Functional Programming https

link-springer-comezaccesslibrariespsuedubook1010072F978-1-4842-2952-1 daggerAkhillWali 2014 Clojure for Machine Learning httpsebookcentralproquestcomlibpensu

detailactiondocID=1674848 daggerArthur Ulfeldt 2015 httpswwwlyndacomClojure-tutorialsUp-Running-Clojure413127-2html DaggerClojure site httpsclojureorg

bull [TensorFlow] DaggerAbadi et al (Google) 2016 ldquoTensorFlow A System for Large-Scale Ma-chine Learningrdquo httpsarxivorgabs160508695 DaggerUdacity ldquoDeep Learningrdquo httpswww

udacitycomcoursedeep-learning--ud730 DaggerTensorFlow site httpswwwtensorfloworg

bull [H2O] Darren Cool 2016 Practical Machine Learning with H2O Powerful Scalable Techniques forDeep Learning and AI OrsquoReilly Arno Candel and Viraj Parmar 2015 Deep Learning with H2ODaggerH2O site httpswwwh2oai

Social Data Structures

Space and Time

bull dagger[GIA] David OrsquoSullivan David J Unwin 2010 Geographic Information Analysis Second EditionJohn Wiley amp Sons httponlinelibrarywileycombook1010029780470549094

bull [Space-Time] Donna J Peuquet 2003 Representations of Space and Time Guilford

bull Roger S Bivand Edzer Pebesma Virgilio Gmez-Rubio 2013 Applied Spatial Data Analysis in RFree through library httplinkspringercomezaccesslibrariespsuedubook101007

2F978-1-4614-7618-4 See also Edzer Pebesma 2016 ldquoHandling and Analyzing Spatial Spa-tiotemporal and Movement Data in Rrdquo httpsedzergithubioUseR2016

bull Dagger[PySAL] Sergio J Rey and Dani Arribas-Bel 2016 ldquoGeographic Data Science with PySAL andthe pydata Stackrdquo httpdarribasorggds_scipy16

27

bull DataScience-Python NB 0413 ldquoGeographic Data with Basemaprdquo

bull Time ldquoLongitudinalrdquo intra-individual data as common in developmental psychology dagger[Longitudinal]Garrett Fitzmaurice Marie Davidian Geert Verbeke and Geert Molenberghs editors 2009 Longitu-dinal Data Analysis Chapman amp Hall CRC httppensueblibcompatronFullRecordaspxp=359998

bull Time ldquoTime Series Time Series Cross Section Panelrdquo as common in economics political scienceand sociology DataScience-Python Ch 3 re pandas DSHandbook-Py Ch 17 ldquoTime Series AnalysisrdquoEconometrics GelmanHill

bull Time ldquoSequential Data Streamsrdquo as in NLP machine learning Spark and other rdquostreaming datardquoreadings

bull Space-Time Data with continuity connectivity neighborhood structure See DeepLearning Chapter 9re convolution

bull CRAN Task Views Spatial httpscranr-projectorgwebviewsSpatialhtml Spatiotempo-ral httpscranr-projectorgwebviewsSpatioTemporalhtml Time Series httpscran

r-projectorgwebviewsTimeSerieshtml

Network Graphs

bull Dagger[Networks] David Easley and Jon Kleinberg 2010 Networks Crowds and Markets ReasoningAbout a Highly Connected World Cambridge University Press httpwwwcscornelleduhome

kleinbernetworks-book

bull MMDS Ch 5 ldquoLink Analysisrdquo Ch 10 ldquoMining Social-Network Graphsrdquo

bull dagger[Networks-R] Douglas Luke 2015 A Userrsquos Guide to Network Analysis in R Springer https

link-springer-comezaccesslibrariespsuedubook101007978-3-319-23883-8

bull dagger[Networks-Python] Mohammed Zuhair Al-Taie and Seifedine Kadry 2017 Python for Graph andNetwork Analysis Springer httpslink-springer-comezaccesslibrariespsuedubook

1010072F978-3-319-53004-8

Hierarchy Clustered Data Aggregation Mixed Models

bull dagger[GelmanHill] Andrew Gelman and Jennifer Hill 2006 Data Analysis Using Regression and Multi-levelHierarchical Models Cambridge University Press httppensueblibcompatronFullRecordaspxp=288457

bull dagger[MixedModels] Eugene Demidenko 2013 Mixed Models Theory and Applications with R 2nd edWiley httpsiteebrarycomezaccesslibrariespsuedulibpennstatedetailaction

docID=10748641

Social Data Channels

Language Text Speech Audio

bull Dagger[NLP] Daniel Jurafsky and James H Martin Forthcoming (2017) Speech and Language ProcessingAn Introduction to Natural Language Processing Speech Recognition and Computational Linguistics3rd ed Prentice-Hall Preprint httpswebstanfordedu~jurafskyslp3

bull InfoRetrieval

bull Dagger[CoreNLP] Stanford CoreNLP httpsstanfordnlpgithubioCoreNLP

28

bull dagger[TextAsData] Grimmer and Stewart 2013 ldquoText as Data The Promise and Pitfalls of AutomaticContent Analysis Methods for Political Textsrdquo Political Analysis httpsdoiorg101093pan

mps028

bull Quinn-Topics

bull FightinWords

bull dagger[NLTK-Python] Jacob Perkins 2010 Python Text Processing with NLTK 20 Cookbook PackthttpsiteebrarycomezaccesslibrariespsuedulibpennstatedetailactiondocID=10435387dagger[TextAnalytics-Python] Dipanjan Sarkar 2016 Text Analytics with Python Springer http

linkspringercomezaccesslibrariespsuedubook1010072F978-1-4842-2388-8

bull DSHandbook-Py Ch 16 ldquoNatural Language Processingrdquo

bull Matthew J Denny httpwwwmjdennycomText_Processing_In_Rhtml

bull CRAN Natural Language Processing httpscranr-projectorgwebviewsNaturalLanguageProcessinghtml

bull dagger[AudioData] Dean Knox and Christopher Lucas 2017 ldquoThe Speaker-Affect Model MeasuringEmotion in Political Speech with Audio Datardquo httpchristopherlucasorgfilesPDFssam

pdf httpswwwyoutubecomwatchv=Hs8A9dwkMzI

bull daggerMorgan amp Claypool Synthesis Lectures on Human Language Technologies httpwwwmorganclaypoolcomtochlt11

bull daggerMorgan amp Claypool Synthesis Lectures on Speech and Audio Processing httpwwwmorganclaypoolcomtocsap11

Vision Image Video

bull Dagger[ComputerVision] Richard Szelinski 2010 Computer Vision Algorithms and Applications SpringerhttpszeliskiorgBook

bull MixedModels Ch 11 ldquoStatistical Analysis of Shaperdquo Ch 12 ldquoStatistical Image Analysisrdquo

bull AudioVideoVolumetric See DeepLearning Chapter 9 re convolution

bull daggerMorgan amp Claypool Synthesis Lectures on Image Video and Multimedia Processing httpwww

morganclaypoolcomtocivm11

bull daggerMorgan amp Claypool Synthesis Lectures on Computer Vision httpwwwmorganclaypoolcom

toccov11

Approaches to Learning from Data (The Analytics Layer)

bull NRCreport Ch 7 ldquoBuilding Models from Massive Datardquo

bull MMDS (data-mining)

bull CausalInference

bull Dagger[VisualAnalytics] Daniel Keim Jorn Kohlhammer Geoffrey Ellis and Florian Mansmann editors2010 Mastering the Information Age Solving Problems with Visual Analytics The EurographicsAssociation Goslar Germany httpwwwvismastereubook See also daggerMorgan amp ClaypoolSynthesis Lectures on Visualization httpwwwmorganclaypoolcomtocvis21

bull Dagger[Econometrics] Bruce E Hansen 2014 Econometrics httpwwwsscwiscedu~bhansen

econometrics

29

bull Dagger[StatisticalLearning] Gareth James Daniela Witten Trevor Hastie and Robert Tibshirani 2013An Introduction to Statistical Learning with Applications in R Springer httpwww-bcfuscedu

~garethISLindexhtml

bull [MachineLearning] Christopher Bishop 2006 Pattern Recognition and Machine Learning Springer

bull dagger[Bayes] Peter D Hoff 2009 A First Course in Bayesian Statistical Methods Springer http

linkspringercombook1010072F978-0-387-92407-6

bull dagger[GraphicalModels] Soslashren Hoslashjsgaard David Edwards Steffen Lauritzen 2012 Graphical Mod-els with R Springer httpslink-springer-comezaccesslibrariespsuedubook101007

2F978-1-4614-2299-0

bull [InformationTheory] David Mackay 2003 Information Theory Inference and Learning AlgorithmsSpringer

bull Dagger[DeepLearning] Ian Goodfellow Yoshua Bengio and Aaron Courville 2016 Deep Learning MITPress httpwwwdeeplearningbookorg (see also Yann LeCun Yoshua Bengio and Geof-frey Hinton 2015 ldquoDeep Learningrdquo Nature 521(7553) 436ndash444 httpsdoiorg101038

nature14539)

See also Dagger[Keras] Keras The Python Deep Learning Library httpskerasio Francoios CholletDirectory of Keras tutorials httpsgithubcomfcholletkeras-resources

bull dagger[SignalProcessing] Jose Maria Giron-Sierra 2013 Digital Signal Processing with MATLAB Exam-ples (Volumes 1-3) Springer httpslink-springer-comezaccesslibrariespsuedubook

101007978-981-10-2534-1

30

Penn State Policy Statements

Academic Integrity

Academic integrity is the pursuit of scholarly activity in an open honest and responsible manner Academicintegrity is a basic guiding principle for all academic activity at The Pennsylvania State University and allmembers of the University community are expected to act in accordance with this principle Consistent withthis expectation the Universitys Code of Conduct states that all students should act with personal integrityrespect other students dignity rights and property and help create and maintain an environment in whichall can succeed through the fruits of their efforts

Academic integrity includes a commitment by all members of the University community not to engage in ortolerate acts of falsification misrepresentation or deception Such acts of dishonesty violate the fundamentalethical principles of the University community and compromise the worth of work completed by others

Disability Accomodation

Penn State welcomes students with disabilities into the Universitys educational programs Every Penn Statecampus has an office for students with disabilities Student Disability Resources (SDR) website provides con-tact information for every Penn State campus (httpequitypsuedusdrdisability-coordinator)For further information please visit the Student Disability Resources website (httpequitypsuedusdr)

In order to receive consideration for reasonable accommodations you must contact the appropriate disabilityservices office at the campus where you are officially enrolled participate in an intake interview and pro-vide documentation See documentation guidelines at (httpequitypsuedusdrguidelines) If thedocumentation supports your request for reasonable accommodations your campus disability services officewill provide you with an accommodation letter Please share this letter with your instructors and discussthe accommodations with them as early as possible You must follow this process for every semester thatyou request accommodations

Psychological Services

Many students at Penn State face personal challenges or have psychological needs that may interfere withtheir academic progress social development or emotional wellbeing The university offers a variety ofconfidential services to help you through difficult times including individual and group counseling crisisintervention consultations online chats and mental health screenings These services are provided by staffwho welcome all students and embrace a philosophy respectful of clients cultural and religious backgroundsand sensitive to differences in race ability gender identity and sexual orientation

bull Counseling and Psychological Services at University Park (CAPS) (httpstudentaffairspsueducounseling) 814-863-0395

bull Penn State Crisis Line (24 hours7 daysweek) 877-229-6400

bull Crisis Text Line (24 hours7 daysweek) Text LIONS to 741741

Educational Equity

Penn State takes great pride to foster a diverse and inclusive environment for students faculty and staffConsistent with University Policy AD29 students who believe they have experienced or observed a hatecrime an act of intolerance discrimination or harassment that occurs at Penn State are urged to report theseincidents as outlined on the Universitys Report Bias webpage (httpequitypsuedureportbias)

31

Background Knowledge

Students will have a variety of backgrounds Prior to beginning interdisciplinary coursework tofulfill Social Data Analytics degree requirements including SoDA 501 and 502 students are ex-pected to have advanced (graduate) training in at least one of the component areas of Social DataAnalytics and a familiarity with basic concepts in the others

With regard to specialization students are expected to have advanced (graduate) training in ONEof the following

bull quantitative social science methodology and a discipline of social science (as would be thecase for a second-year PhD student in Political Science Sociology Criminology Human De-velopment and Family Studies or Demography) OR

bull statistics (as would be the case for a second-year PhD student in Statistics) OR

bull information science or informatics (as would be the case for a second-year PhD student inInformation Science and Technology or a second-year PhD student in Geography specializingin GIScience) OR

bull computer science (as would be the case for a second-year PhD student in Computer Scienceand Engineering)

This requirement is met as a matter of meeting home program requirements for students in thedual-title PhD but may require additional coursework on the part of students in other programswishing to pursue the graduate minor

With regard to general preparation students are expected to have ALL of the following technicalknowledge

bull basic programming skills including basic facility with R ampor Python AND

bull basic knowledge of relational databases ampor geographic information systems AND

bull basic knowledge of probability applied statistics ampor social science research design AND

bull basic familiarity with a substantive or theoretical area of social science (eg 300-level course-work in political science sociology criminology human development psychology economicscommunication anthropology human geography social informatics or similar fields)

It is not unusual for students to have one or more gaps in this preparation at time of applicationto the SoDA program Students should work with Social Data Analytics advisers to develop aplan for timely remediation of any deficiencies which generally will not require formal courseworkfor students whose training and interests are otherwise appropriate for pursuit of the Social DataAnalytics degree Where possible this will be addressed at time of application to the Social DataAnalytics program

To this end some free training materials are linked in the reference section and there are alsoabundant free high-quality self-paced course-style training materials on these and related subjectsavailable through edX (httpwwwedxorg) Udacity (httpwwwudacitycom) Codecademy(httpcodecademycom) and Lynda (httpwwwlyndapsu)

32

Page 13: SoDA 501: Approaches and Issues in Big Social Data Spring ... · \Approaches and Issues in Big Social Data," o ered in the spring semester. Some SoDA / IGERT students will take more

Readings and References (updated Spring 2018)

We will discuss a relatively small subset of the readings listed here and this will vary based on topics andreadings discussed by visiting speakers and you yourselves The remainder are provided here as curatedreferences for more in-depth investigation in both the theory and practice related to the topic (with thelatter heavily weighted toward resources in Python and R)

dagger Material that is at last check made available through the Penn State library Most journal links shouldwork if you are logged in to a Penn State machine or through the Penn State VPN Some article archivesand most books require additional authentication through webaccess If links are broken start directly froma search via httpswwwlibrariespsuedu Some books require installation of e-readers like AdobeDigital Editions Links to lyndacom can be accessed through httplyndapsuedu

Dagger Material that is at last check legally provided for free In some cases these are the preprint versionsof published material

sect Material for which a legal selection is or will be provided through the class Box folder

Big Data amp Social Data Analytics

Overviews

bull Dagger[CompSocSci] David Lazer Alex Pentland Lada Adamic Sinan Aral Albert-Laszlo BarabasiDevon Brewer Nicholas Christakis Noshir Contractor James Fowler Myron Gutmann Tony JebaraGary King Michael Macy Deb Roy and Marshall Van Alstyne 2009 ldquoComputational Social SciencerdquoScience 323(5915)721-3 + Supp Feb 6 httpsciencesciencemagorgcontent3235915

721full httpsgkingharvardedufilesgkingfilesLazPenAda09pdf

bull Dagger[NRCreport] National Research Council 2013 Frontiers in Massive Data Analysis NationalAcademies Press (Free w registration httpwwwnapeducatalogphprecord_id=18374)Ch1 ldquoIntroductionrdquo Ch2 ldquoMassive Data in Science Technology Commerce National DefenseTelecommunications and other Endeavorsrdquo

bull Dagger[MMDS] Jure Leskovec Anand Rajaraman and Jeff Ullman 2014 Mining of Massive DatasetsCambridge University Press httpwwwmmdsorg (BETA version of Third Edition httpi

stanfordedu~ullmanmmdsnhtml)

bull sect[BitByBit] Matthew J Salganik 2018 (Forthcoming) Bit by Bit Social Research in the DigitalAge Princeton University Press Ch 1 ldquoIntroductionrdquo Ch 2 ldquoObserving Behaviorrdquo

bull DSHandbook-Py Ch 1 ldquoIntroduction Becoming a Unicornrdquo Ch2 ldquoThe Data Science Road Maprdquo

Burt-schtick

bull Dagger[Monroe-5Vs] Burt L Monroe 2013 ldquoThe Five Vs of Big Data Political Science Introductionto the Special Issue on Big Data in Political Sciencerdquo Political Analysis 21(V5) 1ndash9 https

doiorg101017S1047198700014315 (Volume Velocity Variety Vinculation Validity)

bull dagger[Monroe-No] Burt L Monroe Jennifer Pan Margaret E Roberts Maya Sen and Betsy Sin-clair 2015 ldquoNo Formal Theory Causal Inference and Big Data Are Not Contradictory Trendsin Political Sciencerdquo PS Political Science amp Politics 48(1) 71ndash4 httpdxdoiorg101017

S1049096514001760

bull dagger[Quinn-Topics] Kevin M Quinn Burt L Monroe Michael Colaresi Michael H Crespin andDragomir R Radev 2010 ldquoHow to Analyze Political Attention with Minimal Assumptions andCostsrdquo American Journal of Political Science 54(1) 209ndash28 httponlinelibrarywileycom

13

doi101111j1540-5907200900427xfull (esp topic modeling as measurement approach tovalidation)

bull dagger[FightinWords] Burt L Monroe Michael Colaresi and Kevin M Quinn 2008 ldquoFightinrsquo WordsLexical Feature Selection and Evaluation for Identifying the Content of Political Conflictrdquo PoliticalAnalysis 16(4) 372-403 httpsdoiorg101093panmpn018 (esp the impact of samplingvariance and regularization through priors)

bull Dagger[BDSS-Census] Big Data Social Science PSU Team 2012 ldquoA Closer Look at the Kaggle CensusDatardquo httpsburtmonroegithubioBDSSKaggleCensus2012 (esp the relevance of the socialprocesses by which data come to exist as data)

Multidisciplinary Perspectives

bull dagger[Business-BigData] Andrew McAfee and Erik Brynjolfsson 2012 ldquoBig Data The ManagementRevolutionrdquo Harvard Business Review 90(10) 61ndash8 October httpshbrorg201210big-

data-the-management-revolution Thomas H Davenport and DJ Patel 2012 ldquoData ScientistThe Sexiest Job of the 21st Centuryrdquo Harvard Business Review 90(10)70-6 October https

hbrorg201210data-scientist-the-sexiest-job-of-the-21st-century

bull dagger[InfoSci-BigData] CL Philip Chen and Chun-Yang Zhang 2014 ldquoData-intensive ApplicationsChallenges Techniques and Technologies A Survey on Big Datardquo Information Sciences 275 314-47httpsdoiorg101016jins201401015

bull dagger[Informatics-BigData] Vasant G Honavar 2014 ldquoThe Promise and Potential of Big Data ACase for Discovery Informaticsrdquo Review of Policy Research 31(4) 326-330 httpsdoiorg10

1111ropr12080

bull Dagger[Stats-BigData] Beate Franke Jean-Francois Ribana Roscher Annie Lee Cathal Smyth ArminHatefi Fuqi Chen Einat Gil Alexander Schwing Alessandro Selvitella Michael M Hoffman RogerGrosse Dietrich Hendricks and Nancy Reid 2016 ldquoStatistical Inference Learning and Models in BigDatardquo International Statistical Review 84(3) 371-89 httponlinelibrarywileycomdoi101111insr12176full

bull dagger[Econ-BigData] Hal R Varian 2013 ldquoBig Data New Tricks for Econometricsrdquo Journal ofEconomic Perspectives 28(2) 3-28 httpsdoiorg101257jep2823

bull dagger[GeoViz-BigData] Alan MacEachren 2017 ldquoLeveraging Big (Geo) Data with (Geo) Visual An-alytics Place as the Next Frontierrdquo In Chenghu Zhou Fenzhen Su Francis Harvey and Jun Xueds Spatial Data Handling in the Big Data Era pp 139-155 Springer httpslink-springer-

comezaccesslibrariespsueduchapter101007978-981-10-4424-3_10

bull dagger[Soc-BigData] David Lazer and Jason Radford 2017 ldquoData ex Machina Introduction to BigDatardquo Annual Review of Sociology 43 19-39 httpsdoiorg101146annurev-soc-060116-

053457

bull sect[Politics-BigData] Keith T Poole L Jason Anasastopolous and James E Monagan III Forthcom-ing ldquoThe lsquoBig Datarsquo Revolution in Political Campaigning and Governancerdquo Oxford Bibliographies inPolitical Science

iexclCuidado Traps Biases Problems Pains Perils

bull dagger[GoogleFlu] David Lazer Ryan Kennedy Gary King and Alessandro Vespignani 2014 ldquoTheParable of Google Flu Traps in Big Data Analysisrdquo Science (343) 14 March httpgking

harvardedufilesgkingfiles0314policyforumffpdf

bull Dagger[MachineBias] ProPublica ldquoMachine Bias Investigating Algorithmic Injusticerdquo Series https

wwwpropublicaorgseriesmachine-bias See especially

14

Julia Angwin Jeff Larson Surya Mattu and Lauren Kirchner 2016 ldquoMachine Biasrdquo ProPub-lica May 23 httpswwwpropublicaorgarticlemachine-bias-risk-assessments-in-

criminal-sentencing

Julia Angwin Madeleine Varner and Ariana Tobin 2017 ldquoFacebook Enabled Adverstisers toReach Jew Hatersrdquo httpswwwpropublicaorgarticlefacebook-enabled-advertisers-

to-reach-jew-haters

bull Dagger[Polling2016] Doug Rivers (Nov 11 2016) ldquoFirst Thoughts on Polling Problems in the 2016 USElectionsrdquo httpstodayyougovcomnews20161111first-thoughts-polling-problems-

2016-us-elections

bull dagger[EventData] Wei Wang Ryan Kennedy David Lazer Naren Ramakrishnan 2016 ldquoGrowing Painsfor Global Monitoring of Societal Eventsrdquo Science 3536307 pp 1502ndash1503 httpsdoiorg101126scienceaaf6758

bull Dagger[GoogleBooks] Eitan Adam Pechenick Christopher M Danforth and Peter Sheridan Dodds 2015ldquoCharacterizing the Google Books Corpus Strong Limits to Inferences of Socio-Cultural and LinguisticEvolutionrdquo PLoS One httpsdoiorg101371journalpone0137041

bull Dagger[OkCupid] Michael Zimmer 2016 ldquoOkCupid Study Reveals the Perils of Big-Data Sciencerdquo Wiredhttpswwwwiredcom201605okcupid-study-reveals-perils-big-data-science May 14

bull dagger[CriticalQuestions] danah boyd and Kate Crawford 2011 ldquoCritical Questions for Big Data Provo-cations for a Cultural Technological and Scholarly Phenomenonrdquo Information Communication ampSociety 15(5) 662ndash79 httpdxdoiorg1010801369118X2012678878

bull Dagger[RacistBot] Daniel Victor 2016 ldquoMicrosoft created a Twitter bot to learn from users It quicklybecame a racist jerkrdquo New York Times httpswwwnytimescom20160325technology

microsoft-created-a-twitter-bot-to-learn-from-users-it-quickly-became-a-racist-jerk

html

bull MMDS 122-123 on Bonferroni

bull Also BDSS-Census

Research Design and Measurement

Overviews

bull BitByBit

bull Dagger[ResearchMethodsKB] William M Trochim 2006 The Research Methods Knowledge Basehttpwwwsocialresearchmethodsnetkb

bull [7Rules] Glenn Firebaugh 2008 Seven Rules for Social Research Princeton University Press

Measurement Reliability and Validity

bull ResearchMethodsKB ldquoMeasurementrdquo

bull 7Rules Ch 3 ldquoBuild Reality Checks into Your Researchrdquo

bull Reliability see also FightinWords

bull Validity see also Quinn10-Topics Monroe-5Vs

Indirect Unobtrusive Nonreactive Measures Data Exhaust

15

bull dagger[UnobtrusiveMeasures] Raymond M Lee 2015 ldquoUnobtrusive Measuresrdquo Oxford Bibliographieshttpsdoiorg101093OBO9780199846740-0048 (Canonical cite is Eugene J Webb 1966Unobtrusive Measures or Webb Donald T Campbell Richard D Schwartz Lee Sechrest 1999Unobtrusive Measures rev ed Sage)

bull BitByBit Examples in Chapter 2

Multiple Measures Latent Variable Measurement

bull 7Rules Ch 4 ldquoReplicate Where Possiblerdquo

bull dagger[LatentVariables] David J Bartholomew Martin Knott and Irini Moustaki 2011 Latent VariableModels and Factor Analysis A Unified Approach Wiley httpsiteebrarycomezaccess

librariespsuedulibpennstatedetailactiondocID=10483308) (esp Ch1 ldquoBasic ideas andexamplesrdquo)

bull dagger[Multivariate-R] Brian Everitt and Torsten Hothorn 2011 An Introduction to Applied MultivariateAnalysis with R Springer httplinkspringercomezaccesslibrariespsuedubook10

10072F978-1-4419-9650-3

bull Shalizi-ADA Ch17 ldquoFactor Modelsrdquo

bull Dagger[NetflixPrize] Edwin Chen 2011 ldquoWinning the Netflix Prize A Summaryrdquo

bull DaggerCRAN httpscranr-projectorgwebviewsMultivariatehtmlhttpscranr-projectorgwebviewsPsychometricshtmlhttpscranr-projectorgwebviewsClusterhtml

Sampling and Survey Design

bull dagger[Sampling] Steven K Thompson 2012 Sampling 3rd ed httpsk8es4mc2lsearchserialssolutionscomsid=sersolampSS_jc=TC_024492330amptitle=Wiley20Desktop20Editions203A20Sampling

bull ResearchMethodsKB ldquoSamplingrdquo

bull NRCreport Ch 8 ldquoSampling and Massive Datardquo

bull BitByBit Ch 3 ldquoAsking Questionsrdquo

bull Dagger[MSE] Daniel Manrique-Vallier Megan E Price and Anita Gohdes 2013 In Seybolt et al (eds)Counting Civilians ldquoMultiple Systems Estimation Techniques for Estimating Casualties in ArmedConflictsrdquo Preprint httpciteseerxistpsueduviewdocdownloaddoi=1011469939amp

rep=rep1amptype=pdf)

bull dagger[NetworkSampling] Ted Mouw and Ashton M Verdery 2012 ldquoNetwork Sampling with Memory AProposal for More Efficient Sampling from Social Networksrdquo Sociological Methodology 42(1)206ndash56httpsdxdoiorg1011772F0081175012461248

bull See also FightinWords (re hidden heteroskedasticity in sample variance)

bull Dagger[HashDontSample] Mudit Uppal 2016 ldquoProbabilistic data structures in the Big data world (+code)rdquo (re ldquoHash donrsquot samplerdquo) httpsmediumcommuppalprobabilistic-data-structures-

in-the-big-data-world-code-b9387cff0c55

bull MMDS Hash Functions (132) Sampling in streams (42)

bull DaggerCRAN httpscranr-projectorgwebviewsOfficialStatisticshtml (ldquoComplex SurveyDesignrdquo ldquoSmall Area Estimationrdquo)

Experimental and Observational Designs for Causal Inference

16

bull Dagger[CausalInference] Miguel A Hernan James M Robins Forthcoming (2017) Causal Infer-ence Chapman amp HallCRC Preprint httpswwwhsphharvardedumiguel-hernancausal-

inference-book Chs 1 ldquoA definition of causal effectrdquo Ch 2 ldquoRandomized experimentsrdquo Ch3 ldquoObservational studiesrdquo (See Ch 6 ldquoGraphical representation of causal effectsrdquo for integrationwith Judea Pearl approach)

bull BitByBit Chapter 4 ldquoRunning Experimentsrdquo Ch 2 Natural experiments in observable data exam-ples

bull ResearchMethodsKB ldquoDesignrdquo

bull Firebaugh-7Rules Ch 2 ldquoLook for Differences that Make a Difference and Report Themrdquo sectCh5 ldquoCompare Like with Likerdquo Ch 6 ldquoUse Panel Data to Study Individual Change and RepeatedCross-Section Data to Study Social Changerdquo

bull Shalizi-ADA Part IV ldquoCausal Inferencerdquo

bull Monroe-No

bull CRAN httpscranr-projectorgwebviewsExperimentalDesignhtml

Technologies for Primary and Secondary Data Collection

Mobile Devices Distributed Sensors Wearable Sensors Remote Sensing

bull dagger[RealityMining] Nathan Eagle and Alex (Sandy) Pentland 2006 ldquoReality Mining SensingComplex Social Systemsrdquo Personal and Ubiquitous Computing 10(4) 255ndash68 httpsdoiorg

101007s00779-005-0046-3

bull dagger[QuantifiedSelf ] Melanie Swan 2013 ldquoThe Quantified Self Fundamental Disruption in Big DataScience and Biological Discoveryrdquo Big Data 1(2) 85ndash99 httpsdoiorg101089big2012

0002

bull dagger[SensorData] Charu C Aggarwal (Ed) 2013 Managing and Mining Sensor Data Springer httplinkspringercomezaccesslibrariespsuedubook1010072F978-1-4614-6309-2 Ch1ldquoAn Introduction to Sensor Data Analyticsrdquo

bull dagger[NightLights] Thushyanthan Baskaran Brian Min Yogesh Uppal 2015 ldquoElection cycles andelectricity provision Evidence from a quasi-experiment with Indian special electionsrdquo Journal ofPublic Economics 12664-73 httpsdoiorg101016jjpubeco201503011

Crowdsourcing Human Computation Citizen Science Web Experiments

bull dagger[Crowdsourcing] Daren C Brabham 2013 Crowdsourcing MIT Press httpsiteebrary

comezaccesslibrariespsuedulibpennstatedetailactiondocID=10692208 ldquoIntroduc-tionrdquo Ch1 ldquoConcepts Theories and Cases of Crowdsourcingrdquo

bull BitByBit Ch 5 ldquoCreating Mass Collaborationrdquo

bull NRCreport Ch 9 ldquoHuman Interaction with Datardquo

bull dagger[MTurk] Krista Casler Lydia Bickel and Elizabeth Hackett 2013 ldquoSeparate but Equal AComparison of Participants and Data Gathered via Amazonrsquos MTurk Social Media and Face-to-FaceBehavioral Testingrdquo Computers in Human Behavior 29(6) 2156ndash60 httpdoiorg101016j

chb201305009

bull dagger[LabintheWild] Katharina Reinecke and Krzysztof Z Gajos 2015 ldquoLabintheWild ConductingLarge-Scale Online Experiments With Uncompensated Samplesrdquo In Proceedings of the 18th ACMConference on Computer Supported Cooperative Work amp Social Computing (CSCW rsquo15) 1364ndash1378httpdxdoiorg10114526751332675246

17

bull dagger[TweetmentEffects] Kevin Munger 2017 ldquoTweetment Effects on the Tweeted ExperimentallyReducing Racist Harassmentrdquo Political Behavior 39(3) 629ndash49 httpdoiorg101016jchb

201305009

bull dagger[HumanComputation] Edith Law and Luis van Ahn 2011 Human Computation Morgan amp Clay-pool httpwwwmorganclaypoolcomezaccesslibrariespsuedudoipdf102200S00371ED1V01Y201107AIM013

Open Data File Formats APIs Semantic Web Linked Data

bull DSHandbook-Py Ch 12 ldquoData Encodings and File Formatsrdquo

bull Dagger[OpenData] Open Data Institute ldquoWhat Is Open Datardquo httpstheodiorgwhat-is-open-

data

bull Dagger[OpenDataHandbook] Open Knowledge International The Open Data Handbook httpopendatahandbookorg Includes appendix ldquoFile Formatsrdquo httpopendatahandbookorgguideenappendices

file-formats

bull Dagger[APIs] Brian Cooksey 2016 An Introduction to APIs httpszapiercomlearnapis

bull Dagger[APIMarkets] RapidAPI mashape API marketplaces httpsdocsrapidapicom https

marketmashapecom ProgrammableWeb httpswwwprogrammablewebcom

bull dagger[LinkedData] Tom Heath and Christian Bizer 2011 Linked Data Evolving the Web into a GlobalData Space Morgan amp Claypool httpwwwmorganclaypoolcomezaccesslibrariespsu

edudoiabs102200S00334ED1V01Y201102WBE001 Ch 1 ldquoIntroductionrdquo Ch 2 ldquoPrinciples ofLinked Datardquo

bull dagger[SemanticWeb] Nikolaos Konstantinos and Dimitrios-Emmanuel Spanos 2015 Materializing theWeb of Linked Data Springer httplinkspringercomezaccesslibrariespsuedubook

1010072F978-3-319-16074-0 Ch 1 ldquoIntroduction Linked Data and the Semantic Webrdquo

bull daggerMorgan amp Claypool Synthesis Lectures on the Semantic Web Theory and Technology httpwww

morganclaypoolcomtocwbe111

Web Scraping

bull dagger[TheoryDrivenScraping] Richard N Landers Robert C Brusso Katelyn J Cavanaugh andAndrew B Colmus 2016 ldquoA Primer on Theory-Driven Web Scraping Automatic Extraction ofBig Data from the Internet for Use in Psychological Researchrdquo Psychological Methods 4 475ndash492httpdxdoiorgezaccesslibrariespsuedu101037met0000081

bull Dagger[Scraping-Py] Al Sweigart 2015 Automate the Boring Stuff with Python Practical Programmingfor Total Beginners Ch 11 Web-Scraping httpsautomatetheboringstuffcomchapter11

bull dagger[Scraping-R] Simon Munzert Christian Rubba Peter Meiszligner and Dominic Nyhuis 2014 Au-tomated Data Collection with R A Practical Guide to Web Scraping and Text Mining Wileyhttponlinelibrarywileycombook1010029781118834732

bull DaggerCRAN httpscranr-projectorgwebviewsWebTechnologieshtml

Ethics amp Scientific Responsibility in Big Social Data

Human Subjects Consent Privacy

bull BitByBit Ch 6 (ldquoEthicsrdquo)

bull Dagger[PSU-ORP] Penn State Office for Research Protections (under the VP for Research)

Human Subjects Research IRB httpswwwresearchpsueduirb

18

Revised Common Rule httpswwwresearchpsueduirbcommonrulechanges

Responsible Conduct of Research httpswwwresearchpsuedueducationrcr

Research Misconduct httpswwwresearchpsueduresearchmisconduct

SARI PSU (Scientific and Research Integrity Training) httpswwwresearchpsuedu

trainingsari

bull Dagger[MenloReport] David Dittrich and Erin Kenneally (Center for Applied Interneet Data Analysis)2012 ldquoThe Menlo Report Ethical Principles Guiding Information and Communication Technol-ogy Researchrdquo and companion report ldquoApplying Ethical Principles rdquo httpswwwcaidaorg

publicationspapers2012menlo_report_actual_formatted

bull Dagger[BigDataEthics-CBDES] Jacob Metcalf Emily F Keller and danah boyd 2016 ldquoPerspec-tives on Big Data Ethics and Societyrdquo Council for Big Data Ethics and Society httpbdes

datasocietynetcouncil-outputperspectives-on-big-data-ethics-and-society

bull Dagger[AoIRReport] Annette Markham and Elizabeth Buchanan (Association of Internet Researchers)2012 ldquoEthical Decision-Making and Internet Research Recommendations of the AoIR Ethics WorkingCommittee (Version 20)rdquo httpsaoirorgreportsethics2pdf and guidelines chart https

aoirorgwp-contentuploads201701aoir_ethics_graphic_2016pdf

bull Dagger[BigDataEthics-Wired] Sarah Zhang 2016 ldquoScientists are Just as Confused about the Ethics ofBig-Data Research as Yourdquo Wired httpswwwwiredcom201605scientists-just-confused-ethics-big-data-research

bull dagger[BigDataEthics-HerschelMori] Richard Herschel and Virginia M Mori 2017 ldquoEthics amp BigDatardquo Technology in Society 49 31-36 httpdoiorg101016jtechsoc201703003

The Science of Data Privacy

bull dagger[BigDataPrivacy] Terence Craig and Mary E Ludloff 2011 Privacy and Big Data OrsquoReilly MediahttppensueblibcompatronFullRecordaspxp=781814

bull Dagger[DataAnalysisPrivacy] John Abowd Lorenzo Alvisi Cynthia Dwork Sampath Kannan AshwinMachanavajjhala Jerome Reiter 2017 ldquoPrivacy-Preserving Data Analysis for the Federal Statisti-cal Agenciesrdquo A Computing Community Consortium white paper httpsarxivorgabs1701

00752

bull dagger[DataPrivacy] Stephen E Fienberg and Aleksandra B Slavkovic 2011 ldquoData Privacy and Con-fidentialityrdquo International Encyclopedia of Statistical Science 342ndash5 httpdoiorg978-3-642-

04898-2_202

bull Dagger[NetworksPrivacy] Vishesh Karwa and Aleksandra Slavkovic 2016 ldquoInference using noisy degreesDifferentially private β-model and synthetic graphsrdquo Annals of Statistics 44(1) 87-112 http

projecteuclidorgeuclidaos1449755958

bull dagger[DataPublishingPrivacy] Raymond Chi-Wing Wong and Ada Wai-Chee Fu 2010 Privacy-Preserving Data Publishing An Overview Morgan amp Claypool httpwwwmorganclaypoolcom

ezaccesslibrariespsuedudoipdfplus102200S00237ED1V01Y201003DTM002

bull DataMatching Ch 8 ldquoPrivacy Aspects of Data Matchingrdquo

Transparency Reproducibility and Team Science

bull Dagger[10RulesforData] Alyssa Goodman Alberto Pepe Alexander W Blocker Christine L BorgmanKyle Cranmer Merce Crosas Rosanne Di Stefano Yolanda Gil Paul Groth Margaret HedstromDavid W Hogg Vinay Kashyap Ashish Mahabal Aneta Siemiginowska and Aleksandra Slavkovic(2014) ldquoTen Simple Rules for the Care and Feeding of Scientific Datardquo PLoS Computational Biology10(4) e1003542 httpsdoiorg101371journalpcbi1003542 (Note esp curated resourcesfor reproducible research)

19

bull dagger[Transparency] E Miguel C Camerer K Casey J Cohen K M Esterling A Gerber R Glen-nerster D P Green M Humphreys G Imbens D Laitin T Madon L Nelson B A Nosek M Pe-tersen R Sedlmayr J P Simmons U Simonsohn M Van der Laan 2014 ldquoPromoting Transparencyin Social Science Researchrdquo Science 343(6166) 30ndash1 httpsdoiorg101126science1245317

bull dagger[Reproducibility] Marcus R Munafo Brian A Nosek Dorothy V M Bishop Katherine S ButtonChristopher D Chambers Nathalie Percie du Sert Uri Simonsohn Eric-Jan Wagenmakers JenniferJ Ware and John P A Ioannidis 2017 ldquoA Manifesto for Reproducible Sciencerdquo Nature HumanBehavior 0021(2017) httpsdoiorg101038s41562-016-0021

bull Dagger[SoftwareCarpentry] esp ldquoLessonsrdquo httpsoftware-carpentryorglessons

bull DSHandbook-Py Ch 9 ldquoTechnical Communication and Documentationrdquo Ch 15 ldquoSoftware Engi-neering Best Practicesrdquo

bull dagger[TeamScienceToolkit] Vogel AL Hall KL Fiore SM Klein JT Bennett LM Gadlin H Stokols DNebeling LC Wuchty S Patrick K Spotts EL Pohl C Riley WT Falk-Krzesinski HJ 2013 ldquoTheTeam Science Toolkit enhancing research collaboration through online knowledge sharingrdquo AmericanJournal of Preventive Medicine 45 787-9 http101016jamepre201309001

bull sect[Databrary] Kara Hall Robert Croyle and Amanda Vogel Forthcoming (2017) Advancing Socialand Behavioral Health Research through Cross-disciplinary Team Science Springer Includes RickO Gilmore and Karen E Adolph ldquoOpen Sharing of Research Video Breaking the Boundaries of theResearch Teamrdquo (See httpdatabraryorg)

bull CRAN httpscranr-projectorgwebviewsReproducibleResearchhtml

Social Bias Fair Algorithms

bull MachineBias

bull Dagger[EmbeddingsBias] Aylin Caliskan Joanna J Bryson and Arvind Narayanan 2017 ldquoSemanticsDerived Automatically from Language Corpora Contain Human Biasesrdquo Science httpsarxiv

orgabs160807187

bull Dagger[Debiasing] Tolga Bolukbasi Kai-Wei Chang James Zou Venkatesh Saligrama Adam Kalai 2016ldquoMan is to Computer Programmer as Woman is to Homemaker Debiasing Word Embeddingsrdquohttpsarxivorgabs160706520

bull Dagger[AvoidingBias] Moritz Hardt Eric Price Nathan Srebro 2016 ldquoEquality of Opportunity inSupervised Learningrdquo httpsarxivorgabs161002413

bull Dagger[InevitableBias] Jon Kleinberg Sendhil Mullainathan Manish Raghavan 2016 ldquoInherent Trade-Offs in the Fair Determination of Risk Scoresrdquo httpsarxivorgabs160905807

Databases and Data Management

bull NRCreport Ch 3 ldquoScaling the Infrastructure for Data Managementrdquo

bull dagger[SQL] Jan L Harrington 2010 SQL Clearly Explained Elsevier httpwwwsciencedirectcomezaccesslibrariespsuedusciencebook9780123756978

bull DSHandbook-Py Ch 14 ldquoDatabasesrdquo

bull dagger[noSQL] Guy Harrison 2015 Next Generation Databases NoSQL NewSQL and Big Data Apresshttpslink-springer-comezaccesslibrariespsuedubook101007978-1-4842-1329-2

bull dagger[Cloud] Divyakant Agrawal Sudipto Das and Amr El Abbadi 2012 Data Management inthe Cloud Challenges and Opportunities Morgan amp Claypool httpwwwmorganclaypoolcom

ezaccesslibrariespsuedudoipdfplus102200S00456ED1V01Y201211DTM032

20

bull daggerMorgan amp Claypool Synthesis Lectures on Data Management httpwwwmorganclaypoolcom

tocdtm11

Data Wrangling

Theoretically-structured approaches to data wrangling

bull Dagger[TidyData-R] Garrett Grolemund and Hadley Wickham 2017 R for Data Science (httpr4dshadconz) esp Ch 12 ldquoTidy Datardquo also Wickham 2014 ldquoTidy Datardquo Journalof Statistical Software 59(10) httpwwwjstatsoftorgv59i10paper Tools The tidyversehttptidyverseorg

bull Dagger[DataScience-Py] Jake VanderPlas 2016 Python Data Science Handbook (httpsgithubcomjakevdpPythonDSHandbook-Py) esp Ch 3 on pandas httppandaspydataorg

bull Dagger[DataCarpentry] Colin Gillespie and Robin Lovelace 2017 Efficient R Programming https

csgillespiegithubioefficientR esp Ch 6 on ldquoefficient data carpentryrdquo

bull sect[Wrangler] Joseph M Hellerstein Jeffrey Heer Tye Rattenbury and Sean Kandel 2017 DataWrangling Practical Techniques for Data Preparation Tool Trifacta Wrangler httpwwwtrifactacomproductswrangler

Data wrangling practice

bull DSHandbook-Py Ch 4 ldquoData Munging String Manipulation Regular Expressions and Data Clean-ingrdquo

bull dagger[DataSimplification] Jules J Berman 2016 Data Simplification Taming Information with OpenSource Tools Elsevier httpwwwsciencedirectcomezaccesslibrariespsueduscience

book9780128037812

bull dagger[Wrangling-Py] Jacqueline Kazil Katharine Jarmul 2016 Data Wrangling with Python OrsquoReillyhttpproquestcombosafaribooksonlinecomezaccesslibrariespsuedu9781491948804

bull dagger[Wrangling-R] Bradley C Boehmke 2016 Data Wrangling with R Springer httplink

springercomezaccesslibrariespsuedubook1010072F978-3-319-45599-0

Record Linkage Entity Resolution Deduplication

bull dagger[DataMatching] Peter Christen 2012 Data Matching Concepts and Techniques for Record Link-age Entity Resolution and Duplicate Detectionhttplinkspringercomezaccesslibrariespsuedubook1010072F978-3-642-31164-2

(esp Ch 2 ldquoThe Data Matching Processrdquo)

bull DataSimplification Chapter 5 ldquoIdentifying and Deidentifying Datardquo

bull Dagger[EntityResolution] Lise Getoor and Ashwin Machanavajjhala 2013 ldquoEntity Resolution for BigDatardquo Tutorial KDD httpwwwumiacsumdedu~getoorTutorialsER_KDD2013pdf

bull Dagger[SyrianCasualties] Peter Sadosky Anshumali Shrivastava Megan Price and Rebecca C Steorts2015 ldquoBlocking Methods Applied to Casualty Records from the Syrian Conflictrdquo httpsarxiv

orgabs151007714 (For more on blocking see DataMatching Ch 4 ldquoIndexingrdquo)

bull CRAN Task Views httpscranr-projectorgwebviewsOfficialStatisticshtml ldquoStatis-tical Matching and Record Linkagerdquo

ldquoMaking up datardquo Imputation Smoothers Kernels Priors Filters Teleportation Negative Sampling Con-volution Augmentation Adversarial Training

21

bull dagger[Imputation] Yi Deng Changgee Chang Moges Seyoum Ido and Qi Long 2016 ldquoMultiple Im-putation for General Missing Data Patterns in the Presence of High-dimensional Datardquo ScientificReports 6(21689) httpswwwnaturecomarticlessrep21689

bull Dagger[Overimputation] Matthew Blackwell James Honaker and Gary King 2017 ldquoA Unified Ap-proach to Measurement Error and Missing Data Overview and Applicationsrdquo Sociological Methodsamp Research 46(3) 303-341 httpgkingharvardedufilesgkingfilesmeasurepdf

bull Shalizi-ADA Sect 15 (Linear Smoothers) Ch 8 (Splines) Sect 144 (Kernel Density Estimates)DataScience-Python NB 0513 ldquoKernel Density Estimationrdquo

bull FightinWords (re Bayesian priors as additional data impact of priors on regularization)

bull DeepLearning Section 75 (Data Augmentation) 712 (Dropout) 713 (Adversarial Examples) Chap-ter 9 (Convolutional Networks)

bull dagger[Adversarial] Ian J Goodfellow Jonathon Shlens Christian Szegedy 2015 ldquoExplaining and Har-vesting Adversarial Examplesrdquo httpsarxivorgabs14126572

bull See also resampling and simulation methods

bull See also feature engineering preprocessing

bull CRAN Task Views httpscranr-projectorgwebviewsOfficialStatisticshtml ldquoImpu-tationrdquo

(Direct) Data Representations Data Mappings

bull NRCreport Ch 5 ldquoLarge-Scale Data Representationsrdquo

bull dagger[PatternRecognition] M Narasimha Murty and V Susheela Devi 2011 Pattern RecognitionAn Algorithmic Approach Springer httpslink-springer-comezaccesslibrariespsuedu

book1010072F978-0-85729-495-1 Section 21 ldquoData Structures for Pattern Representationrdquo

bull Dagger[InfoRetrieval] Christopher D Manning Prabhakar Raghavan and Hinrich Schutze 2009 Intro-duction to Information Retrieval Cambridge University Press httpnlpstanfordeduIR-bookCh 1 2 6 (also note slides used in their class)

bull dagger[Algorithms] Brian Steele John Chandler Swarna Reddy 2016 Algorithms for Data ScienceWiley httplinkspringercomezaccesslibrariespsuedubook1010072F978-3-319-

45797-0 Ch 2 ldquoData Mapping and Data Dictionariesrdquo

bull See also Social Data Structures

Similarity Distance Association Covariance The Kernel Trick

bull PatternRecognition Section 23 ldquoProximity Measuresrdquo

bull Dagger[Similarity] Brendan OrsquoConnor 2012 ldquoCosine similarity Pearson correlation and OLS coeffi-cientsrdquo httpsbrenoconcomblog201203cosine-similarity-pearson-correlation-and-

ols-coefficients

bull For relatively comprehensive lists see also

M-J Lesot M Rifqi and H Benhadda 2009 ldquoSimilarity measures for binary and numerical data a surveyrdquoInt J Knowledge Engineering and Soft Data Paradigms 1(1) 63- httpciteseerxistpsueduviewdoc

downloaddoi=10112126533amprep=rep1amptype=pdf

Seung-Seok Choi Sung-Hyuk Cha Charles C Tappert 2010 ldquoA Survey of Binary Similarity and DistanceMeasuresrdquo Journal of Systemics Cybernetics amp Informatics 8(1)43-8 httpwwwiiisciorgJournalCV$

scipdfsGS315JGpdf

22

Sung-Hyuk Cha 2007 ldquoComprehensive Survey on DistanceSimilarity Measures between Probability DensityFunctionsrdquo International Journal of Mathematical Models and Methods in Applied Sciences 4(1) 300-7 http

csispaceeductappertdpsd861-12session4-p2pdf

Anna Huang 2008 ldquoSimilarity Measures for Text Document Clusteringrdquo Proceedings of the 6th New ZealandComputer Science Research Student Conference 49-56 httpwwwnzcsrsc08canterburyacnzsiteproceedingsIndividual_Paperspg049_Similarity_Measures_for_Text_Document_Clusteringpdf

bull Michael B Jordan ldquoThe Kernel Trickrdquo (Lecture Notes) httpspeopleeecsberkeleyedu

~jordancourses281B-spring04lectureslec3pdf

bull Eric Kim 2017 ldquoEverything You Ever Wanted to Know about the Kernel Trick (But Were Afraidto Ask)rdquo httpwwweric-kimneteric-kim-netposts1kernel_trick_blog_ekim_12_20_

2017pdf

Derived Data Representations - Dimensionality Reduction Compression De-composition Embeddings

The groupings here and under the measurement multivariate statistics section are particularly arbitraryFor example ldquok-Means clusteringrdquo can be viewed as a technique for ldquodimensionality reductionrdquo ldquocompres-sionrdquo ldquofeature extractionrdquo ldquolatent variable measurementrdquo ldquounsupervised learningrdquo ldquocollaborative filteringrdquo

Clustering hashing quantization blocking compression

bull Dagger[Compression] Khalid Sayood 2012 Introduction to Data Compression 4th ed Springer httpwwwsciencedirectcomezaccesslibrariespsuedusciencebook9780124157965 (eg cod-ing blocking via vector quantization)

bull MMDS Ch3 ldquoFinding Similar Itemsrdquo (minhashing locality sensitive hashing)

bull Multivariate-R Ch 6 ldquoClusteringrdquo

bull DataScience-Python NB 0511 ldquok-Means Clusteringrdquo NG 0512 ldquoGaussian Mixture Modelsrdquo

bull Dagger[KMeansHashing] Kaiming He Fang Wen Jian Sun 2013 ldquoK-means Hashing An Affinity-Preserving Quantization Method for Learning Binary Compact Codesrdquo CVPR httpswwwcv-foundationorgopenaccesscontent_cvpr_2013papersHe_K-Means_Hashing_An_2013_CVPR_paper

pdf

bull dagger[Squashing] Madigan D Raghavan N Dumouchel W Nason M Posse C and RidgewayG (2002) ldquoLikelihood-based data squashing A modeling approach to instance constructionrdquo DataMining and Knowledge Discovery 6(2) 173-190 httpdxdoiorgezaccesslibrariespsu

edu101023A1014095614948

bull dagger[Core-sets] Piotr Indyk Sepideh Mahabadi Mohammad Mahdian Vahab S Mirrokni 2014 ldquoCom-posable core-sets for diversity and coverage maximizationrdquo PODS rsquo14 100-8 httpsdoiorg10114525945382594560

Feature selection feature extraction feature engineering weighting preprocessing

bull PatternRecognition Sections 26-7 ldquoFeature Selection Feature Extractionrdquo

bull DSHandbook-Py Ch 7 ldquoInterlude Feature Extraction Ideasrdquo

bull DataScience-Python NB 0504 ldquoFeature Engineeringrdquo

bull Features in text NLP and InfoRetrieval re tfidf and similar FightinWords

bull Features in images DataScience-Python NB 0514 ldquoImage Featuresrdquo

23

Dimensionality reduction decomposition factorization change of basis reparameterization matrix com-pletion latent variables source separation

bull MMDS Ch 9 ldquoRecommendation Systemsrdquo Ch11 ldquoDimensionality Reductionrdquo

bull DSHandbook-Py Ch 10 ldquoUnsupervised Learning Clustering and Dimensionality Reductionrdquo

bull Dagger[Shalizi-ADA] Cosma Rohilla Shalizi 2017 Advanced Data Analysis from an Elementary Pointof View Ch 16 (ldquoPrincipal Components Analysisrdquo) Ch 17 (ldquoFactor Modelsrdquo) httpwwwstat

cmuedu~cshaliziADAfaEPoV

See also Multivariate-R Ch 3 ldquoPrincipal Components Analysisrdquo Ch 4 ldquoMultidimensional ScalingrdquoCh 5 ldquoExploratory Factor Analysisrdquo Latent

bull DeepLearning Ch 2 ldquoLinear Algebrardquo Ch 13 ldquoLinear Factor Modelsrdquo

bull Dagger[GloVe] Jeffrey Pennington Richard Socher Christopher Manning 2014 ldquoGloVe Global vectorsfor word representationrdquo EMNLP httpsnlpstanfordeduprojectsglove

bull dagger[NMF] Daniel D Lee and H Sebastian Seung 1999 ldquoLearning the parts of objects by non-negativematrix factorizationrdquo Nature 401788-791 httpdxdoiorgezaccesslibrariespsuedu

10103844565

bull dagger[CUR] Michael W Mahoney and Petros Drineas 2009 ldquoCUR matrix decompositions for improveddata analysisrdquo PNAS httpwwwpnasorgcontent1063697full

bull Dagger[ICA] Aapo Hyvarinen and Erkki Oja 2000 ldquoIndependent Component Analysis Algorithmsand Applicationsrdquo Neural Networks 13(4-5) 411-30 httpswwwcshelsinkifiuahyvarin

papersNN00newpdf

bull [RandomProjection] Ella Bingham and Heikki Mannilla 2001 ldquoRandom projection in dimension-ality reduction applications to image and text datardquo KDD httpsdoiorg1011452F502512

502546

bull Compression eg ldquoTransform codingrdquo ldquoWaveletsrdquo

Nonlinear dimensionality reduction Manifold learning

bull DeepLearning Section 5113 ldquoManifold Learningrdquo

bull DataScience-Python NB 0510 ldquoManifold Learningrdquo (Locally linear embedding [LLE] Isomap)

bull Dagger[KernelPCA] Sebastian Raschka 2014 ldquoKernel tricks and nonlinear dimensionality reduction viaRBF kernel PCArdquo httpsebastianraschkacomArticles2014_kernel_pcahtml

bull Shalizi-ADA Ch 18 ldquoNonlinear Dimensionality Reductionrdquo (LLE)

bull Dagger[LaplacianEigenmaps] Mikhail Belkin and Partha Niyogi 2003 ldquoLaplacian eigenmaps for di-mensionality reduction and data representationrdquo Neural Computation 15(6) 13731396 http

webcseohio-stateedu~belkin8papersLEM_NC_03pdf

bull DeepLearning Ch14 (ldquoAutoencodersrdquo)

bull Dagger[word2vec] Tomas Mikolov Kai Chen Greg Corrado Jeffrey Dean 2013 ldquoEfficient Estimation ofWord Representations in Vector Spacerdquo httpsarxivorgabs13013781

bull Dagger[word2vecExplained] Yoav Goldberg and Omer Levy 2014 ldquoword2vec Explained DerivingMikolov et als Negative-Sampling Word-Embedding Methodrdquo httpsarxivorgabs14023722

bull Dagger[t-SNE] Laurens van der Maaten and Geoffrey Hinton 2008 ldquoVisualising Data using t-SNErdquo Jour-nal of Machine Learning Research 9 2579-2605 httpjmlrcsailmitedupapersvolume9

vandermaaten08avandermaaten08apdf

24

Computation and Scaling Up

Scientific computing computation at scale

bull NRCreport Ch 10 ldquoThe Seven Computational Giants of Massive Data Analysisrdquo

bull DSHandbook-Py Ch 21 ldquoPerformance and Computer Memoryrdquo Ch 22 ldquoComputer Memory andData Structuresrdquo

bull Dagger[ComputationalStatistics-Python] Cliburn Chan Computational Statistics in Python https

peopledukeedu~ccc14sta-663indexhtml

bull CRAN Task View High Performance Computinghttpscranr-projectorgwebviewsHighPerformanceComputinghtml

Numerical computing

bull [NumericalComputing] Ward Cheney and David Kincaid 2013 Numerical Mathematics andComputing 7th ed BrooksCole Cengage Learning

bull CRAN Task View Numerical Mathematicshttpscranr-projectorgwebviewsNumericalMathematicshtml

Optimization (eg MLE gradient descent stochastic gradient descent EM algorithm neural nets)

bull DSHandbook-Py Ch 23 ldquoMaximum Likelihood Estimation and Optimizationrdquo (gradient descent)

bull DeepLearning Ch 4 ldquoNumerical Computationrdquo Ch 6 ldquoDeep Feedforward Networksrdquo Ch 8ldquoOptimization for Training Deep Modelsrdquo

bull CRAN Task View Optimization and Mathematical Programminghttpscranr-projectorgwebviewsOptimizationhtml

Linear algebra matrix computations

bull [MatrixComputations] Gene H Golub and Charles F Van Loan 2013 Matrix Computations 4thed Johns Hopkins University Press

bull Dagger[NetflixMatrix] Yehuda Koren Robert Bell and Chris Volinsky 2009 ldquoMatrix FactorizationTechniques for Recommender Systemsrdquo Computer August 42-9 httpsdatajobscomdata-

science-repoRecommender-Systems-[Netflix]pdf

bull Dagger[BigDataPCA] Jianqing Fan Qiang Sun Wen-Xin Zhou Ziwei Zhu ldquoPrincipal Component Anal-ysis for Big Datardquo httpwwwprincetonedu~ziweizpcapdf

bull Dagger[RandomSVD] Andrew Tulloch 2009 ldquoFast Randomized SVDrdquo httpsresearchfbcom

fast-randomized-svd

bull Dagger[FactorSGD] Rainer Gemulla Peter J Haas Erik Nijkamp and Yannis Sismanis 2011 ldquoLarge-Scale Matrix Factorization with Distributed Stochastic Gradient Descentrdquo KDD httpwwwcs

utahedu~hariteachingbigdatagemulla11dsgdpdf

bull Dagger[Sparse] Max Grossman 2015 ldquo101 Ways to Store a Sparse Matrixrdquo httpsmediumcom

jmaxg3101-ways-to-store-a-sparse-matrix-c7f2bf15a229

bull Dagger[DontInvert] John D Cook 2010 ldquoDonrsquot Invert that Matrixrdquo httpswwwjohndcookcom

blog20100119dont-invert-that-matrix

Simulation-based inference resampling Monte Carlo methods MCMC Bayes approximate inference

25

bull Dagger[MarkovVisually] Victor Powell ldquoMarkov Chains Explained Visuallyrdquo httpsetosaioev

markov-chains

bull DSHandbook-Py Ch 25 ldquoStochastic Modelingrdquo (Markov chains MCMC HMM)

bull Bayes ComputationalStatistics-Py

bull DeepLearning Ch 17 ldquoMonte Carlo Methodsrdquo Ch 19 ldquoApproximate Inferencerdquo

bull Dagger[VariationalInference] Jason Eisner 2011 ldquoHigh-Level Explanation of Variational Inferencerdquohttpswwwcsjhuedu~jasontutorialsvariationalhtml

bull CRAN Task View Bayesian Inferencehttpscranr-projectorgwebviewsBayesianhtml

Parallelism MapReduce Split-Apply-Combine

bull Dagger[MapReduceIntuition] Jigsaw Academy 2014 ldquoBig Data Specialist MapReducerdquo https

wwwyoutubecomwatchv=TwcYQzFqg-8ampfeature=youtube

bull MMDS Ch 2 ldquoMap-Reduce and the New Software Stackrdquo (3rd Edition discusses Spark amp Tensor-Flow)

bull Algorithms Ch 3ldquoScalable Algorithms and Associative Statisticsrdquo Ch 4 ldquoHadoop and MapReducerdquo

bull NRCreport Ch 6 ldquoResources Trade-offs and Limitationsrdquo

bull TidyData-R (Split-apply-combine is the motivating principle behind the ldquotidyverserdquo approach) Seealso Part III ldquoProgramrdquo (pipes functions vectors iteration)

bull Dagger[TidySAC-Video] Hadley Wickham 2017 ldquoData Science in the Tidyverserdquo httpswww

rstudiocomresourcesvideosdata-science-in-the-tidyverse

bull See also FactorSGD

Functional Programming

bull ComputationalStatistics-Python ldquoFunctions are first class objectsrdquo through first exercises

bull DSHandbook-Py Ch 20 ldquoProgramming Language Conceptsrdquo

bull See also Haskell

Scaling iteration streaming data online algorithms (Spark)

bull NRCreport Ch 4 ldquoTemporal Data and Real-Time Algorithmsrdquo

bull MMDS Ch 4 ldquoMining Data Streamsrdquo

bull Dagger[BDAS] AMPLab BDAS The Berkeley Data Analytics Stack httpsamplabcsberkeleyedu

software

bull dagger[Spark] Mohammed Guller 2015 Big Data Analytics with Spark A Practitioners Guide toUsing Spark for Large-Scale Data Processing Machine Learning and Graph Analytics and High-Velocity Data Stream Processing Apress httpslink-springer-comezaccesslibrariespsu

edubook101007978-1-4842-0964-6

bull DSHandbook-Py Ch 13 ldquoBig Datardquo

bull DeepLearning Ch 10 ldquoSequence Modeling Recurrent and Recursive Netsrdquo

General Resources for Python and R

26

bull dagger[DSHandbook-Py] Field Cady 2017 The Data Science Handbook (Python-based) Wiley http

onlinelibrarywileycomezaccesslibrariespsuedubook1010029781119092919

bull Dagger[Tutorials-R] Ujjwal Karn 2017 ldquoA curated list of R tutorials for Data Science NLP and MachineLearningrdquo httpsgithubcomujjwalkarnDataScienceR

bull Dagger[Tutorials-Python] Ujjwal Karn 2017 ldquoA curated list of Python tutorials for Data Science NLPand Machine Learningrdquo httpsgithubcomujjwalkarnDataSciencePython

Cutting and Bleeding Edge of Data Science Languages

bull [Scala] daggerVishal Layka and David Pollak 2015 Beginning Scala httpslink-springer-

comezaccesslibrariespsuedubook1010072F978-1-4842-0232-6 daggerNicolas Patrick 2014Scala for Machine Learning httpsebookcentralproquestcomlibpensudetailaction

docID=1901910 DaggerScala site httpswwwscala-langorg (See also )

bull [Julia] daggerIvo Baobaert 2015 Getting Started with Julia httpsebookcentralproquestcom

libpensudetailactiondocID=1973847 DaggerDouglas Bates 2013 ldquoJulia for R Programmersrdquohttpwwwstatwiscedu~batesJuliaForRProgrammerspdf DaggerJulia site httpsjulialang

org

bull [Haskell] daggerRichard Bird 2014 Thinking Functionally with Haskell httpsdoi-orgezaccess

librariespsuedu101017CBO9781316092415 daggerHakim Cassimally 2017 Learning Haskell Pro-gramming httpswwwlyndacomHaskell-tutorialsLearning-Haskell-Programming604926-2html daggerJames Church 2017 Learning Haskell for Data Analysis httpswwwlyndacomDeveloper-tutorialsLearning-Haskell-Data-Analysis604234-2html Haskell site httpswwwhaskellorg

bull [Clojure] daggerMark McDonnell 2017 Quick Clojure Essential Functional Programming https

link-springer-comezaccesslibrariespsuedubook1010072F978-1-4842-2952-1 daggerAkhillWali 2014 Clojure for Machine Learning httpsebookcentralproquestcomlibpensu

detailactiondocID=1674848 daggerArthur Ulfeldt 2015 httpswwwlyndacomClojure-tutorialsUp-Running-Clojure413127-2html DaggerClojure site httpsclojureorg

bull [TensorFlow] DaggerAbadi et al (Google) 2016 ldquoTensorFlow A System for Large-Scale Ma-chine Learningrdquo httpsarxivorgabs160508695 DaggerUdacity ldquoDeep Learningrdquo httpswww

udacitycomcoursedeep-learning--ud730 DaggerTensorFlow site httpswwwtensorfloworg

bull [H2O] Darren Cool 2016 Practical Machine Learning with H2O Powerful Scalable Techniques forDeep Learning and AI OrsquoReilly Arno Candel and Viraj Parmar 2015 Deep Learning with H2ODaggerH2O site httpswwwh2oai

Social Data Structures

Space and Time

bull dagger[GIA] David OrsquoSullivan David J Unwin 2010 Geographic Information Analysis Second EditionJohn Wiley amp Sons httponlinelibrarywileycombook1010029780470549094

bull [Space-Time] Donna J Peuquet 2003 Representations of Space and Time Guilford

bull Roger S Bivand Edzer Pebesma Virgilio Gmez-Rubio 2013 Applied Spatial Data Analysis in RFree through library httplinkspringercomezaccesslibrariespsuedubook101007

2F978-1-4614-7618-4 See also Edzer Pebesma 2016 ldquoHandling and Analyzing Spatial Spa-tiotemporal and Movement Data in Rrdquo httpsedzergithubioUseR2016

bull Dagger[PySAL] Sergio J Rey and Dani Arribas-Bel 2016 ldquoGeographic Data Science with PySAL andthe pydata Stackrdquo httpdarribasorggds_scipy16

27

bull DataScience-Python NB 0413 ldquoGeographic Data with Basemaprdquo

bull Time ldquoLongitudinalrdquo intra-individual data as common in developmental psychology dagger[Longitudinal]Garrett Fitzmaurice Marie Davidian Geert Verbeke and Geert Molenberghs editors 2009 Longitu-dinal Data Analysis Chapman amp Hall CRC httppensueblibcompatronFullRecordaspxp=359998

bull Time ldquoTime Series Time Series Cross Section Panelrdquo as common in economics political scienceand sociology DataScience-Python Ch 3 re pandas DSHandbook-Py Ch 17 ldquoTime Series AnalysisrdquoEconometrics GelmanHill

bull Time ldquoSequential Data Streamsrdquo as in NLP machine learning Spark and other rdquostreaming datardquoreadings

bull Space-Time Data with continuity connectivity neighborhood structure See DeepLearning Chapter 9re convolution

bull CRAN Task Views Spatial httpscranr-projectorgwebviewsSpatialhtml Spatiotempo-ral httpscranr-projectorgwebviewsSpatioTemporalhtml Time Series httpscran

r-projectorgwebviewsTimeSerieshtml

Network Graphs

bull Dagger[Networks] David Easley and Jon Kleinberg 2010 Networks Crowds and Markets ReasoningAbout a Highly Connected World Cambridge University Press httpwwwcscornelleduhome

kleinbernetworks-book

bull MMDS Ch 5 ldquoLink Analysisrdquo Ch 10 ldquoMining Social-Network Graphsrdquo

bull dagger[Networks-R] Douglas Luke 2015 A Userrsquos Guide to Network Analysis in R Springer https

link-springer-comezaccesslibrariespsuedubook101007978-3-319-23883-8

bull dagger[Networks-Python] Mohammed Zuhair Al-Taie and Seifedine Kadry 2017 Python for Graph andNetwork Analysis Springer httpslink-springer-comezaccesslibrariespsuedubook

1010072F978-3-319-53004-8

Hierarchy Clustered Data Aggregation Mixed Models

bull dagger[GelmanHill] Andrew Gelman and Jennifer Hill 2006 Data Analysis Using Regression and Multi-levelHierarchical Models Cambridge University Press httppensueblibcompatronFullRecordaspxp=288457

bull dagger[MixedModels] Eugene Demidenko 2013 Mixed Models Theory and Applications with R 2nd edWiley httpsiteebrarycomezaccesslibrariespsuedulibpennstatedetailaction

docID=10748641

Social Data Channels

Language Text Speech Audio

bull Dagger[NLP] Daniel Jurafsky and James H Martin Forthcoming (2017) Speech and Language ProcessingAn Introduction to Natural Language Processing Speech Recognition and Computational Linguistics3rd ed Prentice-Hall Preprint httpswebstanfordedu~jurafskyslp3

bull InfoRetrieval

bull Dagger[CoreNLP] Stanford CoreNLP httpsstanfordnlpgithubioCoreNLP

28

bull dagger[TextAsData] Grimmer and Stewart 2013 ldquoText as Data The Promise and Pitfalls of AutomaticContent Analysis Methods for Political Textsrdquo Political Analysis httpsdoiorg101093pan

mps028

bull Quinn-Topics

bull FightinWords

bull dagger[NLTK-Python] Jacob Perkins 2010 Python Text Processing with NLTK 20 Cookbook PackthttpsiteebrarycomezaccesslibrariespsuedulibpennstatedetailactiondocID=10435387dagger[TextAnalytics-Python] Dipanjan Sarkar 2016 Text Analytics with Python Springer http

linkspringercomezaccesslibrariespsuedubook1010072F978-1-4842-2388-8

bull DSHandbook-Py Ch 16 ldquoNatural Language Processingrdquo

bull Matthew J Denny httpwwwmjdennycomText_Processing_In_Rhtml

bull CRAN Natural Language Processing httpscranr-projectorgwebviewsNaturalLanguageProcessinghtml

bull dagger[AudioData] Dean Knox and Christopher Lucas 2017 ldquoThe Speaker-Affect Model MeasuringEmotion in Political Speech with Audio Datardquo httpchristopherlucasorgfilesPDFssam

pdf httpswwwyoutubecomwatchv=Hs8A9dwkMzI

bull daggerMorgan amp Claypool Synthesis Lectures on Human Language Technologies httpwwwmorganclaypoolcomtochlt11

bull daggerMorgan amp Claypool Synthesis Lectures on Speech and Audio Processing httpwwwmorganclaypoolcomtocsap11

Vision Image Video

bull Dagger[ComputerVision] Richard Szelinski 2010 Computer Vision Algorithms and Applications SpringerhttpszeliskiorgBook

bull MixedModels Ch 11 ldquoStatistical Analysis of Shaperdquo Ch 12 ldquoStatistical Image Analysisrdquo

bull AudioVideoVolumetric See DeepLearning Chapter 9 re convolution

bull daggerMorgan amp Claypool Synthesis Lectures on Image Video and Multimedia Processing httpwww

morganclaypoolcomtocivm11

bull daggerMorgan amp Claypool Synthesis Lectures on Computer Vision httpwwwmorganclaypoolcom

toccov11

Approaches to Learning from Data (The Analytics Layer)

bull NRCreport Ch 7 ldquoBuilding Models from Massive Datardquo

bull MMDS (data-mining)

bull CausalInference

bull Dagger[VisualAnalytics] Daniel Keim Jorn Kohlhammer Geoffrey Ellis and Florian Mansmann editors2010 Mastering the Information Age Solving Problems with Visual Analytics The EurographicsAssociation Goslar Germany httpwwwvismastereubook See also daggerMorgan amp ClaypoolSynthesis Lectures on Visualization httpwwwmorganclaypoolcomtocvis21

bull Dagger[Econometrics] Bruce E Hansen 2014 Econometrics httpwwwsscwiscedu~bhansen

econometrics

29

bull Dagger[StatisticalLearning] Gareth James Daniela Witten Trevor Hastie and Robert Tibshirani 2013An Introduction to Statistical Learning with Applications in R Springer httpwww-bcfuscedu

~garethISLindexhtml

bull [MachineLearning] Christopher Bishop 2006 Pattern Recognition and Machine Learning Springer

bull dagger[Bayes] Peter D Hoff 2009 A First Course in Bayesian Statistical Methods Springer http

linkspringercombook1010072F978-0-387-92407-6

bull dagger[GraphicalModels] Soslashren Hoslashjsgaard David Edwards Steffen Lauritzen 2012 Graphical Mod-els with R Springer httpslink-springer-comezaccesslibrariespsuedubook101007

2F978-1-4614-2299-0

bull [InformationTheory] David Mackay 2003 Information Theory Inference and Learning AlgorithmsSpringer

bull Dagger[DeepLearning] Ian Goodfellow Yoshua Bengio and Aaron Courville 2016 Deep Learning MITPress httpwwwdeeplearningbookorg (see also Yann LeCun Yoshua Bengio and Geof-frey Hinton 2015 ldquoDeep Learningrdquo Nature 521(7553) 436ndash444 httpsdoiorg101038

nature14539)

See also Dagger[Keras] Keras The Python Deep Learning Library httpskerasio Francoios CholletDirectory of Keras tutorials httpsgithubcomfcholletkeras-resources

bull dagger[SignalProcessing] Jose Maria Giron-Sierra 2013 Digital Signal Processing with MATLAB Exam-ples (Volumes 1-3) Springer httpslink-springer-comezaccesslibrariespsuedubook

101007978-981-10-2534-1

30

Penn State Policy Statements

Academic Integrity

Academic integrity is the pursuit of scholarly activity in an open honest and responsible manner Academicintegrity is a basic guiding principle for all academic activity at The Pennsylvania State University and allmembers of the University community are expected to act in accordance with this principle Consistent withthis expectation the Universitys Code of Conduct states that all students should act with personal integrityrespect other students dignity rights and property and help create and maintain an environment in whichall can succeed through the fruits of their efforts

Academic integrity includes a commitment by all members of the University community not to engage in ortolerate acts of falsification misrepresentation or deception Such acts of dishonesty violate the fundamentalethical principles of the University community and compromise the worth of work completed by others

Disability Accomodation

Penn State welcomes students with disabilities into the Universitys educational programs Every Penn Statecampus has an office for students with disabilities Student Disability Resources (SDR) website provides con-tact information for every Penn State campus (httpequitypsuedusdrdisability-coordinator)For further information please visit the Student Disability Resources website (httpequitypsuedusdr)

In order to receive consideration for reasonable accommodations you must contact the appropriate disabilityservices office at the campus where you are officially enrolled participate in an intake interview and pro-vide documentation See documentation guidelines at (httpequitypsuedusdrguidelines) If thedocumentation supports your request for reasonable accommodations your campus disability services officewill provide you with an accommodation letter Please share this letter with your instructors and discussthe accommodations with them as early as possible You must follow this process for every semester thatyou request accommodations

Psychological Services

Many students at Penn State face personal challenges or have psychological needs that may interfere withtheir academic progress social development or emotional wellbeing The university offers a variety ofconfidential services to help you through difficult times including individual and group counseling crisisintervention consultations online chats and mental health screenings These services are provided by staffwho welcome all students and embrace a philosophy respectful of clients cultural and religious backgroundsand sensitive to differences in race ability gender identity and sexual orientation

bull Counseling and Psychological Services at University Park (CAPS) (httpstudentaffairspsueducounseling) 814-863-0395

bull Penn State Crisis Line (24 hours7 daysweek) 877-229-6400

bull Crisis Text Line (24 hours7 daysweek) Text LIONS to 741741

Educational Equity

Penn State takes great pride to foster a diverse and inclusive environment for students faculty and staffConsistent with University Policy AD29 students who believe they have experienced or observed a hatecrime an act of intolerance discrimination or harassment that occurs at Penn State are urged to report theseincidents as outlined on the Universitys Report Bias webpage (httpequitypsuedureportbias)

31

Background Knowledge

Students will have a variety of backgrounds Prior to beginning interdisciplinary coursework tofulfill Social Data Analytics degree requirements including SoDA 501 and 502 students are ex-pected to have advanced (graduate) training in at least one of the component areas of Social DataAnalytics and a familiarity with basic concepts in the others

With regard to specialization students are expected to have advanced (graduate) training in ONEof the following

bull quantitative social science methodology and a discipline of social science (as would be thecase for a second-year PhD student in Political Science Sociology Criminology Human De-velopment and Family Studies or Demography) OR

bull statistics (as would be the case for a second-year PhD student in Statistics) OR

bull information science or informatics (as would be the case for a second-year PhD student inInformation Science and Technology or a second-year PhD student in Geography specializingin GIScience) OR

bull computer science (as would be the case for a second-year PhD student in Computer Scienceand Engineering)

This requirement is met as a matter of meeting home program requirements for students in thedual-title PhD but may require additional coursework on the part of students in other programswishing to pursue the graduate minor

With regard to general preparation students are expected to have ALL of the following technicalknowledge

bull basic programming skills including basic facility with R ampor Python AND

bull basic knowledge of relational databases ampor geographic information systems AND

bull basic knowledge of probability applied statistics ampor social science research design AND

bull basic familiarity with a substantive or theoretical area of social science (eg 300-level course-work in political science sociology criminology human development psychology economicscommunication anthropology human geography social informatics or similar fields)

It is not unusual for students to have one or more gaps in this preparation at time of applicationto the SoDA program Students should work with Social Data Analytics advisers to develop aplan for timely remediation of any deficiencies which generally will not require formal courseworkfor students whose training and interests are otherwise appropriate for pursuit of the Social DataAnalytics degree Where possible this will be addressed at time of application to the Social DataAnalytics program

To this end some free training materials are linked in the reference section and there are alsoabundant free high-quality self-paced course-style training materials on these and related subjectsavailable through edX (httpwwwedxorg) Udacity (httpwwwudacitycom) Codecademy(httpcodecademycom) and Lynda (httpwwwlyndapsu)

32

Page 14: SoDA 501: Approaches and Issues in Big Social Data Spring ... · \Approaches and Issues in Big Social Data," o ered in the spring semester. Some SoDA / IGERT students will take more

doi101111j1540-5907200900427xfull (esp topic modeling as measurement approach tovalidation)

bull dagger[FightinWords] Burt L Monroe Michael Colaresi and Kevin M Quinn 2008 ldquoFightinrsquo WordsLexical Feature Selection and Evaluation for Identifying the Content of Political Conflictrdquo PoliticalAnalysis 16(4) 372-403 httpsdoiorg101093panmpn018 (esp the impact of samplingvariance and regularization through priors)

bull Dagger[BDSS-Census] Big Data Social Science PSU Team 2012 ldquoA Closer Look at the Kaggle CensusDatardquo httpsburtmonroegithubioBDSSKaggleCensus2012 (esp the relevance of the socialprocesses by which data come to exist as data)

Multidisciplinary Perspectives

bull dagger[Business-BigData] Andrew McAfee and Erik Brynjolfsson 2012 ldquoBig Data The ManagementRevolutionrdquo Harvard Business Review 90(10) 61ndash8 October httpshbrorg201210big-

data-the-management-revolution Thomas H Davenport and DJ Patel 2012 ldquoData ScientistThe Sexiest Job of the 21st Centuryrdquo Harvard Business Review 90(10)70-6 October https

hbrorg201210data-scientist-the-sexiest-job-of-the-21st-century

bull dagger[InfoSci-BigData] CL Philip Chen and Chun-Yang Zhang 2014 ldquoData-intensive ApplicationsChallenges Techniques and Technologies A Survey on Big Datardquo Information Sciences 275 314-47httpsdoiorg101016jins201401015

bull dagger[Informatics-BigData] Vasant G Honavar 2014 ldquoThe Promise and Potential of Big Data ACase for Discovery Informaticsrdquo Review of Policy Research 31(4) 326-330 httpsdoiorg10

1111ropr12080

bull Dagger[Stats-BigData] Beate Franke Jean-Francois Ribana Roscher Annie Lee Cathal Smyth ArminHatefi Fuqi Chen Einat Gil Alexander Schwing Alessandro Selvitella Michael M Hoffman RogerGrosse Dietrich Hendricks and Nancy Reid 2016 ldquoStatistical Inference Learning and Models in BigDatardquo International Statistical Review 84(3) 371-89 httponlinelibrarywileycomdoi101111insr12176full

bull dagger[Econ-BigData] Hal R Varian 2013 ldquoBig Data New Tricks for Econometricsrdquo Journal ofEconomic Perspectives 28(2) 3-28 httpsdoiorg101257jep2823

bull dagger[GeoViz-BigData] Alan MacEachren 2017 ldquoLeveraging Big (Geo) Data with (Geo) Visual An-alytics Place as the Next Frontierrdquo In Chenghu Zhou Fenzhen Su Francis Harvey and Jun Xueds Spatial Data Handling in the Big Data Era pp 139-155 Springer httpslink-springer-

comezaccesslibrariespsueduchapter101007978-981-10-4424-3_10

bull dagger[Soc-BigData] David Lazer and Jason Radford 2017 ldquoData ex Machina Introduction to BigDatardquo Annual Review of Sociology 43 19-39 httpsdoiorg101146annurev-soc-060116-

053457

bull sect[Politics-BigData] Keith T Poole L Jason Anasastopolous and James E Monagan III Forthcom-ing ldquoThe lsquoBig Datarsquo Revolution in Political Campaigning and Governancerdquo Oxford Bibliographies inPolitical Science

iexclCuidado Traps Biases Problems Pains Perils

bull dagger[GoogleFlu] David Lazer Ryan Kennedy Gary King and Alessandro Vespignani 2014 ldquoTheParable of Google Flu Traps in Big Data Analysisrdquo Science (343) 14 March httpgking

harvardedufilesgkingfiles0314policyforumffpdf

bull Dagger[MachineBias] ProPublica ldquoMachine Bias Investigating Algorithmic Injusticerdquo Series https

wwwpropublicaorgseriesmachine-bias See especially

14

Julia Angwin Jeff Larson Surya Mattu and Lauren Kirchner 2016 ldquoMachine Biasrdquo ProPub-lica May 23 httpswwwpropublicaorgarticlemachine-bias-risk-assessments-in-

criminal-sentencing

Julia Angwin Madeleine Varner and Ariana Tobin 2017 ldquoFacebook Enabled Adverstisers toReach Jew Hatersrdquo httpswwwpropublicaorgarticlefacebook-enabled-advertisers-

to-reach-jew-haters

bull Dagger[Polling2016] Doug Rivers (Nov 11 2016) ldquoFirst Thoughts on Polling Problems in the 2016 USElectionsrdquo httpstodayyougovcomnews20161111first-thoughts-polling-problems-

2016-us-elections

bull dagger[EventData] Wei Wang Ryan Kennedy David Lazer Naren Ramakrishnan 2016 ldquoGrowing Painsfor Global Monitoring of Societal Eventsrdquo Science 3536307 pp 1502ndash1503 httpsdoiorg101126scienceaaf6758

bull Dagger[GoogleBooks] Eitan Adam Pechenick Christopher M Danforth and Peter Sheridan Dodds 2015ldquoCharacterizing the Google Books Corpus Strong Limits to Inferences of Socio-Cultural and LinguisticEvolutionrdquo PLoS One httpsdoiorg101371journalpone0137041

bull Dagger[OkCupid] Michael Zimmer 2016 ldquoOkCupid Study Reveals the Perils of Big-Data Sciencerdquo Wiredhttpswwwwiredcom201605okcupid-study-reveals-perils-big-data-science May 14

bull dagger[CriticalQuestions] danah boyd and Kate Crawford 2011 ldquoCritical Questions for Big Data Provo-cations for a Cultural Technological and Scholarly Phenomenonrdquo Information Communication ampSociety 15(5) 662ndash79 httpdxdoiorg1010801369118X2012678878

bull Dagger[RacistBot] Daniel Victor 2016 ldquoMicrosoft created a Twitter bot to learn from users It quicklybecame a racist jerkrdquo New York Times httpswwwnytimescom20160325technology

microsoft-created-a-twitter-bot-to-learn-from-users-it-quickly-became-a-racist-jerk

html

bull MMDS 122-123 on Bonferroni

bull Also BDSS-Census

Research Design and Measurement

Overviews

bull BitByBit

bull Dagger[ResearchMethodsKB] William M Trochim 2006 The Research Methods Knowledge Basehttpwwwsocialresearchmethodsnetkb

bull [7Rules] Glenn Firebaugh 2008 Seven Rules for Social Research Princeton University Press

Measurement Reliability and Validity

bull ResearchMethodsKB ldquoMeasurementrdquo

bull 7Rules Ch 3 ldquoBuild Reality Checks into Your Researchrdquo

bull Reliability see also FightinWords

bull Validity see also Quinn10-Topics Monroe-5Vs

Indirect Unobtrusive Nonreactive Measures Data Exhaust

15

bull dagger[UnobtrusiveMeasures] Raymond M Lee 2015 ldquoUnobtrusive Measuresrdquo Oxford Bibliographieshttpsdoiorg101093OBO9780199846740-0048 (Canonical cite is Eugene J Webb 1966Unobtrusive Measures or Webb Donald T Campbell Richard D Schwartz Lee Sechrest 1999Unobtrusive Measures rev ed Sage)

bull BitByBit Examples in Chapter 2

Multiple Measures Latent Variable Measurement

bull 7Rules Ch 4 ldquoReplicate Where Possiblerdquo

bull dagger[LatentVariables] David J Bartholomew Martin Knott and Irini Moustaki 2011 Latent VariableModels and Factor Analysis A Unified Approach Wiley httpsiteebrarycomezaccess

librariespsuedulibpennstatedetailactiondocID=10483308) (esp Ch1 ldquoBasic ideas andexamplesrdquo)

bull dagger[Multivariate-R] Brian Everitt and Torsten Hothorn 2011 An Introduction to Applied MultivariateAnalysis with R Springer httplinkspringercomezaccesslibrariespsuedubook10

10072F978-1-4419-9650-3

bull Shalizi-ADA Ch17 ldquoFactor Modelsrdquo

bull Dagger[NetflixPrize] Edwin Chen 2011 ldquoWinning the Netflix Prize A Summaryrdquo

bull DaggerCRAN httpscranr-projectorgwebviewsMultivariatehtmlhttpscranr-projectorgwebviewsPsychometricshtmlhttpscranr-projectorgwebviewsClusterhtml

Sampling and Survey Design

bull dagger[Sampling] Steven K Thompson 2012 Sampling 3rd ed httpsk8es4mc2lsearchserialssolutionscomsid=sersolampSS_jc=TC_024492330amptitle=Wiley20Desktop20Editions203A20Sampling

bull ResearchMethodsKB ldquoSamplingrdquo

bull NRCreport Ch 8 ldquoSampling and Massive Datardquo

bull BitByBit Ch 3 ldquoAsking Questionsrdquo

bull Dagger[MSE] Daniel Manrique-Vallier Megan E Price and Anita Gohdes 2013 In Seybolt et al (eds)Counting Civilians ldquoMultiple Systems Estimation Techniques for Estimating Casualties in ArmedConflictsrdquo Preprint httpciteseerxistpsueduviewdocdownloaddoi=1011469939amp

rep=rep1amptype=pdf)

bull dagger[NetworkSampling] Ted Mouw and Ashton M Verdery 2012 ldquoNetwork Sampling with Memory AProposal for More Efficient Sampling from Social Networksrdquo Sociological Methodology 42(1)206ndash56httpsdxdoiorg1011772F0081175012461248

bull See also FightinWords (re hidden heteroskedasticity in sample variance)

bull Dagger[HashDontSample] Mudit Uppal 2016 ldquoProbabilistic data structures in the Big data world (+code)rdquo (re ldquoHash donrsquot samplerdquo) httpsmediumcommuppalprobabilistic-data-structures-

in-the-big-data-world-code-b9387cff0c55

bull MMDS Hash Functions (132) Sampling in streams (42)

bull DaggerCRAN httpscranr-projectorgwebviewsOfficialStatisticshtml (ldquoComplex SurveyDesignrdquo ldquoSmall Area Estimationrdquo)

Experimental and Observational Designs for Causal Inference

16

bull Dagger[CausalInference] Miguel A Hernan James M Robins Forthcoming (2017) Causal Infer-ence Chapman amp HallCRC Preprint httpswwwhsphharvardedumiguel-hernancausal-

inference-book Chs 1 ldquoA definition of causal effectrdquo Ch 2 ldquoRandomized experimentsrdquo Ch3 ldquoObservational studiesrdquo (See Ch 6 ldquoGraphical representation of causal effectsrdquo for integrationwith Judea Pearl approach)

bull BitByBit Chapter 4 ldquoRunning Experimentsrdquo Ch 2 Natural experiments in observable data exam-ples

bull ResearchMethodsKB ldquoDesignrdquo

bull Firebaugh-7Rules Ch 2 ldquoLook for Differences that Make a Difference and Report Themrdquo sectCh5 ldquoCompare Like with Likerdquo Ch 6 ldquoUse Panel Data to Study Individual Change and RepeatedCross-Section Data to Study Social Changerdquo

bull Shalizi-ADA Part IV ldquoCausal Inferencerdquo

bull Monroe-No

bull CRAN httpscranr-projectorgwebviewsExperimentalDesignhtml

Technologies for Primary and Secondary Data Collection

Mobile Devices Distributed Sensors Wearable Sensors Remote Sensing

bull dagger[RealityMining] Nathan Eagle and Alex (Sandy) Pentland 2006 ldquoReality Mining SensingComplex Social Systemsrdquo Personal and Ubiquitous Computing 10(4) 255ndash68 httpsdoiorg

101007s00779-005-0046-3

bull dagger[QuantifiedSelf ] Melanie Swan 2013 ldquoThe Quantified Self Fundamental Disruption in Big DataScience and Biological Discoveryrdquo Big Data 1(2) 85ndash99 httpsdoiorg101089big2012

0002

bull dagger[SensorData] Charu C Aggarwal (Ed) 2013 Managing and Mining Sensor Data Springer httplinkspringercomezaccesslibrariespsuedubook1010072F978-1-4614-6309-2 Ch1ldquoAn Introduction to Sensor Data Analyticsrdquo

bull dagger[NightLights] Thushyanthan Baskaran Brian Min Yogesh Uppal 2015 ldquoElection cycles andelectricity provision Evidence from a quasi-experiment with Indian special electionsrdquo Journal ofPublic Economics 12664-73 httpsdoiorg101016jjpubeco201503011

Crowdsourcing Human Computation Citizen Science Web Experiments

bull dagger[Crowdsourcing] Daren C Brabham 2013 Crowdsourcing MIT Press httpsiteebrary

comezaccesslibrariespsuedulibpennstatedetailactiondocID=10692208 ldquoIntroduc-tionrdquo Ch1 ldquoConcepts Theories and Cases of Crowdsourcingrdquo

bull BitByBit Ch 5 ldquoCreating Mass Collaborationrdquo

bull NRCreport Ch 9 ldquoHuman Interaction with Datardquo

bull dagger[MTurk] Krista Casler Lydia Bickel and Elizabeth Hackett 2013 ldquoSeparate but Equal AComparison of Participants and Data Gathered via Amazonrsquos MTurk Social Media and Face-to-FaceBehavioral Testingrdquo Computers in Human Behavior 29(6) 2156ndash60 httpdoiorg101016j

chb201305009

bull dagger[LabintheWild] Katharina Reinecke and Krzysztof Z Gajos 2015 ldquoLabintheWild ConductingLarge-Scale Online Experiments With Uncompensated Samplesrdquo In Proceedings of the 18th ACMConference on Computer Supported Cooperative Work amp Social Computing (CSCW rsquo15) 1364ndash1378httpdxdoiorg10114526751332675246

17

bull dagger[TweetmentEffects] Kevin Munger 2017 ldquoTweetment Effects on the Tweeted ExperimentallyReducing Racist Harassmentrdquo Political Behavior 39(3) 629ndash49 httpdoiorg101016jchb

201305009

bull dagger[HumanComputation] Edith Law and Luis van Ahn 2011 Human Computation Morgan amp Clay-pool httpwwwmorganclaypoolcomezaccesslibrariespsuedudoipdf102200S00371ED1V01Y201107AIM013

Open Data File Formats APIs Semantic Web Linked Data

bull DSHandbook-Py Ch 12 ldquoData Encodings and File Formatsrdquo

bull Dagger[OpenData] Open Data Institute ldquoWhat Is Open Datardquo httpstheodiorgwhat-is-open-

data

bull Dagger[OpenDataHandbook] Open Knowledge International The Open Data Handbook httpopendatahandbookorg Includes appendix ldquoFile Formatsrdquo httpopendatahandbookorgguideenappendices

file-formats

bull Dagger[APIs] Brian Cooksey 2016 An Introduction to APIs httpszapiercomlearnapis

bull Dagger[APIMarkets] RapidAPI mashape API marketplaces httpsdocsrapidapicom https

marketmashapecom ProgrammableWeb httpswwwprogrammablewebcom

bull dagger[LinkedData] Tom Heath and Christian Bizer 2011 Linked Data Evolving the Web into a GlobalData Space Morgan amp Claypool httpwwwmorganclaypoolcomezaccesslibrariespsu

edudoiabs102200S00334ED1V01Y201102WBE001 Ch 1 ldquoIntroductionrdquo Ch 2 ldquoPrinciples ofLinked Datardquo

bull dagger[SemanticWeb] Nikolaos Konstantinos and Dimitrios-Emmanuel Spanos 2015 Materializing theWeb of Linked Data Springer httplinkspringercomezaccesslibrariespsuedubook

1010072F978-3-319-16074-0 Ch 1 ldquoIntroduction Linked Data and the Semantic Webrdquo

bull daggerMorgan amp Claypool Synthesis Lectures on the Semantic Web Theory and Technology httpwww

morganclaypoolcomtocwbe111

Web Scraping

bull dagger[TheoryDrivenScraping] Richard N Landers Robert C Brusso Katelyn J Cavanaugh andAndrew B Colmus 2016 ldquoA Primer on Theory-Driven Web Scraping Automatic Extraction ofBig Data from the Internet for Use in Psychological Researchrdquo Psychological Methods 4 475ndash492httpdxdoiorgezaccesslibrariespsuedu101037met0000081

bull Dagger[Scraping-Py] Al Sweigart 2015 Automate the Boring Stuff with Python Practical Programmingfor Total Beginners Ch 11 Web-Scraping httpsautomatetheboringstuffcomchapter11

bull dagger[Scraping-R] Simon Munzert Christian Rubba Peter Meiszligner and Dominic Nyhuis 2014 Au-tomated Data Collection with R A Practical Guide to Web Scraping and Text Mining Wileyhttponlinelibrarywileycombook1010029781118834732

bull DaggerCRAN httpscranr-projectorgwebviewsWebTechnologieshtml

Ethics amp Scientific Responsibility in Big Social Data

Human Subjects Consent Privacy

bull BitByBit Ch 6 (ldquoEthicsrdquo)

bull Dagger[PSU-ORP] Penn State Office for Research Protections (under the VP for Research)

Human Subjects Research IRB httpswwwresearchpsueduirb

18

Revised Common Rule httpswwwresearchpsueduirbcommonrulechanges

Responsible Conduct of Research httpswwwresearchpsuedueducationrcr

Research Misconduct httpswwwresearchpsueduresearchmisconduct

SARI PSU (Scientific and Research Integrity Training) httpswwwresearchpsuedu

trainingsari

bull Dagger[MenloReport] David Dittrich and Erin Kenneally (Center for Applied Interneet Data Analysis)2012 ldquoThe Menlo Report Ethical Principles Guiding Information and Communication Technol-ogy Researchrdquo and companion report ldquoApplying Ethical Principles rdquo httpswwwcaidaorg

publicationspapers2012menlo_report_actual_formatted

bull Dagger[BigDataEthics-CBDES] Jacob Metcalf Emily F Keller and danah boyd 2016 ldquoPerspec-tives on Big Data Ethics and Societyrdquo Council for Big Data Ethics and Society httpbdes

datasocietynetcouncil-outputperspectives-on-big-data-ethics-and-society

bull Dagger[AoIRReport] Annette Markham and Elizabeth Buchanan (Association of Internet Researchers)2012 ldquoEthical Decision-Making and Internet Research Recommendations of the AoIR Ethics WorkingCommittee (Version 20)rdquo httpsaoirorgreportsethics2pdf and guidelines chart https

aoirorgwp-contentuploads201701aoir_ethics_graphic_2016pdf

bull Dagger[BigDataEthics-Wired] Sarah Zhang 2016 ldquoScientists are Just as Confused about the Ethics ofBig-Data Research as Yourdquo Wired httpswwwwiredcom201605scientists-just-confused-ethics-big-data-research

bull dagger[BigDataEthics-HerschelMori] Richard Herschel and Virginia M Mori 2017 ldquoEthics amp BigDatardquo Technology in Society 49 31-36 httpdoiorg101016jtechsoc201703003

The Science of Data Privacy

bull dagger[BigDataPrivacy] Terence Craig and Mary E Ludloff 2011 Privacy and Big Data OrsquoReilly MediahttppensueblibcompatronFullRecordaspxp=781814

bull Dagger[DataAnalysisPrivacy] John Abowd Lorenzo Alvisi Cynthia Dwork Sampath Kannan AshwinMachanavajjhala Jerome Reiter 2017 ldquoPrivacy-Preserving Data Analysis for the Federal Statisti-cal Agenciesrdquo A Computing Community Consortium white paper httpsarxivorgabs1701

00752

bull dagger[DataPrivacy] Stephen E Fienberg and Aleksandra B Slavkovic 2011 ldquoData Privacy and Con-fidentialityrdquo International Encyclopedia of Statistical Science 342ndash5 httpdoiorg978-3-642-

04898-2_202

bull Dagger[NetworksPrivacy] Vishesh Karwa and Aleksandra Slavkovic 2016 ldquoInference using noisy degreesDifferentially private β-model and synthetic graphsrdquo Annals of Statistics 44(1) 87-112 http

projecteuclidorgeuclidaos1449755958

bull dagger[DataPublishingPrivacy] Raymond Chi-Wing Wong and Ada Wai-Chee Fu 2010 Privacy-Preserving Data Publishing An Overview Morgan amp Claypool httpwwwmorganclaypoolcom

ezaccesslibrariespsuedudoipdfplus102200S00237ED1V01Y201003DTM002

bull DataMatching Ch 8 ldquoPrivacy Aspects of Data Matchingrdquo

Transparency Reproducibility and Team Science

bull Dagger[10RulesforData] Alyssa Goodman Alberto Pepe Alexander W Blocker Christine L BorgmanKyle Cranmer Merce Crosas Rosanne Di Stefano Yolanda Gil Paul Groth Margaret HedstromDavid W Hogg Vinay Kashyap Ashish Mahabal Aneta Siemiginowska and Aleksandra Slavkovic(2014) ldquoTen Simple Rules for the Care and Feeding of Scientific Datardquo PLoS Computational Biology10(4) e1003542 httpsdoiorg101371journalpcbi1003542 (Note esp curated resourcesfor reproducible research)

19

bull dagger[Transparency] E Miguel C Camerer K Casey J Cohen K M Esterling A Gerber R Glen-nerster D P Green M Humphreys G Imbens D Laitin T Madon L Nelson B A Nosek M Pe-tersen R Sedlmayr J P Simmons U Simonsohn M Van der Laan 2014 ldquoPromoting Transparencyin Social Science Researchrdquo Science 343(6166) 30ndash1 httpsdoiorg101126science1245317

bull dagger[Reproducibility] Marcus R Munafo Brian A Nosek Dorothy V M Bishop Katherine S ButtonChristopher D Chambers Nathalie Percie du Sert Uri Simonsohn Eric-Jan Wagenmakers JenniferJ Ware and John P A Ioannidis 2017 ldquoA Manifesto for Reproducible Sciencerdquo Nature HumanBehavior 0021(2017) httpsdoiorg101038s41562-016-0021

bull Dagger[SoftwareCarpentry] esp ldquoLessonsrdquo httpsoftware-carpentryorglessons

bull DSHandbook-Py Ch 9 ldquoTechnical Communication and Documentationrdquo Ch 15 ldquoSoftware Engi-neering Best Practicesrdquo

bull dagger[TeamScienceToolkit] Vogel AL Hall KL Fiore SM Klein JT Bennett LM Gadlin H Stokols DNebeling LC Wuchty S Patrick K Spotts EL Pohl C Riley WT Falk-Krzesinski HJ 2013 ldquoTheTeam Science Toolkit enhancing research collaboration through online knowledge sharingrdquo AmericanJournal of Preventive Medicine 45 787-9 http101016jamepre201309001

bull sect[Databrary] Kara Hall Robert Croyle and Amanda Vogel Forthcoming (2017) Advancing Socialand Behavioral Health Research through Cross-disciplinary Team Science Springer Includes RickO Gilmore and Karen E Adolph ldquoOpen Sharing of Research Video Breaking the Boundaries of theResearch Teamrdquo (See httpdatabraryorg)

bull CRAN httpscranr-projectorgwebviewsReproducibleResearchhtml

Social Bias Fair Algorithms

bull MachineBias

bull Dagger[EmbeddingsBias] Aylin Caliskan Joanna J Bryson and Arvind Narayanan 2017 ldquoSemanticsDerived Automatically from Language Corpora Contain Human Biasesrdquo Science httpsarxiv

orgabs160807187

bull Dagger[Debiasing] Tolga Bolukbasi Kai-Wei Chang James Zou Venkatesh Saligrama Adam Kalai 2016ldquoMan is to Computer Programmer as Woman is to Homemaker Debiasing Word Embeddingsrdquohttpsarxivorgabs160706520

bull Dagger[AvoidingBias] Moritz Hardt Eric Price Nathan Srebro 2016 ldquoEquality of Opportunity inSupervised Learningrdquo httpsarxivorgabs161002413

bull Dagger[InevitableBias] Jon Kleinberg Sendhil Mullainathan Manish Raghavan 2016 ldquoInherent Trade-Offs in the Fair Determination of Risk Scoresrdquo httpsarxivorgabs160905807

Databases and Data Management

bull NRCreport Ch 3 ldquoScaling the Infrastructure for Data Managementrdquo

bull dagger[SQL] Jan L Harrington 2010 SQL Clearly Explained Elsevier httpwwwsciencedirectcomezaccesslibrariespsuedusciencebook9780123756978

bull DSHandbook-Py Ch 14 ldquoDatabasesrdquo

bull dagger[noSQL] Guy Harrison 2015 Next Generation Databases NoSQL NewSQL and Big Data Apresshttpslink-springer-comezaccesslibrariespsuedubook101007978-1-4842-1329-2

bull dagger[Cloud] Divyakant Agrawal Sudipto Das and Amr El Abbadi 2012 Data Management inthe Cloud Challenges and Opportunities Morgan amp Claypool httpwwwmorganclaypoolcom

ezaccesslibrariespsuedudoipdfplus102200S00456ED1V01Y201211DTM032

20

bull daggerMorgan amp Claypool Synthesis Lectures on Data Management httpwwwmorganclaypoolcom

tocdtm11

Data Wrangling

Theoretically-structured approaches to data wrangling

bull Dagger[TidyData-R] Garrett Grolemund and Hadley Wickham 2017 R for Data Science (httpr4dshadconz) esp Ch 12 ldquoTidy Datardquo also Wickham 2014 ldquoTidy Datardquo Journalof Statistical Software 59(10) httpwwwjstatsoftorgv59i10paper Tools The tidyversehttptidyverseorg

bull Dagger[DataScience-Py] Jake VanderPlas 2016 Python Data Science Handbook (httpsgithubcomjakevdpPythonDSHandbook-Py) esp Ch 3 on pandas httppandaspydataorg

bull Dagger[DataCarpentry] Colin Gillespie and Robin Lovelace 2017 Efficient R Programming https

csgillespiegithubioefficientR esp Ch 6 on ldquoefficient data carpentryrdquo

bull sect[Wrangler] Joseph M Hellerstein Jeffrey Heer Tye Rattenbury and Sean Kandel 2017 DataWrangling Practical Techniques for Data Preparation Tool Trifacta Wrangler httpwwwtrifactacomproductswrangler

Data wrangling practice

bull DSHandbook-Py Ch 4 ldquoData Munging String Manipulation Regular Expressions and Data Clean-ingrdquo

bull dagger[DataSimplification] Jules J Berman 2016 Data Simplification Taming Information with OpenSource Tools Elsevier httpwwwsciencedirectcomezaccesslibrariespsueduscience

book9780128037812

bull dagger[Wrangling-Py] Jacqueline Kazil Katharine Jarmul 2016 Data Wrangling with Python OrsquoReillyhttpproquestcombosafaribooksonlinecomezaccesslibrariespsuedu9781491948804

bull dagger[Wrangling-R] Bradley C Boehmke 2016 Data Wrangling with R Springer httplink

springercomezaccesslibrariespsuedubook1010072F978-3-319-45599-0

Record Linkage Entity Resolution Deduplication

bull dagger[DataMatching] Peter Christen 2012 Data Matching Concepts and Techniques for Record Link-age Entity Resolution and Duplicate Detectionhttplinkspringercomezaccesslibrariespsuedubook1010072F978-3-642-31164-2

(esp Ch 2 ldquoThe Data Matching Processrdquo)

bull DataSimplification Chapter 5 ldquoIdentifying and Deidentifying Datardquo

bull Dagger[EntityResolution] Lise Getoor and Ashwin Machanavajjhala 2013 ldquoEntity Resolution for BigDatardquo Tutorial KDD httpwwwumiacsumdedu~getoorTutorialsER_KDD2013pdf

bull Dagger[SyrianCasualties] Peter Sadosky Anshumali Shrivastava Megan Price and Rebecca C Steorts2015 ldquoBlocking Methods Applied to Casualty Records from the Syrian Conflictrdquo httpsarxiv

orgabs151007714 (For more on blocking see DataMatching Ch 4 ldquoIndexingrdquo)

bull CRAN Task Views httpscranr-projectorgwebviewsOfficialStatisticshtml ldquoStatis-tical Matching and Record Linkagerdquo

ldquoMaking up datardquo Imputation Smoothers Kernels Priors Filters Teleportation Negative Sampling Con-volution Augmentation Adversarial Training

21

bull dagger[Imputation] Yi Deng Changgee Chang Moges Seyoum Ido and Qi Long 2016 ldquoMultiple Im-putation for General Missing Data Patterns in the Presence of High-dimensional Datardquo ScientificReports 6(21689) httpswwwnaturecomarticlessrep21689

bull Dagger[Overimputation] Matthew Blackwell James Honaker and Gary King 2017 ldquoA Unified Ap-proach to Measurement Error and Missing Data Overview and Applicationsrdquo Sociological Methodsamp Research 46(3) 303-341 httpgkingharvardedufilesgkingfilesmeasurepdf

bull Shalizi-ADA Sect 15 (Linear Smoothers) Ch 8 (Splines) Sect 144 (Kernel Density Estimates)DataScience-Python NB 0513 ldquoKernel Density Estimationrdquo

bull FightinWords (re Bayesian priors as additional data impact of priors on regularization)

bull DeepLearning Section 75 (Data Augmentation) 712 (Dropout) 713 (Adversarial Examples) Chap-ter 9 (Convolutional Networks)

bull dagger[Adversarial] Ian J Goodfellow Jonathon Shlens Christian Szegedy 2015 ldquoExplaining and Har-vesting Adversarial Examplesrdquo httpsarxivorgabs14126572

bull See also resampling and simulation methods

bull See also feature engineering preprocessing

bull CRAN Task Views httpscranr-projectorgwebviewsOfficialStatisticshtml ldquoImpu-tationrdquo

(Direct) Data Representations Data Mappings

bull NRCreport Ch 5 ldquoLarge-Scale Data Representationsrdquo

bull dagger[PatternRecognition] M Narasimha Murty and V Susheela Devi 2011 Pattern RecognitionAn Algorithmic Approach Springer httpslink-springer-comezaccesslibrariespsuedu

book1010072F978-0-85729-495-1 Section 21 ldquoData Structures for Pattern Representationrdquo

bull Dagger[InfoRetrieval] Christopher D Manning Prabhakar Raghavan and Hinrich Schutze 2009 Intro-duction to Information Retrieval Cambridge University Press httpnlpstanfordeduIR-bookCh 1 2 6 (also note slides used in their class)

bull dagger[Algorithms] Brian Steele John Chandler Swarna Reddy 2016 Algorithms for Data ScienceWiley httplinkspringercomezaccesslibrariespsuedubook1010072F978-3-319-

45797-0 Ch 2 ldquoData Mapping and Data Dictionariesrdquo

bull See also Social Data Structures

Similarity Distance Association Covariance The Kernel Trick

bull PatternRecognition Section 23 ldquoProximity Measuresrdquo

bull Dagger[Similarity] Brendan OrsquoConnor 2012 ldquoCosine similarity Pearson correlation and OLS coeffi-cientsrdquo httpsbrenoconcomblog201203cosine-similarity-pearson-correlation-and-

ols-coefficients

bull For relatively comprehensive lists see also

M-J Lesot M Rifqi and H Benhadda 2009 ldquoSimilarity measures for binary and numerical data a surveyrdquoInt J Knowledge Engineering and Soft Data Paradigms 1(1) 63- httpciteseerxistpsueduviewdoc

downloaddoi=10112126533amprep=rep1amptype=pdf

Seung-Seok Choi Sung-Hyuk Cha Charles C Tappert 2010 ldquoA Survey of Binary Similarity and DistanceMeasuresrdquo Journal of Systemics Cybernetics amp Informatics 8(1)43-8 httpwwwiiisciorgJournalCV$

scipdfsGS315JGpdf

22

Sung-Hyuk Cha 2007 ldquoComprehensive Survey on DistanceSimilarity Measures between Probability DensityFunctionsrdquo International Journal of Mathematical Models and Methods in Applied Sciences 4(1) 300-7 http

csispaceeductappertdpsd861-12session4-p2pdf

Anna Huang 2008 ldquoSimilarity Measures for Text Document Clusteringrdquo Proceedings of the 6th New ZealandComputer Science Research Student Conference 49-56 httpwwwnzcsrsc08canterburyacnzsiteproceedingsIndividual_Paperspg049_Similarity_Measures_for_Text_Document_Clusteringpdf

bull Michael B Jordan ldquoThe Kernel Trickrdquo (Lecture Notes) httpspeopleeecsberkeleyedu

~jordancourses281B-spring04lectureslec3pdf

bull Eric Kim 2017 ldquoEverything You Ever Wanted to Know about the Kernel Trick (But Were Afraidto Ask)rdquo httpwwweric-kimneteric-kim-netposts1kernel_trick_blog_ekim_12_20_

2017pdf

Derived Data Representations - Dimensionality Reduction Compression De-composition Embeddings

The groupings here and under the measurement multivariate statistics section are particularly arbitraryFor example ldquok-Means clusteringrdquo can be viewed as a technique for ldquodimensionality reductionrdquo ldquocompres-sionrdquo ldquofeature extractionrdquo ldquolatent variable measurementrdquo ldquounsupervised learningrdquo ldquocollaborative filteringrdquo

Clustering hashing quantization blocking compression

bull Dagger[Compression] Khalid Sayood 2012 Introduction to Data Compression 4th ed Springer httpwwwsciencedirectcomezaccesslibrariespsuedusciencebook9780124157965 (eg cod-ing blocking via vector quantization)

bull MMDS Ch3 ldquoFinding Similar Itemsrdquo (minhashing locality sensitive hashing)

bull Multivariate-R Ch 6 ldquoClusteringrdquo

bull DataScience-Python NB 0511 ldquok-Means Clusteringrdquo NG 0512 ldquoGaussian Mixture Modelsrdquo

bull Dagger[KMeansHashing] Kaiming He Fang Wen Jian Sun 2013 ldquoK-means Hashing An Affinity-Preserving Quantization Method for Learning Binary Compact Codesrdquo CVPR httpswwwcv-foundationorgopenaccesscontent_cvpr_2013papersHe_K-Means_Hashing_An_2013_CVPR_paper

pdf

bull dagger[Squashing] Madigan D Raghavan N Dumouchel W Nason M Posse C and RidgewayG (2002) ldquoLikelihood-based data squashing A modeling approach to instance constructionrdquo DataMining and Knowledge Discovery 6(2) 173-190 httpdxdoiorgezaccesslibrariespsu

edu101023A1014095614948

bull dagger[Core-sets] Piotr Indyk Sepideh Mahabadi Mohammad Mahdian Vahab S Mirrokni 2014 ldquoCom-posable core-sets for diversity and coverage maximizationrdquo PODS rsquo14 100-8 httpsdoiorg10114525945382594560

Feature selection feature extraction feature engineering weighting preprocessing

bull PatternRecognition Sections 26-7 ldquoFeature Selection Feature Extractionrdquo

bull DSHandbook-Py Ch 7 ldquoInterlude Feature Extraction Ideasrdquo

bull DataScience-Python NB 0504 ldquoFeature Engineeringrdquo

bull Features in text NLP and InfoRetrieval re tfidf and similar FightinWords

bull Features in images DataScience-Python NB 0514 ldquoImage Featuresrdquo

23

Dimensionality reduction decomposition factorization change of basis reparameterization matrix com-pletion latent variables source separation

bull MMDS Ch 9 ldquoRecommendation Systemsrdquo Ch11 ldquoDimensionality Reductionrdquo

bull DSHandbook-Py Ch 10 ldquoUnsupervised Learning Clustering and Dimensionality Reductionrdquo

bull Dagger[Shalizi-ADA] Cosma Rohilla Shalizi 2017 Advanced Data Analysis from an Elementary Pointof View Ch 16 (ldquoPrincipal Components Analysisrdquo) Ch 17 (ldquoFactor Modelsrdquo) httpwwwstat

cmuedu~cshaliziADAfaEPoV

See also Multivariate-R Ch 3 ldquoPrincipal Components Analysisrdquo Ch 4 ldquoMultidimensional ScalingrdquoCh 5 ldquoExploratory Factor Analysisrdquo Latent

bull DeepLearning Ch 2 ldquoLinear Algebrardquo Ch 13 ldquoLinear Factor Modelsrdquo

bull Dagger[GloVe] Jeffrey Pennington Richard Socher Christopher Manning 2014 ldquoGloVe Global vectorsfor word representationrdquo EMNLP httpsnlpstanfordeduprojectsglove

bull dagger[NMF] Daniel D Lee and H Sebastian Seung 1999 ldquoLearning the parts of objects by non-negativematrix factorizationrdquo Nature 401788-791 httpdxdoiorgezaccesslibrariespsuedu

10103844565

bull dagger[CUR] Michael W Mahoney and Petros Drineas 2009 ldquoCUR matrix decompositions for improveddata analysisrdquo PNAS httpwwwpnasorgcontent1063697full

bull Dagger[ICA] Aapo Hyvarinen and Erkki Oja 2000 ldquoIndependent Component Analysis Algorithmsand Applicationsrdquo Neural Networks 13(4-5) 411-30 httpswwwcshelsinkifiuahyvarin

papersNN00newpdf

bull [RandomProjection] Ella Bingham and Heikki Mannilla 2001 ldquoRandom projection in dimension-ality reduction applications to image and text datardquo KDD httpsdoiorg1011452F502512

502546

bull Compression eg ldquoTransform codingrdquo ldquoWaveletsrdquo

Nonlinear dimensionality reduction Manifold learning

bull DeepLearning Section 5113 ldquoManifold Learningrdquo

bull DataScience-Python NB 0510 ldquoManifold Learningrdquo (Locally linear embedding [LLE] Isomap)

bull Dagger[KernelPCA] Sebastian Raschka 2014 ldquoKernel tricks and nonlinear dimensionality reduction viaRBF kernel PCArdquo httpsebastianraschkacomArticles2014_kernel_pcahtml

bull Shalizi-ADA Ch 18 ldquoNonlinear Dimensionality Reductionrdquo (LLE)

bull Dagger[LaplacianEigenmaps] Mikhail Belkin and Partha Niyogi 2003 ldquoLaplacian eigenmaps for di-mensionality reduction and data representationrdquo Neural Computation 15(6) 13731396 http

webcseohio-stateedu~belkin8papersLEM_NC_03pdf

bull DeepLearning Ch14 (ldquoAutoencodersrdquo)

bull Dagger[word2vec] Tomas Mikolov Kai Chen Greg Corrado Jeffrey Dean 2013 ldquoEfficient Estimation ofWord Representations in Vector Spacerdquo httpsarxivorgabs13013781

bull Dagger[word2vecExplained] Yoav Goldberg and Omer Levy 2014 ldquoword2vec Explained DerivingMikolov et als Negative-Sampling Word-Embedding Methodrdquo httpsarxivorgabs14023722

bull Dagger[t-SNE] Laurens van der Maaten and Geoffrey Hinton 2008 ldquoVisualising Data using t-SNErdquo Jour-nal of Machine Learning Research 9 2579-2605 httpjmlrcsailmitedupapersvolume9

vandermaaten08avandermaaten08apdf

24

Computation and Scaling Up

Scientific computing computation at scale

bull NRCreport Ch 10 ldquoThe Seven Computational Giants of Massive Data Analysisrdquo

bull DSHandbook-Py Ch 21 ldquoPerformance and Computer Memoryrdquo Ch 22 ldquoComputer Memory andData Structuresrdquo

bull Dagger[ComputationalStatistics-Python] Cliburn Chan Computational Statistics in Python https

peopledukeedu~ccc14sta-663indexhtml

bull CRAN Task View High Performance Computinghttpscranr-projectorgwebviewsHighPerformanceComputinghtml

Numerical computing

bull [NumericalComputing] Ward Cheney and David Kincaid 2013 Numerical Mathematics andComputing 7th ed BrooksCole Cengage Learning

bull CRAN Task View Numerical Mathematicshttpscranr-projectorgwebviewsNumericalMathematicshtml

Optimization (eg MLE gradient descent stochastic gradient descent EM algorithm neural nets)

bull DSHandbook-Py Ch 23 ldquoMaximum Likelihood Estimation and Optimizationrdquo (gradient descent)

bull DeepLearning Ch 4 ldquoNumerical Computationrdquo Ch 6 ldquoDeep Feedforward Networksrdquo Ch 8ldquoOptimization for Training Deep Modelsrdquo

bull CRAN Task View Optimization and Mathematical Programminghttpscranr-projectorgwebviewsOptimizationhtml

Linear algebra matrix computations

bull [MatrixComputations] Gene H Golub and Charles F Van Loan 2013 Matrix Computations 4thed Johns Hopkins University Press

bull Dagger[NetflixMatrix] Yehuda Koren Robert Bell and Chris Volinsky 2009 ldquoMatrix FactorizationTechniques for Recommender Systemsrdquo Computer August 42-9 httpsdatajobscomdata-

science-repoRecommender-Systems-[Netflix]pdf

bull Dagger[BigDataPCA] Jianqing Fan Qiang Sun Wen-Xin Zhou Ziwei Zhu ldquoPrincipal Component Anal-ysis for Big Datardquo httpwwwprincetonedu~ziweizpcapdf

bull Dagger[RandomSVD] Andrew Tulloch 2009 ldquoFast Randomized SVDrdquo httpsresearchfbcom

fast-randomized-svd

bull Dagger[FactorSGD] Rainer Gemulla Peter J Haas Erik Nijkamp and Yannis Sismanis 2011 ldquoLarge-Scale Matrix Factorization with Distributed Stochastic Gradient Descentrdquo KDD httpwwwcs

utahedu~hariteachingbigdatagemulla11dsgdpdf

bull Dagger[Sparse] Max Grossman 2015 ldquo101 Ways to Store a Sparse Matrixrdquo httpsmediumcom

jmaxg3101-ways-to-store-a-sparse-matrix-c7f2bf15a229

bull Dagger[DontInvert] John D Cook 2010 ldquoDonrsquot Invert that Matrixrdquo httpswwwjohndcookcom

blog20100119dont-invert-that-matrix

Simulation-based inference resampling Monte Carlo methods MCMC Bayes approximate inference

25

bull Dagger[MarkovVisually] Victor Powell ldquoMarkov Chains Explained Visuallyrdquo httpsetosaioev

markov-chains

bull DSHandbook-Py Ch 25 ldquoStochastic Modelingrdquo (Markov chains MCMC HMM)

bull Bayes ComputationalStatistics-Py

bull DeepLearning Ch 17 ldquoMonte Carlo Methodsrdquo Ch 19 ldquoApproximate Inferencerdquo

bull Dagger[VariationalInference] Jason Eisner 2011 ldquoHigh-Level Explanation of Variational Inferencerdquohttpswwwcsjhuedu~jasontutorialsvariationalhtml

bull CRAN Task View Bayesian Inferencehttpscranr-projectorgwebviewsBayesianhtml

Parallelism MapReduce Split-Apply-Combine

bull Dagger[MapReduceIntuition] Jigsaw Academy 2014 ldquoBig Data Specialist MapReducerdquo https

wwwyoutubecomwatchv=TwcYQzFqg-8ampfeature=youtube

bull MMDS Ch 2 ldquoMap-Reduce and the New Software Stackrdquo (3rd Edition discusses Spark amp Tensor-Flow)

bull Algorithms Ch 3ldquoScalable Algorithms and Associative Statisticsrdquo Ch 4 ldquoHadoop and MapReducerdquo

bull NRCreport Ch 6 ldquoResources Trade-offs and Limitationsrdquo

bull TidyData-R (Split-apply-combine is the motivating principle behind the ldquotidyverserdquo approach) Seealso Part III ldquoProgramrdquo (pipes functions vectors iteration)

bull Dagger[TidySAC-Video] Hadley Wickham 2017 ldquoData Science in the Tidyverserdquo httpswww

rstudiocomresourcesvideosdata-science-in-the-tidyverse

bull See also FactorSGD

Functional Programming

bull ComputationalStatistics-Python ldquoFunctions are first class objectsrdquo through first exercises

bull DSHandbook-Py Ch 20 ldquoProgramming Language Conceptsrdquo

bull See also Haskell

Scaling iteration streaming data online algorithms (Spark)

bull NRCreport Ch 4 ldquoTemporal Data and Real-Time Algorithmsrdquo

bull MMDS Ch 4 ldquoMining Data Streamsrdquo

bull Dagger[BDAS] AMPLab BDAS The Berkeley Data Analytics Stack httpsamplabcsberkeleyedu

software

bull dagger[Spark] Mohammed Guller 2015 Big Data Analytics with Spark A Practitioners Guide toUsing Spark for Large-Scale Data Processing Machine Learning and Graph Analytics and High-Velocity Data Stream Processing Apress httpslink-springer-comezaccesslibrariespsu

edubook101007978-1-4842-0964-6

bull DSHandbook-Py Ch 13 ldquoBig Datardquo

bull DeepLearning Ch 10 ldquoSequence Modeling Recurrent and Recursive Netsrdquo

General Resources for Python and R

26

bull dagger[DSHandbook-Py] Field Cady 2017 The Data Science Handbook (Python-based) Wiley http

onlinelibrarywileycomezaccesslibrariespsuedubook1010029781119092919

bull Dagger[Tutorials-R] Ujjwal Karn 2017 ldquoA curated list of R tutorials for Data Science NLP and MachineLearningrdquo httpsgithubcomujjwalkarnDataScienceR

bull Dagger[Tutorials-Python] Ujjwal Karn 2017 ldquoA curated list of Python tutorials for Data Science NLPand Machine Learningrdquo httpsgithubcomujjwalkarnDataSciencePython

Cutting and Bleeding Edge of Data Science Languages

bull [Scala] daggerVishal Layka and David Pollak 2015 Beginning Scala httpslink-springer-

comezaccesslibrariespsuedubook1010072F978-1-4842-0232-6 daggerNicolas Patrick 2014Scala for Machine Learning httpsebookcentralproquestcomlibpensudetailaction

docID=1901910 DaggerScala site httpswwwscala-langorg (See also )

bull [Julia] daggerIvo Baobaert 2015 Getting Started with Julia httpsebookcentralproquestcom

libpensudetailactiondocID=1973847 DaggerDouglas Bates 2013 ldquoJulia for R Programmersrdquohttpwwwstatwiscedu~batesJuliaForRProgrammerspdf DaggerJulia site httpsjulialang

org

bull [Haskell] daggerRichard Bird 2014 Thinking Functionally with Haskell httpsdoi-orgezaccess

librariespsuedu101017CBO9781316092415 daggerHakim Cassimally 2017 Learning Haskell Pro-gramming httpswwwlyndacomHaskell-tutorialsLearning-Haskell-Programming604926-2html daggerJames Church 2017 Learning Haskell for Data Analysis httpswwwlyndacomDeveloper-tutorialsLearning-Haskell-Data-Analysis604234-2html Haskell site httpswwwhaskellorg

bull [Clojure] daggerMark McDonnell 2017 Quick Clojure Essential Functional Programming https

link-springer-comezaccesslibrariespsuedubook1010072F978-1-4842-2952-1 daggerAkhillWali 2014 Clojure for Machine Learning httpsebookcentralproquestcomlibpensu

detailactiondocID=1674848 daggerArthur Ulfeldt 2015 httpswwwlyndacomClojure-tutorialsUp-Running-Clojure413127-2html DaggerClojure site httpsclojureorg

bull [TensorFlow] DaggerAbadi et al (Google) 2016 ldquoTensorFlow A System for Large-Scale Ma-chine Learningrdquo httpsarxivorgabs160508695 DaggerUdacity ldquoDeep Learningrdquo httpswww

udacitycomcoursedeep-learning--ud730 DaggerTensorFlow site httpswwwtensorfloworg

bull [H2O] Darren Cool 2016 Practical Machine Learning with H2O Powerful Scalable Techniques forDeep Learning and AI OrsquoReilly Arno Candel and Viraj Parmar 2015 Deep Learning with H2ODaggerH2O site httpswwwh2oai

Social Data Structures

Space and Time

bull dagger[GIA] David OrsquoSullivan David J Unwin 2010 Geographic Information Analysis Second EditionJohn Wiley amp Sons httponlinelibrarywileycombook1010029780470549094

bull [Space-Time] Donna J Peuquet 2003 Representations of Space and Time Guilford

bull Roger S Bivand Edzer Pebesma Virgilio Gmez-Rubio 2013 Applied Spatial Data Analysis in RFree through library httplinkspringercomezaccesslibrariespsuedubook101007

2F978-1-4614-7618-4 See also Edzer Pebesma 2016 ldquoHandling and Analyzing Spatial Spa-tiotemporal and Movement Data in Rrdquo httpsedzergithubioUseR2016

bull Dagger[PySAL] Sergio J Rey and Dani Arribas-Bel 2016 ldquoGeographic Data Science with PySAL andthe pydata Stackrdquo httpdarribasorggds_scipy16

27

bull DataScience-Python NB 0413 ldquoGeographic Data with Basemaprdquo

bull Time ldquoLongitudinalrdquo intra-individual data as common in developmental psychology dagger[Longitudinal]Garrett Fitzmaurice Marie Davidian Geert Verbeke and Geert Molenberghs editors 2009 Longitu-dinal Data Analysis Chapman amp Hall CRC httppensueblibcompatronFullRecordaspxp=359998

bull Time ldquoTime Series Time Series Cross Section Panelrdquo as common in economics political scienceand sociology DataScience-Python Ch 3 re pandas DSHandbook-Py Ch 17 ldquoTime Series AnalysisrdquoEconometrics GelmanHill

bull Time ldquoSequential Data Streamsrdquo as in NLP machine learning Spark and other rdquostreaming datardquoreadings

bull Space-Time Data with continuity connectivity neighborhood structure See DeepLearning Chapter 9re convolution

bull CRAN Task Views Spatial httpscranr-projectorgwebviewsSpatialhtml Spatiotempo-ral httpscranr-projectorgwebviewsSpatioTemporalhtml Time Series httpscran

r-projectorgwebviewsTimeSerieshtml

Network Graphs

bull Dagger[Networks] David Easley and Jon Kleinberg 2010 Networks Crowds and Markets ReasoningAbout a Highly Connected World Cambridge University Press httpwwwcscornelleduhome

kleinbernetworks-book

bull MMDS Ch 5 ldquoLink Analysisrdquo Ch 10 ldquoMining Social-Network Graphsrdquo

bull dagger[Networks-R] Douglas Luke 2015 A Userrsquos Guide to Network Analysis in R Springer https

link-springer-comezaccesslibrariespsuedubook101007978-3-319-23883-8

bull dagger[Networks-Python] Mohammed Zuhair Al-Taie and Seifedine Kadry 2017 Python for Graph andNetwork Analysis Springer httpslink-springer-comezaccesslibrariespsuedubook

1010072F978-3-319-53004-8

Hierarchy Clustered Data Aggregation Mixed Models

bull dagger[GelmanHill] Andrew Gelman and Jennifer Hill 2006 Data Analysis Using Regression and Multi-levelHierarchical Models Cambridge University Press httppensueblibcompatronFullRecordaspxp=288457

bull dagger[MixedModels] Eugene Demidenko 2013 Mixed Models Theory and Applications with R 2nd edWiley httpsiteebrarycomezaccesslibrariespsuedulibpennstatedetailaction

docID=10748641

Social Data Channels

Language Text Speech Audio

bull Dagger[NLP] Daniel Jurafsky and James H Martin Forthcoming (2017) Speech and Language ProcessingAn Introduction to Natural Language Processing Speech Recognition and Computational Linguistics3rd ed Prentice-Hall Preprint httpswebstanfordedu~jurafskyslp3

bull InfoRetrieval

bull Dagger[CoreNLP] Stanford CoreNLP httpsstanfordnlpgithubioCoreNLP

28

bull dagger[TextAsData] Grimmer and Stewart 2013 ldquoText as Data The Promise and Pitfalls of AutomaticContent Analysis Methods for Political Textsrdquo Political Analysis httpsdoiorg101093pan

mps028

bull Quinn-Topics

bull FightinWords

bull dagger[NLTK-Python] Jacob Perkins 2010 Python Text Processing with NLTK 20 Cookbook PackthttpsiteebrarycomezaccesslibrariespsuedulibpennstatedetailactiondocID=10435387dagger[TextAnalytics-Python] Dipanjan Sarkar 2016 Text Analytics with Python Springer http

linkspringercomezaccesslibrariespsuedubook1010072F978-1-4842-2388-8

bull DSHandbook-Py Ch 16 ldquoNatural Language Processingrdquo

bull Matthew J Denny httpwwwmjdennycomText_Processing_In_Rhtml

bull CRAN Natural Language Processing httpscranr-projectorgwebviewsNaturalLanguageProcessinghtml

bull dagger[AudioData] Dean Knox and Christopher Lucas 2017 ldquoThe Speaker-Affect Model MeasuringEmotion in Political Speech with Audio Datardquo httpchristopherlucasorgfilesPDFssam

pdf httpswwwyoutubecomwatchv=Hs8A9dwkMzI

bull daggerMorgan amp Claypool Synthesis Lectures on Human Language Technologies httpwwwmorganclaypoolcomtochlt11

bull daggerMorgan amp Claypool Synthesis Lectures on Speech and Audio Processing httpwwwmorganclaypoolcomtocsap11

Vision Image Video

bull Dagger[ComputerVision] Richard Szelinski 2010 Computer Vision Algorithms and Applications SpringerhttpszeliskiorgBook

bull MixedModels Ch 11 ldquoStatistical Analysis of Shaperdquo Ch 12 ldquoStatistical Image Analysisrdquo

bull AudioVideoVolumetric See DeepLearning Chapter 9 re convolution

bull daggerMorgan amp Claypool Synthesis Lectures on Image Video and Multimedia Processing httpwww

morganclaypoolcomtocivm11

bull daggerMorgan amp Claypool Synthesis Lectures on Computer Vision httpwwwmorganclaypoolcom

toccov11

Approaches to Learning from Data (The Analytics Layer)

bull NRCreport Ch 7 ldquoBuilding Models from Massive Datardquo

bull MMDS (data-mining)

bull CausalInference

bull Dagger[VisualAnalytics] Daniel Keim Jorn Kohlhammer Geoffrey Ellis and Florian Mansmann editors2010 Mastering the Information Age Solving Problems with Visual Analytics The EurographicsAssociation Goslar Germany httpwwwvismastereubook See also daggerMorgan amp ClaypoolSynthesis Lectures on Visualization httpwwwmorganclaypoolcomtocvis21

bull Dagger[Econometrics] Bruce E Hansen 2014 Econometrics httpwwwsscwiscedu~bhansen

econometrics

29

bull Dagger[StatisticalLearning] Gareth James Daniela Witten Trevor Hastie and Robert Tibshirani 2013An Introduction to Statistical Learning with Applications in R Springer httpwww-bcfuscedu

~garethISLindexhtml

bull [MachineLearning] Christopher Bishop 2006 Pattern Recognition and Machine Learning Springer

bull dagger[Bayes] Peter D Hoff 2009 A First Course in Bayesian Statistical Methods Springer http

linkspringercombook1010072F978-0-387-92407-6

bull dagger[GraphicalModels] Soslashren Hoslashjsgaard David Edwards Steffen Lauritzen 2012 Graphical Mod-els with R Springer httpslink-springer-comezaccesslibrariespsuedubook101007

2F978-1-4614-2299-0

bull [InformationTheory] David Mackay 2003 Information Theory Inference and Learning AlgorithmsSpringer

bull Dagger[DeepLearning] Ian Goodfellow Yoshua Bengio and Aaron Courville 2016 Deep Learning MITPress httpwwwdeeplearningbookorg (see also Yann LeCun Yoshua Bengio and Geof-frey Hinton 2015 ldquoDeep Learningrdquo Nature 521(7553) 436ndash444 httpsdoiorg101038

nature14539)

See also Dagger[Keras] Keras The Python Deep Learning Library httpskerasio Francoios CholletDirectory of Keras tutorials httpsgithubcomfcholletkeras-resources

bull dagger[SignalProcessing] Jose Maria Giron-Sierra 2013 Digital Signal Processing with MATLAB Exam-ples (Volumes 1-3) Springer httpslink-springer-comezaccesslibrariespsuedubook

101007978-981-10-2534-1

30

Penn State Policy Statements

Academic Integrity

Academic integrity is the pursuit of scholarly activity in an open honest and responsible manner Academicintegrity is a basic guiding principle for all academic activity at The Pennsylvania State University and allmembers of the University community are expected to act in accordance with this principle Consistent withthis expectation the Universitys Code of Conduct states that all students should act with personal integrityrespect other students dignity rights and property and help create and maintain an environment in whichall can succeed through the fruits of their efforts

Academic integrity includes a commitment by all members of the University community not to engage in ortolerate acts of falsification misrepresentation or deception Such acts of dishonesty violate the fundamentalethical principles of the University community and compromise the worth of work completed by others

Disability Accomodation

Penn State welcomes students with disabilities into the Universitys educational programs Every Penn Statecampus has an office for students with disabilities Student Disability Resources (SDR) website provides con-tact information for every Penn State campus (httpequitypsuedusdrdisability-coordinator)For further information please visit the Student Disability Resources website (httpequitypsuedusdr)

In order to receive consideration for reasonable accommodations you must contact the appropriate disabilityservices office at the campus where you are officially enrolled participate in an intake interview and pro-vide documentation See documentation guidelines at (httpequitypsuedusdrguidelines) If thedocumentation supports your request for reasonable accommodations your campus disability services officewill provide you with an accommodation letter Please share this letter with your instructors and discussthe accommodations with them as early as possible You must follow this process for every semester thatyou request accommodations

Psychological Services

Many students at Penn State face personal challenges or have psychological needs that may interfere withtheir academic progress social development or emotional wellbeing The university offers a variety ofconfidential services to help you through difficult times including individual and group counseling crisisintervention consultations online chats and mental health screenings These services are provided by staffwho welcome all students and embrace a philosophy respectful of clients cultural and religious backgroundsand sensitive to differences in race ability gender identity and sexual orientation

bull Counseling and Psychological Services at University Park (CAPS) (httpstudentaffairspsueducounseling) 814-863-0395

bull Penn State Crisis Line (24 hours7 daysweek) 877-229-6400

bull Crisis Text Line (24 hours7 daysweek) Text LIONS to 741741

Educational Equity

Penn State takes great pride to foster a diverse and inclusive environment for students faculty and staffConsistent with University Policy AD29 students who believe they have experienced or observed a hatecrime an act of intolerance discrimination or harassment that occurs at Penn State are urged to report theseincidents as outlined on the Universitys Report Bias webpage (httpequitypsuedureportbias)

31

Background Knowledge

Students will have a variety of backgrounds Prior to beginning interdisciplinary coursework tofulfill Social Data Analytics degree requirements including SoDA 501 and 502 students are ex-pected to have advanced (graduate) training in at least one of the component areas of Social DataAnalytics and a familiarity with basic concepts in the others

With regard to specialization students are expected to have advanced (graduate) training in ONEof the following

bull quantitative social science methodology and a discipline of social science (as would be thecase for a second-year PhD student in Political Science Sociology Criminology Human De-velopment and Family Studies or Demography) OR

bull statistics (as would be the case for a second-year PhD student in Statistics) OR

bull information science or informatics (as would be the case for a second-year PhD student inInformation Science and Technology or a second-year PhD student in Geography specializingin GIScience) OR

bull computer science (as would be the case for a second-year PhD student in Computer Scienceand Engineering)

This requirement is met as a matter of meeting home program requirements for students in thedual-title PhD but may require additional coursework on the part of students in other programswishing to pursue the graduate minor

With regard to general preparation students are expected to have ALL of the following technicalknowledge

bull basic programming skills including basic facility with R ampor Python AND

bull basic knowledge of relational databases ampor geographic information systems AND

bull basic knowledge of probability applied statistics ampor social science research design AND

bull basic familiarity with a substantive or theoretical area of social science (eg 300-level course-work in political science sociology criminology human development psychology economicscommunication anthropology human geography social informatics or similar fields)

It is not unusual for students to have one or more gaps in this preparation at time of applicationto the SoDA program Students should work with Social Data Analytics advisers to develop aplan for timely remediation of any deficiencies which generally will not require formal courseworkfor students whose training and interests are otherwise appropriate for pursuit of the Social DataAnalytics degree Where possible this will be addressed at time of application to the Social DataAnalytics program

To this end some free training materials are linked in the reference section and there are alsoabundant free high-quality self-paced course-style training materials on these and related subjectsavailable through edX (httpwwwedxorg) Udacity (httpwwwudacitycom) Codecademy(httpcodecademycom) and Lynda (httpwwwlyndapsu)

32

Page 15: SoDA 501: Approaches and Issues in Big Social Data Spring ... · \Approaches and Issues in Big Social Data," o ered in the spring semester. Some SoDA / IGERT students will take more

Julia Angwin Jeff Larson Surya Mattu and Lauren Kirchner 2016 ldquoMachine Biasrdquo ProPub-lica May 23 httpswwwpropublicaorgarticlemachine-bias-risk-assessments-in-

criminal-sentencing

Julia Angwin Madeleine Varner and Ariana Tobin 2017 ldquoFacebook Enabled Adverstisers toReach Jew Hatersrdquo httpswwwpropublicaorgarticlefacebook-enabled-advertisers-

to-reach-jew-haters

bull Dagger[Polling2016] Doug Rivers (Nov 11 2016) ldquoFirst Thoughts on Polling Problems in the 2016 USElectionsrdquo httpstodayyougovcomnews20161111first-thoughts-polling-problems-

2016-us-elections

bull dagger[EventData] Wei Wang Ryan Kennedy David Lazer Naren Ramakrishnan 2016 ldquoGrowing Painsfor Global Monitoring of Societal Eventsrdquo Science 3536307 pp 1502ndash1503 httpsdoiorg101126scienceaaf6758

bull Dagger[GoogleBooks] Eitan Adam Pechenick Christopher M Danforth and Peter Sheridan Dodds 2015ldquoCharacterizing the Google Books Corpus Strong Limits to Inferences of Socio-Cultural and LinguisticEvolutionrdquo PLoS One httpsdoiorg101371journalpone0137041

bull Dagger[OkCupid] Michael Zimmer 2016 ldquoOkCupid Study Reveals the Perils of Big-Data Sciencerdquo Wiredhttpswwwwiredcom201605okcupid-study-reveals-perils-big-data-science May 14

bull dagger[CriticalQuestions] danah boyd and Kate Crawford 2011 ldquoCritical Questions for Big Data Provo-cations for a Cultural Technological and Scholarly Phenomenonrdquo Information Communication ampSociety 15(5) 662ndash79 httpdxdoiorg1010801369118X2012678878

bull Dagger[RacistBot] Daniel Victor 2016 ldquoMicrosoft created a Twitter bot to learn from users It quicklybecame a racist jerkrdquo New York Times httpswwwnytimescom20160325technology

microsoft-created-a-twitter-bot-to-learn-from-users-it-quickly-became-a-racist-jerk

html

bull MMDS 122-123 on Bonferroni

bull Also BDSS-Census

Research Design and Measurement

Overviews

bull BitByBit

bull Dagger[ResearchMethodsKB] William M Trochim 2006 The Research Methods Knowledge Basehttpwwwsocialresearchmethodsnetkb

bull [7Rules] Glenn Firebaugh 2008 Seven Rules for Social Research Princeton University Press

Measurement Reliability and Validity

bull ResearchMethodsKB ldquoMeasurementrdquo

bull 7Rules Ch 3 ldquoBuild Reality Checks into Your Researchrdquo

bull Reliability see also FightinWords

bull Validity see also Quinn10-Topics Monroe-5Vs

Indirect Unobtrusive Nonreactive Measures Data Exhaust

15

bull dagger[UnobtrusiveMeasures] Raymond M Lee 2015 ldquoUnobtrusive Measuresrdquo Oxford Bibliographieshttpsdoiorg101093OBO9780199846740-0048 (Canonical cite is Eugene J Webb 1966Unobtrusive Measures or Webb Donald T Campbell Richard D Schwartz Lee Sechrest 1999Unobtrusive Measures rev ed Sage)

bull BitByBit Examples in Chapter 2

Multiple Measures Latent Variable Measurement

bull 7Rules Ch 4 ldquoReplicate Where Possiblerdquo

bull dagger[LatentVariables] David J Bartholomew Martin Knott and Irini Moustaki 2011 Latent VariableModels and Factor Analysis A Unified Approach Wiley httpsiteebrarycomezaccess

librariespsuedulibpennstatedetailactiondocID=10483308) (esp Ch1 ldquoBasic ideas andexamplesrdquo)

bull dagger[Multivariate-R] Brian Everitt and Torsten Hothorn 2011 An Introduction to Applied MultivariateAnalysis with R Springer httplinkspringercomezaccesslibrariespsuedubook10

10072F978-1-4419-9650-3

bull Shalizi-ADA Ch17 ldquoFactor Modelsrdquo

bull Dagger[NetflixPrize] Edwin Chen 2011 ldquoWinning the Netflix Prize A Summaryrdquo

bull DaggerCRAN httpscranr-projectorgwebviewsMultivariatehtmlhttpscranr-projectorgwebviewsPsychometricshtmlhttpscranr-projectorgwebviewsClusterhtml

Sampling and Survey Design

bull dagger[Sampling] Steven K Thompson 2012 Sampling 3rd ed httpsk8es4mc2lsearchserialssolutionscomsid=sersolampSS_jc=TC_024492330amptitle=Wiley20Desktop20Editions203A20Sampling

bull ResearchMethodsKB ldquoSamplingrdquo

bull NRCreport Ch 8 ldquoSampling and Massive Datardquo

bull BitByBit Ch 3 ldquoAsking Questionsrdquo

bull Dagger[MSE] Daniel Manrique-Vallier Megan E Price and Anita Gohdes 2013 In Seybolt et al (eds)Counting Civilians ldquoMultiple Systems Estimation Techniques for Estimating Casualties in ArmedConflictsrdquo Preprint httpciteseerxistpsueduviewdocdownloaddoi=1011469939amp

rep=rep1amptype=pdf)

bull dagger[NetworkSampling] Ted Mouw and Ashton M Verdery 2012 ldquoNetwork Sampling with Memory AProposal for More Efficient Sampling from Social Networksrdquo Sociological Methodology 42(1)206ndash56httpsdxdoiorg1011772F0081175012461248

bull See also FightinWords (re hidden heteroskedasticity in sample variance)

bull Dagger[HashDontSample] Mudit Uppal 2016 ldquoProbabilistic data structures in the Big data world (+code)rdquo (re ldquoHash donrsquot samplerdquo) httpsmediumcommuppalprobabilistic-data-structures-

in-the-big-data-world-code-b9387cff0c55

bull MMDS Hash Functions (132) Sampling in streams (42)

bull DaggerCRAN httpscranr-projectorgwebviewsOfficialStatisticshtml (ldquoComplex SurveyDesignrdquo ldquoSmall Area Estimationrdquo)

Experimental and Observational Designs for Causal Inference

16

bull Dagger[CausalInference] Miguel A Hernan James M Robins Forthcoming (2017) Causal Infer-ence Chapman amp HallCRC Preprint httpswwwhsphharvardedumiguel-hernancausal-

inference-book Chs 1 ldquoA definition of causal effectrdquo Ch 2 ldquoRandomized experimentsrdquo Ch3 ldquoObservational studiesrdquo (See Ch 6 ldquoGraphical representation of causal effectsrdquo for integrationwith Judea Pearl approach)

bull BitByBit Chapter 4 ldquoRunning Experimentsrdquo Ch 2 Natural experiments in observable data exam-ples

bull ResearchMethodsKB ldquoDesignrdquo

bull Firebaugh-7Rules Ch 2 ldquoLook for Differences that Make a Difference and Report Themrdquo sectCh5 ldquoCompare Like with Likerdquo Ch 6 ldquoUse Panel Data to Study Individual Change and RepeatedCross-Section Data to Study Social Changerdquo

bull Shalizi-ADA Part IV ldquoCausal Inferencerdquo

bull Monroe-No

bull CRAN httpscranr-projectorgwebviewsExperimentalDesignhtml

Technologies for Primary and Secondary Data Collection

Mobile Devices Distributed Sensors Wearable Sensors Remote Sensing

bull dagger[RealityMining] Nathan Eagle and Alex (Sandy) Pentland 2006 ldquoReality Mining SensingComplex Social Systemsrdquo Personal and Ubiquitous Computing 10(4) 255ndash68 httpsdoiorg

101007s00779-005-0046-3

bull dagger[QuantifiedSelf ] Melanie Swan 2013 ldquoThe Quantified Self Fundamental Disruption in Big DataScience and Biological Discoveryrdquo Big Data 1(2) 85ndash99 httpsdoiorg101089big2012

0002

bull dagger[SensorData] Charu C Aggarwal (Ed) 2013 Managing and Mining Sensor Data Springer httplinkspringercomezaccesslibrariespsuedubook1010072F978-1-4614-6309-2 Ch1ldquoAn Introduction to Sensor Data Analyticsrdquo

bull dagger[NightLights] Thushyanthan Baskaran Brian Min Yogesh Uppal 2015 ldquoElection cycles andelectricity provision Evidence from a quasi-experiment with Indian special electionsrdquo Journal ofPublic Economics 12664-73 httpsdoiorg101016jjpubeco201503011

Crowdsourcing Human Computation Citizen Science Web Experiments

bull dagger[Crowdsourcing] Daren C Brabham 2013 Crowdsourcing MIT Press httpsiteebrary

comezaccesslibrariespsuedulibpennstatedetailactiondocID=10692208 ldquoIntroduc-tionrdquo Ch1 ldquoConcepts Theories and Cases of Crowdsourcingrdquo

bull BitByBit Ch 5 ldquoCreating Mass Collaborationrdquo

bull NRCreport Ch 9 ldquoHuman Interaction with Datardquo

bull dagger[MTurk] Krista Casler Lydia Bickel and Elizabeth Hackett 2013 ldquoSeparate but Equal AComparison of Participants and Data Gathered via Amazonrsquos MTurk Social Media and Face-to-FaceBehavioral Testingrdquo Computers in Human Behavior 29(6) 2156ndash60 httpdoiorg101016j

chb201305009

bull dagger[LabintheWild] Katharina Reinecke and Krzysztof Z Gajos 2015 ldquoLabintheWild ConductingLarge-Scale Online Experiments With Uncompensated Samplesrdquo In Proceedings of the 18th ACMConference on Computer Supported Cooperative Work amp Social Computing (CSCW rsquo15) 1364ndash1378httpdxdoiorg10114526751332675246

17

bull dagger[TweetmentEffects] Kevin Munger 2017 ldquoTweetment Effects on the Tweeted ExperimentallyReducing Racist Harassmentrdquo Political Behavior 39(3) 629ndash49 httpdoiorg101016jchb

201305009

bull dagger[HumanComputation] Edith Law and Luis van Ahn 2011 Human Computation Morgan amp Clay-pool httpwwwmorganclaypoolcomezaccesslibrariespsuedudoipdf102200S00371ED1V01Y201107AIM013

Open Data File Formats APIs Semantic Web Linked Data

bull DSHandbook-Py Ch 12 ldquoData Encodings and File Formatsrdquo

bull Dagger[OpenData] Open Data Institute ldquoWhat Is Open Datardquo httpstheodiorgwhat-is-open-

data

bull Dagger[OpenDataHandbook] Open Knowledge International The Open Data Handbook httpopendatahandbookorg Includes appendix ldquoFile Formatsrdquo httpopendatahandbookorgguideenappendices

file-formats

bull Dagger[APIs] Brian Cooksey 2016 An Introduction to APIs httpszapiercomlearnapis

bull Dagger[APIMarkets] RapidAPI mashape API marketplaces httpsdocsrapidapicom https

marketmashapecom ProgrammableWeb httpswwwprogrammablewebcom

bull dagger[LinkedData] Tom Heath and Christian Bizer 2011 Linked Data Evolving the Web into a GlobalData Space Morgan amp Claypool httpwwwmorganclaypoolcomezaccesslibrariespsu

edudoiabs102200S00334ED1V01Y201102WBE001 Ch 1 ldquoIntroductionrdquo Ch 2 ldquoPrinciples ofLinked Datardquo

bull dagger[SemanticWeb] Nikolaos Konstantinos and Dimitrios-Emmanuel Spanos 2015 Materializing theWeb of Linked Data Springer httplinkspringercomezaccesslibrariespsuedubook

1010072F978-3-319-16074-0 Ch 1 ldquoIntroduction Linked Data and the Semantic Webrdquo

bull daggerMorgan amp Claypool Synthesis Lectures on the Semantic Web Theory and Technology httpwww

morganclaypoolcomtocwbe111

Web Scraping

bull dagger[TheoryDrivenScraping] Richard N Landers Robert C Brusso Katelyn J Cavanaugh andAndrew B Colmus 2016 ldquoA Primer on Theory-Driven Web Scraping Automatic Extraction ofBig Data from the Internet for Use in Psychological Researchrdquo Psychological Methods 4 475ndash492httpdxdoiorgezaccesslibrariespsuedu101037met0000081

bull Dagger[Scraping-Py] Al Sweigart 2015 Automate the Boring Stuff with Python Practical Programmingfor Total Beginners Ch 11 Web-Scraping httpsautomatetheboringstuffcomchapter11

bull dagger[Scraping-R] Simon Munzert Christian Rubba Peter Meiszligner and Dominic Nyhuis 2014 Au-tomated Data Collection with R A Practical Guide to Web Scraping and Text Mining Wileyhttponlinelibrarywileycombook1010029781118834732

bull DaggerCRAN httpscranr-projectorgwebviewsWebTechnologieshtml

Ethics amp Scientific Responsibility in Big Social Data

Human Subjects Consent Privacy

bull BitByBit Ch 6 (ldquoEthicsrdquo)

bull Dagger[PSU-ORP] Penn State Office for Research Protections (under the VP for Research)

Human Subjects Research IRB httpswwwresearchpsueduirb

18

Revised Common Rule httpswwwresearchpsueduirbcommonrulechanges

Responsible Conduct of Research httpswwwresearchpsuedueducationrcr

Research Misconduct httpswwwresearchpsueduresearchmisconduct

SARI PSU (Scientific and Research Integrity Training) httpswwwresearchpsuedu

trainingsari

bull Dagger[MenloReport] David Dittrich and Erin Kenneally (Center for Applied Interneet Data Analysis)2012 ldquoThe Menlo Report Ethical Principles Guiding Information and Communication Technol-ogy Researchrdquo and companion report ldquoApplying Ethical Principles rdquo httpswwwcaidaorg

publicationspapers2012menlo_report_actual_formatted

bull Dagger[BigDataEthics-CBDES] Jacob Metcalf Emily F Keller and danah boyd 2016 ldquoPerspec-tives on Big Data Ethics and Societyrdquo Council for Big Data Ethics and Society httpbdes

datasocietynetcouncil-outputperspectives-on-big-data-ethics-and-society

bull Dagger[AoIRReport] Annette Markham and Elizabeth Buchanan (Association of Internet Researchers)2012 ldquoEthical Decision-Making and Internet Research Recommendations of the AoIR Ethics WorkingCommittee (Version 20)rdquo httpsaoirorgreportsethics2pdf and guidelines chart https

aoirorgwp-contentuploads201701aoir_ethics_graphic_2016pdf

bull Dagger[BigDataEthics-Wired] Sarah Zhang 2016 ldquoScientists are Just as Confused about the Ethics ofBig-Data Research as Yourdquo Wired httpswwwwiredcom201605scientists-just-confused-ethics-big-data-research

bull dagger[BigDataEthics-HerschelMori] Richard Herschel and Virginia M Mori 2017 ldquoEthics amp BigDatardquo Technology in Society 49 31-36 httpdoiorg101016jtechsoc201703003

The Science of Data Privacy

bull dagger[BigDataPrivacy] Terence Craig and Mary E Ludloff 2011 Privacy and Big Data OrsquoReilly MediahttppensueblibcompatronFullRecordaspxp=781814

bull Dagger[DataAnalysisPrivacy] John Abowd Lorenzo Alvisi Cynthia Dwork Sampath Kannan AshwinMachanavajjhala Jerome Reiter 2017 ldquoPrivacy-Preserving Data Analysis for the Federal Statisti-cal Agenciesrdquo A Computing Community Consortium white paper httpsarxivorgabs1701

00752

bull dagger[DataPrivacy] Stephen E Fienberg and Aleksandra B Slavkovic 2011 ldquoData Privacy and Con-fidentialityrdquo International Encyclopedia of Statistical Science 342ndash5 httpdoiorg978-3-642-

04898-2_202

bull Dagger[NetworksPrivacy] Vishesh Karwa and Aleksandra Slavkovic 2016 ldquoInference using noisy degreesDifferentially private β-model and synthetic graphsrdquo Annals of Statistics 44(1) 87-112 http

projecteuclidorgeuclidaos1449755958

bull dagger[DataPublishingPrivacy] Raymond Chi-Wing Wong and Ada Wai-Chee Fu 2010 Privacy-Preserving Data Publishing An Overview Morgan amp Claypool httpwwwmorganclaypoolcom

ezaccesslibrariespsuedudoipdfplus102200S00237ED1V01Y201003DTM002

bull DataMatching Ch 8 ldquoPrivacy Aspects of Data Matchingrdquo

Transparency Reproducibility and Team Science

bull Dagger[10RulesforData] Alyssa Goodman Alberto Pepe Alexander W Blocker Christine L BorgmanKyle Cranmer Merce Crosas Rosanne Di Stefano Yolanda Gil Paul Groth Margaret HedstromDavid W Hogg Vinay Kashyap Ashish Mahabal Aneta Siemiginowska and Aleksandra Slavkovic(2014) ldquoTen Simple Rules for the Care and Feeding of Scientific Datardquo PLoS Computational Biology10(4) e1003542 httpsdoiorg101371journalpcbi1003542 (Note esp curated resourcesfor reproducible research)

19

bull dagger[Transparency] E Miguel C Camerer K Casey J Cohen K M Esterling A Gerber R Glen-nerster D P Green M Humphreys G Imbens D Laitin T Madon L Nelson B A Nosek M Pe-tersen R Sedlmayr J P Simmons U Simonsohn M Van der Laan 2014 ldquoPromoting Transparencyin Social Science Researchrdquo Science 343(6166) 30ndash1 httpsdoiorg101126science1245317

bull dagger[Reproducibility] Marcus R Munafo Brian A Nosek Dorothy V M Bishop Katherine S ButtonChristopher D Chambers Nathalie Percie du Sert Uri Simonsohn Eric-Jan Wagenmakers JenniferJ Ware and John P A Ioannidis 2017 ldquoA Manifesto for Reproducible Sciencerdquo Nature HumanBehavior 0021(2017) httpsdoiorg101038s41562-016-0021

bull Dagger[SoftwareCarpentry] esp ldquoLessonsrdquo httpsoftware-carpentryorglessons

bull DSHandbook-Py Ch 9 ldquoTechnical Communication and Documentationrdquo Ch 15 ldquoSoftware Engi-neering Best Practicesrdquo

bull dagger[TeamScienceToolkit] Vogel AL Hall KL Fiore SM Klein JT Bennett LM Gadlin H Stokols DNebeling LC Wuchty S Patrick K Spotts EL Pohl C Riley WT Falk-Krzesinski HJ 2013 ldquoTheTeam Science Toolkit enhancing research collaboration through online knowledge sharingrdquo AmericanJournal of Preventive Medicine 45 787-9 http101016jamepre201309001

bull sect[Databrary] Kara Hall Robert Croyle and Amanda Vogel Forthcoming (2017) Advancing Socialand Behavioral Health Research through Cross-disciplinary Team Science Springer Includes RickO Gilmore and Karen E Adolph ldquoOpen Sharing of Research Video Breaking the Boundaries of theResearch Teamrdquo (See httpdatabraryorg)

bull CRAN httpscranr-projectorgwebviewsReproducibleResearchhtml

Social Bias Fair Algorithms

bull MachineBias

bull Dagger[EmbeddingsBias] Aylin Caliskan Joanna J Bryson and Arvind Narayanan 2017 ldquoSemanticsDerived Automatically from Language Corpora Contain Human Biasesrdquo Science httpsarxiv

orgabs160807187

bull Dagger[Debiasing] Tolga Bolukbasi Kai-Wei Chang James Zou Venkatesh Saligrama Adam Kalai 2016ldquoMan is to Computer Programmer as Woman is to Homemaker Debiasing Word Embeddingsrdquohttpsarxivorgabs160706520

bull Dagger[AvoidingBias] Moritz Hardt Eric Price Nathan Srebro 2016 ldquoEquality of Opportunity inSupervised Learningrdquo httpsarxivorgabs161002413

bull Dagger[InevitableBias] Jon Kleinberg Sendhil Mullainathan Manish Raghavan 2016 ldquoInherent Trade-Offs in the Fair Determination of Risk Scoresrdquo httpsarxivorgabs160905807

Databases and Data Management

bull NRCreport Ch 3 ldquoScaling the Infrastructure for Data Managementrdquo

bull dagger[SQL] Jan L Harrington 2010 SQL Clearly Explained Elsevier httpwwwsciencedirectcomezaccesslibrariespsuedusciencebook9780123756978

bull DSHandbook-Py Ch 14 ldquoDatabasesrdquo

bull dagger[noSQL] Guy Harrison 2015 Next Generation Databases NoSQL NewSQL and Big Data Apresshttpslink-springer-comezaccesslibrariespsuedubook101007978-1-4842-1329-2

bull dagger[Cloud] Divyakant Agrawal Sudipto Das and Amr El Abbadi 2012 Data Management inthe Cloud Challenges and Opportunities Morgan amp Claypool httpwwwmorganclaypoolcom

ezaccesslibrariespsuedudoipdfplus102200S00456ED1V01Y201211DTM032

20

bull daggerMorgan amp Claypool Synthesis Lectures on Data Management httpwwwmorganclaypoolcom

tocdtm11

Data Wrangling

Theoretically-structured approaches to data wrangling

bull Dagger[TidyData-R] Garrett Grolemund and Hadley Wickham 2017 R for Data Science (httpr4dshadconz) esp Ch 12 ldquoTidy Datardquo also Wickham 2014 ldquoTidy Datardquo Journalof Statistical Software 59(10) httpwwwjstatsoftorgv59i10paper Tools The tidyversehttptidyverseorg

bull Dagger[DataScience-Py] Jake VanderPlas 2016 Python Data Science Handbook (httpsgithubcomjakevdpPythonDSHandbook-Py) esp Ch 3 on pandas httppandaspydataorg

bull Dagger[DataCarpentry] Colin Gillespie and Robin Lovelace 2017 Efficient R Programming https

csgillespiegithubioefficientR esp Ch 6 on ldquoefficient data carpentryrdquo

bull sect[Wrangler] Joseph M Hellerstein Jeffrey Heer Tye Rattenbury and Sean Kandel 2017 DataWrangling Practical Techniques for Data Preparation Tool Trifacta Wrangler httpwwwtrifactacomproductswrangler

Data wrangling practice

bull DSHandbook-Py Ch 4 ldquoData Munging String Manipulation Regular Expressions and Data Clean-ingrdquo

bull dagger[DataSimplification] Jules J Berman 2016 Data Simplification Taming Information with OpenSource Tools Elsevier httpwwwsciencedirectcomezaccesslibrariespsueduscience

book9780128037812

bull dagger[Wrangling-Py] Jacqueline Kazil Katharine Jarmul 2016 Data Wrangling with Python OrsquoReillyhttpproquestcombosafaribooksonlinecomezaccesslibrariespsuedu9781491948804

bull dagger[Wrangling-R] Bradley C Boehmke 2016 Data Wrangling with R Springer httplink

springercomezaccesslibrariespsuedubook1010072F978-3-319-45599-0

Record Linkage Entity Resolution Deduplication

bull dagger[DataMatching] Peter Christen 2012 Data Matching Concepts and Techniques for Record Link-age Entity Resolution and Duplicate Detectionhttplinkspringercomezaccesslibrariespsuedubook1010072F978-3-642-31164-2

(esp Ch 2 ldquoThe Data Matching Processrdquo)

bull DataSimplification Chapter 5 ldquoIdentifying and Deidentifying Datardquo

bull Dagger[EntityResolution] Lise Getoor and Ashwin Machanavajjhala 2013 ldquoEntity Resolution for BigDatardquo Tutorial KDD httpwwwumiacsumdedu~getoorTutorialsER_KDD2013pdf

bull Dagger[SyrianCasualties] Peter Sadosky Anshumali Shrivastava Megan Price and Rebecca C Steorts2015 ldquoBlocking Methods Applied to Casualty Records from the Syrian Conflictrdquo httpsarxiv

orgabs151007714 (For more on blocking see DataMatching Ch 4 ldquoIndexingrdquo)

bull CRAN Task Views httpscranr-projectorgwebviewsOfficialStatisticshtml ldquoStatis-tical Matching and Record Linkagerdquo

ldquoMaking up datardquo Imputation Smoothers Kernels Priors Filters Teleportation Negative Sampling Con-volution Augmentation Adversarial Training

21

bull dagger[Imputation] Yi Deng Changgee Chang Moges Seyoum Ido and Qi Long 2016 ldquoMultiple Im-putation for General Missing Data Patterns in the Presence of High-dimensional Datardquo ScientificReports 6(21689) httpswwwnaturecomarticlessrep21689

bull Dagger[Overimputation] Matthew Blackwell James Honaker and Gary King 2017 ldquoA Unified Ap-proach to Measurement Error and Missing Data Overview and Applicationsrdquo Sociological Methodsamp Research 46(3) 303-341 httpgkingharvardedufilesgkingfilesmeasurepdf

bull Shalizi-ADA Sect 15 (Linear Smoothers) Ch 8 (Splines) Sect 144 (Kernel Density Estimates)DataScience-Python NB 0513 ldquoKernel Density Estimationrdquo

bull FightinWords (re Bayesian priors as additional data impact of priors on regularization)

bull DeepLearning Section 75 (Data Augmentation) 712 (Dropout) 713 (Adversarial Examples) Chap-ter 9 (Convolutional Networks)

bull dagger[Adversarial] Ian J Goodfellow Jonathon Shlens Christian Szegedy 2015 ldquoExplaining and Har-vesting Adversarial Examplesrdquo httpsarxivorgabs14126572

bull See also resampling and simulation methods

bull See also feature engineering preprocessing

bull CRAN Task Views httpscranr-projectorgwebviewsOfficialStatisticshtml ldquoImpu-tationrdquo

(Direct) Data Representations Data Mappings

bull NRCreport Ch 5 ldquoLarge-Scale Data Representationsrdquo

bull dagger[PatternRecognition] M Narasimha Murty and V Susheela Devi 2011 Pattern RecognitionAn Algorithmic Approach Springer httpslink-springer-comezaccesslibrariespsuedu

book1010072F978-0-85729-495-1 Section 21 ldquoData Structures for Pattern Representationrdquo

bull Dagger[InfoRetrieval] Christopher D Manning Prabhakar Raghavan and Hinrich Schutze 2009 Intro-duction to Information Retrieval Cambridge University Press httpnlpstanfordeduIR-bookCh 1 2 6 (also note slides used in their class)

bull dagger[Algorithms] Brian Steele John Chandler Swarna Reddy 2016 Algorithms for Data ScienceWiley httplinkspringercomezaccesslibrariespsuedubook1010072F978-3-319-

45797-0 Ch 2 ldquoData Mapping and Data Dictionariesrdquo

bull See also Social Data Structures

Similarity Distance Association Covariance The Kernel Trick

bull PatternRecognition Section 23 ldquoProximity Measuresrdquo

bull Dagger[Similarity] Brendan OrsquoConnor 2012 ldquoCosine similarity Pearson correlation and OLS coeffi-cientsrdquo httpsbrenoconcomblog201203cosine-similarity-pearson-correlation-and-

ols-coefficients

bull For relatively comprehensive lists see also

M-J Lesot M Rifqi and H Benhadda 2009 ldquoSimilarity measures for binary and numerical data a surveyrdquoInt J Knowledge Engineering and Soft Data Paradigms 1(1) 63- httpciteseerxistpsueduviewdoc

downloaddoi=10112126533amprep=rep1amptype=pdf

Seung-Seok Choi Sung-Hyuk Cha Charles C Tappert 2010 ldquoA Survey of Binary Similarity and DistanceMeasuresrdquo Journal of Systemics Cybernetics amp Informatics 8(1)43-8 httpwwwiiisciorgJournalCV$

scipdfsGS315JGpdf

22

Sung-Hyuk Cha 2007 ldquoComprehensive Survey on DistanceSimilarity Measures between Probability DensityFunctionsrdquo International Journal of Mathematical Models and Methods in Applied Sciences 4(1) 300-7 http

csispaceeductappertdpsd861-12session4-p2pdf

Anna Huang 2008 ldquoSimilarity Measures for Text Document Clusteringrdquo Proceedings of the 6th New ZealandComputer Science Research Student Conference 49-56 httpwwwnzcsrsc08canterburyacnzsiteproceedingsIndividual_Paperspg049_Similarity_Measures_for_Text_Document_Clusteringpdf

bull Michael B Jordan ldquoThe Kernel Trickrdquo (Lecture Notes) httpspeopleeecsberkeleyedu

~jordancourses281B-spring04lectureslec3pdf

bull Eric Kim 2017 ldquoEverything You Ever Wanted to Know about the Kernel Trick (But Were Afraidto Ask)rdquo httpwwweric-kimneteric-kim-netposts1kernel_trick_blog_ekim_12_20_

2017pdf

Derived Data Representations - Dimensionality Reduction Compression De-composition Embeddings

The groupings here and under the measurement multivariate statistics section are particularly arbitraryFor example ldquok-Means clusteringrdquo can be viewed as a technique for ldquodimensionality reductionrdquo ldquocompres-sionrdquo ldquofeature extractionrdquo ldquolatent variable measurementrdquo ldquounsupervised learningrdquo ldquocollaborative filteringrdquo

Clustering hashing quantization blocking compression

bull Dagger[Compression] Khalid Sayood 2012 Introduction to Data Compression 4th ed Springer httpwwwsciencedirectcomezaccesslibrariespsuedusciencebook9780124157965 (eg cod-ing blocking via vector quantization)

bull MMDS Ch3 ldquoFinding Similar Itemsrdquo (minhashing locality sensitive hashing)

bull Multivariate-R Ch 6 ldquoClusteringrdquo

bull DataScience-Python NB 0511 ldquok-Means Clusteringrdquo NG 0512 ldquoGaussian Mixture Modelsrdquo

bull Dagger[KMeansHashing] Kaiming He Fang Wen Jian Sun 2013 ldquoK-means Hashing An Affinity-Preserving Quantization Method for Learning Binary Compact Codesrdquo CVPR httpswwwcv-foundationorgopenaccesscontent_cvpr_2013papersHe_K-Means_Hashing_An_2013_CVPR_paper

pdf

bull dagger[Squashing] Madigan D Raghavan N Dumouchel W Nason M Posse C and RidgewayG (2002) ldquoLikelihood-based data squashing A modeling approach to instance constructionrdquo DataMining and Knowledge Discovery 6(2) 173-190 httpdxdoiorgezaccesslibrariespsu

edu101023A1014095614948

bull dagger[Core-sets] Piotr Indyk Sepideh Mahabadi Mohammad Mahdian Vahab S Mirrokni 2014 ldquoCom-posable core-sets for diversity and coverage maximizationrdquo PODS rsquo14 100-8 httpsdoiorg10114525945382594560

Feature selection feature extraction feature engineering weighting preprocessing

bull PatternRecognition Sections 26-7 ldquoFeature Selection Feature Extractionrdquo

bull DSHandbook-Py Ch 7 ldquoInterlude Feature Extraction Ideasrdquo

bull DataScience-Python NB 0504 ldquoFeature Engineeringrdquo

bull Features in text NLP and InfoRetrieval re tfidf and similar FightinWords

bull Features in images DataScience-Python NB 0514 ldquoImage Featuresrdquo

23

Dimensionality reduction decomposition factorization change of basis reparameterization matrix com-pletion latent variables source separation

bull MMDS Ch 9 ldquoRecommendation Systemsrdquo Ch11 ldquoDimensionality Reductionrdquo

bull DSHandbook-Py Ch 10 ldquoUnsupervised Learning Clustering and Dimensionality Reductionrdquo

bull Dagger[Shalizi-ADA] Cosma Rohilla Shalizi 2017 Advanced Data Analysis from an Elementary Pointof View Ch 16 (ldquoPrincipal Components Analysisrdquo) Ch 17 (ldquoFactor Modelsrdquo) httpwwwstat

cmuedu~cshaliziADAfaEPoV

See also Multivariate-R Ch 3 ldquoPrincipal Components Analysisrdquo Ch 4 ldquoMultidimensional ScalingrdquoCh 5 ldquoExploratory Factor Analysisrdquo Latent

bull DeepLearning Ch 2 ldquoLinear Algebrardquo Ch 13 ldquoLinear Factor Modelsrdquo

bull Dagger[GloVe] Jeffrey Pennington Richard Socher Christopher Manning 2014 ldquoGloVe Global vectorsfor word representationrdquo EMNLP httpsnlpstanfordeduprojectsglove

bull dagger[NMF] Daniel D Lee and H Sebastian Seung 1999 ldquoLearning the parts of objects by non-negativematrix factorizationrdquo Nature 401788-791 httpdxdoiorgezaccesslibrariespsuedu

10103844565

bull dagger[CUR] Michael W Mahoney and Petros Drineas 2009 ldquoCUR matrix decompositions for improveddata analysisrdquo PNAS httpwwwpnasorgcontent1063697full

bull Dagger[ICA] Aapo Hyvarinen and Erkki Oja 2000 ldquoIndependent Component Analysis Algorithmsand Applicationsrdquo Neural Networks 13(4-5) 411-30 httpswwwcshelsinkifiuahyvarin

papersNN00newpdf

bull [RandomProjection] Ella Bingham and Heikki Mannilla 2001 ldquoRandom projection in dimension-ality reduction applications to image and text datardquo KDD httpsdoiorg1011452F502512

502546

bull Compression eg ldquoTransform codingrdquo ldquoWaveletsrdquo

Nonlinear dimensionality reduction Manifold learning

bull DeepLearning Section 5113 ldquoManifold Learningrdquo

bull DataScience-Python NB 0510 ldquoManifold Learningrdquo (Locally linear embedding [LLE] Isomap)

bull Dagger[KernelPCA] Sebastian Raschka 2014 ldquoKernel tricks and nonlinear dimensionality reduction viaRBF kernel PCArdquo httpsebastianraschkacomArticles2014_kernel_pcahtml

bull Shalizi-ADA Ch 18 ldquoNonlinear Dimensionality Reductionrdquo (LLE)

bull Dagger[LaplacianEigenmaps] Mikhail Belkin and Partha Niyogi 2003 ldquoLaplacian eigenmaps for di-mensionality reduction and data representationrdquo Neural Computation 15(6) 13731396 http

webcseohio-stateedu~belkin8papersLEM_NC_03pdf

bull DeepLearning Ch14 (ldquoAutoencodersrdquo)

bull Dagger[word2vec] Tomas Mikolov Kai Chen Greg Corrado Jeffrey Dean 2013 ldquoEfficient Estimation ofWord Representations in Vector Spacerdquo httpsarxivorgabs13013781

bull Dagger[word2vecExplained] Yoav Goldberg and Omer Levy 2014 ldquoword2vec Explained DerivingMikolov et als Negative-Sampling Word-Embedding Methodrdquo httpsarxivorgabs14023722

bull Dagger[t-SNE] Laurens van der Maaten and Geoffrey Hinton 2008 ldquoVisualising Data using t-SNErdquo Jour-nal of Machine Learning Research 9 2579-2605 httpjmlrcsailmitedupapersvolume9

vandermaaten08avandermaaten08apdf

24

Computation and Scaling Up

Scientific computing computation at scale

bull NRCreport Ch 10 ldquoThe Seven Computational Giants of Massive Data Analysisrdquo

bull DSHandbook-Py Ch 21 ldquoPerformance and Computer Memoryrdquo Ch 22 ldquoComputer Memory andData Structuresrdquo

bull Dagger[ComputationalStatistics-Python] Cliburn Chan Computational Statistics in Python https

peopledukeedu~ccc14sta-663indexhtml

bull CRAN Task View High Performance Computinghttpscranr-projectorgwebviewsHighPerformanceComputinghtml

Numerical computing

bull [NumericalComputing] Ward Cheney and David Kincaid 2013 Numerical Mathematics andComputing 7th ed BrooksCole Cengage Learning

bull CRAN Task View Numerical Mathematicshttpscranr-projectorgwebviewsNumericalMathematicshtml

Optimization (eg MLE gradient descent stochastic gradient descent EM algorithm neural nets)

bull DSHandbook-Py Ch 23 ldquoMaximum Likelihood Estimation and Optimizationrdquo (gradient descent)

bull DeepLearning Ch 4 ldquoNumerical Computationrdquo Ch 6 ldquoDeep Feedforward Networksrdquo Ch 8ldquoOptimization for Training Deep Modelsrdquo

bull CRAN Task View Optimization and Mathematical Programminghttpscranr-projectorgwebviewsOptimizationhtml

Linear algebra matrix computations

bull [MatrixComputations] Gene H Golub and Charles F Van Loan 2013 Matrix Computations 4thed Johns Hopkins University Press

bull Dagger[NetflixMatrix] Yehuda Koren Robert Bell and Chris Volinsky 2009 ldquoMatrix FactorizationTechniques for Recommender Systemsrdquo Computer August 42-9 httpsdatajobscomdata-

science-repoRecommender-Systems-[Netflix]pdf

bull Dagger[BigDataPCA] Jianqing Fan Qiang Sun Wen-Xin Zhou Ziwei Zhu ldquoPrincipal Component Anal-ysis for Big Datardquo httpwwwprincetonedu~ziweizpcapdf

bull Dagger[RandomSVD] Andrew Tulloch 2009 ldquoFast Randomized SVDrdquo httpsresearchfbcom

fast-randomized-svd

bull Dagger[FactorSGD] Rainer Gemulla Peter J Haas Erik Nijkamp and Yannis Sismanis 2011 ldquoLarge-Scale Matrix Factorization with Distributed Stochastic Gradient Descentrdquo KDD httpwwwcs

utahedu~hariteachingbigdatagemulla11dsgdpdf

bull Dagger[Sparse] Max Grossman 2015 ldquo101 Ways to Store a Sparse Matrixrdquo httpsmediumcom

jmaxg3101-ways-to-store-a-sparse-matrix-c7f2bf15a229

bull Dagger[DontInvert] John D Cook 2010 ldquoDonrsquot Invert that Matrixrdquo httpswwwjohndcookcom

blog20100119dont-invert-that-matrix

Simulation-based inference resampling Monte Carlo methods MCMC Bayes approximate inference

25

bull Dagger[MarkovVisually] Victor Powell ldquoMarkov Chains Explained Visuallyrdquo httpsetosaioev

markov-chains

bull DSHandbook-Py Ch 25 ldquoStochastic Modelingrdquo (Markov chains MCMC HMM)

bull Bayes ComputationalStatistics-Py

bull DeepLearning Ch 17 ldquoMonte Carlo Methodsrdquo Ch 19 ldquoApproximate Inferencerdquo

bull Dagger[VariationalInference] Jason Eisner 2011 ldquoHigh-Level Explanation of Variational Inferencerdquohttpswwwcsjhuedu~jasontutorialsvariationalhtml

bull CRAN Task View Bayesian Inferencehttpscranr-projectorgwebviewsBayesianhtml

Parallelism MapReduce Split-Apply-Combine

bull Dagger[MapReduceIntuition] Jigsaw Academy 2014 ldquoBig Data Specialist MapReducerdquo https

wwwyoutubecomwatchv=TwcYQzFqg-8ampfeature=youtube

bull MMDS Ch 2 ldquoMap-Reduce and the New Software Stackrdquo (3rd Edition discusses Spark amp Tensor-Flow)

bull Algorithms Ch 3ldquoScalable Algorithms and Associative Statisticsrdquo Ch 4 ldquoHadoop and MapReducerdquo

bull NRCreport Ch 6 ldquoResources Trade-offs and Limitationsrdquo

bull TidyData-R (Split-apply-combine is the motivating principle behind the ldquotidyverserdquo approach) Seealso Part III ldquoProgramrdquo (pipes functions vectors iteration)

bull Dagger[TidySAC-Video] Hadley Wickham 2017 ldquoData Science in the Tidyverserdquo httpswww

rstudiocomresourcesvideosdata-science-in-the-tidyverse

bull See also FactorSGD

Functional Programming

bull ComputationalStatistics-Python ldquoFunctions are first class objectsrdquo through first exercises

bull DSHandbook-Py Ch 20 ldquoProgramming Language Conceptsrdquo

bull See also Haskell

Scaling iteration streaming data online algorithms (Spark)

bull NRCreport Ch 4 ldquoTemporal Data and Real-Time Algorithmsrdquo

bull MMDS Ch 4 ldquoMining Data Streamsrdquo

bull Dagger[BDAS] AMPLab BDAS The Berkeley Data Analytics Stack httpsamplabcsberkeleyedu

software

bull dagger[Spark] Mohammed Guller 2015 Big Data Analytics with Spark A Practitioners Guide toUsing Spark for Large-Scale Data Processing Machine Learning and Graph Analytics and High-Velocity Data Stream Processing Apress httpslink-springer-comezaccesslibrariespsu

edubook101007978-1-4842-0964-6

bull DSHandbook-Py Ch 13 ldquoBig Datardquo

bull DeepLearning Ch 10 ldquoSequence Modeling Recurrent and Recursive Netsrdquo

General Resources for Python and R

26

bull dagger[DSHandbook-Py] Field Cady 2017 The Data Science Handbook (Python-based) Wiley http

onlinelibrarywileycomezaccesslibrariespsuedubook1010029781119092919

bull Dagger[Tutorials-R] Ujjwal Karn 2017 ldquoA curated list of R tutorials for Data Science NLP and MachineLearningrdquo httpsgithubcomujjwalkarnDataScienceR

bull Dagger[Tutorials-Python] Ujjwal Karn 2017 ldquoA curated list of Python tutorials for Data Science NLPand Machine Learningrdquo httpsgithubcomujjwalkarnDataSciencePython

Cutting and Bleeding Edge of Data Science Languages

bull [Scala] daggerVishal Layka and David Pollak 2015 Beginning Scala httpslink-springer-

comezaccesslibrariespsuedubook1010072F978-1-4842-0232-6 daggerNicolas Patrick 2014Scala for Machine Learning httpsebookcentralproquestcomlibpensudetailaction

docID=1901910 DaggerScala site httpswwwscala-langorg (See also )

bull [Julia] daggerIvo Baobaert 2015 Getting Started with Julia httpsebookcentralproquestcom

libpensudetailactiondocID=1973847 DaggerDouglas Bates 2013 ldquoJulia for R Programmersrdquohttpwwwstatwiscedu~batesJuliaForRProgrammerspdf DaggerJulia site httpsjulialang

org

bull [Haskell] daggerRichard Bird 2014 Thinking Functionally with Haskell httpsdoi-orgezaccess

librariespsuedu101017CBO9781316092415 daggerHakim Cassimally 2017 Learning Haskell Pro-gramming httpswwwlyndacomHaskell-tutorialsLearning-Haskell-Programming604926-2html daggerJames Church 2017 Learning Haskell for Data Analysis httpswwwlyndacomDeveloper-tutorialsLearning-Haskell-Data-Analysis604234-2html Haskell site httpswwwhaskellorg

bull [Clojure] daggerMark McDonnell 2017 Quick Clojure Essential Functional Programming https

link-springer-comezaccesslibrariespsuedubook1010072F978-1-4842-2952-1 daggerAkhillWali 2014 Clojure for Machine Learning httpsebookcentralproquestcomlibpensu

detailactiondocID=1674848 daggerArthur Ulfeldt 2015 httpswwwlyndacomClojure-tutorialsUp-Running-Clojure413127-2html DaggerClojure site httpsclojureorg

bull [TensorFlow] DaggerAbadi et al (Google) 2016 ldquoTensorFlow A System for Large-Scale Ma-chine Learningrdquo httpsarxivorgabs160508695 DaggerUdacity ldquoDeep Learningrdquo httpswww

udacitycomcoursedeep-learning--ud730 DaggerTensorFlow site httpswwwtensorfloworg

bull [H2O] Darren Cool 2016 Practical Machine Learning with H2O Powerful Scalable Techniques forDeep Learning and AI OrsquoReilly Arno Candel and Viraj Parmar 2015 Deep Learning with H2ODaggerH2O site httpswwwh2oai

Social Data Structures

Space and Time

bull dagger[GIA] David OrsquoSullivan David J Unwin 2010 Geographic Information Analysis Second EditionJohn Wiley amp Sons httponlinelibrarywileycombook1010029780470549094

bull [Space-Time] Donna J Peuquet 2003 Representations of Space and Time Guilford

bull Roger S Bivand Edzer Pebesma Virgilio Gmez-Rubio 2013 Applied Spatial Data Analysis in RFree through library httplinkspringercomezaccesslibrariespsuedubook101007

2F978-1-4614-7618-4 See also Edzer Pebesma 2016 ldquoHandling and Analyzing Spatial Spa-tiotemporal and Movement Data in Rrdquo httpsedzergithubioUseR2016

bull Dagger[PySAL] Sergio J Rey and Dani Arribas-Bel 2016 ldquoGeographic Data Science with PySAL andthe pydata Stackrdquo httpdarribasorggds_scipy16

27

bull DataScience-Python NB 0413 ldquoGeographic Data with Basemaprdquo

bull Time ldquoLongitudinalrdquo intra-individual data as common in developmental psychology dagger[Longitudinal]Garrett Fitzmaurice Marie Davidian Geert Verbeke and Geert Molenberghs editors 2009 Longitu-dinal Data Analysis Chapman amp Hall CRC httppensueblibcompatronFullRecordaspxp=359998

bull Time ldquoTime Series Time Series Cross Section Panelrdquo as common in economics political scienceand sociology DataScience-Python Ch 3 re pandas DSHandbook-Py Ch 17 ldquoTime Series AnalysisrdquoEconometrics GelmanHill

bull Time ldquoSequential Data Streamsrdquo as in NLP machine learning Spark and other rdquostreaming datardquoreadings

bull Space-Time Data with continuity connectivity neighborhood structure See DeepLearning Chapter 9re convolution

bull CRAN Task Views Spatial httpscranr-projectorgwebviewsSpatialhtml Spatiotempo-ral httpscranr-projectorgwebviewsSpatioTemporalhtml Time Series httpscran

r-projectorgwebviewsTimeSerieshtml

Network Graphs

bull Dagger[Networks] David Easley and Jon Kleinberg 2010 Networks Crowds and Markets ReasoningAbout a Highly Connected World Cambridge University Press httpwwwcscornelleduhome

kleinbernetworks-book

bull MMDS Ch 5 ldquoLink Analysisrdquo Ch 10 ldquoMining Social-Network Graphsrdquo

bull dagger[Networks-R] Douglas Luke 2015 A Userrsquos Guide to Network Analysis in R Springer https

link-springer-comezaccesslibrariespsuedubook101007978-3-319-23883-8

bull dagger[Networks-Python] Mohammed Zuhair Al-Taie and Seifedine Kadry 2017 Python for Graph andNetwork Analysis Springer httpslink-springer-comezaccesslibrariespsuedubook

1010072F978-3-319-53004-8

Hierarchy Clustered Data Aggregation Mixed Models

bull dagger[GelmanHill] Andrew Gelman and Jennifer Hill 2006 Data Analysis Using Regression and Multi-levelHierarchical Models Cambridge University Press httppensueblibcompatronFullRecordaspxp=288457

bull dagger[MixedModels] Eugene Demidenko 2013 Mixed Models Theory and Applications with R 2nd edWiley httpsiteebrarycomezaccesslibrariespsuedulibpennstatedetailaction

docID=10748641

Social Data Channels

Language Text Speech Audio

bull Dagger[NLP] Daniel Jurafsky and James H Martin Forthcoming (2017) Speech and Language ProcessingAn Introduction to Natural Language Processing Speech Recognition and Computational Linguistics3rd ed Prentice-Hall Preprint httpswebstanfordedu~jurafskyslp3

bull InfoRetrieval

bull Dagger[CoreNLP] Stanford CoreNLP httpsstanfordnlpgithubioCoreNLP

28

bull dagger[TextAsData] Grimmer and Stewart 2013 ldquoText as Data The Promise and Pitfalls of AutomaticContent Analysis Methods for Political Textsrdquo Political Analysis httpsdoiorg101093pan

mps028

bull Quinn-Topics

bull FightinWords

bull dagger[NLTK-Python] Jacob Perkins 2010 Python Text Processing with NLTK 20 Cookbook PackthttpsiteebrarycomezaccesslibrariespsuedulibpennstatedetailactiondocID=10435387dagger[TextAnalytics-Python] Dipanjan Sarkar 2016 Text Analytics with Python Springer http

linkspringercomezaccesslibrariespsuedubook1010072F978-1-4842-2388-8

bull DSHandbook-Py Ch 16 ldquoNatural Language Processingrdquo

bull Matthew J Denny httpwwwmjdennycomText_Processing_In_Rhtml

bull CRAN Natural Language Processing httpscranr-projectorgwebviewsNaturalLanguageProcessinghtml

bull dagger[AudioData] Dean Knox and Christopher Lucas 2017 ldquoThe Speaker-Affect Model MeasuringEmotion in Political Speech with Audio Datardquo httpchristopherlucasorgfilesPDFssam

pdf httpswwwyoutubecomwatchv=Hs8A9dwkMzI

bull daggerMorgan amp Claypool Synthesis Lectures on Human Language Technologies httpwwwmorganclaypoolcomtochlt11

bull daggerMorgan amp Claypool Synthesis Lectures on Speech and Audio Processing httpwwwmorganclaypoolcomtocsap11

Vision Image Video

bull Dagger[ComputerVision] Richard Szelinski 2010 Computer Vision Algorithms and Applications SpringerhttpszeliskiorgBook

bull MixedModels Ch 11 ldquoStatistical Analysis of Shaperdquo Ch 12 ldquoStatistical Image Analysisrdquo

bull AudioVideoVolumetric See DeepLearning Chapter 9 re convolution

bull daggerMorgan amp Claypool Synthesis Lectures on Image Video and Multimedia Processing httpwww

morganclaypoolcomtocivm11

bull daggerMorgan amp Claypool Synthesis Lectures on Computer Vision httpwwwmorganclaypoolcom

toccov11

Approaches to Learning from Data (The Analytics Layer)

bull NRCreport Ch 7 ldquoBuilding Models from Massive Datardquo

bull MMDS (data-mining)

bull CausalInference

bull Dagger[VisualAnalytics] Daniel Keim Jorn Kohlhammer Geoffrey Ellis and Florian Mansmann editors2010 Mastering the Information Age Solving Problems with Visual Analytics The EurographicsAssociation Goslar Germany httpwwwvismastereubook See also daggerMorgan amp ClaypoolSynthesis Lectures on Visualization httpwwwmorganclaypoolcomtocvis21

bull Dagger[Econometrics] Bruce E Hansen 2014 Econometrics httpwwwsscwiscedu~bhansen

econometrics

29

bull Dagger[StatisticalLearning] Gareth James Daniela Witten Trevor Hastie and Robert Tibshirani 2013An Introduction to Statistical Learning with Applications in R Springer httpwww-bcfuscedu

~garethISLindexhtml

bull [MachineLearning] Christopher Bishop 2006 Pattern Recognition and Machine Learning Springer

bull dagger[Bayes] Peter D Hoff 2009 A First Course in Bayesian Statistical Methods Springer http

linkspringercombook1010072F978-0-387-92407-6

bull dagger[GraphicalModels] Soslashren Hoslashjsgaard David Edwards Steffen Lauritzen 2012 Graphical Mod-els with R Springer httpslink-springer-comezaccesslibrariespsuedubook101007

2F978-1-4614-2299-0

bull [InformationTheory] David Mackay 2003 Information Theory Inference and Learning AlgorithmsSpringer

bull Dagger[DeepLearning] Ian Goodfellow Yoshua Bengio and Aaron Courville 2016 Deep Learning MITPress httpwwwdeeplearningbookorg (see also Yann LeCun Yoshua Bengio and Geof-frey Hinton 2015 ldquoDeep Learningrdquo Nature 521(7553) 436ndash444 httpsdoiorg101038

nature14539)

See also Dagger[Keras] Keras The Python Deep Learning Library httpskerasio Francoios CholletDirectory of Keras tutorials httpsgithubcomfcholletkeras-resources

bull dagger[SignalProcessing] Jose Maria Giron-Sierra 2013 Digital Signal Processing with MATLAB Exam-ples (Volumes 1-3) Springer httpslink-springer-comezaccesslibrariespsuedubook

101007978-981-10-2534-1

30

Penn State Policy Statements

Academic Integrity

Academic integrity is the pursuit of scholarly activity in an open honest and responsible manner Academicintegrity is a basic guiding principle for all academic activity at The Pennsylvania State University and allmembers of the University community are expected to act in accordance with this principle Consistent withthis expectation the Universitys Code of Conduct states that all students should act with personal integrityrespect other students dignity rights and property and help create and maintain an environment in whichall can succeed through the fruits of their efforts

Academic integrity includes a commitment by all members of the University community not to engage in ortolerate acts of falsification misrepresentation or deception Such acts of dishonesty violate the fundamentalethical principles of the University community and compromise the worth of work completed by others

Disability Accomodation

Penn State welcomes students with disabilities into the Universitys educational programs Every Penn Statecampus has an office for students with disabilities Student Disability Resources (SDR) website provides con-tact information for every Penn State campus (httpequitypsuedusdrdisability-coordinator)For further information please visit the Student Disability Resources website (httpequitypsuedusdr)

In order to receive consideration for reasonable accommodations you must contact the appropriate disabilityservices office at the campus where you are officially enrolled participate in an intake interview and pro-vide documentation See documentation guidelines at (httpequitypsuedusdrguidelines) If thedocumentation supports your request for reasonable accommodations your campus disability services officewill provide you with an accommodation letter Please share this letter with your instructors and discussthe accommodations with them as early as possible You must follow this process for every semester thatyou request accommodations

Psychological Services

Many students at Penn State face personal challenges or have psychological needs that may interfere withtheir academic progress social development or emotional wellbeing The university offers a variety ofconfidential services to help you through difficult times including individual and group counseling crisisintervention consultations online chats and mental health screenings These services are provided by staffwho welcome all students and embrace a philosophy respectful of clients cultural and religious backgroundsand sensitive to differences in race ability gender identity and sexual orientation

bull Counseling and Psychological Services at University Park (CAPS) (httpstudentaffairspsueducounseling) 814-863-0395

bull Penn State Crisis Line (24 hours7 daysweek) 877-229-6400

bull Crisis Text Line (24 hours7 daysweek) Text LIONS to 741741

Educational Equity

Penn State takes great pride to foster a diverse and inclusive environment for students faculty and staffConsistent with University Policy AD29 students who believe they have experienced or observed a hatecrime an act of intolerance discrimination or harassment that occurs at Penn State are urged to report theseincidents as outlined on the Universitys Report Bias webpage (httpequitypsuedureportbias)

31

Background Knowledge

Students will have a variety of backgrounds Prior to beginning interdisciplinary coursework tofulfill Social Data Analytics degree requirements including SoDA 501 and 502 students are ex-pected to have advanced (graduate) training in at least one of the component areas of Social DataAnalytics and a familiarity with basic concepts in the others

With regard to specialization students are expected to have advanced (graduate) training in ONEof the following

bull quantitative social science methodology and a discipline of social science (as would be thecase for a second-year PhD student in Political Science Sociology Criminology Human De-velopment and Family Studies or Demography) OR

bull statistics (as would be the case for a second-year PhD student in Statistics) OR

bull information science or informatics (as would be the case for a second-year PhD student inInformation Science and Technology or a second-year PhD student in Geography specializingin GIScience) OR

bull computer science (as would be the case for a second-year PhD student in Computer Scienceand Engineering)

This requirement is met as a matter of meeting home program requirements for students in thedual-title PhD but may require additional coursework on the part of students in other programswishing to pursue the graduate minor

With regard to general preparation students are expected to have ALL of the following technicalknowledge

bull basic programming skills including basic facility with R ampor Python AND

bull basic knowledge of relational databases ampor geographic information systems AND

bull basic knowledge of probability applied statistics ampor social science research design AND

bull basic familiarity with a substantive or theoretical area of social science (eg 300-level course-work in political science sociology criminology human development psychology economicscommunication anthropology human geography social informatics or similar fields)

It is not unusual for students to have one or more gaps in this preparation at time of applicationto the SoDA program Students should work with Social Data Analytics advisers to develop aplan for timely remediation of any deficiencies which generally will not require formal courseworkfor students whose training and interests are otherwise appropriate for pursuit of the Social DataAnalytics degree Where possible this will be addressed at time of application to the Social DataAnalytics program

To this end some free training materials are linked in the reference section and there are alsoabundant free high-quality self-paced course-style training materials on these and related subjectsavailable through edX (httpwwwedxorg) Udacity (httpwwwudacitycom) Codecademy(httpcodecademycom) and Lynda (httpwwwlyndapsu)

32

Page 16: SoDA 501: Approaches and Issues in Big Social Data Spring ... · \Approaches and Issues in Big Social Data," o ered in the spring semester. Some SoDA / IGERT students will take more

bull dagger[UnobtrusiveMeasures] Raymond M Lee 2015 ldquoUnobtrusive Measuresrdquo Oxford Bibliographieshttpsdoiorg101093OBO9780199846740-0048 (Canonical cite is Eugene J Webb 1966Unobtrusive Measures or Webb Donald T Campbell Richard D Schwartz Lee Sechrest 1999Unobtrusive Measures rev ed Sage)

bull BitByBit Examples in Chapter 2

Multiple Measures Latent Variable Measurement

bull 7Rules Ch 4 ldquoReplicate Where Possiblerdquo

bull dagger[LatentVariables] David J Bartholomew Martin Knott and Irini Moustaki 2011 Latent VariableModels and Factor Analysis A Unified Approach Wiley httpsiteebrarycomezaccess

librariespsuedulibpennstatedetailactiondocID=10483308) (esp Ch1 ldquoBasic ideas andexamplesrdquo)

bull dagger[Multivariate-R] Brian Everitt and Torsten Hothorn 2011 An Introduction to Applied MultivariateAnalysis with R Springer httplinkspringercomezaccesslibrariespsuedubook10

10072F978-1-4419-9650-3

bull Shalizi-ADA Ch17 ldquoFactor Modelsrdquo

bull Dagger[NetflixPrize] Edwin Chen 2011 ldquoWinning the Netflix Prize A Summaryrdquo

bull DaggerCRAN httpscranr-projectorgwebviewsMultivariatehtmlhttpscranr-projectorgwebviewsPsychometricshtmlhttpscranr-projectorgwebviewsClusterhtml

Sampling and Survey Design

bull dagger[Sampling] Steven K Thompson 2012 Sampling 3rd ed httpsk8es4mc2lsearchserialssolutionscomsid=sersolampSS_jc=TC_024492330amptitle=Wiley20Desktop20Editions203A20Sampling

bull ResearchMethodsKB ldquoSamplingrdquo

bull NRCreport Ch 8 ldquoSampling and Massive Datardquo

bull BitByBit Ch 3 ldquoAsking Questionsrdquo

bull Dagger[MSE] Daniel Manrique-Vallier Megan E Price and Anita Gohdes 2013 In Seybolt et al (eds)Counting Civilians ldquoMultiple Systems Estimation Techniques for Estimating Casualties in ArmedConflictsrdquo Preprint httpciteseerxistpsueduviewdocdownloaddoi=1011469939amp

rep=rep1amptype=pdf)

bull dagger[NetworkSampling] Ted Mouw and Ashton M Verdery 2012 ldquoNetwork Sampling with Memory AProposal for More Efficient Sampling from Social Networksrdquo Sociological Methodology 42(1)206ndash56httpsdxdoiorg1011772F0081175012461248

bull See also FightinWords (re hidden heteroskedasticity in sample variance)

bull Dagger[HashDontSample] Mudit Uppal 2016 ldquoProbabilistic data structures in the Big data world (+code)rdquo (re ldquoHash donrsquot samplerdquo) httpsmediumcommuppalprobabilistic-data-structures-

in-the-big-data-world-code-b9387cff0c55

bull MMDS Hash Functions (132) Sampling in streams (42)

bull DaggerCRAN httpscranr-projectorgwebviewsOfficialStatisticshtml (ldquoComplex SurveyDesignrdquo ldquoSmall Area Estimationrdquo)

Experimental and Observational Designs for Causal Inference

16

bull Dagger[CausalInference] Miguel A Hernan James M Robins Forthcoming (2017) Causal Infer-ence Chapman amp HallCRC Preprint httpswwwhsphharvardedumiguel-hernancausal-

inference-book Chs 1 ldquoA definition of causal effectrdquo Ch 2 ldquoRandomized experimentsrdquo Ch3 ldquoObservational studiesrdquo (See Ch 6 ldquoGraphical representation of causal effectsrdquo for integrationwith Judea Pearl approach)

bull BitByBit Chapter 4 ldquoRunning Experimentsrdquo Ch 2 Natural experiments in observable data exam-ples

bull ResearchMethodsKB ldquoDesignrdquo

bull Firebaugh-7Rules Ch 2 ldquoLook for Differences that Make a Difference and Report Themrdquo sectCh5 ldquoCompare Like with Likerdquo Ch 6 ldquoUse Panel Data to Study Individual Change and RepeatedCross-Section Data to Study Social Changerdquo

bull Shalizi-ADA Part IV ldquoCausal Inferencerdquo

bull Monroe-No

bull CRAN httpscranr-projectorgwebviewsExperimentalDesignhtml

Technologies for Primary and Secondary Data Collection

Mobile Devices Distributed Sensors Wearable Sensors Remote Sensing

bull dagger[RealityMining] Nathan Eagle and Alex (Sandy) Pentland 2006 ldquoReality Mining SensingComplex Social Systemsrdquo Personal and Ubiquitous Computing 10(4) 255ndash68 httpsdoiorg

101007s00779-005-0046-3

bull dagger[QuantifiedSelf ] Melanie Swan 2013 ldquoThe Quantified Self Fundamental Disruption in Big DataScience and Biological Discoveryrdquo Big Data 1(2) 85ndash99 httpsdoiorg101089big2012

0002

bull dagger[SensorData] Charu C Aggarwal (Ed) 2013 Managing and Mining Sensor Data Springer httplinkspringercomezaccesslibrariespsuedubook1010072F978-1-4614-6309-2 Ch1ldquoAn Introduction to Sensor Data Analyticsrdquo

bull dagger[NightLights] Thushyanthan Baskaran Brian Min Yogesh Uppal 2015 ldquoElection cycles andelectricity provision Evidence from a quasi-experiment with Indian special electionsrdquo Journal ofPublic Economics 12664-73 httpsdoiorg101016jjpubeco201503011

Crowdsourcing Human Computation Citizen Science Web Experiments

bull dagger[Crowdsourcing] Daren C Brabham 2013 Crowdsourcing MIT Press httpsiteebrary

comezaccesslibrariespsuedulibpennstatedetailactiondocID=10692208 ldquoIntroduc-tionrdquo Ch1 ldquoConcepts Theories and Cases of Crowdsourcingrdquo

bull BitByBit Ch 5 ldquoCreating Mass Collaborationrdquo

bull NRCreport Ch 9 ldquoHuman Interaction with Datardquo

bull dagger[MTurk] Krista Casler Lydia Bickel and Elizabeth Hackett 2013 ldquoSeparate but Equal AComparison of Participants and Data Gathered via Amazonrsquos MTurk Social Media and Face-to-FaceBehavioral Testingrdquo Computers in Human Behavior 29(6) 2156ndash60 httpdoiorg101016j

chb201305009

bull dagger[LabintheWild] Katharina Reinecke and Krzysztof Z Gajos 2015 ldquoLabintheWild ConductingLarge-Scale Online Experiments With Uncompensated Samplesrdquo In Proceedings of the 18th ACMConference on Computer Supported Cooperative Work amp Social Computing (CSCW rsquo15) 1364ndash1378httpdxdoiorg10114526751332675246

17

bull dagger[TweetmentEffects] Kevin Munger 2017 ldquoTweetment Effects on the Tweeted ExperimentallyReducing Racist Harassmentrdquo Political Behavior 39(3) 629ndash49 httpdoiorg101016jchb

201305009

bull dagger[HumanComputation] Edith Law and Luis van Ahn 2011 Human Computation Morgan amp Clay-pool httpwwwmorganclaypoolcomezaccesslibrariespsuedudoipdf102200S00371ED1V01Y201107AIM013

Open Data File Formats APIs Semantic Web Linked Data

bull DSHandbook-Py Ch 12 ldquoData Encodings and File Formatsrdquo

bull Dagger[OpenData] Open Data Institute ldquoWhat Is Open Datardquo httpstheodiorgwhat-is-open-

data

bull Dagger[OpenDataHandbook] Open Knowledge International The Open Data Handbook httpopendatahandbookorg Includes appendix ldquoFile Formatsrdquo httpopendatahandbookorgguideenappendices

file-formats

bull Dagger[APIs] Brian Cooksey 2016 An Introduction to APIs httpszapiercomlearnapis

bull Dagger[APIMarkets] RapidAPI mashape API marketplaces httpsdocsrapidapicom https

marketmashapecom ProgrammableWeb httpswwwprogrammablewebcom

bull dagger[LinkedData] Tom Heath and Christian Bizer 2011 Linked Data Evolving the Web into a GlobalData Space Morgan amp Claypool httpwwwmorganclaypoolcomezaccesslibrariespsu

edudoiabs102200S00334ED1V01Y201102WBE001 Ch 1 ldquoIntroductionrdquo Ch 2 ldquoPrinciples ofLinked Datardquo

bull dagger[SemanticWeb] Nikolaos Konstantinos and Dimitrios-Emmanuel Spanos 2015 Materializing theWeb of Linked Data Springer httplinkspringercomezaccesslibrariespsuedubook

1010072F978-3-319-16074-0 Ch 1 ldquoIntroduction Linked Data and the Semantic Webrdquo

bull daggerMorgan amp Claypool Synthesis Lectures on the Semantic Web Theory and Technology httpwww

morganclaypoolcomtocwbe111

Web Scraping

bull dagger[TheoryDrivenScraping] Richard N Landers Robert C Brusso Katelyn J Cavanaugh andAndrew B Colmus 2016 ldquoA Primer on Theory-Driven Web Scraping Automatic Extraction ofBig Data from the Internet for Use in Psychological Researchrdquo Psychological Methods 4 475ndash492httpdxdoiorgezaccesslibrariespsuedu101037met0000081

bull Dagger[Scraping-Py] Al Sweigart 2015 Automate the Boring Stuff with Python Practical Programmingfor Total Beginners Ch 11 Web-Scraping httpsautomatetheboringstuffcomchapter11

bull dagger[Scraping-R] Simon Munzert Christian Rubba Peter Meiszligner and Dominic Nyhuis 2014 Au-tomated Data Collection with R A Practical Guide to Web Scraping and Text Mining Wileyhttponlinelibrarywileycombook1010029781118834732

bull DaggerCRAN httpscranr-projectorgwebviewsWebTechnologieshtml

Ethics amp Scientific Responsibility in Big Social Data

Human Subjects Consent Privacy

bull BitByBit Ch 6 (ldquoEthicsrdquo)

bull Dagger[PSU-ORP] Penn State Office for Research Protections (under the VP for Research)

Human Subjects Research IRB httpswwwresearchpsueduirb

18

Revised Common Rule httpswwwresearchpsueduirbcommonrulechanges

Responsible Conduct of Research httpswwwresearchpsuedueducationrcr

Research Misconduct httpswwwresearchpsueduresearchmisconduct

SARI PSU (Scientific and Research Integrity Training) httpswwwresearchpsuedu

trainingsari

bull Dagger[MenloReport] David Dittrich and Erin Kenneally (Center for Applied Interneet Data Analysis)2012 ldquoThe Menlo Report Ethical Principles Guiding Information and Communication Technol-ogy Researchrdquo and companion report ldquoApplying Ethical Principles rdquo httpswwwcaidaorg

publicationspapers2012menlo_report_actual_formatted

bull Dagger[BigDataEthics-CBDES] Jacob Metcalf Emily F Keller and danah boyd 2016 ldquoPerspec-tives on Big Data Ethics and Societyrdquo Council for Big Data Ethics and Society httpbdes

datasocietynetcouncil-outputperspectives-on-big-data-ethics-and-society

bull Dagger[AoIRReport] Annette Markham and Elizabeth Buchanan (Association of Internet Researchers)2012 ldquoEthical Decision-Making and Internet Research Recommendations of the AoIR Ethics WorkingCommittee (Version 20)rdquo httpsaoirorgreportsethics2pdf and guidelines chart https

aoirorgwp-contentuploads201701aoir_ethics_graphic_2016pdf

bull Dagger[BigDataEthics-Wired] Sarah Zhang 2016 ldquoScientists are Just as Confused about the Ethics ofBig-Data Research as Yourdquo Wired httpswwwwiredcom201605scientists-just-confused-ethics-big-data-research

bull dagger[BigDataEthics-HerschelMori] Richard Herschel and Virginia M Mori 2017 ldquoEthics amp BigDatardquo Technology in Society 49 31-36 httpdoiorg101016jtechsoc201703003

The Science of Data Privacy

bull dagger[BigDataPrivacy] Terence Craig and Mary E Ludloff 2011 Privacy and Big Data OrsquoReilly MediahttppensueblibcompatronFullRecordaspxp=781814

bull Dagger[DataAnalysisPrivacy] John Abowd Lorenzo Alvisi Cynthia Dwork Sampath Kannan AshwinMachanavajjhala Jerome Reiter 2017 ldquoPrivacy-Preserving Data Analysis for the Federal Statisti-cal Agenciesrdquo A Computing Community Consortium white paper httpsarxivorgabs1701

00752

bull dagger[DataPrivacy] Stephen E Fienberg and Aleksandra B Slavkovic 2011 ldquoData Privacy and Con-fidentialityrdquo International Encyclopedia of Statistical Science 342ndash5 httpdoiorg978-3-642-

04898-2_202

bull Dagger[NetworksPrivacy] Vishesh Karwa and Aleksandra Slavkovic 2016 ldquoInference using noisy degreesDifferentially private β-model and synthetic graphsrdquo Annals of Statistics 44(1) 87-112 http

projecteuclidorgeuclidaos1449755958

bull dagger[DataPublishingPrivacy] Raymond Chi-Wing Wong and Ada Wai-Chee Fu 2010 Privacy-Preserving Data Publishing An Overview Morgan amp Claypool httpwwwmorganclaypoolcom

ezaccesslibrariespsuedudoipdfplus102200S00237ED1V01Y201003DTM002

bull DataMatching Ch 8 ldquoPrivacy Aspects of Data Matchingrdquo

Transparency Reproducibility and Team Science

bull Dagger[10RulesforData] Alyssa Goodman Alberto Pepe Alexander W Blocker Christine L BorgmanKyle Cranmer Merce Crosas Rosanne Di Stefano Yolanda Gil Paul Groth Margaret HedstromDavid W Hogg Vinay Kashyap Ashish Mahabal Aneta Siemiginowska and Aleksandra Slavkovic(2014) ldquoTen Simple Rules for the Care and Feeding of Scientific Datardquo PLoS Computational Biology10(4) e1003542 httpsdoiorg101371journalpcbi1003542 (Note esp curated resourcesfor reproducible research)

19

bull dagger[Transparency] E Miguel C Camerer K Casey J Cohen K M Esterling A Gerber R Glen-nerster D P Green M Humphreys G Imbens D Laitin T Madon L Nelson B A Nosek M Pe-tersen R Sedlmayr J P Simmons U Simonsohn M Van der Laan 2014 ldquoPromoting Transparencyin Social Science Researchrdquo Science 343(6166) 30ndash1 httpsdoiorg101126science1245317

bull dagger[Reproducibility] Marcus R Munafo Brian A Nosek Dorothy V M Bishop Katherine S ButtonChristopher D Chambers Nathalie Percie du Sert Uri Simonsohn Eric-Jan Wagenmakers JenniferJ Ware and John P A Ioannidis 2017 ldquoA Manifesto for Reproducible Sciencerdquo Nature HumanBehavior 0021(2017) httpsdoiorg101038s41562-016-0021

bull Dagger[SoftwareCarpentry] esp ldquoLessonsrdquo httpsoftware-carpentryorglessons

bull DSHandbook-Py Ch 9 ldquoTechnical Communication and Documentationrdquo Ch 15 ldquoSoftware Engi-neering Best Practicesrdquo

bull dagger[TeamScienceToolkit] Vogel AL Hall KL Fiore SM Klein JT Bennett LM Gadlin H Stokols DNebeling LC Wuchty S Patrick K Spotts EL Pohl C Riley WT Falk-Krzesinski HJ 2013 ldquoTheTeam Science Toolkit enhancing research collaboration through online knowledge sharingrdquo AmericanJournal of Preventive Medicine 45 787-9 http101016jamepre201309001

bull sect[Databrary] Kara Hall Robert Croyle and Amanda Vogel Forthcoming (2017) Advancing Socialand Behavioral Health Research through Cross-disciplinary Team Science Springer Includes RickO Gilmore and Karen E Adolph ldquoOpen Sharing of Research Video Breaking the Boundaries of theResearch Teamrdquo (See httpdatabraryorg)

bull CRAN httpscranr-projectorgwebviewsReproducibleResearchhtml

Social Bias Fair Algorithms

bull MachineBias

bull Dagger[EmbeddingsBias] Aylin Caliskan Joanna J Bryson and Arvind Narayanan 2017 ldquoSemanticsDerived Automatically from Language Corpora Contain Human Biasesrdquo Science httpsarxiv

orgabs160807187

bull Dagger[Debiasing] Tolga Bolukbasi Kai-Wei Chang James Zou Venkatesh Saligrama Adam Kalai 2016ldquoMan is to Computer Programmer as Woman is to Homemaker Debiasing Word Embeddingsrdquohttpsarxivorgabs160706520

bull Dagger[AvoidingBias] Moritz Hardt Eric Price Nathan Srebro 2016 ldquoEquality of Opportunity inSupervised Learningrdquo httpsarxivorgabs161002413

bull Dagger[InevitableBias] Jon Kleinberg Sendhil Mullainathan Manish Raghavan 2016 ldquoInherent Trade-Offs in the Fair Determination of Risk Scoresrdquo httpsarxivorgabs160905807

Databases and Data Management

bull NRCreport Ch 3 ldquoScaling the Infrastructure for Data Managementrdquo

bull dagger[SQL] Jan L Harrington 2010 SQL Clearly Explained Elsevier httpwwwsciencedirectcomezaccesslibrariespsuedusciencebook9780123756978

bull DSHandbook-Py Ch 14 ldquoDatabasesrdquo

bull dagger[noSQL] Guy Harrison 2015 Next Generation Databases NoSQL NewSQL and Big Data Apresshttpslink-springer-comezaccesslibrariespsuedubook101007978-1-4842-1329-2

bull dagger[Cloud] Divyakant Agrawal Sudipto Das and Amr El Abbadi 2012 Data Management inthe Cloud Challenges and Opportunities Morgan amp Claypool httpwwwmorganclaypoolcom

ezaccesslibrariespsuedudoipdfplus102200S00456ED1V01Y201211DTM032

20

bull daggerMorgan amp Claypool Synthesis Lectures on Data Management httpwwwmorganclaypoolcom

tocdtm11

Data Wrangling

Theoretically-structured approaches to data wrangling

bull Dagger[TidyData-R] Garrett Grolemund and Hadley Wickham 2017 R for Data Science (httpr4dshadconz) esp Ch 12 ldquoTidy Datardquo also Wickham 2014 ldquoTidy Datardquo Journalof Statistical Software 59(10) httpwwwjstatsoftorgv59i10paper Tools The tidyversehttptidyverseorg

bull Dagger[DataScience-Py] Jake VanderPlas 2016 Python Data Science Handbook (httpsgithubcomjakevdpPythonDSHandbook-Py) esp Ch 3 on pandas httppandaspydataorg

bull Dagger[DataCarpentry] Colin Gillespie and Robin Lovelace 2017 Efficient R Programming https

csgillespiegithubioefficientR esp Ch 6 on ldquoefficient data carpentryrdquo

bull sect[Wrangler] Joseph M Hellerstein Jeffrey Heer Tye Rattenbury and Sean Kandel 2017 DataWrangling Practical Techniques for Data Preparation Tool Trifacta Wrangler httpwwwtrifactacomproductswrangler

Data wrangling practice

bull DSHandbook-Py Ch 4 ldquoData Munging String Manipulation Regular Expressions and Data Clean-ingrdquo

bull dagger[DataSimplification] Jules J Berman 2016 Data Simplification Taming Information with OpenSource Tools Elsevier httpwwwsciencedirectcomezaccesslibrariespsueduscience

book9780128037812

bull dagger[Wrangling-Py] Jacqueline Kazil Katharine Jarmul 2016 Data Wrangling with Python OrsquoReillyhttpproquestcombosafaribooksonlinecomezaccesslibrariespsuedu9781491948804

bull dagger[Wrangling-R] Bradley C Boehmke 2016 Data Wrangling with R Springer httplink

springercomezaccesslibrariespsuedubook1010072F978-3-319-45599-0

Record Linkage Entity Resolution Deduplication

bull dagger[DataMatching] Peter Christen 2012 Data Matching Concepts and Techniques for Record Link-age Entity Resolution and Duplicate Detectionhttplinkspringercomezaccesslibrariespsuedubook1010072F978-3-642-31164-2

(esp Ch 2 ldquoThe Data Matching Processrdquo)

bull DataSimplification Chapter 5 ldquoIdentifying and Deidentifying Datardquo

bull Dagger[EntityResolution] Lise Getoor and Ashwin Machanavajjhala 2013 ldquoEntity Resolution for BigDatardquo Tutorial KDD httpwwwumiacsumdedu~getoorTutorialsER_KDD2013pdf

bull Dagger[SyrianCasualties] Peter Sadosky Anshumali Shrivastava Megan Price and Rebecca C Steorts2015 ldquoBlocking Methods Applied to Casualty Records from the Syrian Conflictrdquo httpsarxiv

orgabs151007714 (For more on blocking see DataMatching Ch 4 ldquoIndexingrdquo)

bull CRAN Task Views httpscranr-projectorgwebviewsOfficialStatisticshtml ldquoStatis-tical Matching and Record Linkagerdquo

ldquoMaking up datardquo Imputation Smoothers Kernels Priors Filters Teleportation Negative Sampling Con-volution Augmentation Adversarial Training

21

bull dagger[Imputation] Yi Deng Changgee Chang Moges Seyoum Ido and Qi Long 2016 ldquoMultiple Im-putation for General Missing Data Patterns in the Presence of High-dimensional Datardquo ScientificReports 6(21689) httpswwwnaturecomarticlessrep21689

bull Dagger[Overimputation] Matthew Blackwell James Honaker and Gary King 2017 ldquoA Unified Ap-proach to Measurement Error and Missing Data Overview and Applicationsrdquo Sociological Methodsamp Research 46(3) 303-341 httpgkingharvardedufilesgkingfilesmeasurepdf

bull Shalizi-ADA Sect 15 (Linear Smoothers) Ch 8 (Splines) Sect 144 (Kernel Density Estimates)DataScience-Python NB 0513 ldquoKernel Density Estimationrdquo

bull FightinWords (re Bayesian priors as additional data impact of priors on regularization)

bull DeepLearning Section 75 (Data Augmentation) 712 (Dropout) 713 (Adversarial Examples) Chap-ter 9 (Convolutional Networks)

bull dagger[Adversarial] Ian J Goodfellow Jonathon Shlens Christian Szegedy 2015 ldquoExplaining and Har-vesting Adversarial Examplesrdquo httpsarxivorgabs14126572

bull See also resampling and simulation methods

bull See also feature engineering preprocessing

bull CRAN Task Views httpscranr-projectorgwebviewsOfficialStatisticshtml ldquoImpu-tationrdquo

(Direct) Data Representations Data Mappings

bull NRCreport Ch 5 ldquoLarge-Scale Data Representationsrdquo

bull dagger[PatternRecognition] M Narasimha Murty and V Susheela Devi 2011 Pattern RecognitionAn Algorithmic Approach Springer httpslink-springer-comezaccesslibrariespsuedu

book1010072F978-0-85729-495-1 Section 21 ldquoData Structures for Pattern Representationrdquo

bull Dagger[InfoRetrieval] Christopher D Manning Prabhakar Raghavan and Hinrich Schutze 2009 Intro-duction to Information Retrieval Cambridge University Press httpnlpstanfordeduIR-bookCh 1 2 6 (also note slides used in their class)

bull dagger[Algorithms] Brian Steele John Chandler Swarna Reddy 2016 Algorithms for Data ScienceWiley httplinkspringercomezaccesslibrariespsuedubook1010072F978-3-319-

45797-0 Ch 2 ldquoData Mapping and Data Dictionariesrdquo

bull See also Social Data Structures

Similarity Distance Association Covariance The Kernel Trick

bull PatternRecognition Section 23 ldquoProximity Measuresrdquo

bull Dagger[Similarity] Brendan OrsquoConnor 2012 ldquoCosine similarity Pearson correlation and OLS coeffi-cientsrdquo httpsbrenoconcomblog201203cosine-similarity-pearson-correlation-and-

ols-coefficients

bull For relatively comprehensive lists see also

M-J Lesot M Rifqi and H Benhadda 2009 ldquoSimilarity measures for binary and numerical data a surveyrdquoInt J Knowledge Engineering and Soft Data Paradigms 1(1) 63- httpciteseerxistpsueduviewdoc

downloaddoi=10112126533amprep=rep1amptype=pdf

Seung-Seok Choi Sung-Hyuk Cha Charles C Tappert 2010 ldquoA Survey of Binary Similarity and DistanceMeasuresrdquo Journal of Systemics Cybernetics amp Informatics 8(1)43-8 httpwwwiiisciorgJournalCV$

scipdfsGS315JGpdf

22

Sung-Hyuk Cha 2007 ldquoComprehensive Survey on DistanceSimilarity Measures between Probability DensityFunctionsrdquo International Journal of Mathematical Models and Methods in Applied Sciences 4(1) 300-7 http

csispaceeductappertdpsd861-12session4-p2pdf

Anna Huang 2008 ldquoSimilarity Measures for Text Document Clusteringrdquo Proceedings of the 6th New ZealandComputer Science Research Student Conference 49-56 httpwwwnzcsrsc08canterburyacnzsiteproceedingsIndividual_Paperspg049_Similarity_Measures_for_Text_Document_Clusteringpdf

bull Michael B Jordan ldquoThe Kernel Trickrdquo (Lecture Notes) httpspeopleeecsberkeleyedu

~jordancourses281B-spring04lectureslec3pdf

bull Eric Kim 2017 ldquoEverything You Ever Wanted to Know about the Kernel Trick (But Were Afraidto Ask)rdquo httpwwweric-kimneteric-kim-netposts1kernel_trick_blog_ekim_12_20_

2017pdf

Derived Data Representations - Dimensionality Reduction Compression De-composition Embeddings

The groupings here and under the measurement multivariate statistics section are particularly arbitraryFor example ldquok-Means clusteringrdquo can be viewed as a technique for ldquodimensionality reductionrdquo ldquocompres-sionrdquo ldquofeature extractionrdquo ldquolatent variable measurementrdquo ldquounsupervised learningrdquo ldquocollaborative filteringrdquo

Clustering hashing quantization blocking compression

bull Dagger[Compression] Khalid Sayood 2012 Introduction to Data Compression 4th ed Springer httpwwwsciencedirectcomezaccesslibrariespsuedusciencebook9780124157965 (eg cod-ing blocking via vector quantization)

bull MMDS Ch3 ldquoFinding Similar Itemsrdquo (minhashing locality sensitive hashing)

bull Multivariate-R Ch 6 ldquoClusteringrdquo

bull DataScience-Python NB 0511 ldquok-Means Clusteringrdquo NG 0512 ldquoGaussian Mixture Modelsrdquo

bull Dagger[KMeansHashing] Kaiming He Fang Wen Jian Sun 2013 ldquoK-means Hashing An Affinity-Preserving Quantization Method for Learning Binary Compact Codesrdquo CVPR httpswwwcv-foundationorgopenaccesscontent_cvpr_2013papersHe_K-Means_Hashing_An_2013_CVPR_paper

pdf

bull dagger[Squashing] Madigan D Raghavan N Dumouchel W Nason M Posse C and RidgewayG (2002) ldquoLikelihood-based data squashing A modeling approach to instance constructionrdquo DataMining and Knowledge Discovery 6(2) 173-190 httpdxdoiorgezaccesslibrariespsu

edu101023A1014095614948

bull dagger[Core-sets] Piotr Indyk Sepideh Mahabadi Mohammad Mahdian Vahab S Mirrokni 2014 ldquoCom-posable core-sets for diversity and coverage maximizationrdquo PODS rsquo14 100-8 httpsdoiorg10114525945382594560

Feature selection feature extraction feature engineering weighting preprocessing

bull PatternRecognition Sections 26-7 ldquoFeature Selection Feature Extractionrdquo

bull DSHandbook-Py Ch 7 ldquoInterlude Feature Extraction Ideasrdquo

bull DataScience-Python NB 0504 ldquoFeature Engineeringrdquo

bull Features in text NLP and InfoRetrieval re tfidf and similar FightinWords

bull Features in images DataScience-Python NB 0514 ldquoImage Featuresrdquo

23

Dimensionality reduction decomposition factorization change of basis reparameterization matrix com-pletion latent variables source separation

bull MMDS Ch 9 ldquoRecommendation Systemsrdquo Ch11 ldquoDimensionality Reductionrdquo

bull DSHandbook-Py Ch 10 ldquoUnsupervised Learning Clustering and Dimensionality Reductionrdquo

bull Dagger[Shalizi-ADA] Cosma Rohilla Shalizi 2017 Advanced Data Analysis from an Elementary Pointof View Ch 16 (ldquoPrincipal Components Analysisrdquo) Ch 17 (ldquoFactor Modelsrdquo) httpwwwstat

cmuedu~cshaliziADAfaEPoV

See also Multivariate-R Ch 3 ldquoPrincipal Components Analysisrdquo Ch 4 ldquoMultidimensional ScalingrdquoCh 5 ldquoExploratory Factor Analysisrdquo Latent

bull DeepLearning Ch 2 ldquoLinear Algebrardquo Ch 13 ldquoLinear Factor Modelsrdquo

bull Dagger[GloVe] Jeffrey Pennington Richard Socher Christopher Manning 2014 ldquoGloVe Global vectorsfor word representationrdquo EMNLP httpsnlpstanfordeduprojectsglove

bull dagger[NMF] Daniel D Lee and H Sebastian Seung 1999 ldquoLearning the parts of objects by non-negativematrix factorizationrdquo Nature 401788-791 httpdxdoiorgezaccesslibrariespsuedu

10103844565

bull dagger[CUR] Michael W Mahoney and Petros Drineas 2009 ldquoCUR matrix decompositions for improveddata analysisrdquo PNAS httpwwwpnasorgcontent1063697full

bull Dagger[ICA] Aapo Hyvarinen and Erkki Oja 2000 ldquoIndependent Component Analysis Algorithmsand Applicationsrdquo Neural Networks 13(4-5) 411-30 httpswwwcshelsinkifiuahyvarin

papersNN00newpdf

bull [RandomProjection] Ella Bingham and Heikki Mannilla 2001 ldquoRandom projection in dimension-ality reduction applications to image and text datardquo KDD httpsdoiorg1011452F502512

502546

bull Compression eg ldquoTransform codingrdquo ldquoWaveletsrdquo

Nonlinear dimensionality reduction Manifold learning

bull DeepLearning Section 5113 ldquoManifold Learningrdquo

bull DataScience-Python NB 0510 ldquoManifold Learningrdquo (Locally linear embedding [LLE] Isomap)

bull Dagger[KernelPCA] Sebastian Raschka 2014 ldquoKernel tricks and nonlinear dimensionality reduction viaRBF kernel PCArdquo httpsebastianraschkacomArticles2014_kernel_pcahtml

bull Shalizi-ADA Ch 18 ldquoNonlinear Dimensionality Reductionrdquo (LLE)

bull Dagger[LaplacianEigenmaps] Mikhail Belkin and Partha Niyogi 2003 ldquoLaplacian eigenmaps for di-mensionality reduction and data representationrdquo Neural Computation 15(6) 13731396 http

webcseohio-stateedu~belkin8papersLEM_NC_03pdf

bull DeepLearning Ch14 (ldquoAutoencodersrdquo)

bull Dagger[word2vec] Tomas Mikolov Kai Chen Greg Corrado Jeffrey Dean 2013 ldquoEfficient Estimation ofWord Representations in Vector Spacerdquo httpsarxivorgabs13013781

bull Dagger[word2vecExplained] Yoav Goldberg and Omer Levy 2014 ldquoword2vec Explained DerivingMikolov et als Negative-Sampling Word-Embedding Methodrdquo httpsarxivorgabs14023722

bull Dagger[t-SNE] Laurens van der Maaten and Geoffrey Hinton 2008 ldquoVisualising Data using t-SNErdquo Jour-nal of Machine Learning Research 9 2579-2605 httpjmlrcsailmitedupapersvolume9

vandermaaten08avandermaaten08apdf

24

Computation and Scaling Up

Scientific computing computation at scale

bull NRCreport Ch 10 ldquoThe Seven Computational Giants of Massive Data Analysisrdquo

bull DSHandbook-Py Ch 21 ldquoPerformance and Computer Memoryrdquo Ch 22 ldquoComputer Memory andData Structuresrdquo

bull Dagger[ComputationalStatistics-Python] Cliburn Chan Computational Statistics in Python https

peopledukeedu~ccc14sta-663indexhtml

bull CRAN Task View High Performance Computinghttpscranr-projectorgwebviewsHighPerformanceComputinghtml

Numerical computing

bull [NumericalComputing] Ward Cheney and David Kincaid 2013 Numerical Mathematics andComputing 7th ed BrooksCole Cengage Learning

bull CRAN Task View Numerical Mathematicshttpscranr-projectorgwebviewsNumericalMathematicshtml

Optimization (eg MLE gradient descent stochastic gradient descent EM algorithm neural nets)

bull DSHandbook-Py Ch 23 ldquoMaximum Likelihood Estimation and Optimizationrdquo (gradient descent)

bull DeepLearning Ch 4 ldquoNumerical Computationrdquo Ch 6 ldquoDeep Feedforward Networksrdquo Ch 8ldquoOptimization for Training Deep Modelsrdquo

bull CRAN Task View Optimization and Mathematical Programminghttpscranr-projectorgwebviewsOptimizationhtml

Linear algebra matrix computations

bull [MatrixComputations] Gene H Golub and Charles F Van Loan 2013 Matrix Computations 4thed Johns Hopkins University Press

bull Dagger[NetflixMatrix] Yehuda Koren Robert Bell and Chris Volinsky 2009 ldquoMatrix FactorizationTechniques for Recommender Systemsrdquo Computer August 42-9 httpsdatajobscomdata-

science-repoRecommender-Systems-[Netflix]pdf

bull Dagger[BigDataPCA] Jianqing Fan Qiang Sun Wen-Xin Zhou Ziwei Zhu ldquoPrincipal Component Anal-ysis for Big Datardquo httpwwwprincetonedu~ziweizpcapdf

bull Dagger[RandomSVD] Andrew Tulloch 2009 ldquoFast Randomized SVDrdquo httpsresearchfbcom

fast-randomized-svd

bull Dagger[FactorSGD] Rainer Gemulla Peter J Haas Erik Nijkamp and Yannis Sismanis 2011 ldquoLarge-Scale Matrix Factorization with Distributed Stochastic Gradient Descentrdquo KDD httpwwwcs

utahedu~hariteachingbigdatagemulla11dsgdpdf

bull Dagger[Sparse] Max Grossman 2015 ldquo101 Ways to Store a Sparse Matrixrdquo httpsmediumcom

jmaxg3101-ways-to-store-a-sparse-matrix-c7f2bf15a229

bull Dagger[DontInvert] John D Cook 2010 ldquoDonrsquot Invert that Matrixrdquo httpswwwjohndcookcom

blog20100119dont-invert-that-matrix

Simulation-based inference resampling Monte Carlo methods MCMC Bayes approximate inference

25

bull Dagger[MarkovVisually] Victor Powell ldquoMarkov Chains Explained Visuallyrdquo httpsetosaioev

markov-chains

bull DSHandbook-Py Ch 25 ldquoStochastic Modelingrdquo (Markov chains MCMC HMM)

bull Bayes ComputationalStatistics-Py

bull DeepLearning Ch 17 ldquoMonte Carlo Methodsrdquo Ch 19 ldquoApproximate Inferencerdquo

bull Dagger[VariationalInference] Jason Eisner 2011 ldquoHigh-Level Explanation of Variational Inferencerdquohttpswwwcsjhuedu~jasontutorialsvariationalhtml

bull CRAN Task View Bayesian Inferencehttpscranr-projectorgwebviewsBayesianhtml

Parallelism MapReduce Split-Apply-Combine

bull Dagger[MapReduceIntuition] Jigsaw Academy 2014 ldquoBig Data Specialist MapReducerdquo https

wwwyoutubecomwatchv=TwcYQzFqg-8ampfeature=youtube

bull MMDS Ch 2 ldquoMap-Reduce and the New Software Stackrdquo (3rd Edition discusses Spark amp Tensor-Flow)

bull Algorithms Ch 3ldquoScalable Algorithms and Associative Statisticsrdquo Ch 4 ldquoHadoop and MapReducerdquo

bull NRCreport Ch 6 ldquoResources Trade-offs and Limitationsrdquo

bull TidyData-R (Split-apply-combine is the motivating principle behind the ldquotidyverserdquo approach) Seealso Part III ldquoProgramrdquo (pipes functions vectors iteration)

bull Dagger[TidySAC-Video] Hadley Wickham 2017 ldquoData Science in the Tidyverserdquo httpswww

rstudiocomresourcesvideosdata-science-in-the-tidyverse

bull See also FactorSGD

Functional Programming

bull ComputationalStatistics-Python ldquoFunctions are first class objectsrdquo through first exercises

bull DSHandbook-Py Ch 20 ldquoProgramming Language Conceptsrdquo

bull See also Haskell

Scaling iteration streaming data online algorithms (Spark)

bull NRCreport Ch 4 ldquoTemporal Data and Real-Time Algorithmsrdquo

bull MMDS Ch 4 ldquoMining Data Streamsrdquo

bull Dagger[BDAS] AMPLab BDAS The Berkeley Data Analytics Stack httpsamplabcsberkeleyedu

software

bull dagger[Spark] Mohammed Guller 2015 Big Data Analytics with Spark A Practitioners Guide toUsing Spark for Large-Scale Data Processing Machine Learning and Graph Analytics and High-Velocity Data Stream Processing Apress httpslink-springer-comezaccesslibrariespsu

edubook101007978-1-4842-0964-6

bull DSHandbook-Py Ch 13 ldquoBig Datardquo

bull DeepLearning Ch 10 ldquoSequence Modeling Recurrent and Recursive Netsrdquo

General Resources for Python and R

26

bull dagger[DSHandbook-Py] Field Cady 2017 The Data Science Handbook (Python-based) Wiley http

onlinelibrarywileycomezaccesslibrariespsuedubook1010029781119092919

bull Dagger[Tutorials-R] Ujjwal Karn 2017 ldquoA curated list of R tutorials for Data Science NLP and MachineLearningrdquo httpsgithubcomujjwalkarnDataScienceR

bull Dagger[Tutorials-Python] Ujjwal Karn 2017 ldquoA curated list of Python tutorials for Data Science NLPand Machine Learningrdquo httpsgithubcomujjwalkarnDataSciencePython

Cutting and Bleeding Edge of Data Science Languages

bull [Scala] daggerVishal Layka and David Pollak 2015 Beginning Scala httpslink-springer-

comezaccesslibrariespsuedubook1010072F978-1-4842-0232-6 daggerNicolas Patrick 2014Scala for Machine Learning httpsebookcentralproquestcomlibpensudetailaction

docID=1901910 DaggerScala site httpswwwscala-langorg (See also )

bull [Julia] daggerIvo Baobaert 2015 Getting Started with Julia httpsebookcentralproquestcom

libpensudetailactiondocID=1973847 DaggerDouglas Bates 2013 ldquoJulia for R Programmersrdquohttpwwwstatwiscedu~batesJuliaForRProgrammerspdf DaggerJulia site httpsjulialang

org

bull [Haskell] daggerRichard Bird 2014 Thinking Functionally with Haskell httpsdoi-orgezaccess

librariespsuedu101017CBO9781316092415 daggerHakim Cassimally 2017 Learning Haskell Pro-gramming httpswwwlyndacomHaskell-tutorialsLearning-Haskell-Programming604926-2html daggerJames Church 2017 Learning Haskell for Data Analysis httpswwwlyndacomDeveloper-tutorialsLearning-Haskell-Data-Analysis604234-2html Haskell site httpswwwhaskellorg

bull [Clojure] daggerMark McDonnell 2017 Quick Clojure Essential Functional Programming https

link-springer-comezaccesslibrariespsuedubook1010072F978-1-4842-2952-1 daggerAkhillWali 2014 Clojure for Machine Learning httpsebookcentralproquestcomlibpensu

detailactiondocID=1674848 daggerArthur Ulfeldt 2015 httpswwwlyndacomClojure-tutorialsUp-Running-Clojure413127-2html DaggerClojure site httpsclojureorg

bull [TensorFlow] DaggerAbadi et al (Google) 2016 ldquoTensorFlow A System for Large-Scale Ma-chine Learningrdquo httpsarxivorgabs160508695 DaggerUdacity ldquoDeep Learningrdquo httpswww

udacitycomcoursedeep-learning--ud730 DaggerTensorFlow site httpswwwtensorfloworg

bull [H2O] Darren Cool 2016 Practical Machine Learning with H2O Powerful Scalable Techniques forDeep Learning and AI OrsquoReilly Arno Candel and Viraj Parmar 2015 Deep Learning with H2ODaggerH2O site httpswwwh2oai

Social Data Structures

Space and Time

bull dagger[GIA] David OrsquoSullivan David J Unwin 2010 Geographic Information Analysis Second EditionJohn Wiley amp Sons httponlinelibrarywileycombook1010029780470549094

bull [Space-Time] Donna J Peuquet 2003 Representations of Space and Time Guilford

bull Roger S Bivand Edzer Pebesma Virgilio Gmez-Rubio 2013 Applied Spatial Data Analysis in RFree through library httplinkspringercomezaccesslibrariespsuedubook101007

2F978-1-4614-7618-4 See also Edzer Pebesma 2016 ldquoHandling and Analyzing Spatial Spa-tiotemporal and Movement Data in Rrdquo httpsedzergithubioUseR2016

bull Dagger[PySAL] Sergio J Rey and Dani Arribas-Bel 2016 ldquoGeographic Data Science with PySAL andthe pydata Stackrdquo httpdarribasorggds_scipy16

27

bull DataScience-Python NB 0413 ldquoGeographic Data with Basemaprdquo

bull Time ldquoLongitudinalrdquo intra-individual data as common in developmental psychology dagger[Longitudinal]Garrett Fitzmaurice Marie Davidian Geert Verbeke and Geert Molenberghs editors 2009 Longitu-dinal Data Analysis Chapman amp Hall CRC httppensueblibcompatronFullRecordaspxp=359998

bull Time ldquoTime Series Time Series Cross Section Panelrdquo as common in economics political scienceand sociology DataScience-Python Ch 3 re pandas DSHandbook-Py Ch 17 ldquoTime Series AnalysisrdquoEconometrics GelmanHill

bull Time ldquoSequential Data Streamsrdquo as in NLP machine learning Spark and other rdquostreaming datardquoreadings

bull Space-Time Data with continuity connectivity neighborhood structure See DeepLearning Chapter 9re convolution

bull CRAN Task Views Spatial httpscranr-projectorgwebviewsSpatialhtml Spatiotempo-ral httpscranr-projectorgwebviewsSpatioTemporalhtml Time Series httpscran

r-projectorgwebviewsTimeSerieshtml

Network Graphs

bull Dagger[Networks] David Easley and Jon Kleinberg 2010 Networks Crowds and Markets ReasoningAbout a Highly Connected World Cambridge University Press httpwwwcscornelleduhome

kleinbernetworks-book

bull MMDS Ch 5 ldquoLink Analysisrdquo Ch 10 ldquoMining Social-Network Graphsrdquo

bull dagger[Networks-R] Douglas Luke 2015 A Userrsquos Guide to Network Analysis in R Springer https

link-springer-comezaccesslibrariespsuedubook101007978-3-319-23883-8

bull dagger[Networks-Python] Mohammed Zuhair Al-Taie and Seifedine Kadry 2017 Python for Graph andNetwork Analysis Springer httpslink-springer-comezaccesslibrariespsuedubook

1010072F978-3-319-53004-8

Hierarchy Clustered Data Aggregation Mixed Models

bull dagger[GelmanHill] Andrew Gelman and Jennifer Hill 2006 Data Analysis Using Regression and Multi-levelHierarchical Models Cambridge University Press httppensueblibcompatronFullRecordaspxp=288457

bull dagger[MixedModels] Eugene Demidenko 2013 Mixed Models Theory and Applications with R 2nd edWiley httpsiteebrarycomezaccesslibrariespsuedulibpennstatedetailaction

docID=10748641

Social Data Channels

Language Text Speech Audio

bull Dagger[NLP] Daniel Jurafsky and James H Martin Forthcoming (2017) Speech and Language ProcessingAn Introduction to Natural Language Processing Speech Recognition and Computational Linguistics3rd ed Prentice-Hall Preprint httpswebstanfordedu~jurafskyslp3

bull InfoRetrieval

bull Dagger[CoreNLP] Stanford CoreNLP httpsstanfordnlpgithubioCoreNLP

28

bull dagger[TextAsData] Grimmer and Stewart 2013 ldquoText as Data The Promise and Pitfalls of AutomaticContent Analysis Methods for Political Textsrdquo Political Analysis httpsdoiorg101093pan

mps028

bull Quinn-Topics

bull FightinWords

bull dagger[NLTK-Python] Jacob Perkins 2010 Python Text Processing with NLTK 20 Cookbook PackthttpsiteebrarycomezaccesslibrariespsuedulibpennstatedetailactiondocID=10435387dagger[TextAnalytics-Python] Dipanjan Sarkar 2016 Text Analytics with Python Springer http

linkspringercomezaccesslibrariespsuedubook1010072F978-1-4842-2388-8

bull DSHandbook-Py Ch 16 ldquoNatural Language Processingrdquo

bull Matthew J Denny httpwwwmjdennycomText_Processing_In_Rhtml

bull CRAN Natural Language Processing httpscranr-projectorgwebviewsNaturalLanguageProcessinghtml

bull dagger[AudioData] Dean Knox and Christopher Lucas 2017 ldquoThe Speaker-Affect Model MeasuringEmotion in Political Speech with Audio Datardquo httpchristopherlucasorgfilesPDFssam

pdf httpswwwyoutubecomwatchv=Hs8A9dwkMzI

bull daggerMorgan amp Claypool Synthesis Lectures on Human Language Technologies httpwwwmorganclaypoolcomtochlt11

bull daggerMorgan amp Claypool Synthesis Lectures on Speech and Audio Processing httpwwwmorganclaypoolcomtocsap11

Vision Image Video

bull Dagger[ComputerVision] Richard Szelinski 2010 Computer Vision Algorithms and Applications SpringerhttpszeliskiorgBook

bull MixedModels Ch 11 ldquoStatistical Analysis of Shaperdquo Ch 12 ldquoStatistical Image Analysisrdquo

bull AudioVideoVolumetric See DeepLearning Chapter 9 re convolution

bull daggerMorgan amp Claypool Synthesis Lectures on Image Video and Multimedia Processing httpwww

morganclaypoolcomtocivm11

bull daggerMorgan amp Claypool Synthesis Lectures on Computer Vision httpwwwmorganclaypoolcom

toccov11

Approaches to Learning from Data (The Analytics Layer)

bull NRCreport Ch 7 ldquoBuilding Models from Massive Datardquo

bull MMDS (data-mining)

bull CausalInference

bull Dagger[VisualAnalytics] Daniel Keim Jorn Kohlhammer Geoffrey Ellis and Florian Mansmann editors2010 Mastering the Information Age Solving Problems with Visual Analytics The EurographicsAssociation Goslar Germany httpwwwvismastereubook See also daggerMorgan amp ClaypoolSynthesis Lectures on Visualization httpwwwmorganclaypoolcomtocvis21

bull Dagger[Econometrics] Bruce E Hansen 2014 Econometrics httpwwwsscwiscedu~bhansen

econometrics

29

bull Dagger[StatisticalLearning] Gareth James Daniela Witten Trevor Hastie and Robert Tibshirani 2013An Introduction to Statistical Learning with Applications in R Springer httpwww-bcfuscedu

~garethISLindexhtml

bull [MachineLearning] Christopher Bishop 2006 Pattern Recognition and Machine Learning Springer

bull dagger[Bayes] Peter D Hoff 2009 A First Course in Bayesian Statistical Methods Springer http

linkspringercombook1010072F978-0-387-92407-6

bull dagger[GraphicalModels] Soslashren Hoslashjsgaard David Edwards Steffen Lauritzen 2012 Graphical Mod-els with R Springer httpslink-springer-comezaccesslibrariespsuedubook101007

2F978-1-4614-2299-0

bull [InformationTheory] David Mackay 2003 Information Theory Inference and Learning AlgorithmsSpringer

bull Dagger[DeepLearning] Ian Goodfellow Yoshua Bengio and Aaron Courville 2016 Deep Learning MITPress httpwwwdeeplearningbookorg (see also Yann LeCun Yoshua Bengio and Geof-frey Hinton 2015 ldquoDeep Learningrdquo Nature 521(7553) 436ndash444 httpsdoiorg101038

nature14539)

See also Dagger[Keras] Keras The Python Deep Learning Library httpskerasio Francoios CholletDirectory of Keras tutorials httpsgithubcomfcholletkeras-resources

bull dagger[SignalProcessing] Jose Maria Giron-Sierra 2013 Digital Signal Processing with MATLAB Exam-ples (Volumes 1-3) Springer httpslink-springer-comezaccesslibrariespsuedubook

101007978-981-10-2534-1

30

Penn State Policy Statements

Academic Integrity

Academic integrity is the pursuit of scholarly activity in an open honest and responsible manner Academicintegrity is a basic guiding principle for all academic activity at The Pennsylvania State University and allmembers of the University community are expected to act in accordance with this principle Consistent withthis expectation the Universitys Code of Conduct states that all students should act with personal integrityrespect other students dignity rights and property and help create and maintain an environment in whichall can succeed through the fruits of their efforts

Academic integrity includes a commitment by all members of the University community not to engage in ortolerate acts of falsification misrepresentation or deception Such acts of dishonesty violate the fundamentalethical principles of the University community and compromise the worth of work completed by others

Disability Accomodation

Penn State welcomes students with disabilities into the Universitys educational programs Every Penn Statecampus has an office for students with disabilities Student Disability Resources (SDR) website provides con-tact information for every Penn State campus (httpequitypsuedusdrdisability-coordinator)For further information please visit the Student Disability Resources website (httpequitypsuedusdr)

In order to receive consideration for reasonable accommodations you must contact the appropriate disabilityservices office at the campus where you are officially enrolled participate in an intake interview and pro-vide documentation See documentation guidelines at (httpequitypsuedusdrguidelines) If thedocumentation supports your request for reasonable accommodations your campus disability services officewill provide you with an accommodation letter Please share this letter with your instructors and discussthe accommodations with them as early as possible You must follow this process for every semester thatyou request accommodations

Psychological Services

Many students at Penn State face personal challenges or have psychological needs that may interfere withtheir academic progress social development or emotional wellbeing The university offers a variety ofconfidential services to help you through difficult times including individual and group counseling crisisintervention consultations online chats and mental health screenings These services are provided by staffwho welcome all students and embrace a philosophy respectful of clients cultural and religious backgroundsand sensitive to differences in race ability gender identity and sexual orientation

bull Counseling and Psychological Services at University Park (CAPS) (httpstudentaffairspsueducounseling) 814-863-0395

bull Penn State Crisis Line (24 hours7 daysweek) 877-229-6400

bull Crisis Text Line (24 hours7 daysweek) Text LIONS to 741741

Educational Equity

Penn State takes great pride to foster a diverse and inclusive environment for students faculty and staffConsistent with University Policy AD29 students who believe they have experienced or observed a hatecrime an act of intolerance discrimination or harassment that occurs at Penn State are urged to report theseincidents as outlined on the Universitys Report Bias webpage (httpequitypsuedureportbias)

31

Background Knowledge

Students will have a variety of backgrounds Prior to beginning interdisciplinary coursework tofulfill Social Data Analytics degree requirements including SoDA 501 and 502 students are ex-pected to have advanced (graduate) training in at least one of the component areas of Social DataAnalytics and a familiarity with basic concepts in the others

With regard to specialization students are expected to have advanced (graduate) training in ONEof the following

bull quantitative social science methodology and a discipline of social science (as would be thecase for a second-year PhD student in Political Science Sociology Criminology Human De-velopment and Family Studies or Demography) OR

bull statistics (as would be the case for a second-year PhD student in Statistics) OR

bull information science or informatics (as would be the case for a second-year PhD student inInformation Science and Technology or a second-year PhD student in Geography specializingin GIScience) OR

bull computer science (as would be the case for a second-year PhD student in Computer Scienceand Engineering)

This requirement is met as a matter of meeting home program requirements for students in thedual-title PhD but may require additional coursework on the part of students in other programswishing to pursue the graduate minor

With regard to general preparation students are expected to have ALL of the following technicalknowledge

bull basic programming skills including basic facility with R ampor Python AND

bull basic knowledge of relational databases ampor geographic information systems AND

bull basic knowledge of probability applied statistics ampor social science research design AND

bull basic familiarity with a substantive or theoretical area of social science (eg 300-level course-work in political science sociology criminology human development psychology economicscommunication anthropology human geography social informatics or similar fields)

It is not unusual for students to have one or more gaps in this preparation at time of applicationto the SoDA program Students should work with Social Data Analytics advisers to develop aplan for timely remediation of any deficiencies which generally will not require formal courseworkfor students whose training and interests are otherwise appropriate for pursuit of the Social DataAnalytics degree Where possible this will be addressed at time of application to the Social DataAnalytics program

To this end some free training materials are linked in the reference section and there are alsoabundant free high-quality self-paced course-style training materials on these and related subjectsavailable through edX (httpwwwedxorg) Udacity (httpwwwudacitycom) Codecademy(httpcodecademycom) and Lynda (httpwwwlyndapsu)

32

Page 17: SoDA 501: Approaches and Issues in Big Social Data Spring ... · \Approaches and Issues in Big Social Data," o ered in the spring semester. Some SoDA / IGERT students will take more

bull Dagger[CausalInference] Miguel A Hernan James M Robins Forthcoming (2017) Causal Infer-ence Chapman amp HallCRC Preprint httpswwwhsphharvardedumiguel-hernancausal-

inference-book Chs 1 ldquoA definition of causal effectrdquo Ch 2 ldquoRandomized experimentsrdquo Ch3 ldquoObservational studiesrdquo (See Ch 6 ldquoGraphical representation of causal effectsrdquo for integrationwith Judea Pearl approach)

bull BitByBit Chapter 4 ldquoRunning Experimentsrdquo Ch 2 Natural experiments in observable data exam-ples

bull ResearchMethodsKB ldquoDesignrdquo

bull Firebaugh-7Rules Ch 2 ldquoLook for Differences that Make a Difference and Report Themrdquo sectCh5 ldquoCompare Like with Likerdquo Ch 6 ldquoUse Panel Data to Study Individual Change and RepeatedCross-Section Data to Study Social Changerdquo

bull Shalizi-ADA Part IV ldquoCausal Inferencerdquo

bull Monroe-No

bull CRAN httpscranr-projectorgwebviewsExperimentalDesignhtml

Technologies for Primary and Secondary Data Collection

Mobile Devices Distributed Sensors Wearable Sensors Remote Sensing

bull dagger[RealityMining] Nathan Eagle and Alex (Sandy) Pentland 2006 ldquoReality Mining SensingComplex Social Systemsrdquo Personal and Ubiquitous Computing 10(4) 255ndash68 httpsdoiorg

101007s00779-005-0046-3

bull dagger[QuantifiedSelf ] Melanie Swan 2013 ldquoThe Quantified Self Fundamental Disruption in Big DataScience and Biological Discoveryrdquo Big Data 1(2) 85ndash99 httpsdoiorg101089big2012

0002

bull dagger[SensorData] Charu C Aggarwal (Ed) 2013 Managing and Mining Sensor Data Springer httplinkspringercomezaccesslibrariespsuedubook1010072F978-1-4614-6309-2 Ch1ldquoAn Introduction to Sensor Data Analyticsrdquo

bull dagger[NightLights] Thushyanthan Baskaran Brian Min Yogesh Uppal 2015 ldquoElection cycles andelectricity provision Evidence from a quasi-experiment with Indian special electionsrdquo Journal ofPublic Economics 12664-73 httpsdoiorg101016jjpubeco201503011

Crowdsourcing Human Computation Citizen Science Web Experiments

bull dagger[Crowdsourcing] Daren C Brabham 2013 Crowdsourcing MIT Press httpsiteebrary

comezaccesslibrariespsuedulibpennstatedetailactiondocID=10692208 ldquoIntroduc-tionrdquo Ch1 ldquoConcepts Theories and Cases of Crowdsourcingrdquo

bull BitByBit Ch 5 ldquoCreating Mass Collaborationrdquo

bull NRCreport Ch 9 ldquoHuman Interaction with Datardquo

bull dagger[MTurk] Krista Casler Lydia Bickel and Elizabeth Hackett 2013 ldquoSeparate but Equal AComparison of Participants and Data Gathered via Amazonrsquos MTurk Social Media and Face-to-FaceBehavioral Testingrdquo Computers in Human Behavior 29(6) 2156ndash60 httpdoiorg101016j

chb201305009

bull dagger[LabintheWild] Katharina Reinecke and Krzysztof Z Gajos 2015 ldquoLabintheWild ConductingLarge-Scale Online Experiments With Uncompensated Samplesrdquo In Proceedings of the 18th ACMConference on Computer Supported Cooperative Work amp Social Computing (CSCW rsquo15) 1364ndash1378httpdxdoiorg10114526751332675246

17

bull dagger[TweetmentEffects] Kevin Munger 2017 ldquoTweetment Effects on the Tweeted ExperimentallyReducing Racist Harassmentrdquo Political Behavior 39(3) 629ndash49 httpdoiorg101016jchb

201305009

bull dagger[HumanComputation] Edith Law and Luis van Ahn 2011 Human Computation Morgan amp Clay-pool httpwwwmorganclaypoolcomezaccesslibrariespsuedudoipdf102200S00371ED1V01Y201107AIM013

Open Data File Formats APIs Semantic Web Linked Data

bull DSHandbook-Py Ch 12 ldquoData Encodings and File Formatsrdquo

bull Dagger[OpenData] Open Data Institute ldquoWhat Is Open Datardquo httpstheodiorgwhat-is-open-

data

bull Dagger[OpenDataHandbook] Open Knowledge International The Open Data Handbook httpopendatahandbookorg Includes appendix ldquoFile Formatsrdquo httpopendatahandbookorgguideenappendices

file-formats

bull Dagger[APIs] Brian Cooksey 2016 An Introduction to APIs httpszapiercomlearnapis

bull Dagger[APIMarkets] RapidAPI mashape API marketplaces httpsdocsrapidapicom https

marketmashapecom ProgrammableWeb httpswwwprogrammablewebcom

bull dagger[LinkedData] Tom Heath and Christian Bizer 2011 Linked Data Evolving the Web into a GlobalData Space Morgan amp Claypool httpwwwmorganclaypoolcomezaccesslibrariespsu

edudoiabs102200S00334ED1V01Y201102WBE001 Ch 1 ldquoIntroductionrdquo Ch 2 ldquoPrinciples ofLinked Datardquo

bull dagger[SemanticWeb] Nikolaos Konstantinos and Dimitrios-Emmanuel Spanos 2015 Materializing theWeb of Linked Data Springer httplinkspringercomezaccesslibrariespsuedubook

1010072F978-3-319-16074-0 Ch 1 ldquoIntroduction Linked Data and the Semantic Webrdquo

bull daggerMorgan amp Claypool Synthesis Lectures on the Semantic Web Theory and Technology httpwww

morganclaypoolcomtocwbe111

Web Scraping

bull dagger[TheoryDrivenScraping] Richard N Landers Robert C Brusso Katelyn J Cavanaugh andAndrew B Colmus 2016 ldquoA Primer on Theory-Driven Web Scraping Automatic Extraction ofBig Data from the Internet for Use in Psychological Researchrdquo Psychological Methods 4 475ndash492httpdxdoiorgezaccesslibrariespsuedu101037met0000081

bull Dagger[Scraping-Py] Al Sweigart 2015 Automate the Boring Stuff with Python Practical Programmingfor Total Beginners Ch 11 Web-Scraping httpsautomatetheboringstuffcomchapter11

bull dagger[Scraping-R] Simon Munzert Christian Rubba Peter Meiszligner and Dominic Nyhuis 2014 Au-tomated Data Collection with R A Practical Guide to Web Scraping and Text Mining Wileyhttponlinelibrarywileycombook1010029781118834732

bull DaggerCRAN httpscranr-projectorgwebviewsWebTechnologieshtml

Ethics amp Scientific Responsibility in Big Social Data

Human Subjects Consent Privacy

bull BitByBit Ch 6 (ldquoEthicsrdquo)

bull Dagger[PSU-ORP] Penn State Office for Research Protections (under the VP for Research)

Human Subjects Research IRB httpswwwresearchpsueduirb

18

Revised Common Rule httpswwwresearchpsueduirbcommonrulechanges

Responsible Conduct of Research httpswwwresearchpsuedueducationrcr

Research Misconduct httpswwwresearchpsueduresearchmisconduct

SARI PSU (Scientific and Research Integrity Training) httpswwwresearchpsuedu

trainingsari

bull Dagger[MenloReport] David Dittrich and Erin Kenneally (Center for Applied Interneet Data Analysis)2012 ldquoThe Menlo Report Ethical Principles Guiding Information and Communication Technol-ogy Researchrdquo and companion report ldquoApplying Ethical Principles rdquo httpswwwcaidaorg

publicationspapers2012menlo_report_actual_formatted

bull Dagger[BigDataEthics-CBDES] Jacob Metcalf Emily F Keller and danah boyd 2016 ldquoPerspec-tives on Big Data Ethics and Societyrdquo Council for Big Data Ethics and Society httpbdes

datasocietynetcouncil-outputperspectives-on-big-data-ethics-and-society

bull Dagger[AoIRReport] Annette Markham and Elizabeth Buchanan (Association of Internet Researchers)2012 ldquoEthical Decision-Making and Internet Research Recommendations of the AoIR Ethics WorkingCommittee (Version 20)rdquo httpsaoirorgreportsethics2pdf and guidelines chart https

aoirorgwp-contentuploads201701aoir_ethics_graphic_2016pdf

bull Dagger[BigDataEthics-Wired] Sarah Zhang 2016 ldquoScientists are Just as Confused about the Ethics ofBig-Data Research as Yourdquo Wired httpswwwwiredcom201605scientists-just-confused-ethics-big-data-research

bull dagger[BigDataEthics-HerschelMori] Richard Herschel and Virginia M Mori 2017 ldquoEthics amp BigDatardquo Technology in Society 49 31-36 httpdoiorg101016jtechsoc201703003

The Science of Data Privacy

bull dagger[BigDataPrivacy] Terence Craig and Mary E Ludloff 2011 Privacy and Big Data OrsquoReilly MediahttppensueblibcompatronFullRecordaspxp=781814

bull Dagger[DataAnalysisPrivacy] John Abowd Lorenzo Alvisi Cynthia Dwork Sampath Kannan AshwinMachanavajjhala Jerome Reiter 2017 ldquoPrivacy-Preserving Data Analysis for the Federal Statisti-cal Agenciesrdquo A Computing Community Consortium white paper httpsarxivorgabs1701

00752

bull dagger[DataPrivacy] Stephen E Fienberg and Aleksandra B Slavkovic 2011 ldquoData Privacy and Con-fidentialityrdquo International Encyclopedia of Statistical Science 342ndash5 httpdoiorg978-3-642-

04898-2_202

bull Dagger[NetworksPrivacy] Vishesh Karwa and Aleksandra Slavkovic 2016 ldquoInference using noisy degreesDifferentially private β-model and synthetic graphsrdquo Annals of Statistics 44(1) 87-112 http

projecteuclidorgeuclidaos1449755958

bull dagger[DataPublishingPrivacy] Raymond Chi-Wing Wong and Ada Wai-Chee Fu 2010 Privacy-Preserving Data Publishing An Overview Morgan amp Claypool httpwwwmorganclaypoolcom

ezaccesslibrariespsuedudoipdfplus102200S00237ED1V01Y201003DTM002

bull DataMatching Ch 8 ldquoPrivacy Aspects of Data Matchingrdquo

Transparency Reproducibility and Team Science

bull Dagger[10RulesforData] Alyssa Goodman Alberto Pepe Alexander W Blocker Christine L BorgmanKyle Cranmer Merce Crosas Rosanne Di Stefano Yolanda Gil Paul Groth Margaret HedstromDavid W Hogg Vinay Kashyap Ashish Mahabal Aneta Siemiginowska and Aleksandra Slavkovic(2014) ldquoTen Simple Rules for the Care and Feeding of Scientific Datardquo PLoS Computational Biology10(4) e1003542 httpsdoiorg101371journalpcbi1003542 (Note esp curated resourcesfor reproducible research)

19

bull dagger[Transparency] E Miguel C Camerer K Casey J Cohen K M Esterling A Gerber R Glen-nerster D P Green M Humphreys G Imbens D Laitin T Madon L Nelson B A Nosek M Pe-tersen R Sedlmayr J P Simmons U Simonsohn M Van der Laan 2014 ldquoPromoting Transparencyin Social Science Researchrdquo Science 343(6166) 30ndash1 httpsdoiorg101126science1245317

bull dagger[Reproducibility] Marcus R Munafo Brian A Nosek Dorothy V M Bishop Katherine S ButtonChristopher D Chambers Nathalie Percie du Sert Uri Simonsohn Eric-Jan Wagenmakers JenniferJ Ware and John P A Ioannidis 2017 ldquoA Manifesto for Reproducible Sciencerdquo Nature HumanBehavior 0021(2017) httpsdoiorg101038s41562-016-0021

bull Dagger[SoftwareCarpentry] esp ldquoLessonsrdquo httpsoftware-carpentryorglessons

bull DSHandbook-Py Ch 9 ldquoTechnical Communication and Documentationrdquo Ch 15 ldquoSoftware Engi-neering Best Practicesrdquo

bull dagger[TeamScienceToolkit] Vogel AL Hall KL Fiore SM Klein JT Bennett LM Gadlin H Stokols DNebeling LC Wuchty S Patrick K Spotts EL Pohl C Riley WT Falk-Krzesinski HJ 2013 ldquoTheTeam Science Toolkit enhancing research collaboration through online knowledge sharingrdquo AmericanJournal of Preventive Medicine 45 787-9 http101016jamepre201309001

bull sect[Databrary] Kara Hall Robert Croyle and Amanda Vogel Forthcoming (2017) Advancing Socialand Behavioral Health Research through Cross-disciplinary Team Science Springer Includes RickO Gilmore and Karen E Adolph ldquoOpen Sharing of Research Video Breaking the Boundaries of theResearch Teamrdquo (See httpdatabraryorg)

bull CRAN httpscranr-projectorgwebviewsReproducibleResearchhtml

Social Bias Fair Algorithms

bull MachineBias

bull Dagger[EmbeddingsBias] Aylin Caliskan Joanna J Bryson and Arvind Narayanan 2017 ldquoSemanticsDerived Automatically from Language Corpora Contain Human Biasesrdquo Science httpsarxiv

orgabs160807187

bull Dagger[Debiasing] Tolga Bolukbasi Kai-Wei Chang James Zou Venkatesh Saligrama Adam Kalai 2016ldquoMan is to Computer Programmer as Woman is to Homemaker Debiasing Word Embeddingsrdquohttpsarxivorgabs160706520

bull Dagger[AvoidingBias] Moritz Hardt Eric Price Nathan Srebro 2016 ldquoEquality of Opportunity inSupervised Learningrdquo httpsarxivorgabs161002413

bull Dagger[InevitableBias] Jon Kleinberg Sendhil Mullainathan Manish Raghavan 2016 ldquoInherent Trade-Offs in the Fair Determination of Risk Scoresrdquo httpsarxivorgabs160905807

Databases and Data Management

bull NRCreport Ch 3 ldquoScaling the Infrastructure for Data Managementrdquo

bull dagger[SQL] Jan L Harrington 2010 SQL Clearly Explained Elsevier httpwwwsciencedirectcomezaccesslibrariespsuedusciencebook9780123756978

bull DSHandbook-Py Ch 14 ldquoDatabasesrdquo

bull dagger[noSQL] Guy Harrison 2015 Next Generation Databases NoSQL NewSQL and Big Data Apresshttpslink-springer-comezaccesslibrariespsuedubook101007978-1-4842-1329-2

bull dagger[Cloud] Divyakant Agrawal Sudipto Das and Amr El Abbadi 2012 Data Management inthe Cloud Challenges and Opportunities Morgan amp Claypool httpwwwmorganclaypoolcom

ezaccesslibrariespsuedudoipdfplus102200S00456ED1V01Y201211DTM032

20

bull daggerMorgan amp Claypool Synthesis Lectures on Data Management httpwwwmorganclaypoolcom

tocdtm11

Data Wrangling

Theoretically-structured approaches to data wrangling

bull Dagger[TidyData-R] Garrett Grolemund and Hadley Wickham 2017 R for Data Science (httpr4dshadconz) esp Ch 12 ldquoTidy Datardquo also Wickham 2014 ldquoTidy Datardquo Journalof Statistical Software 59(10) httpwwwjstatsoftorgv59i10paper Tools The tidyversehttptidyverseorg

bull Dagger[DataScience-Py] Jake VanderPlas 2016 Python Data Science Handbook (httpsgithubcomjakevdpPythonDSHandbook-Py) esp Ch 3 on pandas httppandaspydataorg

bull Dagger[DataCarpentry] Colin Gillespie and Robin Lovelace 2017 Efficient R Programming https

csgillespiegithubioefficientR esp Ch 6 on ldquoefficient data carpentryrdquo

bull sect[Wrangler] Joseph M Hellerstein Jeffrey Heer Tye Rattenbury and Sean Kandel 2017 DataWrangling Practical Techniques for Data Preparation Tool Trifacta Wrangler httpwwwtrifactacomproductswrangler

Data wrangling practice

bull DSHandbook-Py Ch 4 ldquoData Munging String Manipulation Regular Expressions and Data Clean-ingrdquo

bull dagger[DataSimplification] Jules J Berman 2016 Data Simplification Taming Information with OpenSource Tools Elsevier httpwwwsciencedirectcomezaccesslibrariespsueduscience

book9780128037812

bull dagger[Wrangling-Py] Jacqueline Kazil Katharine Jarmul 2016 Data Wrangling with Python OrsquoReillyhttpproquestcombosafaribooksonlinecomezaccesslibrariespsuedu9781491948804

bull dagger[Wrangling-R] Bradley C Boehmke 2016 Data Wrangling with R Springer httplink

springercomezaccesslibrariespsuedubook1010072F978-3-319-45599-0

Record Linkage Entity Resolution Deduplication

bull dagger[DataMatching] Peter Christen 2012 Data Matching Concepts and Techniques for Record Link-age Entity Resolution and Duplicate Detectionhttplinkspringercomezaccesslibrariespsuedubook1010072F978-3-642-31164-2

(esp Ch 2 ldquoThe Data Matching Processrdquo)

bull DataSimplification Chapter 5 ldquoIdentifying and Deidentifying Datardquo

bull Dagger[EntityResolution] Lise Getoor and Ashwin Machanavajjhala 2013 ldquoEntity Resolution for BigDatardquo Tutorial KDD httpwwwumiacsumdedu~getoorTutorialsER_KDD2013pdf

bull Dagger[SyrianCasualties] Peter Sadosky Anshumali Shrivastava Megan Price and Rebecca C Steorts2015 ldquoBlocking Methods Applied to Casualty Records from the Syrian Conflictrdquo httpsarxiv

orgabs151007714 (For more on blocking see DataMatching Ch 4 ldquoIndexingrdquo)

bull CRAN Task Views httpscranr-projectorgwebviewsOfficialStatisticshtml ldquoStatis-tical Matching and Record Linkagerdquo

ldquoMaking up datardquo Imputation Smoothers Kernels Priors Filters Teleportation Negative Sampling Con-volution Augmentation Adversarial Training

21

bull dagger[Imputation] Yi Deng Changgee Chang Moges Seyoum Ido and Qi Long 2016 ldquoMultiple Im-putation for General Missing Data Patterns in the Presence of High-dimensional Datardquo ScientificReports 6(21689) httpswwwnaturecomarticlessrep21689

bull Dagger[Overimputation] Matthew Blackwell James Honaker and Gary King 2017 ldquoA Unified Ap-proach to Measurement Error and Missing Data Overview and Applicationsrdquo Sociological Methodsamp Research 46(3) 303-341 httpgkingharvardedufilesgkingfilesmeasurepdf

bull Shalizi-ADA Sect 15 (Linear Smoothers) Ch 8 (Splines) Sect 144 (Kernel Density Estimates)DataScience-Python NB 0513 ldquoKernel Density Estimationrdquo

bull FightinWords (re Bayesian priors as additional data impact of priors on regularization)

bull DeepLearning Section 75 (Data Augmentation) 712 (Dropout) 713 (Adversarial Examples) Chap-ter 9 (Convolutional Networks)

bull dagger[Adversarial] Ian J Goodfellow Jonathon Shlens Christian Szegedy 2015 ldquoExplaining and Har-vesting Adversarial Examplesrdquo httpsarxivorgabs14126572

bull See also resampling and simulation methods

bull See also feature engineering preprocessing

bull CRAN Task Views httpscranr-projectorgwebviewsOfficialStatisticshtml ldquoImpu-tationrdquo

(Direct) Data Representations Data Mappings

bull NRCreport Ch 5 ldquoLarge-Scale Data Representationsrdquo

bull dagger[PatternRecognition] M Narasimha Murty and V Susheela Devi 2011 Pattern RecognitionAn Algorithmic Approach Springer httpslink-springer-comezaccesslibrariespsuedu

book1010072F978-0-85729-495-1 Section 21 ldquoData Structures for Pattern Representationrdquo

bull Dagger[InfoRetrieval] Christopher D Manning Prabhakar Raghavan and Hinrich Schutze 2009 Intro-duction to Information Retrieval Cambridge University Press httpnlpstanfordeduIR-bookCh 1 2 6 (also note slides used in their class)

bull dagger[Algorithms] Brian Steele John Chandler Swarna Reddy 2016 Algorithms for Data ScienceWiley httplinkspringercomezaccesslibrariespsuedubook1010072F978-3-319-

45797-0 Ch 2 ldquoData Mapping and Data Dictionariesrdquo

bull See also Social Data Structures

Similarity Distance Association Covariance The Kernel Trick

bull PatternRecognition Section 23 ldquoProximity Measuresrdquo

bull Dagger[Similarity] Brendan OrsquoConnor 2012 ldquoCosine similarity Pearson correlation and OLS coeffi-cientsrdquo httpsbrenoconcomblog201203cosine-similarity-pearson-correlation-and-

ols-coefficients

bull For relatively comprehensive lists see also

M-J Lesot M Rifqi and H Benhadda 2009 ldquoSimilarity measures for binary and numerical data a surveyrdquoInt J Knowledge Engineering and Soft Data Paradigms 1(1) 63- httpciteseerxistpsueduviewdoc

downloaddoi=10112126533amprep=rep1amptype=pdf

Seung-Seok Choi Sung-Hyuk Cha Charles C Tappert 2010 ldquoA Survey of Binary Similarity and DistanceMeasuresrdquo Journal of Systemics Cybernetics amp Informatics 8(1)43-8 httpwwwiiisciorgJournalCV$

scipdfsGS315JGpdf

22

Sung-Hyuk Cha 2007 ldquoComprehensive Survey on DistanceSimilarity Measures between Probability DensityFunctionsrdquo International Journal of Mathematical Models and Methods in Applied Sciences 4(1) 300-7 http

csispaceeductappertdpsd861-12session4-p2pdf

Anna Huang 2008 ldquoSimilarity Measures for Text Document Clusteringrdquo Proceedings of the 6th New ZealandComputer Science Research Student Conference 49-56 httpwwwnzcsrsc08canterburyacnzsiteproceedingsIndividual_Paperspg049_Similarity_Measures_for_Text_Document_Clusteringpdf

bull Michael B Jordan ldquoThe Kernel Trickrdquo (Lecture Notes) httpspeopleeecsberkeleyedu

~jordancourses281B-spring04lectureslec3pdf

bull Eric Kim 2017 ldquoEverything You Ever Wanted to Know about the Kernel Trick (But Were Afraidto Ask)rdquo httpwwweric-kimneteric-kim-netposts1kernel_trick_blog_ekim_12_20_

2017pdf

Derived Data Representations - Dimensionality Reduction Compression De-composition Embeddings

The groupings here and under the measurement multivariate statistics section are particularly arbitraryFor example ldquok-Means clusteringrdquo can be viewed as a technique for ldquodimensionality reductionrdquo ldquocompres-sionrdquo ldquofeature extractionrdquo ldquolatent variable measurementrdquo ldquounsupervised learningrdquo ldquocollaborative filteringrdquo

Clustering hashing quantization blocking compression

bull Dagger[Compression] Khalid Sayood 2012 Introduction to Data Compression 4th ed Springer httpwwwsciencedirectcomezaccesslibrariespsuedusciencebook9780124157965 (eg cod-ing blocking via vector quantization)

bull MMDS Ch3 ldquoFinding Similar Itemsrdquo (minhashing locality sensitive hashing)

bull Multivariate-R Ch 6 ldquoClusteringrdquo

bull DataScience-Python NB 0511 ldquok-Means Clusteringrdquo NG 0512 ldquoGaussian Mixture Modelsrdquo

bull Dagger[KMeansHashing] Kaiming He Fang Wen Jian Sun 2013 ldquoK-means Hashing An Affinity-Preserving Quantization Method for Learning Binary Compact Codesrdquo CVPR httpswwwcv-foundationorgopenaccesscontent_cvpr_2013papersHe_K-Means_Hashing_An_2013_CVPR_paper

pdf

bull dagger[Squashing] Madigan D Raghavan N Dumouchel W Nason M Posse C and RidgewayG (2002) ldquoLikelihood-based data squashing A modeling approach to instance constructionrdquo DataMining and Knowledge Discovery 6(2) 173-190 httpdxdoiorgezaccesslibrariespsu

edu101023A1014095614948

bull dagger[Core-sets] Piotr Indyk Sepideh Mahabadi Mohammad Mahdian Vahab S Mirrokni 2014 ldquoCom-posable core-sets for diversity and coverage maximizationrdquo PODS rsquo14 100-8 httpsdoiorg10114525945382594560

Feature selection feature extraction feature engineering weighting preprocessing

bull PatternRecognition Sections 26-7 ldquoFeature Selection Feature Extractionrdquo

bull DSHandbook-Py Ch 7 ldquoInterlude Feature Extraction Ideasrdquo

bull DataScience-Python NB 0504 ldquoFeature Engineeringrdquo

bull Features in text NLP and InfoRetrieval re tfidf and similar FightinWords

bull Features in images DataScience-Python NB 0514 ldquoImage Featuresrdquo

23

Dimensionality reduction decomposition factorization change of basis reparameterization matrix com-pletion latent variables source separation

bull MMDS Ch 9 ldquoRecommendation Systemsrdquo Ch11 ldquoDimensionality Reductionrdquo

bull DSHandbook-Py Ch 10 ldquoUnsupervised Learning Clustering and Dimensionality Reductionrdquo

bull Dagger[Shalizi-ADA] Cosma Rohilla Shalizi 2017 Advanced Data Analysis from an Elementary Pointof View Ch 16 (ldquoPrincipal Components Analysisrdquo) Ch 17 (ldquoFactor Modelsrdquo) httpwwwstat

cmuedu~cshaliziADAfaEPoV

See also Multivariate-R Ch 3 ldquoPrincipal Components Analysisrdquo Ch 4 ldquoMultidimensional ScalingrdquoCh 5 ldquoExploratory Factor Analysisrdquo Latent

bull DeepLearning Ch 2 ldquoLinear Algebrardquo Ch 13 ldquoLinear Factor Modelsrdquo

bull Dagger[GloVe] Jeffrey Pennington Richard Socher Christopher Manning 2014 ldquoGloVe Global vectorsfor word representationrdquo EMNLP httpsnlpstanfordeduprojectsglove

bull dagger[NMF] Daniel D Lee and H Sebastian Seung 1999 ldquoLearning the parts of objects by non-negativematrix factorizationrdquo Nature 401788-791 httpdxdoiorgezaccesslibrariespsuedu

10103844565

bull dagger[CUR] Michael W Mahoney and Petros Drineas 2009 ldquoCUR matrix decompositions for improveddata analysisrdquo PNAS httpwwwpnasorgcontent1063697full

bull Dagger[ICA] Aapo Hyvarinen and Erkki Oja 2000 ldquoIndependent Component Analysis Algorithmsand Applicationsrdquo Neural Networks 13(4-5) 411-30 httpswwwcshelsinkifiuahyvarin

papersNN00newpdf

bull [RandomProjection] Ella Bingham and Heikki Mannilla 2001 ldquoRandom projection in dimension-ality reduction applications to image and text datardquo KDD httpsdoiorg1011452F502512

502546

bull Compression eg ldquoTransform codingrdquo ldquoWaveletsrdquo

Nonlinear dimensionality reduction Manifold learning

bull DeepLearning Section 5113 ldquoManifold Learningrdquo

bull DataScience-Python NB 0510 ldquoManifold Learningrdquo (Locally linear embedding [LLE] Isomap)

bull Dagger[KernelPCA] Sebastian Raschka 2014 ldquoKernel tricks and nonlinear dimensionality reduction viaRBF kernel PCArdquo httpsebastianraschkacomArticles2014_kernel_pcahtml

bull Shalizi-ADA Ch 18 ldquoNonlinear Dimensionality Reductionrdquo (LLE)

bull Dagger[LaplacianEigenmaps] Mikhail Belkin and Partha Niyogi 2003 ldquoLaplacian eigenmaps for di-mensionality reduction and data representationrdquo Neural Computation 15(6) 13731396 http

webcseohio-stateedu~belkin8papersLEM_NC_03pdf

bull DeepLearning Ch14 (ldquoAutoencodersrdquo)

bull Dagger[word2vec] Tomas Mikolov Kai Chen Greg Corrado Jeffrey Dean 2013 ldquoEfficient Estimation ofWord Representations in Vector Spacerdquo httpsarxivorgabs13013781

bull Dagger[word2vecExplained] Yoav Goldberg and Omer Levy 2014 ldquoword2vec Explained DerivingMikolov et als Negative-Sampling Word-Embedding Methodrdquo httpsarxivorgabs14023722

bull Dagger[t-SNE] Laurens van der Maaten and Geoffrey Hinton 2008 ldquoVisualising Data using t-SNErdquo Jour-nal of Machine Learning Research 9 2579-2605 httpjmlrcsailmitedupapersvolume9

vandermaaten08avandermaaten08apdf

24

Computation and Scaling Up

Scientific computing computation at scale

bull NRCreport Ch 10 ldquoThe Seven Computational Giants of Massive Data Analysisrdquo

bull DSHandbook-Py Ch 21 ldquoPerformance and Computer Memoryrdquo Ch 22 ldquoComputer Memory andData Structuresrdquo

bull Dagger[ComputationalStatistics-Python] Cliburn Chan Computational Statistics in Python https

peopledukeedu~ccc14sta-663indexhtml

bull CRAN Task View High Performance Computinghttpscranr-projectorgwebviewsHighPerformanceComputinghtml

Numerical computing

bull [NumericalComputing] Ward Cheney and David Kincaid 2013 Numerical Mathematics andComputing 7th ed BrooksCole Cengage Learning

bull CRAN Task View Numerical Mathematicshttpscranr-projectorgwebviewsNumericalMathematicshtml

Optimization (eg MLE gradient descent stochastic gradient descent EM algorithm neural nets)

bull DSHandbook-Py Ch 23 ldquoMaximum Likelihood Estimation and Optimizationrdquo (gradient descent)

bull DeepLearning Ch 4 ldquoNumerical Computationrdquo Ch 6 ldquoDeep Feedforward Networksrdquo Ch 8ldquoOptimization for Training Deep Modelsrdquo

bull CRAN Task View Optimization and Mathematical Programminghttpscranr-projectorgwebviewsOptimizationhtml

Linear algebra matrix computations

bull [MatrixComputations] Gene H Golub and Charles F Van Loan 2013 Matrix Computations 4thed Johns Hopkins University Press

bull Dagger[NetflixMatrix] Yehuda Koren Robert Bell and Chris Volinsky 2009 ldquoMatrix FactorizationTechniques for Recommender Systemsrdquo Computer August 42-9 httpsdatajobscomdata-

science-repoRecommender-Systems-[Netflix]pdf

bull Dagger[BigDataPCA] Jianqing Fan Qiang Sun Wen-Xin Zhou Ziwei Zhu ldquoPrincipal Component Anal-ysis for Big Datardquo httpwwwprincetonedu~ziweizpcapdf

bull Dagger[RandomSVD] Andrew Tulloch 2009 ldquoFast Randomized SVDrdquo httpsresearchfbcom

fast-randomized-svd

bull Dagger[FactorSGD] Rainer Gemulla Peter J Haas Erik Nijkamp and Yannis Sismanis 2011 ldquoLarge-Scale Matrix Factorization with Distributed Stochastic Gradient Descentrdquo KDD httpwwwcs

utahedu~hariteachingbigdatagemulla11dsgdpdf

bull Dagger[Sparse] Max Grossman 2015 ldquo101 Ways to Store a Sparse Matrixrdquo httpsmediumcom

jmaxg3101-ways-to-store-a-sparse-matrix-c7f2bf15a229

bull Dagger[DontInvert] John D Cook 2010 ldquoDonrsquot Invert that Matrixrdquo httpswwwjohndcookcom

blog20100119dont-invert-that-matrix

Simulation-based inference resampling Monte Carlo methods MCMC Bayes approximate inference

25

bull Dagger[MarkovVisually] Victor Powell ldquoMarkov Chains Explained Visuallyrdquo httpsetosaioev

markov-chains

bull DSHandbook-Py Ch 25 ldquoStochastic Modelingrdquo (Markov chains MCMC HMM)

bull Bayes ComputationalStatistics-Py

bull DeepLearning Ch 17 ldquoMonte Carlo Methodsrdquo Ch 19 ldquoApproximate Inferencerdquo

bull Dagger[VariationalInference] Jason Eisner 2011 ldquoHigh-Level Explanation of Variational Inferencerdquohttpswwwcsjhuedu~jasontutorialsvariationalhtml

bull CRAN Task View Bayesian Inferencehttpscranr-projectorgwebviewsBayesianhtml

Parallelism MapReduce Split-Apply-Combine

bull Dagger[MapReduceIntuition] Jigsaw Academy 2014 ldquoBig Data Specialist MapReducerdquo https

wwwyoutubecomwatchv=TwcYQzFqg-8ampfeature=youtube

bull MMDS Ch 2 ldquoMap-Reduce and the New Software Stackrdquo (3rd Edition discusses Spark amp Tensor-Flow)

bull Algorithms Ch 3ldquoScalable Algorithms and Associative Statisticsrdquo Ch 4 ldquoHadoop and MapReducerdquo

bull NRCreport Ch 6 ldquoResources Trade-offs and Limitationsrdquo

bull TidyData-R (Split-apply-combine is the motivating principle behind the ldquotidyverserdquo approach) Seealso Part III ldquoProgramrdquo (pipes functions vectors iteration)

bull Dagger[TidySAC-Video] Hadley Wickham 2017 ldquoData Science in the Tidyverserdquo httpswww

rstudiocomresourcesvideosdata-science-in-the-tidyverse

bull See also FactorSGD

Functional Programming

bull ComputationalStatistics-Python ldquoFunctions are first class objectsrdquo through first exercises

bull DSHandbook-Py Ch 20 ldquoProgramming Language Conceptsrdquo

bull See also Haskell

Scaling iteration streaming data online algorithms (Spark)

bull NRCreport Ch 4 ldquoTemporal Data and Real-Time Algorithmsrdquo

bull MMDS Ch 4 ldquoMining Data Streamsrdquo

bull Dagger[BDAS] AMPLab BDAS The Berkeley Data Analytics Stack httpsamplabcsberkeleyedu

software

bull dagger[Spark] Mohammed Guller 2015 Big Data Analytics with Spark A Practitioners Guide toUsing Spark for Large-Scale Data Processing Machine Learning and Graph Analytics and High-Velocity Data Stream Processing Apress httpslink-springer-comezaccesslibrariespsu

edubook101007978-1-4842-0964-6

bull DSHandbook-Py Ch 13 ldquoBig Datardquo

bull DeepLearning Ch 10 ldquoSequence Modeling Recurrent and Recursive Netsrdquo

General Resources for Python and R

26

bull dagger[DSHandbook-Py] Field Cady 2017 The Data Science Handbook (Python-based) Wiley http

onlinelibrarywileycomezaccesslibrariespsuedubook1010029781119092919

bull Dagger[Tutorials-R] Ujjwal Karn 2017 ldquoA curated list of R tutorials for Data Science NLP and MachineLearningrdquo httpsgithubcomujjwalkarnDataScienceR

bull Dagger[Tutorials-Python] Ujjwal Karn 2017 ldquoA curated list of Python tutorials for Data Science NLPand Machine Learningrdquo httpsgithubcomujjwalkarnDataSciencePython

Cutting and Bleeding Edge of Data Science Languages

bull [Scala] daggerVishal Layka and David Pollak 2015 Beginning Scala httpslink-springer-

comezaccesslibrariespsuedubook1010072F978-1-4842-0232-6 daggerNicolas Patrick 2014Scala for Machine Learning httpsebookcentralproquestcomlibpensudetailaction

docID=1901910 DaggerScala site httpswwwscala-langorg (See also )

bull [Julia] daggerIvo Baobaert 2015 Getting Started with Julia httpsebookcentralproquestcom

libpensudetailactiondocID=1973847 DaggerDouglas Bates 2013 ldquoJulia for R Programmersrdquohttpwwwstatwiscedu~batesJuliaForRProgrammerspdf DaggerJulia site httpsjulialang

org

bull [Haskell] daggerRichard Bird 2014 Thinking Functionally with Haskell httpsdoi-orgezaccess

librariespsuedu101017CBO9781316092415 daggerHakim Cassimally 2017 Learning Haskell Pro-gramming httpswwwlyndacomHaskell-tutorialsLearning-Haskell-Programming604926-2html daggerJames Church 2017 Learning Haskell for Data Analysis httpswwwlyndacomDeveloper-tutorialsLearning-Haskell-Data-Analysis604234-2html Haskell site httpswwwhaskellorg

bull [Clojure] daggerMark McDonnell 2017 Quick Clojure Essential Functional Programming https

link-springer-comezaccesslibrariespsuedubook1010072F978-1-4842-2952-1 daggerAkhillWali 2014 Clojure for Machine Learning httpsebookcentralproquestcomlibpensu

detailactiondocID=1674848 daggerArthur Ulfeldt 2015 httpswwwlyndacomClojure-tutorialsUp-Running-Clojure413127-2html DaggerClojure site httpsclojureorg

bull [TensorFlow] DaggerAbadi et al (Google) 2016 ldquoTensorFlow A System for Large-Scale Ma-chine Learningrdquo httpsarxivorgabs160508695 DaggerUdacity ldquoDeep Learningrdquo httpswww

udacitycomcoursedeep-learning--ud730 DaggerTensorFlow site httpswwwtensorfloworg

bull [H2O] Darren Cool 2016 Practical Machine Learning with H2O Powerful Scalable Techniques forDeep Learning and AI OrsquoReilly Arno Candel and Viraj Parmar 2015 Deep Learning with H2ODaggerH2O site httpswwwh2oai

Social Data Structures

Space and Time

bull dagger[GIA] David OrsquoSullivan David J Unwin 2010 Geographic Information Analysis Second EditionJohn Wiley amp Sons httponlinelibrarywileycombook1010029780470549094

bull [Space-Time] Donna J Peuquet 2003 Representations of Space and Time Guilford

bull Roger S Bivand Edzer Pebesma Virgilio Gmez-Rubio 2013 Applied Spatial Data Analysis in RFree through library httplinkspringercomezaccesslibrariespsuedubook101007

2F978-1-4614-7618-4 See also Edzer Pebesma 2016 ldquoHandling and Analyzing Spatial Spa-tiotemporal and Movement Data in Rrdquo httpsedzergithubioUseR2016

bull Dagger[PySAL] Sergio J Rey and Dani Arribas-Bel 2016 ldquoGeographic Data Science with PySAL andthe pydata Stackrdquo httpdarribasorggds_scipy16

27

bull DataScience-Python NB 0413 ldquoGeographic Data with Basemaprdquo

bull Time ldquoLongitudinalrdquo intra-individual data as common in developmental psychology dagger[Longitudinal]Garrett Fitzmaurice Marie Davidian Geert Verbeke and Geert Molenberghs editors 2009 Longitu-dinal Data Analysis Chapman amp Hall CRC httppensueblibcompatronFullRecordaspxp=359998

bull Time ldquoTime Series Time Series Cross Section Panelrdquo as common in economics political scienceand sociology DataScience-Python Ch 3 re pandas DSHandbook-Py Ch 17 ldquoTime Series AnalysisrdquoEconometrics GelmanHill

bull Time ldquoSequential Data Streamsrdquo as in NLP machine learning Spark and other rdquostreaming datardquoreadings

bull Space-Time Data with continuity connectivity neighborhood structure See DeepLearning Chapter 9re convolution

bull CRAN Task Views Spatial httpscranr-projectorgwebviewsSpatialhtml Spatiotempo-ral httpscranr-projectorgwebviewsSpatioTemporalhtml Time Series httpscran

r-projectorgwebviewsTimeSerieshtml

Network Graphs

bull Dagger[Networks] David Easley and Jon Kleinberg 2010 Networks Crowds and Markets ReasoningAbout a Highly Connected World Cambridge University Press httpwwwcscornelleduhome

kleinbernetworks-book

bull MMDS Ch 5 ldquoLink Analysisrdquo Ch 10 ldquoMining Social-Network Graphsrdquo

bull dagger[Networks-R] Douglas Luke 2015 A Userrsquos Guide to Network Analysis in R Springer https

link-springer-comezaccesslibrariespsuedubook101007978-3-319-23883-8

bull dagger[Networks-Python] Mohammed Zuhair Al-Taie and Seifedine Kadry 2017 Python for Graph andNetwork Analysis Springer httpslink-springer-comezaccesslibrariespsuedubook

1010072F978-3-319-53004-8

Hierarchy Clustered Data Aggregation Mixed Models

bull dagger[GelmanHill] Andrew Gelman and Jennifer Hill 2006 Data Analysis Using Regression and Multi-levelHierarchical Models Cambridge University Press httppensueblibcompatronFullRecordaspxp=288457

bull dagger[MixedModels] Eugene Demidenko 2013 Mixed Models Theory and Applications with R 2nd edWiley httpsiteebrarycomezaccesslibrariespsuedulibpennstatedetailaction

docID=10748641

Social Data Channels

Language Text Speech Audio

bull Dagger[NLP] Daniel Jurafsky and James H Martin Forthcoming (2017) Speech and Language ProcessingAn Introduction to Natural Language Processing Speech Recognition and Computational Linguistics3rd ed Prentice-Hall Preprint httpswebstanfordedu~jurafskyslp3

bull InfoRetrieval

bull Dagger[CoreNLP] Stanford CoreNLP httpsstanfordnlpgithubioCoreNLP

28

bull dagger[TextAsData] Grimmer and Stewart 2013 ldquoText as Data The Promise and Pitfalls of AutomaticContent Analysis Methods for Political Textsrdquo Political Analysis httpsdoiorg101093pan

mps028

bull Quinn-Topics

bull FightinWords

bull dagger[NLTK-Python] Jacob Perkins 2010 Python Text Processing with NLTK 20 Cookbook PackthttpsiteebrarycomezaccesslibrariespsuedulibpennstatedetailactiondocID=10435387dagger[TextAnalytics-Python] Dipanjan Sarkar 2016 Text Analytics with Python Springer http

linkspringercomezaccesslibrariespsuedubook1010072F978-1-4842-2388-8

bull DSHandbook-Py Ch 16 ldquoNatural Language Processingrdquo

bull Matthew J Denny httpwwwmjdennycomText_Processing_In_Rhtml

bull CRAN Natural Language Processing httpscranr-projectorgwebviewsNaturalLanguageProcessinghtml

bull dagger[AudioData] Dean Knox and Christopher Lucas 2017 ldquoThe Speaker-Affect Model MeasuringEmotion in Political Speech with Audio Datardquo httpchristopherlucasorgfilesPDFssam

pdf httpswwwyoutubecomwatchv=Hs8A9dwkMzI

bull daggerMorgan amp Claypool Synthesis Lectures on Human Language Technologies httpwwwmorganclaypoolcomtochlt11

bull daggerMorgan amp Claypool Synthesis Lectures on Speech and Audio Processing httpwwwmorganclaypoolcomtocsap11

Vision Image Video

bull Dagger[ComputerVision] Richard Szelinski 2010 Computer Vision Algorithms and Applications SpringerhttpszeliskiorgBook

bull MixedModels Ch 11 ldquoStatistical Analysis of Shaperdquo Ch 12 ldquoStatistical Image Analysisrdquo

bull AudioVideoVolumetric See DeepLearning Chapter 9 re convolution

bull daggerMorgan amp Claypool Synthesis Lectures on Image Video and Multimedia Processing httpwww

morganclaypoolcomtocivm11

bull daggerMorgan amp Claypool Synthesis Lectures on Computer Vision httpwwwmorganclaypoolcom

toccov11

Approaches to Learning from Data (The Analytics Layer)

bull NRCreport Ch 7 ldquoBuilding Models from Massive Datardquo

bull MMDS (data-mining)

bull CausalInference

bull Dagger[VisualAnalytics] Daniel Keim Jorn Kohlhammer Geoffrey Ellis and Florian Mansmann editors2010 Mastering the Information Age Solving Problems with Visual Analytics The EurographicsAssociation Goslar Germany httpwwwvismastereubook See also daggerMorgan amp ClaypoolSynthesis Lectures on Visualization httpwwwmorganclaypoolcomtocvis21

bull Dagger[Econometrics] Bruce E Hansen 2014 Econometrics httpwwwsscwiscedu~bhansen

econometrics

29

bull Dagger[StatisticalLearning] Gareth James Daniela Witten Trevor Hastie and Robert Tibshirani 2013An Introduction to Statistical Learning with Applications in R Springer httpwww-bcfuscedu

~garethISLindexhtml

bull [MachineLearning] Christopher Bishop 2006 Pattern Recognition and Machine Learning Springer

bull dagger[Bayes] Peter D Hoff 2009 A First Course in Bayesian Statistical Methods Springer http

linkspringercombook1010072F978-0-387-92407-6

bull dagger[GraphicalModels] Soslashren Hoslashjsgaard David Edwards Steffen Lauritzen 2012 Graphical Mod-els with R Springer httpslink-springer-comezaccesslibrariespsuedubook101007

2F978-1-4614-2299-0

bull [InformationTheory] David Mackay 2003 Information Theory Inference and Learning AlgorithmsSpringer

bull Dagger[DeepLearning] Ian Goodfellow Yoshua Bengio and Aaron Courville 2016 Deep Learning MITPress httpwwwdeeplearningbookorg (see also Yann LeCun Yoshua Bengio and Geof-frey Hinton 2015 ldquoDeep Learningrdquo Nature 521(7553) 436ndash444 httpsdoiorg101038

nature14539)

See also Dagger[Keras] Keras The Python Deep Learning Library httpskerasio Francoios CholletDirectory of Keras tutorials httpsgithubcomfcholletkeras-resources

bull dagger[SignalProcessing] Jose Maria Giron-Sierra 2013 Digital Signal Processing with MATLAB Exam-ples (Volumes 1-3) Springer httpslink-springer-comezaccesslibrariespsuedubook

101007978-981-10-2534-1

30

Penn State Policy Statements

Academic Integrity

Academic integrity is the pursuit of scholarly activity in an open honest and responsible manner Academicintegrity is a basic guiding principle for all academic activity at The Pennsylvania State University and allmembers of the University community are expected to act in accordance with this principle Consistent withthis expectation the Universitys Code of Conduct states that all students should act with personal integrityrespect other students dignity rights and property and help create and maintain an environment in whichall can succeed through the fruits of their efforts

Academic integrity includes a commitment by all members of the University community not to engage in ortolerate acts of falsification misrepresentation or deception Such acts of dishonesty violate the fundamentalethical principles of the University community and compromise the worth of work completed by others

Disability Accomodation

Penn State welcomes students with disabilities into the Universitys educational programs Every Penn Statecampus has an office for students with disabilities Student Disability Resources (SDR) website provides con-tact information for every Penn State campus (httpequitypsuedusdrdisability-coordinator)For further information please visit the Student Disability Resources website (httpequitypsuedusdr)

In order to receive consideration for reasonable accommodations you must contact the appropriate disabilityservices office at the campus where you are officially enrolled participate in an intake interview and pro-vide documentation See documentation guidelines at (httpequitypsuedusdrguidelines) If thedocumentation supports your request for reasonable accommodations your campus disability services officewill provide you with an accommodation letter Please share this letter with your instructors and discussthe accommodations with them as early as possible You must follow this process for every semester thatyou request accommodations

Psychological Services

Many students at Penn State face personal challenges or have psychological needs that may interfere withtheir academic progress social development or emotional wellbeing The university offers a variety ofconfidential services to help you through difficult times including individual and group counseling crisisintervention consultations online chats and mental health screenings These services are provided by staffwho welcome all students and embrace a philosophy respectful of clients cultural and religious backgroundsand sensitive to differences in race ability gender identity and sexual orientation

bull Counseling and Psychological Services at University Park (CAPS) (httpstudentaffairspsueducounseling) 814-863-0395

bull Penn State Crisis Line (24 hours7 daysweek) 877-229-6400

bull Crisis Text Line (24 hours7 daysweek) Text LIONS to 741741

Educational Equity

Penn State takes great pride to foster a diverse and inclusive environment for students faculty and staffConsistent with University Policy AD29 students who believe they have experienced or observed a hatecrime an act of intolerance discrimination or harassment that occurs at Penn State are urged to report theseincidents as outlined on the Universitys Report Bias webpage (httpequitypsuedureportbias)

31

Background Knowledge

Students will have a variety of backgrounds Prior to beginning interdisciplinary coursework tofulfill Social Data Analytics degree requirements including SoDA 501 and 502 students are ex-pected to have advanced (graduate) training in at least one of the component areas of Social DataAnalytics and a familiarity with basic concepts in the others

With regard to specialization students are expected to have advanced (graduate) training in ONEof the following

bull quantitative social science methodology and a discipline of social science (as would be thecase for a second-year PhD student in Political Science Sociology Criminology Human De-velopment and Family Studies or Demography) OR

bull statistics (as would be the case for a second-year PhD student in Statistics) OR

bull information science or informatics (as would be the case for a second-year PhD student inInformation Science and Technology or a second-year PhD student in Geography specializingin GIScience) OR

bull computer science (as would be the case for a second-year PhD student in Computer Scienceand Engineering)

This requirement is met as a matter of meeting home program requirements for students in thedual-title PhD but may require additional coursework on the part of students in other programswishing to pursue the graduate minor

With regard to general preparation students are expected to have ALL of the following technicalknowledge

bull basic programming skills including basic facility with R ampor Python AND

bull basic knowledge of relational databases ampor geographic information systems AND

bull basic knowledge of probability applied statistics ampor social science research design AND

bull basic familiarity with a substantive or theoretical area of social science (eg 300-level course-work in political science sociology criminology human development psychology economicscommunication anthropology human geography social informatics or similar fields)

It is not unusual for students to have one or more gaps in this preparation at time of applicationto the SoDA program Students should work with Social Data Analytics advisers to develop aplan for timely remediation of any deficiencies which generally will not require formal courseworkfor students whose training and interests are otherwise appropriate for pursuit of the Social DataAnalytics degree Where possible this will be addressed at time of application to the Social DataAnalytics program

To this end some free training materials are linked in the reference section and there are alsoabundant free high-quality self-paced course-style training materials on these and related subjectsavailable through edX (httpwwwedxorg) Udacity (httpwwwudacitycom) Codecademy(httpcodecademycom) and Lynda (httpwwwlyndapsu)

32

Page 18: SoDA 501: Approaches and Issues in Big Social Data Spring ... · \Approaches and Issues in Big Social Data," o ered in the spring semester. Some SoDA / IGERT students will take more

bull dagger[TweetmentEffects] Kevin Munger 2017 ldquoTweetment Effects on the Tweeted ExperimentallyReducing Racist Harassmentrdquo Political Behavior 39(3) 629ndash49 httpdoiorg101016jchb

201305009

bull dagger[HumanComputation] Edith Law and Luis van Ahn 2011 Human Computation Morgan amp Clay-pool httpwwwmorganclaypoolcomezaccesslibrariespsuedudoipdf102200S00371ED1V01Y201107AIM013

Open Data File Formats APIs Semantic Web Linked Data

bull DSHandbook-Py Ch 12 ldquoData Encodings and File Formatsrdquo

bull Dagger[OpenData] Open Data Institute ldquoWhat Is Open Datardquo httpstheodiorgwhat-is-open-

data

bull Dagger[OpenDataHandbook] Open Knowledge International The Open Data Handbook httpopendatahandbookorg Includes appendix ldquoFile Formatsrdquo httpopendatahandbookorgguideenappendices

file-formats

bull Dagger[APIs] Brian Cooksey 2016 An Introduction to APIs httpszapiercomlearnapis

bull Dagger[APIMarkets] RapidAPI mashape API marketplaces httpsdocsrapidapicom https

marketmashapecom ProgrammableWeb httpswwwprogrammablewebcom

bull dagger[LinkedData] Tom Heath and Christian Bizer 2011 Linked Data Evolving the Web into a GlobalData Space Morgan amp Claypool httpwwwmorganclaypoolcomezaccesslibrariespsu

edudoiabs102200S00334ED1V01Y201102WBE001 Ch 1 ldquoIntroductionrdquo Ch 2 ldquoPrinciples ofLinked Datardquo

bull dagger[SemanticWeb] Nikolaos Konstantinos and Dimitrios-Emmanuel Spanos 2015 Materializing theWeb of Linked Data Springer httplinkspringercomezaccesslibrariespsuedubook

1010072F978-3-319-16074-0 Ch 1 ldquoIntroduction Linked Data and the Semantic Webrdquo

bull daggerMorgan amp Claypool Synthesis Lectures on the Semantic Web Theory and Technology httpwww

morganclaypoolcomtocwbe111

Web Scraping

bull dagger[TheoryDrivenScraping] Richard N Landers Robert C Brusso Katelyn J Cavanaugh andAndrew B Colmus 2016 ldquoA Primer on Theory-Driven Web Scraping Automatic Extraction ofBig Data from the Internet for Use in Psychological Researchrdquo Psychological Methods 4 475ndash492httpdxdoiorgezaccesslibrariespsuedu101037met0000081

bull Dagger[Scraping-Py] Al Sweigart 2015 Automate the Boring Stuff with Python Practical Programmingfor Total Beginners Ch 11 Web-Scraping httpsautomatetheboringstuffcomchapter11

bull dagger[Scraping-R] Simon Munzert Christian Rubba Peter Meiszligner and Dominic Nyhuis 2014 Au-tomated Data Collection with R A Practical Guide to Web Scraping and Text Mining Wileyhttponlinelibrarywileycombook1010029781118834732

bull DaggerCRAN httpscranr-projectorgwebviewsWebTechnologieshtml

Ethics amp Scientific Responsibility in Big Social Data

Human Subjects Consent Privacy

bull BitByBit Ch 6 (ldquoEthicsrdquo)

bull Dagger[PSU-ORP] Penn State Office for Research Protections (under the VP for Research)

Human Subjects Research IRB httpswwwresearchpsueduirb

18

Revised Common Rule httpswwwresearchpsueduirbcommonrulechanges

Responsible Conduct of Research httpswwwresearchpsuedueducationrcr

Research Misconduct httpswwwresearchpsueduresearchmisconduct

SARI PSU (Scientific and Research Integrity Training) httpswwwresearchpsuedu

trainingsari

bull Dagger[MenloReport] David Dittrich and Erin Kenneally (Center for Applied Interneet Data Analysis)2012 ldquoThe Menlo Report Ethical Principles Guiding Information and Communication Technol-ogy Researchrdquo and companion report ldquoApplying Ethical Principles rdquo httpswwwcaidaorg

publicationspapers2012menlo_report_actual_formatted

bull Dagger[BigDataEthics-CBDES] Jacob Metcalf Emily F Keller and danah boyd 2016 ldquoPerspec-tives on Big Data Ethics and Societyrdquo Council for Big Data Ethics and Society httpbdes

datasocietynetcouncil-outputperspectives-on-big-data-ethics-and-society

bull Dagger[AoIRReport] Annette Markham and Elizabeth Buchanan (Association of Internet Researchers)2012 ldquoEthical Decision-Making and Internet Research Recommendations of the AoIR Ethics WorkingCommittee (Version 20)rdquo httpsaoirorgreportsethics2pdf and guidelines chart https

aoirorgwp-contentuploads201701aoir_ethics_graphic_2016pdf

bull Dagger[BigDataEthics-Wired] Sarah Zhang 2016 ldquoScientists are Just as Confused about the Ethics ofBig-Data Research as Yourdquo Wired httpswwwwiredcom201605scientists-just-confused-ethics-big-data-research

bull dagger[BigDataEthics-HerschelMori] Richard Herschel and Virginia M Mori 2017 ldquoEthics amp BigDatardquo Technology in Society 49 31-36 httpdoiorg101016jtechsoc201703003

The Science of Data Privacy

bull dagger[BigDataPrivacy] Terence Craig and Mary E Ludloff 2011 Privacy and Big Data OrsquoReilly MediahttppensueblibcompatronFullRecordaspxp=781814

bull Dagger[DataAnalysisPrivacy] John Abowd Lorenzo Alvisi Cynthia Dwork Sampath Kannan AshwinMachanavajjhala Jerome Reiter 2017 ldquoPrivacy-Preserving Data Analysis for the Federal Statisti-cal Agenciesrdquo A Computing Community Consortium white paper httpsarxivorgabs1701

00752

bull dagger[DataPrivacy] Stephen E Fienberg and Aleksandra B Slavkovic 2011 ldquoData Privacy and Con-fidentialityrdquo International Encyclopedia of Statistical Science 342ndash5 httpdoiorg978-3-642-

04898-2_202

bull Dagger[NetworksPrivacy] Vishesh Karwa and Aleksandra Slavkovic 2016 ldquoInference using noisy degreesDifferentially private β-model and synthetic graphsrdquo Annals of Statistics 44(1) 87-112 http

projecteuclidorgeuclidaos1449755958

bull dagger[DataPublishingPrivacy] Raymond Chi-Wing Wong and Ada Wai-Chee Fu 2010 Privacy-Preserving Data Publishing An Overview Morgan amp Claypool httpwwwmorganclaypoolcom

ezaccesslibrariespsuedudoipdfplus102200S00237ED1V01Y201003DTM002

bull DataMatching Ch 8 ldquoPrivacy Aspects of Data Matchingrdquo

Transparency Reproducibility and Team Science

bull Dagger[10RulesforData] Alyssa Goodman Alberto Pepe Alexander W Blocker Christine L BorgmanKyle Cranmer Merce Crosas Rosanne Di Stefano Yolanda Gil Paul Groth Margaret HedstromDavid W Hogg Vinay Kashyap Ashish Mahabal Aneta Siemiginowska and Aleksandra Slavkovic(2014) ldquoTen Simple Rules for the Care and Feeding of Scientific Datardquo PLoS Computational Biology10(4) e1003542 httpsdoiorg101371journalpcbi1003542 (Note esp curated resourcesfor reproducible research)

19

bull dagger[Transparency] E Miguel C Camerer K Casey J Cohen K M Esterling A Gerber R Glen-nerster D P Green M Humphreys G Imbens D Laitin T Madon L Nelson B A Nosek M Pe-tersen R Sedlmayr J P Simmons U Simonsohn M Van der Laan 2014 ldquoPromoting Transparencyin Social Science Researchrdquo Science 343(6166) 30ndash1 httpsdoiorg101126science1245317

bull dagger[Reproducibility] Marcus R Munafo Brian A Nosek Dorothy V M Bishop Katherine S ButtonChristopher D Chambers Nathalie Percie du Sert Uri Simonsohn Eric-Jan Wagenmakers JenniferJ Ware and John P A Ioannidis 2017 ldquoA Manifesto for Reproducible Sciencerdquo Nature HumanBehavior 0021(2017) httpsdoiorg101038s41562-016-0021

bull Dagger[SoftwareCarpentry] esp ldquoLessonsrdquo httpsoftware-carpentryorglessons

bull DSHandbook-Py Ch 9 ldquoTechnical Communication and Documentationrdquo Ch 15 ldquoSoftware Engi-neering Best Practicesrdquo

bull dagger[TeamScienceToolkit] Vogel AL Hall KL Fiore SM Klein JT Bennett LM Gadlin H Stokols DNebeling LC Wuchty S Patrick K Spotts EL Pohl C Riley WT Falk-Krzesinski HJ 2013 ldquoTheTeam Science Toolkit enhancing research collaboration through online knowledge sharingrdquo AmericanJournal of Preventive Medicine 45 787-9 http101016jamepre201309001

bull sect[Databrary] Kara Hall Robert Croyle and Amanda Vogel Forthcoming (2017) Advancing Socialand Behavioral Health Research through Cross-disciplinary Team Science Springer Includes RickO Gilmore and Karen E Adolph ldquoOpen Sharing of Research Video Breaking the Boundaries of theResearch Teamrdquo (See httpdatabraryorg)

bull CRAN httpscranr-projectorgwebviewsReproducibleResearchhtml

Social Bias Fair Algorithms

bull MachineBias

bull Dagger[EmbeddingsBias] Aylin Caliskan Joanna J Bryson and Arvind Narayanan 2017 ldquoSemanticsDerived Automatically from Language Corpora Contain Human Biasesrdquo Science httpsarxiv

orgabs160807187

bull Dagger[Debiasing] Tolga Bolukbasi Kai-Wei Chang James Zou Venkatesh Saligrama Adam Kalai 2016ldquoMan is to Computer Programmer as Woman is to Homemaker Debiasing Word Embeddingsrdquohttpsarxivorgabs160706520

bull Dagger[AvoidingBias] Moritz Hardt Eric Price Nathan Srebro 2016 ldquoEquality of Opportunity inSupervised Learningrdquo httpsarxivorgabs161002413

bull Dagger[InevitableBias] Jon Kleinberg Sendhil Mullainathan Manish Raghavan 2016 ldquoInherent Trade-Offs in the Fair Determination of Risk Scoresrdquo httpsarxivorgabs160905807

Databases and Data Management

bull NRCreport Ch 3 ldquoScaling the Infrastructure for Data Managementrdquo

bull dagger[SQL] Jan L Harrington 2010 SQL Clearly Explained Elsevier httpwwwsciencedirectcomezaccesslibrariespsuedusciencebook9780123756978

bull DSHandbook-Py Ch 14 ldquoDatabasesrdquo

bull dagger[noSQL] Guy Harrison 2015 Next Generation Databases NoSQL NewSQL and Big Data Apresshttpslink-springer-comezaccesslibrariespsuedubook101007978-1-4842-1329-2

bull dagger[Cloud] Divyakant Agrawal Sudipto Das and Amr El Abbadi 2012 Data Management inthe Cloud Challenges and Opportunities Morgan amp Claypool httpwwwmorganclaypoolcom

ezaccesslibrariespsuedudoipdfplus102200S00456ED1V01Y201211DTM032

20

bull daggerMorgan amp Claypool Synthesis Lectures on Data Management httpwwwmorganclaypoolcom

tocdtm11

Data Wrangling

Theoretically-structured approaches to data wrangling

bull Dagger[TidyData-R] Garrett Grolemund and Hadley Wickham 2017 R for Data Science (httpr4dshadconz) esp Ch 12 ldquoTidy Datardquo also Wickham 2014 ldquoTidy Datardquo Journalof Statistical Software 59(10) httpwwwjstatsoftorgv59i10paper Tools The tidyversehttptidyverseorg

bull Dagger[DataScience-Py] Jake VanderPlas 2016 Python Data Science Handbook (httpsgithubcomjakevdpPythonDSHandbook-Py) esp Ch 3 on pandas httppandaspydataorg

bull Dagger[DataCarpentry] Colin Gillespie and Robin Lovelace 2017 Efficient R Programming https

csgillespiegithubioefficientR esp Ch 6 on ldquoefficient data carpentryrdquo

bull sect[Wrangler] Joseph M Hellerstein Jeffrey Heer Tye Rattenbury and Sean Kandel 2017 DataWrangling Practical Techniques for Data Preparation Tool Trifacta Wrangler httpwwwtrifactacomproductswrangler

Data wrangling practice

bull DSHandbook-Py Ch 4 ldquoData Munging String Manipulation Regular Expressions and Data Clean-ingrdquo

bull dagger[DataSimplification] Jules J Berman 2016 Data Simplification Taming Information with OpenSource Tools Elsevier httpwwwsciencedirectcomezaccesslibrariespsueduscience

book9780128037812

bull dagger[Wrangling-Py] Jacqueline Kazil Katharine Jarmul 2016 Data Wrangling with Python OrsquoReillyhttpproquestcombosafaribooksonlinecomezaccesslibrariespsuedu9781491948804

bull dagger[Wrangling-R] Bradley C Boehmke 2016 Data Wrangling with R Springer httplink

springercomezaccesslibrariespsuedubook1010072F978-3-319-45599-0

Record Linkage Entity Resolution Deduplication

bull dagger[DataMatching] Peter Christen 2012 Data Matching Concepts and Techniques for Record Link-age Entity Resolution and Duplicate Detectionhttplinkspringercomezaccesslibrariespsuedubook1010072F978-3-642-31164-2

(esp Ch 2 ldquoThe Data Matching Processrdquo)

bull DataSimplification Chapter 5 ldquoIdentifying and Deidentifying Datardquo

bull Dagger[EntityResolution] Lise Getoor and Ashwin Machanavajjhala 2013 ldquoEntity Resolution for BigDatardquo Tutorial KDD httpwwwumiacsumdedu~getoorTutorialsER_KDD2013pdf

bull Dagger[SyrianCasualties] Peter Sadosky Anshumali Shrivastava Megan Price and Rebecca C Steorts2015 ldquoBlocking Methods Applied to Casualty Records from the Syrian Conflictrdquo httpsarxiv

orgabs151007714 (For more on blocking see DataMatching Ch 4 ldquoIndexingrdquo)

bull CRAN Task Views httpscranr-projectorgwebviewsOfficialStatisticshtml ldquoStatis-tical Matching and Record Linkagerdquo

ldquoMaking up datardquo Imputation Smoothers Kernels Priors Filters Teleportation Negative Sampling Con-volution Augmentation Adversarial Training

21

bull dagger[Imputation] Yi Deng Changgee Chang Moges Seyoum Ido and Qi Long 2016 ldquoMultiple Im-putation for General Missing Data Patterns in the Presence of High-dimensional Datardquo ScientificReports 6(21689) httpswwwnaturecomarticlessrep21689

bull Dagger[Overimputation] Matthew Blackwell James Honaker and Gary King 2017 ldquoA Unified Ap-proach to Measurement Error and Missing Data Overview and Applicationsrdquo Sociological Methodsamp Research 46(3) 303-341 httpgkingharvardedufilesgkingfilesmeasurepdf

bull Shalizi-ADA Sect 15 (Linear Smoothers) Ch 8 (Splines) Sect 144 (Kernel Density Estimates)DataScience-Python NB 0513 ldquoKernel Density Estimationrdquo

bull FightinWords (re Bayesian priors as additional data impact of priors on regularization)

bull DeepLearning Section 75 (Data Augmentation) 712 (Dropout) 713 (Adversarial Examples) Chap-ter 9 (Convolutional Networks)

bull dagger[Adversarial] Ian J Goodfellow Jonathon Shlens Christian Szegedy 2015 ldquoExplaining and Har-vesting Adversarial Examplesrdquo httpsarxivorgabs14126572

bull See also resampling and simulation methods

bull See also feature engineering preprocessing

bull CRAN Task Views httpscranr-projectorgwebviewsOfficialStatisticshtml ldquoImpu-tationrdquo

(Direct) Data Representations Data Mappings

bull NRCreport Ch 5 ldquoLarge-Scale Data Representationsrdquo

bull dagger[PatternRecognition] M Narasimha Murty and V Susheela Devi 2011 Pattern RecognitionAn Algorithmic Approach Springer httpslink-springer-comezaccesslibrariespsuedu

book1010072F978-0-85729-495-1 Section 21 ldquoData Structures for Pattern Representationrdquo

bull Dagger[InfoRetrieval] Christopher D Manning Prabhakar Raghavan and Hinrich Schutze 2009 Intro-duction to Information Retrieval Cambridge University Press httpnlpstanfordeduIR-bookCh 1 2 6 (also note slides used in their class)

bull dagger[Algorithms] Brian Steele John Chandler Swarna Reddy 2016 Algorithms for Data ScienceWiley httplinkspringercomezaccesslibrariespsuedubook1010072F978-3-319-

45797-0 Ch 2 ldquoData Mapping and Data Dictionariesrdquo

bull See also Social Data Structures

Similarity Distance Association Covariance The Kernel Trick

bull PatternRecognition Section 23 ldquoProximity Measuresrdquo

bull Dagger[Similarity] Brendan OrsquoConnor 2012 ldquoCosine similarity Pearson correlation and OLS coeffi-cientsrdquo httpsbrenoconcomblog201203cosine-similarity-pearson-correlation-and-

ols-coefficients

bull For relatively comprehensive lists see also

M-J Lesot M Rifqi and H Benhadda 2009 ldquoSimilarity measures for binary and numerical data a surveyrdquoInt J Knowledge Engineering and Soft Data Paradigms 1(1) 63- httpciteseerxistpsueduviewdoc

downloaddoi=10112126533amprep=rep1amptype=pdf

Seung-Seok Choi Sung-Hyuk Cha Charles C Tappert 2010 ldquoA Survey of Binary Similarity and DistanceMeasuresrdquo Journal of Systemics Cybernetics amp Informatics 8(1)43-8 httpwwwiiisciorgJournalCV$

scipdfsGS315JGpdf

22

Sung-Hyuk Cha 2007 ldquoComprehensive Survey on DistanceSimilarity Measures between Probability DensityFunctionsrdquo International Journal of Mathematical Models and Methods in Applied Sciences 4(1) 300-7 http

csispaceeductappertdpsd861-12session4-p2pdf

Anna Huang 2008 ldquoSimilarity Measures for Text Document Clusteringrdquo Proceedings of the 6th New ZealandComputer Science Research Student Conference 49-56 httpwwwnzcsrsc08canterburyacnzsiteproceedingsIndividual_Paperspg049_Similarity_Measures_for_Text_Document_Clusteringpdf

bull Michael B Jordan ldquoThe Kernel Trickrdquo (Lecture Notes) httpspeopleeecsberkeleyedu

~jordancourses281B-spring04lectureslec3pdf

bull Eric Kim 2017 ldquoEverything You Ever Wanted to Know about the Kernel Trick (But Were Afraidto Ask)rdquo httpwwweric-kimneteric-kim-netposts1kernel_trick_blog_ekim_12_20_

2017pdf

Derived Data Representations - Dimensionality Reduction Compression De-composition Embeddings

The groupings here and under the measurement multivariate statistics section are particularly arbitraryFor example ldquok-Means clusteringrdquo can be viewed as a technique for ldquodimensionality reductionrdquo ldquocompres-sionrdquo ldquofeature extractionrdquo ldquolatent variable measurementrdquo ldquounsupervised learningrdquo ldquocollaborative filteringrdquo

Clustering hashing quantization blocking compression

bull Dagger[Compression] Khalid Sayood 2012 Introduction to Data Compression 4th ed Springer httpwwwsciencedirectcomezaccesslibrariespsuedusciencebook9780124157965 (eg cod-ing blocking via vector quantization)

bull MMDS Ch3 ldquoFinding Similar Itemsrdquo (minhashing locality sensitive hashing)

bull Multivariate-R Ch 6 ldquoClusteringrdquo

bull DataScience-Python NB 0511 ldquok-Means Clusteringrdquo NG 0512 ldquoGaussian Mixture Modelsrdquo

bull Dagger[KMeansHashing] Kaiming He Fang Wen Jian Sun 2013 ldquoK-means Hashing An Affinity-Preserving Quantization Method for Learning Binary Compact Codesrdquo CVPR httpswwwcv-foundationorgopenaccesscontent_cvpr_2013papersHe_K-Means_Hashing_An_2013_CVPR_paper

pdf

bull dagger[Squashing] Madigan D Raghavan N Dumouchel W Nason M Posse C and RidgewayG (2002) ldquoLikelihood-based data squashing A modeling approach to instance constructionrdquo DataMining and Knowledge Discovery 6(2) 173-190 httpdxdoiorgezaccesslibrariespsu

edu101023A1014095614948

bull dagger[Core-sets] Piotr Indyk Sepideh Mahabadi Mohammad Mahdian Vahab S Mirrokni 2014 ldquoCom-posable core-sets for diversity and coverage maximizationrdquo PODS rsquo14 100-8 httpsdoiorg10114525945382594560

Feature selection feature extraction feature engineering weighting preprocessing

bull PatternRecognition Sections 26-7 ldquoFeature Selection Feature Extractionrdquo

bull DSHandbook-Py Ch 7 ldquoInterlude Feature Extraction Ideasrdquo

bull DataScience-Python NB 0504 ldquoFeature Engineeringrdquo

bull Features in text NLP and InfoRetrieval re tfidf and similar FightinWords

bull Features in images DataScience-Python NB 0514 ldquoImage Featuresrdquo

23

Dimensionality reduction decomposition factorization change of basis reparameterization matrix com-pletion latent variables source separation

bull MMDS Ch 9 ldquoRecommendation Systemsrdquo Ch11 ldquoDimensionality Reductionrdquo

bull DSHandbook-Py Ch 10 ldquoUnsupervised Learning Clustering and Dimensionality Reductionrdquo

bull Dagger[Shalizi-ADA] Cosma Rohilla Shalizi 2017 Advanced Data Analysis from an Elementary Pointof View Ch 16 (ldquoPrincipal Components Analysisrdquo) Ch 17 (ldquoFactor Modelsrdquo) httpwwwstat

cmuedu~cshaliziADAfaEPoV

See also Multivariate-R Ch 3 ldquoPrincipal Components Analysisrdquo Ch 4 ldquoMultidimensional ScalingrdquoCh 5 ldquoExploratory Factor Analysisrdquo Latent

bull DeepLearning Ch 2 ldquoLinear Algebrardquo Ch 13 ldquoLinear Factor Modelsrdquo

bull Dagger[GloVe] Jeffrey Pennington Richard Socher Christopher Manning 2014 ldquoGloVe Global vectorsfor word representationrdquo EMNLP httpsnlpstanfordeduprojectsglove

bull dagger[NMF] Daniel D Lee and H Sebastian Seung 1999 ldquoLearning the parts of objects by non-negativematrix factorizationrdquo Nature 401788-791 httpdxdoiorgezaccesslibrariespsuedu

10103844565

bull dagger[CUR] Michael W Mahoney and Petros Drineas 2009 ldquoCUR matrix decompositions for improveddata analysisrdquo PNAS httpwwwpnasorgcontent1063697full

bull Dagger[ICA] Aapo Hyvarinen and Erkki Oja 2000 ldquoIndependent Component Analysis Algorithmsand Applicationsrdquo Neural Networks 13(4-5) 411-30 httpswwwcshelsinkifiuahyvarin

papersNN00newpdf

bull [RandomProjection] Ella Bingham and Heikki Mannilla 2001 ldquoRandom projection in dimension-ality reduction applications to image and text datardquo KDD httpsdoiorg1011452F502512

502546

bull Compression eg ldquoTransform codingrdquo ldquoWaveletsrdquo

Nonlinear dimensionality reduction Manifold learning

bull DeepLearning Section 5113 ldquoManifold Learningrdquo

bull DataScience-Python NB 0510 ldquoManifold Learningrdquo (Locally linear embedding [LLE] Isomap)

bull Dagger[KernelPCA] Sebastian Raschka 2014 ldquoKernel tricks and nonlinear dimensionality reduction viaRBF kernel PCArdquo httpsebastianraschkacomArticles2014_kernel_pcahtml

bull Shalizi-ADA Ch 18 ldquoNonlinear Dimensionality Reductionrdquo (LLE)

bull Dagger[LaplacianEigenmaps] Mikhail Belkin and Partha Niyogi 2003 ldquoLaplacian eigenmaps for di-mensionality reduction and data representationrdquo Neural Computation 15(6) 13731396 http

webcseohio-stateedu~belkin8papersLEM_NC_03pdf

bull DeepLearning Ch14 (ldquoAutoencodersrdquo)

bull Dagger[word2vec] Tomas Mikolov Kai Chen Greg Corrado Jeffrey Dean 2013 ldquoEfficient Estimation ofWord Representations in Vector Spacerdquo httpsarxivorgabs13013781

bull Dagger[word2vecExplained] Yoav Goldberg and Omer Levy 2014 ldquoword2vec Explained DerivingMikolov et als Negative-Sampling Word-Embedding Methodrdquo httpsarxivorgabs14023722

bull Dagger[t-SNE] Laurens van der Maaten and Geoffrey Hinton 2008 ldquoVisualising Data using t-SNErdquo Jour-nal of Machine Learning Research 9 2579-2605 httpjmlrcsailmitedupapersvolume9

vandermaaten08avandermaaten08apdf

24

Computation and Scaling Up

Scientific computing computation at scale

bull NRCreport Ch 10 ldquoThe Seven Computational Giants of Massive Data Analysisrdquo

bull DSHandbook-Py Ch 21 ldquoPerformance and Computer Memoryrdquo Ch 22 ldquoComputer Memory andData Structuresrdquo

bull Dagger[ComputationalStatistics-Python] Cliburn Chan Computational Statistics in Python https

peopledukeedu~ccc14sta-663indexhtml

bull CRAN Task View High Performance Computinghttpscranr-projectorgwebviewsHighPerformanceComputinghtml

Numerical computing

bull [NumericalComputing] Ward Cheney and David Kincaid 2013 Numerical Mathematics andComputing 7th ed BrooksCole Cengage Learning

bull CRAN Task View Numerical Mathematicshttpscranr-projectorgwebviewsNumericalMathematicshtml

Optimization (eg MLE gradient descent stochastic gradient descent EM algorithm neural nets)

bull DSHandbook-Py Ch 23 ldquoMaximum Likelihood Estimation and Optimizationrdquo (gradient descent)

bull DeepLearning Ch 4 ldquoNumerical Computationrdquo Ch 6 ldquoDeep Feedforward Networksrdquo Ch 8ldquoOptimization for Training Deep Modelsrdquo

bull CRAN Task View Optimization and Mathematical Programminghttpscranr-projectorgwebviewsOptimizationhtml

Linear algebra matrix computations

bull [MatrixComputations] Gene H Golub and Charles F Van Loan 2013 Matrix Computations 4thed Johns Hopkins University Press

bull Dagger[NetflixMatrix] Yehuda Koren Robert Bell and Chris Volinsky 2009 ldquoMatrix FactorizationTechniques for Recommender Systemsrdquo Computer August 42-9 httpsdatajobscomdata-

science-repoRecommender-Systems-[Netflix]pdf

bull Dagger[BigDataPCA] Jianqing Fan Qiang Sun Wen-Xin Zhou Ziwei Zhu ldquoPrincipal Component Anal-ysis for Big Datardquo httpwwwprincetonedu~ziweizpcapdf

bull Dagger[RandomSVD] Andrew Tulloch 2009 ldquoFast Randomized SVDrdquo httpsresearchfbcom

fast-randomized-svd

bull Dagger[FactorSGD] Rainer Gemulla Peter J Haas Erik Nijkamp and Yannis Sismanis 2011 ldquoLarge-Scale Matrix Factorization with Distributed Stochastic Gradient Descentrdquo KDD httpwwwcs

utahedu~hariteachingbigdatagemulla11dsgdpdf

bull Dagger[Sparse] Max Grossman 2015 ldquo101 Ways to Store a Sparse Matrixrdquo httpsmediumcom

jmaxg3101-ways-to-store-a-sparse-matrix-c7f2bf15a229

bull Dagger[DontInvert] John D Cook 2010 ldquoDonrsquot Invert that Matrixrdquo httpswwwjohndcookcom

blog20100119dont-invert-that-matrix

Simulation-based inference resampling Monte Carlo methods MCMC Bayes approximate inference

25

bull Dagger[MarkovVisually] Victor Powell ldquoMarkov Chains Explained Visuallyrdquo httpsetosaioev

markov-chains

bull DSHandbook-Py Ch 25 ldquoStochastic Modelingrdquo (Markov chains MCMC HMM)

bull Bayes ComputationalStatistics-Py

bull DeepLearning Ch 17 ldquoMonte Carlo Methodsrdquo Ch 19 ldquoApproximate Inferencerdquo

bull Dagger[VariationalInference] Jason Eisner 2011 ldquoHigh-Level Explanation of Variational Inferencerdquohttpswwwcsjhuedu~jasontutorialsvariationalhtml

bull CRAN Task View Bayesian Inferencehttpscranr-projectorgwebviewsBayesianhtml

Parallelism MapReduce Split-Apply-Combine

bull Dagger[MapReduceIntuition] Jigsaw Academy 2014 ldquoBig Data Specialist MapReducerdquo https

wwwyoutubecomwatchv=TwcYQzFqg-8ampfeature=youtube

bull MMDS Ch 2 ldquoMap-Reduce and the New Software Stackrdquo (3rd Edition discusses Spark amp Tensor-Flow)

bull Algorithms Ch 3ldquoScalable Algorithms and Associative Statisticsrdquo Ch 4 ldquoHadoop and MapReducerdquo

bull NRCreport Ch 6 ldquoResources Trade-offs and Limitationsrdquo

bull TidyData-R (Split-apply-combine is the motivating principle behind the ldquotidyverserdquo approach) Seealso Part III ldquoProgramrdquo (pipes functions vectors iteration)

bull Dagger[TidySAC-Video] Hadley Wickham 2017 ldquoData Science in the Tidyverserdquo httpswww

rstudiocomresourcesvideosdata-science-in-the-tidyverse

bull See also FactorSGD

Functional Programming

bull ComputationalStatistics-Python ldquoFunctions are first class objectsrdquo through first exercises

bull DSHandbook-Py Ch 20 ldquoProgramming Language Conceptsrdquo

bull See also Haskell

Scaling iteration streaming data online algorithms (Spark)

bull NRCreport Ch 4 ldquoTemporal Data and Real-Time Algorithmsrdquo

bull MMDS Ch 4 ldquoMining Data Streamsrdquo

bull Dagger[BDAS] AMPLab BDAS The Berkeley Data Analytics Stack httpsamplabcsberkeleyedu

software

bull dagger[Spark] Mohammed Guller 2015 Big Data Analytics with Spark A Practitioners Guide toUsing Spark for Large-Scale Data Processing Machine Learning and Graph Analytics and High-Velocity Data Stream Processing Apress httpslink-springer-comezaccesslibrariespsu

edubook101007978-1-4842-0964-6

bull DSHandbook-Py Ch 13 ldquoBig Datardquo

bull DeepLearning Ch 10 ldquoSequence Modeling Recurrent and Recursive Netsrdquo

General Resources for Python and R

26

bull dagger[DSHandbook-Py] Field Cady 2017 The Data Science Handbook (Python-based) Wiley http

onlinelibrarywileycomezaccesslibrariespsuedubook1010029781119092919

bull Dagger[Tutorials-R] Ujjwal Karn 2017 ldquoA curated list of R tutorials for Data Science NLP and MachineLearningrdquo httpsgithubcomujjwalkarnDataScienceR

bull Dagger[Tutorials-Python] Ujjwal Karn 2017 ldquoA curated list of Python tutorials for Data Science NLPand Machine Learningrdquo httpsgithubcomujjwalkarnDataSciencePython

Cutting and Bleeding Edge of Data Science Languages

bull [Scala] daggerVishal Layka and David Pollak 2015 Beginning Scala httpslink-springer-

comezaccesslibrariespsuedubook1010072F978-1-4842-0232-6 daggerNicolas Patrick 2014Scala for Machine Learning httpsebookcentralproquestcomlibpensudetailaction

docID=1901910 DaggerScala site httpswwwscala-langorg (See also )

bull [Julia] daggerIvo Baobaert 2015 Getting Started with Julia httpsebookcentralproquestcom

libpensudetailactiondocID=1973847 DaggerDouglas Bates 2013 ldquoJulia for R Programmersrdquohttpwwwstatwiscedu~batesJuliaForRProgrammerspdf DaggerJulia site httpsjulialang

org

bull [Haskell] daggerRichard Bird 2014 Thinking Functionally with Haskell httpsdoi-orgezaccess

librariespsuedu101017CBO9781316092415 daggerHakim Cassimally 2017 Learning Haskell Pro-gramming httpswwwlyndacomHaskell-tutorialsLearning-Haskell-Programming604926-2html daggerJames Church 2017 Learning Haskell for Data Analysis httpswwwlyndacomDeveloper-tutorialsLearning-Haskell-Data-Analysis604234-2html Haskell site httpswwwhaskellorg

bull [Clojure] daggerMark McDonnell 2017 Quick Clojure Essential Functional Programming https

link-springer-comezaccesslibrariespsuedubook1010072F978-1-4842-2952-1 daggerAkhillWali 2014 Clojure for Machine Learning httpsebookcentralproquestcomlibpensu

detailactiondocID=1674848 daggerArthur Ulfeldt 2015 httpswwwlyndacomClojure-tutorialsUp-Running-Clojure413127-2html DaggerClojure site httpsclojureorg

bull [TensorFlow] DaggerAbadi et al (Google) 2016 ldquoTensorFlow A System for Large-Scale Ma-chine Learningrdquo httpsarxivorgabs160508695 DaggerUdacity ldquoDeep Learningrdquo httpswww

udacitycomcoursedeep-learning--ud730 DaggerTensorFlow site httpswwwtensorfloworg

bull [H2O] Darren Cool 2016 Practical Machine Learning with H2O Powerful Scalable Techniques forDeep Learning and AI OrsquoReilly Arno Candel and Viraj Parmar 2015 Deep Learning with H2ODaggerH2O site httpswwwh2oai

Social Data Structures

Space and Time

bull dagger[GIA] David OrsquoSullivan David J Unwin 2010 Geographic Information Analysis Second EditionJohn Wiley amp Sons httponlinelibrarywileycombook1010029780470549094

bull [Space-Time] Donna J Peuquet 2003 Representations of Space and Time Guilford

bull Roger S Bivand Edzer Pebesma Virgilio Gmez-Rubio 2013 Applied Spatial Data Analysis in RFree through library httplinkspringercomezaccesslibrariespsuedubook101007

2F978-1-4614-7618-4 See also Edzer Pebesma 2016 ldquoHandling and Analyzing Spatial Spa-tiotemporal and Movement Data in Rrdquo httpsedzergithubioUseR2016

bull Dagger[PySAL] Sergio J Rey and Dani Arribas-Bel 2016 ldquoGeographic Data Science with PySAL andthe pydata Stackrdquo httpdarribasorggds_scipy16

27

bull DataScience-Python NB 0413 ldquoGeographic Data with Basemaprdquo

bull Time ldquoLongitudinalrdquo intra-individual data as common in developmental psychology dagger[Longitudinal]Garrett Fitzmaurice Marie Davidian Geert Verbeke and Geert Molenberghs editors 2009 Longitu-dinal Data Analysis Chapman amp Hall CRC httppensueblibcompatronFullRecordaspxp=359998

bull Time ldquoTime Series Time Series Cross Section Panelrdquo as common in economics political scienceand sociology DataScience-Python Ch 3 re pandas DSHandbook-Py Ch 17 ldquoTime Series AnalysisrdquoEconometrics GelmanHill

bull Time ldquoSequential Data Streamsrdquo as in NLP machine learning Spark and other rdquostreaming datardquoreadings

bull Space-Time Data with continuity connectivity neighborhood structure See DeepLearning Chapter 9re convolution

bull CRAN Task Views Spatial httpscranr-projectorgwebviewsSpatialhtml Spatiotempo-ral httpscranr-projectorgwebviewsSpatioTemporalhtml Time Series httpscran

r-projectorgwebviewsTimeSerieshtml

Network Graphs

bull Dagger[Networks] David Easley and Jon Kleinberg 2010 Networks Crowds and Markets ReasoningAbout a Highly Connected World Cambridge University Press httpwwwcscornelleduhome

kleinbernetworks-book

bull MMDS Ch 5 ldquoLink Analysisrdquo Ch 10 ldquoMining Social-Network Graphsrdquo

bull dagger[Networks-R] Douglas Luke 2015 A Userrsquos Guide to Network Analysis in R Springer https

link-springer-comezaccesslibrariespsuedubook101007978-3-319-23883-8

bull dagger[Networks-Python] Mohammed Zuhair Al-Taie and Seifedine Kadry 2017 Python for Graph andNetwork Analysis Springer httpslink-springer-comezaccesslibrariespsuedubook

1010072F978-3-319-53004-8

Hierarchy Clustered Data Aggregation Mixed Models

bull dagger[GelmanHill] Andrew Gelman and Jennifer Hill 2006 Data Analysis Using Regression and Multi-levelHierarchical Models Cambridge University Press httppensueblibcompatronFullRecordaspxp=288457

bull dagger[MixedModels] Eugene Demidenko 2013 Mixed Models Theory and Applications with R 2nd edWiley httpsiteebrarycomezaccesslibrariespsuedulibpennstatedetailaction

docID=10748641

Social Data Channels

Language Text Speech Audio

bull Dagger[NLP] Daniel Jurafsky and James H Martin Forthcoming (2017) Speech and Language ProcessingAn Introduction to Natural Language Processing Speech Recognition and Computational Linguistics3rd ed Prentice-Hall Preprint httpswebstanfordedu~jurafskyslp3

bull InfoRetrieval

bull Dagger[CoreNLP] Stanford CoreNLP httpsstanfordnlpgithubioCoreNLP

28

bull dagger[TextAsData] Grimmer and Stewart 2013 ldquoText as Data The Promise and Pitfalls of AutomaticContent Analysis Methods for Political Textsrdquo Political Analysis httpsdoiorg101093pan

mps028

bull Quinn-Topics

bull FightinWords

bull dagger[NLTK-Python] Jacob Perkins 2010 Python Text Processing with NLTK 20 Cookbook PackthttpsiteebrarycomezaccesslibrariespsuedulibpennstatedetailactiondocID=10435387dagger[TextAnalytics-Python] Dipanjan Sarkar 2016 Text Analytics with Python Springer http

linkspringercomezaccesslibrariespsuedubook1010072F978-1-4842-2388-8

bull DSHandbook-Py Ch 16 ldquoNatural Language Processingrdquo

bull Matthew J Denny httpwwwmjdennycomText_Processing_In_Rhtml

bull CRAN Natural Language Processing httpscranr-projectorgwebviewsNaturalLanguageProcessinghtml

bull dagger[AudioData] Dean Knox and Christopher Lucas 2017 ldquoThe Speaker-Affect Model MeasuringEmotion in Political Speech with Audio Datardquo httpchristopherlucasorgfilesPDFssam

pdf httpswwwyoutubecomwatchv=Hs8A9dwkMzI

bull daggerMorgan amp Claypool Synthesis Lectures on Human Language Technologies httpwwwmorganclaypoolcomtochlt11

bull daggerMorgan amp Claypool Synthesis Lectures on Speech and Audio Processing httpwwwmorganclaypoolcomtocsap11

Vision Image Video

bull Dagger[ComputerVision] Richard Szelinski 2010 Computer Vision Algorithms and Applications SpringerhttpszeliskiorgBook

bull MixedModels Ch 11 ldquoStatistical Analysis of Shaperdquo Ch 12 ldquoStatistical Image Analysisrdquo

bull AudioVideoVolumetric See DeepLearning Chapter 9 re convolution

bull daggerMorgan amp Claypool Synthesis Lectures on Image Video and Multimedia Processing httpwww

morganclaypoolcomtocivm11

bull daggerMorgan amp Claypool Synthesis Lectures on Computer Vision httpwwwmorganclaypoolcom

toccov11

Approaches to Learning from Data (The Analytics Layer)

bull NRCreport Ch 7 ldquoBuilding Models from Massive Datardquo

bull MMDS (data-mining)

bull CausalInference

bull Dagger[VisualAnalytics] Daniel Keim Jorn Kohlhammer Geoffrey Ellis and Florian Mansmann editors2010 Mastering the Information Age Solving Problems with Visual Analytics The EurographicsAssociation Goslar Germany httpwwwvismastereubook See also daggerMorgan amp ClaypoolSynthesis Lectures on Visualization httpwwwmorganclaypoolcomtocvis21

bull Dagger[Econometrics] Bruce E Hansen 2014 Econometrics httpwwwsscwiscedu~bhansen

econometrics

29

bull Dagger[StatisticalLearning] Gareth James Daniela Witten Trevor Hastie and Robert Tibshirani 2013An Introduction to Statistical Learning with Applications in R Springer httpwww-bcfuscedu

~garethISLindexhtml

bull [MachineLearning] Christopher Bishop 2006 Pattern Recognition and Machine Learning Springer

bull dagger[Bayes] Peter D Hoff 2009 A First Course in Bayesian Statistical Methods Springer http

linkspringercombook1010072F978-0-387-92407-6

bull dagger[GraphicalModels] Soslashren Hoslashjsgaard David Edwards Steffen Lauritzen 2012 Graphical Mod-els with R Springer httpslink-springer-comezaccesslibrariespsuedubook101007

2F978-1-4614-2299-0

bull [InformationTheory] David Mackay 2003 Information Theory Inference and Learning AlgorithmsSpringer

bull Dagger[DeepLearning] Ian Goodfellow Yoshua Bengio and Aaron Courville 2016 Deep Learning MITPress httpwwwdeeplearningbookorg (see also Yann LeCun Yoshua Bengio and Geof-frey Hinton 2015 ldquoDeep Learningrdquo Nature 521(7553) 436ndash444 httpsdoiorg101038

nature14539)

See also Dagger[Keras] Keras The Python Deep Learning Library httpskerasio Francoios CholletDirectory of Keras tutorials httpsgithubcomfcholletkeras-resources

bull dagger[SignalProcessing] Jose Maria Giron-Sierra 2013 Digital Signal Processing with MATLAB Exam-ples (Volumes 1-3) Springer httpslink-springer-comezaccesslibrariespsuedubook

101007978-981-10-2534-1

30

Penn State Policy Statements

Academic Integrity

Academic integrity is the pursuit of scholarly activity in an open honest and responsible manner Academicintegrity is a basic guiding principle for all academic activity at The Pennsylvania State University and allmembers of the University community are expected to act in accordance with this principle Consistent withthis expectation the Universitys Code of Conduct states that all students should act with personal integrityrespect other students dignity rights and property and help create and maintain an environment in whichall can succeed through the fruits of their efforts

Academic integrity includes a commitment by all members of the University community not to engage in ortolerate acts of falsification misrepresentation or deception Such acts of dishonesty violate the fundamentalethical principles of the University community and compromise the worth of work completed by others

Disability Accomodation

Penn State welcomes students with disabilities into the Universitys educational programs Every Penn Statecampus has an office for students with disabilities Student Disability Resources (SDR) website provides con-tact information for every Penn State campus (httpequitypsuedusdrdisability-coordinator)For further information please visit the Student Disability Resources website (httpequitypsuedusdr)

In order to receive consideration for reasonable accommodations you must contact the appropriate disabilityservices office at the campus where you are officially enrolled participate in an intake interview and pro-vide documentation See documentation guidelines at (httpequitypsuedusdrguidelines) If thedocumentation supports your request for reasonable accommodations your campus disability services officewill provide you with an accommodation letter Please share this letter with your instructors and discussthe accommodations with them as early as possible You must follow this process for every semester thatyou request accommodations

Psychological Services

Many students at Penn State face personal challenges or have psychological needs that may interfere withtheir academic progress social development or emotional wellbeing The university offers a variety ofconfidential services to help you through difficult times including individual and group counseling crisisintervention consultations online chats and mental health screenings These services are provided by staffwho welcome all students and embrace a philosophy respectful of clients cultural and religious backgroundsand sensitive to differences in race ability gender identity and sexual orientation

bull Counseling and Psychological Services at University Park (CAPS) (httpstudentaffairspsueducounseling) 814-863-0395

bull Penn State Crisis Line (24 hours7 daysweek) 877-229-6400

bull Crisis Text Line (24 hours7 daysweek) Text LIONS to 741741

Educational Equity

Penn State takes great pride to foster a diverse and inclusive environment for students faculty and staffConsistent with University Policy AD29 students who believe they have experienced or observed a hatecrime an act of intolerance discrimination or harassment that occurs at Penn State are urged to report theseincidents as outlined on the Universitys Report Bias webpage (httpequitypsuedureportbias)

31

Background Knowledge

Students will have a variety of backgrounds Prior to beginning interdisciplinary coursework tofulfill Social Data Analytics degree requirements including SoDA 501 and 502 students are ex-pected to have advanced (graduate) training in at least one of the component areas of Social DataAnalytics and a familiarity with basic concepts in the others

With regard to specialization students are expected to have advanced (graduate) training in ONEof the following

bull quantitative social science methodology and a discipline of social science (as would be thecase for a second-year PhD student in Political Science Sociology Criminology Human De-velopment and Family Studies or Demography) OR

bull statistics (as would be the case for a second-year PhD student in Statistics) OR

bull information science or informatics (as would be the case for a second-year PhD student inInformation Science and Technology or a second-year PhD student in Geography specializingin GIScience) OR

bull computer science (as would be the case for a second-year PhD student in Computer Scienceand Engineering)

This requirement is met as a matter of meeting home program requirements for students in thedual-title PhD but may require additional coursework on the part of students in other programswishing to pursue the graduate minor

With regard to general preparation students are expected to have ALL of the following technicalknowledge

bull basic programming skills including basic facility with R ampor Python AND

bull basic knowledge of relational databases ampor geographic information systems AND

bull basic knowledge of probability applied statistics ampor social science research design AND

bull basic familiarity with a substantive or theoretical area of social science (eg 300-level course-work in political science sociology criminology human development psychology economicscommunication anthropology human geography social informatics or similar fields)

It is not unusual for students to have one or more gaps in this preparation at time of applicationto the SoDA program Students should work with Social Data Analytics advisers to develop aplan for timely remediation of any deficiencies which generally will not require formal courseworkfor students whose training and interests are otherwise appropriate for pursuit of the Social DataAnalytics degree Where possible this will be addressed at time of application to the Social DataAnalytics program

To this end some free training materials are linked in the reference section and there are alsoabundant free high-quality self-paced course-style training materials on these and related subjectsavailable through edX (httpwwwedxorg) Udacity (httpwwwudacitycom) Codecademy(httpcodecademycom) and Lynda (httpwwwlyndapsu)

32

Page 19: SoDA 501: Approaches and Issues in Big Social Data Spring ... · \Approaches and Issues in Big Social Data," o ered in the spring semester. Some SoDA / IGERT students will take more

Revised Common Rule httpswwwresearchpsueduirbcommonrulechanges

Responsible Conduct of Research httpswwwresearchpsuedueducationrcr

Research Misconduct httpswwwresearchpsueduresearchmisconduct

SARI PSU (Scientific and Research Integrity Training) httpswwwresearchpsuedu

trainingsari

bull Dagger[MenloReport] David Dittrich and Erin Kenneally (Center for Applied Interneet Data Analysis)2012 ldquoThe Menlo Report Ethical Principles Guiding Information and Communication Technol-ogy Researchrdquo and companion report ldquoApplying Ethical Principles rdquo httpswwwcaidaorg

publicationspapers2012menlo_report_actual_formatted

bull Dagger[BigDataEthics-CBDES] Jacob Metcalf Emily F Keller and danah boyd 2016 ldquoPerspec-tives on Big Data Ethics and Societyrdquo Council for Big Data Ethics and Society httpbdes

datasocietynetcouncil-outputperspectives-on-big-data-ethics-and-society

bull Dagger[AoIRReport] Annette Markham and Elizabeth Buchanan (Association of Internet Researchers)2012 ldquoEthical Decision-Making and Internet Research Recommendations of the AoIR Ethics WorkingCommittee (Version 20)rdquo httpsaoirorgreportsethics2pdf and guidelines chart https

aoirorgwp-contentuploads201701aoir_ethics_graphic_2016pdf

bull Dagger[BigDataEthics-Wired] Sarah Zhang 2016 ldquoScientists are Just as Confused about the Ethics ofBig-Data Research as Yourdquo Wired httpswwwwiredcom201605scientists-just-confused-ethics-big-data-research

bull dagger[BigDataEthics-HerschelMori] Richard Herschel and Virginia M Mori 2017 ldquoEthics amp BigDatardquo Technology in Society 49 31-36 httpdoiorg101016jtechsoc201703003

The Science of Data Privacy

bull dagger[BigDataPrivacy] Terence Craig and Mary E Ludloff 2011 Privacy and Big Data OrsquoReilly MediahttppensueblibcompatronFullRecordaspxp=781814

bull Dagger[DataAnalysisPrivacy] John Abowd Lorenzo Alvisi Cynthia Dwork Sampath Kannan AshwinMachanavajjhala Jerome Reiter 2017 ldquoPrivacy-Preserving Data Analysis for the Federal Statisti-cal Agenciesrdquo A Computing Community Consortium white paper httpsarxivorgabs1701

00752

bull dagger[DataPrivacy] Stephen E Fienberg and Aleksandra B Slavkovic 2011 ldquoData Privacy and Con-fidentialityrdquo International Encyclopedia of Statistical Science 342ndash5 httpdoiorg978-3-642-

04898-2_202

bull Dagger[NetworksPrivacy] Vishesh Karwa and Aleksandra Slavkovic 2016 ldquoInference using noisy degreesDifferentially private β-model and synthetic graphsrdquo Annals of Statistics 44(1) 87-112 http

projecteuclidorgeuclidaos1449755958

bull dagger[DataPublishingPrivacy] Raymond Chi-Wing Wong and Ada Wai-Chee Fu 2010 Privacy-Preserving Data Publishing An Overview Morgan amp Claypool httpwwwmorganclaypoolcom

ezaccesslibrariespsuedudoipdfplus102200S00237ED1V01Y201003DTM002

bull DataMatching Ch 8 ldquoPrivacy Aspects of Data Matchingrdquo

Transparency Reproducibility and Team Science

bull Dagger[10RulesforData] Alyssa Goodman Alberto Pepe Alexander W Blocker Christine L BorgmanKyle Cranmer Merce Crosas Rosanne Di Stefano Yolanda Gil Paul Groth Margaret HedstromDavid W Hogg Vinay Kashyap Ashish Mahabal Aneta Siemiginowska and Aleksandra Slavkovic(2014) ldquoTen Simple Rules for the Care and Feeding of Scientific Datardquo PLoS Computational Biology10(4) e1003542 httpsdoiorg101371journalpcbi1003542 (Note esp curated resourcesfor reproducible research)

19

bull dagger[Transparency] E Miguel C Camerer K Casey J Cohen K M Esterling A Gerber R Glen-nerster D P Green M Humphreys G Imbens D Laitin T Madon L Nelson B A Nosek M Pe-tersen R Sedlmayr J P Simmons U Simonsohn M Van der Laan 2014 ldquoPromoting Transparencyin Social Science Researchrdquo Science 343(6166) 30ndash1 httpsdoiorg101126science1245317

bull dagger[Reproducibility] Marcus R Munafo Brian A Nosek Dorothy V M Bishop Katherine S ButtonChristopher D Chambers Nathalie Percie du Sert Uri Simonsohn Eric-Jan Wagenmakers JenniferJ Ware and John P A Ioannidis 2017 ldquoA Manifesto for Reproducible Sciencerdquo Nature HumanBehavior 0021(2017) httpsdoiorg101038s41562-016-0021

bull Dagger[SoftwareCarpentry] esp ldquoLessonsrdquo httpsoftware-carpentryorglessons

bull DSHandbook-Py Ch 9 ldquoTechnical Communication and Documentationrdquo Ch 15 ldquoSoftware Engi-neering Best Practicesrdquo

bull dagger[TeamScienceToolkit] Vogel AL Hall KL Fiore SM Klein JT Bennett LM Gadlin H Stokols DNebeling LC Wuchty S Patrick K Spotts EL Pohl C Riley WT Falk-Krzesinski HJ 2013 ldquoTheTeam Science Toolkit enhancing research collaboration through online knowledge sharingrdquo AmericanJournal of Preventive Medicine 45 787-9 http101016jamepre201309001

bull sect[Databrary] Kara Hall Robert Croyle and Amanda Vogel Forthcoming (2017) Advancing Socialand Behavioral Health Research through Cross-disciplinary Team Science Springer Includes RickO Gilmore and Karen E Adolph ldquoOpen Sharing of Research Video Breaking the Boundaries of theResearch Teamrdquo (See httpdatabraryorg)

bull CRAN httpscranr-projectorgwebviewsReproducibleResearchhtml

Social Bias Fair Algorithms

bull MachineBias

bull Dagger[EmbeddingsBias] Aylin Caliskan Joanna J Bryson and Arvind Narayanan 2017 ldquoSemanticsDerived Automatically from Language Corpora Contain Human Biasesrdquo Science httpsarxiv

orgabs160807187

bull Dagger[Debiasing] Tolga Bolukbasi Kai-Wei Chang James Zou Venkatesh Saligrama Adam Kalai 2016ldquoMan is to Computer Programmer as Woman is to Homemaker Debiasing Word Embeddingsrdquohttpsarxivorgabs160706520

bull Dagger[AvoidingBias] Moritz Hardt Eric Price Nathan Srebro 2016 ldquoEquality of Opportunity inSupervised Learningrdquo httpsarxivorgabs161002413

bull Dagger[InevitableBias] Jon Kleinberg Sendhil Mullainathan Manish Raghavan 2016 ldquoInherent Trade-Offs in the Fair Determination of Risk Scoresrdquo httpsarxivorgabs160905807

Databases and Data Management

bull NRCreport Ch 3 ldquoScaling the Infrastructure for Data Managementrdquo

bull dagger[SQL] Jan L Harrington 2010 SQL Clearly Explained Elsevier httpwwwsciencedirectcomezaccesslibrariespsuedusciencebook9780123756978

bull DSHandbook-Py Ch 14 ldquoDatabasesrdquo

bull dagger[noSQL] Guy Harrison 2015 Next Generation Databases NoSQL NewSQL and Big Data Apresshttpslink-springer-comezaccesslibrariespsuedubook101007978-1-4842-1329-2

bull dagger[Cloud] Divyakant Agrawal Sudipto Das and Amr El Abbadi 2012 Data Management inthe Cloud Challenges and Opportunities Morgan amp Claypool httpwwwmorganclaypoolcom

ezaccesslibrariespsuedudoipdfplus102200S00456ED1V01Y201211DTM032

20

bull daggerMorgan amp Claypool Synthesis Lectures on Data Management httpwwwmorganclaypoolcom

tocdtm11

Data Wrangling

Theoretically-structured approaches to data wrangling

bull Dagger[TidyData-R] Garrett Grolemund and Hadley Wickham 2017 R for Data Science (httpr4dshadconz) esp Ch 12 ldquoTidy Datardquo also Wickham 2014 ldquoTidy Datardquo Journalof Statistical Software 59(10) httpwwwjstatsoftorgv59i10paper Tools The tidyversehttptidyverseorg

bull Dagger[DataScience-Py] Jake VanderPlas 2016 Python Data Science Handbook (httpsgithubcomjakevdpPythonDSHandbook-Py) esp Ch 3 on pandas httppandaspydataorg

bull Dagger[DataCarpentry] Colin Gillespie and Robin Lovelace 2017 Efficient R Programming https

csgillespiegithubioefficientR esp Ch 6 on ldquoefficient data carpentryrdquo

bull sect[Wrangler] Joseph M Hellerstein Jeffrey Heer Tye Rattenbury and Sean Kandel 2017 DataWrangling Practical Techniques for Data Preparation Tool Trifacta Wrangler httpwwwtrifactacomproductswrangler

Data wrangling practice

bull DSHandbook-Py Ch 4 ldquoData Munging String Manipulation Regular Expressions and Data Clean-ingrdquo

bull dagger[DataSimplification] Jules J Berman 2016 Data Simplification Taming Information with OpenSource Tools Elsevier httpwwwsciencedirectcomezaccesslibrariespsueduscience

book9780128037812

bull dagger[Wrangling-Py] Jacqueline Kazil Katharine Jarmul 2016 Data Wrangling with Python OrsquoReillyhttpproquestcombosafaribooksonlinecomezaccesslibrariespsuedu9781491948804

bull dagger[Wrangling-R] Bradley C Boehmke 2016 Data Wrangling with R Springer httplink

springercomezaccesslibrariespsuedubook1010072F978-3-319-45599-0

Record Linkage Entity Resolution Deduplication

bull dagger[DataMatching] Peter Christen 2012 Data Matching Concepts and Techniques for Record Link-age Entity Resolution and Duplicate Detectionhttplinkspringercomezaccesslibrariespsuedubook1010072F978-3-642-31164-2

(esp Ch 2 ldquoThe Data Matching Processrdquo)

bull DataSimplification Chapter 5 ldquoIdentifying and Deidentifying Datardquo

bull Dagger[EntityResolution] Lise Getoor and Ashwin Machanavajjhala 2013 ldquoEntity Resolution for BigDatardquo Tutorial KDD httpwwwumiacsumdedu~getoorTutorialsER_KDD2013pdf

bull Dagger[SyrianCasualties] Peter Sadosky Anshumali Shrivastava Megan Price and Rebecca C Steorts2015 ldquoBlocking Methods Applied to Casualty Records from the Syrian Conflictrdquo httpsarxiv

orgabs151007714 (For more on blocking see DataMatching Ch 4 ldquoIndexingrdquo)

bull CRAN Task Views httpscranr-projectorgwebviewsOfficialStatisticshtml ldquoStatis-tical Matching and Record Linkagerdquo

ldquoMaking up datardquo Imputation Smoothers Kernels Priors Filters Teleportation Negative Sampling Con-volution Augmentation Adversarial Training

21

bull dagger[Imputation] Yi Deng Changgee Chang Moges Seyoum Ido and Qi Long 2016 ldquoMultiple Im-putation for General Missing Data Patterns in the Presence of High-dimensional Datardquo ScientificReports 6(21689) httpswwwnaturecomarticlessrep21689

bull Dagger[Overimputation] Matthew Blackwell James Honaker and Gary King 2017 ldquoA Unified Ap-proach to Measurement Error and Missing Data Overview and Applicationsrdquo Sociological Methodsamp Research 46(3) 303-341 httpgkingharvardedufilesgkingfilesmeasurepdf

bull Shalizi-ADA Sect 15 (Linear Smoothers) Ch 8 (Splines) Sect 144 (Kernel Density Estimates)DataScience-Python NB 0513 ldquoKernel Density Estimationrdquo

bull FightinWords (re Bayesian priors as additional data impact of priors on regularization)

bull DeepLearning Section 75 (Data Augmentation) 712 (Dropout) 713 (Adversarial Examples) Chap-ter 9 (Convolutional Networks)

bull dagger[Adversarial] Ian J Goodfellow Jonathon Shlens Christian Szegedy 2015 ldquoExplaining and Har-vesting Adversarial Examplesrdquo httpsarxivorgabs14126572

bull See also resampling and simulation methods

bull See also feature engineering preprocessing

bull CRAN Task Views httpscranr-projectorgwebviewsOfficialStatisticshtml ldquoImpu-tationrdquo

(Direct) Data Representations Data Mappings

bull NRCreport Ch 5 ldquoLarge-Scale Data Representationsrdquo

bull dagger[PatternRecognition] M Narasimha Murty and V Susheela Devi 2011 Pattern RecognitionAn Algorithmic Approach Springer httpslink-springer-comezaccesslibrariespsuedu

book1010072F978-0-85729-495-1 Section 21 ldquoData Structures for Pattern Representationrdquo

bull Dagger[InfoRetrieval] Christopher D Manning Prabhakar Raghavan and Hinrich Schutze 2009 Intro-duction to Information Retrieval Cambridge University Press httpnlpstanfordeduIR-bookCh 1 2 6 (also note slides used in their class)

bull dagger[Algorithms] Brian Steele John Chandler Swarna Reddy 2016 Algorithms for Data ScienceWiley httplinkspringercomezaccesslibrariespsuedubook1010072F978-3-319-

45797-0 Ch 2 ldquoData Mapping and Data Dictionariesrdquo

bull See also Social Data Structures

Similarity Distance Association Covariance The Kernel Trick

bull PatternRecognition Section 23 ldquoProximity Measuresrdquo

bull Dagger[Similarity] Brendan OrsquoConnor 2012 ldquoCosine similarity Pearson correlation and OLS coeffi-cientsrdquo httpsbrenoconcomblog201203cosine-similarity-pearson-correlation-and-

ols-coefficients

bull For relatively comprehensive lists see also

M-J Lesot M Rifqi and H Benhadda 2009 ldquoSimilarity measures for binary and numerical data a surveyrdquoInt J Knowledge Engineering and Soft Data Paradigms 1(1) 63- httpciteseerxistpsueduviewdoc

downloaddoi=10112126533amprep=rep1amptype=pdf

Seung-Seok Choi Sung-Hyuk Cha Charles C Tappert 2010 ldquoA Survey of Binary Similarity and DistanceMeasuresrdquo Journal of Systemics Cybernetics amp Informatics 8(1)43-8 httpwwwiiisciorgJournalCV$

scipdfsGS315JGpdf

22

Sung-Hyuk Cha 2007 ldquoComprehensive Survey on DistanceSimilarity Measures between Probability DensityFunctionsrdquo International Journal of Mathematical Models and Methods in Applied Sciences 4(1) 300-7 http

csispaceeductappertdpsd861-12session4-p2pdf

Anna Huang 2008 ldquoSimilarity Measures for Text Document Clusteringrdquo Proceedings of the 6th New ZealandComputer Science Research Student Conference 49-56 httpwwwnzcsrsc08canterburyacnzsiteproceedingsIndividual_Paperspg049_Similarity_Measures_for_Text_Document_Clusteringpdf

bull Michael B Jordan ldquoThe Kernel Trickrdquo (Lecture Notes) httpspeopleeecsberkeleyedu

~jordancourses281B-spring04lectureslec3pdf

bull Eric Kim 2017 ldquoEverything You Ever Wanted to Know about the Kernel Trick (But Were Afraidto Ask)rdquo httpwwweric-kimneteric-kim-netposts1kernel_trick_blog_ekim_12_20_

2017pdf

Derived Data Representations - Dimensionality Reduction Compression De-composition Embeddings

The groupings here and under the measurement multivariate statistics section are particularly arbitraryFor example ldquok-Means clusteringrdquo can be viewed as a technique for ldquodimensionality reductionrdquo ldquocompres-sionrdquo ldquofeature extractionrdquo ldquolatent variable measurementrdquo ldquounsupervised learningrdquo ldquocollaborative filteringrdquo

Clustering hashing quantization blocking compression

bull Dagger[Compression] Khalid Sayood 2012 Introduction to Data Compression 4th ed Springer httpwwwsciencedirectcomezaccesslibrariespsuedusciencebook9780124157965 (eg cod-ing blocking via vector quantization)

bull MMDS Ch3 ldquoFinding Similar Itemsrdquo (minhashing locality sensitive hashing)

bull Multivariate-R Ch 6 ldquoClusteringrdquo

bull DataScience-Python NB 0511 ldquok-Means Clusteringrdquo NG 0512 ldquoGaussian Mixture Modelsrdquo

bull Dagger[KMeansHashing] Kaiming He Fang Wen Jian Sun 2013 ldquoK-means Hashing An Affinity-Preserving Quantization Method for Learning Binary Compact Codesrdquo CVPR httpswwwcv-foundationorgopenaccesscontent_cvpr_2013papersHe_K-Means_Hashing_An_2013_CVPR_paper

pdf

bull dagger[Squashing] Madigan D Raghavan N Dumouchel W Nason M Posse C and RidgewayG (2002) ldquoLikelihood-based data squashing A modeling approach to instance constructionrdquo DataMining and Knowledge Discovery 6(2) 173-190 httpdxdoiorgezaccesslibrariespsu

edu101023A1014095614948

bull dagger[Core-sets] Piotr Indyk Sepideh Mahabadi Mohammad Mahdian Vahab S Mirrokni 2014 ldquoCom-posable core-sets for diversity and coverage maximizationrdquo PODS rsquo14 100-8 httpsdoiorg10114525945382594560

Feature selection feature extraction feature engineering weighting preprocessing

bull PatternRecognition Sections 26-7 ldquoFeature Selection Feature Extractionrdquo

bull DSHandbook-Py Ch 7 ldquoInterlude Feature Extraction Ideasrdquo

bull DataScience-Python NB 0504 ldquoFeature Engineeringrdquo

bull Features in text NLP and InfoRetrieval re tfidf and similar FightinWords

bull Features in images DataScience-Python NB 0514 ldquoImage Featuresrdquo

23

Dimensionality reduction decomposition factorization change of basis reparameterization matrix com-pletion latent variables source separation

bull MMDS Ch 9 ldquoRecommendation Systemsrdquo Ch11 ldquoDimensionality Reductionrdquo

bull DSHandbook-Py Ch 10 ldquoUnsupervised Learning Clustering and Dimensionality Reductionrdquo

bull Dagger[Shalizi-ADA] Cosma Rohilla Shalizi 2017 Advanced Data Analysis from an Elementary Pointof View Ch 16 (ldquoPrincipal Components Analysisrdquo) Ch 17 (ldquoFactor Modelsrdquo) httpwwwstat

cmuedu~cshaliziADAfaEPoV

See also Multivariate-R Ch 3 ldquoPrincipal Components Analysisrdquo Ch 4 ldquoMultidimensional ScalingrdquoCh 5 ldquoExploratory Factor Analysisrdquo Latent

bull DeepLearning Ch 2 ldquoLinear Algebrardquo Ch 13 ldquoLinear Factor Modelsrdquo

bull Dagger[GloVe] Jeffrey Pennington Richard Socher Christopher Manning 2014 ldquoGloVe Global vectorsfor word representationrdquo EMNLP httpsnlpstanfordeduprojectsglove

bull dagger[NMF] Daniel D Lee and H Sebastian Seung 1999 ldquoLearning the parts of objects by non-negativematrix factorizationrdquo Nature 401788-791 httpdxdoiorgezaccesslibrariespsuedu

10103844565

bull dagger[CUR] Michael W Mahoney and Petros Drineas 2009 ldquoCUR matrix decompositions for improveddata analysisrdquo PNAS httpwwwpnasorgcontent1063697full

bull Dagger[ICA] Aapo Hyvarinen and Erkki Oja 2000 ldquoIndependent Component Analysis Algorithmsand Applicationsrdquo Neural Networks 13(4-5) 411-30 httpswwwcshelsinkifiuahyvarin

papersNN00newpdf

bull [RandomProjection] Ella Bingham and Heikki Mannilla 2001 ldquoRandom projection in dimension-ality reduction applications to image and text datardquo KDD httpsdoiorg1011452F502512

502546

bull Compression eg ldquoTransform codingrdquo ldquoWaveletsrdquo

Nonlinear dimensionality reduction Manifold learning

bull DeepLearning Section 5113 ldquoManifold Learningrdquo

bull DataScience-Python NB 0510 ldquoManifold Learningrdquo (Locally linear embedding [LLE] Isomap)

bull Dagger[KernelPCA] Sebastian Raschka 2014 ldquoKernel tricks and nonlinear dimensionality reduction viaRBF kernel PCArdquo httpsebastianraschkacomArticles2014_kernel_pcahtml

bull Shalizi-ADA Ch 18 ldquoNonlinear Dimensionality Reductionrdquo (LLE)

bull Dagger[LaplacianEigenmaps] Mikhail Belkin and Partha Niyogi 2003 ldquoLaplacian eigenmaps for di-mensionality reduction and data representationrdquo Neural Computation 15(6) 13731396 http

webcseohio-stateedu~belkin8papersLEM_NC_03pdf

bull DeepLearning Ch14 (ldquoAutoencodersrdquo)

bull Dagger[word2vec] Tomas Mikolov Kai Chen Greg Corrado Jeffrey Dean 2013 ldquoEfficient Estimation ofWord Representations in Vector Spacerdquo httpsarxivorgabs13013781

bull Dagger[word2vecExplained] Yoav Goldberg and Omer Levy 2014 ldquoword2vec Explained DerivingMikolov et als Negative-Sampling Word-Embedding Methodrdquo httpsarxivorgabs14023722

bull Dagger[t-SNE] Laurens van der Maaten and Geoffrey Hinton 2008 ldquoVisualising Data using t-SNErdquo Jour-nal of Machine Learning Research 9 2579-2605 httpjmlrcsailmitedupapersvolume9

vandermaaten08avandermaaten08apdf

24

Computation and Scaling Up

Scientific computing computation at scale

bull NRCreport Ch 10 ldquoThe Seven Computational Giants of Massive Data Analysisrdquo

bull DSHandbook-Py Ch 21 ldquoPerformance and Computer Memoryrdquo Ch 22 ldquoComputer Memory andData Structuresrdquo

bull Dagger[ComputationalStatistics-Python] Cliburn Chan Computational Statistics in Python https

peopledukeedu~ccc14sta-663indexhtml

bull CRAN Task View High Performance Computinghttpscranr-projectorgwebviewsHighPerformanceComputinghtml

Numerical computing

bull [NumericalComputing] Ward Cheney and David Kincaid 2013 Numerical Mathematics andComputing 7th ed BrooksCole Cengage Learning

bull CRAN Task View Numerical Mathematicshttpscranr-projectorgwebviewsNumericalMathematicshtml

Optimization (eg MLE gradient descent stochastic gradient descent EM algorithm neural nets)

bull DSHandbook-Py Ch 23 ldquoMaximum Likelihood Estimation and Optimizationrdquo (gradient descent)

bull DeepLearning Ch 4 ldquoNumerical Computationrdquo Ch 6 ldquoDeep Feedforward Networksrdquo Ch 8ldquoOptimization for Training Deep Modelsrdquo

bull CRAN Task View Optimization and Mathematical Programminghttpscranr-projectorgwebviewsOptimizationhtml

Linear algebra matrix computations

bull [MatrixComputations] Gene H Golub and Charles F Van Loan 2013 Matrix Computations 4thed Johns Hopkins University Press

bull Dagger[NetflixMatrix] Yehuda Koren Robert Bell and Chris Volinsky 2009 ldquoMatrix FactorizationTechniques for Recommender Systemsrdquo Computer August 42-9 httpsdatajobscomdata-

science-repoRecommender-Systems-[Netflix]pdf

bull Dagger[BigDataPCA] Jianqing Fan Qiang Sun Wen-Xin Zhou Ziwei Zhu ldquoPrincipal Component Anal-ysis for Big Datardquo httpwwwprincetonedu~ziweizpcapdf

bull Dagger[RandomSVD] Andrew Tulloch 2009 ldquoFast Randomized SVDrdquo httpsresearchfbcom

fast-randomized-svd

bull Dagger[FactorSGD] Rainer Gemulla Peter J Haas Erik Nijkamp and Yannis Sismanis 2011 ldquoLarge-Scale Matrix Factorization with Distributed Stochastic Gradient Descentrdquo KDD httpwwwcs

utahedu~hariteachingbigdatagemulla11dsgdpdf

bull Dagger[Sparse] Max Grossman 2015 ldquo101 Ways to Store a Sparse Matrixrdquo httpsmediumcom

jmaxg3101-ways-to-store-a-sparse-matrix-c7f2bf15a229

bull Dagger[DontInvert] John D Cook 2010 ldquoDonrsquot Invert that Matrixrdquo httpswwwjohndcookcom

blog20100119dont-invert-that-matrix

Simulation-based inference resampling Monte Carlo methods MCMC Bayes approximate inference

25

bull Dagger[MarkovVisually] Victor Powell ldquoMarkov Chains Explained Visuallyrdquo httpsetosaioev

markov-chains

bull DSHandbook-Py Ch 25 ldquoStochastic Modelingrdquo (Markov chains MCMC HMM)

bull Bayes ComputationalStatistics-Py

bull DeepLearning Ch 17 ldquoMonte Carlo Methodsrdquo Ch 19 ldquoApproximate Inferencerdquo

bull Dagger[VariationalInference] Jason Eisner 2011 ldquoHigh-Level Explanation of Variational Inferencerdquohttpswwwcsjhuedu~jasontutorialsvariationalhtml

bull CRAN Task View Bayesian Inferencehttpscranr-projectorgwebviewsBayesianhtml

Parallelism MapReduce Split-Apply-Combine

bull Dagger[MapReduceIntuition] Jigsaw Academy 2014 ldquoBig Data Specialist MapReducerdquo https

wwwyoutubecomwatchv=TwcYQzFqg-8ampfeature=youtube

bull MMDS Ch 2 ldquoMap-Reduce and the New Software Stackrdquo (3rd Edition discusses Spark amp Tensor-Flow)

bull Algorithms Ch 3ldquoScalable Algorithms and Associative Statisticsrdquo Ch 4 ldquoHadoop and MapReducerdquo

bull NRCreport Ch 6 ldquoResources Trade-offs and Limitationsrdquo

bull TidyData-R (Split-apply-combine is the motivating principle behind the ldquotidyverserdquo approach) Seealso Part III ldquoProgramrdquo (pipes functions vectors iteration)

bull Dagger[TidySAC-Video] Hadley Wickham 2017 ldquoData Science in the Tidyverserdquo httpswww

rstudiocomresourcesvideosdata-science-in-the-tidyverse

bull See also FactorSGD

Functional Programming

bull ComputationalStatistics-Python ldquoFunctions are first class objectsrdquo through first exercises

bull DSHandbook-Py Ch 20 ldquoProgramming Language Conceptsrdquo

bull See also Haskell

Scaling iteration streaming data online algorithms (Spark)

bull NRCreport Ch 4 ldquoTemporal Data and Real-Time Algorithmsrdquo

bull MMDS Ch 4 ldquoMining Data Streamsrdquo

bull Dagger[BDAS] AMPLab BDAS The Berkeley Data Analytics Stack httpsamplabcsberkeleyedu

software

bull dagger[Spark] Mohammed Guller 2015 Big Data Analytics with Spark A Practitioners Guide toUsing Spark for Large-Scale Data Processing Machine Learning and Graph Analytics and High-Velocity Data Stream Processing Apress httpslink-springer-comezaccesslibrariespsu

edubook101007978-1-4842-0964-6

bull DSHandbook-Py Ch 13 ldquoBig Datardquo

bull DeepLearning Ch 10 ldquoSequence Modeling Recurrent and Recursive Netsrdquo

General Resources for Python and R

26

bull dagger[DSHandbook-Py] Field Cady 2017 The Data Science Handbook (Python-based) Wiley http

onlinelibrarywileycomezaccesslibrariespsuedubook1010029781119092919

bull Dagger[Tutorials-R] Ujjwal Karn 2017 ldquoA curated list of R tutorials for Data Science NLP and MachineLearningrdquo httpsgithubcomujjwalkarnDataScienceR

bull Dagger[Tutorials-Python] Ujjwal Karn 2017 ldquoA curated list of Python tutorials for Data Science NLPand Machine Learningrdquo httpsgithubcomujjwalkarnDataSciencePython

Cutting and Bleeding Edge of Data Science Languages

bull [Scala] daggerVishal Layka and David Pollak 2015 Beginning Scala httpslink-springer-

comezaccesslibrariespsuedubook1010072F978-1-4842-0232-6 daggerNicolas Patrick 2014Scala for Machine Learning httpsebookcentralproquestcomlibpensudetailaction

docID=1901910 DaggerScala site httpswwwscala-langorg (See also )

bull [Julia] daggerIvo Baobaert 2015 Getting Started with Julia httpsebookcentralproquestcom

libpensudetailactiondocID=1973847 DaggerDouglas Bates 2013 ldquoJulia for R Programmersrdquohttpwwwstatwiscedu~batesJuliaForRProgrammerspdf DaggerJulia site httpsjulialang

org

bull [Haskell] daggerRichard Bird 2014 Thinking Functionally with Haskell httpsdoi-orgezaccess

librariespsuedu101017CBO9781316092415 daggerHakim Cassimally 2017 Learning Haskell Pro-gramming httpswwwlyndacomHaskell-tutorialsLearning-Haskell-Programming604926-2html daggerJames Church 2017 Learning Haskell for Data Analysis httpswwwlyndacomDeveloper-tutorialsLearning-Haskell-Data-Analysis604234-2html Haskell site httpswwwhaskellorg

bull [Clojure] daggerMark McDonnell 2017 Quick Clojure Essential Functional Programming https

link-springer-comezaccesslibrariespsuedubook1010072F978-1-4842-2952-1 daggerAkhillWali 2014 Clojure for Machine Learning httpsebookcentralproquestcomlibpensu

detailactiondocID=1674848 daggerArthur Ulfeldt 2015 httpswwwlyndacomClojure-tutorialsUp-Running-Clojure413127-2html DaggerClojure site httpsclojureorg

bull [TensorFlow] DaggerAbadi et al (Google) 2016 ldquoTensorFlow A System for Large-Scale Ma-chine Learningrdquo httpsarxivorgabs160508695 DaggerUdacity ldquoDeep Learningrdquo httpswww

udacitycomcoursedeep-learning--ud730 DaggerTensorFlow site httpswwwtensorfloworg

bull [H2O] Darren Cool 2016 Practical Machine Learning with H2O Powerful Scalable Techniques forDeep Learning and AI OrsquoReilly Arno Candel and Viraj Parmar 2015 Deep Learning with H2ODaggerH2O site httpswwwh2oai

Social Data Structures

Space and Time

bull dagger[GIA] David OrsquoSullivan David J Unwin 2010 Geographic Information Analysis Second EditionJohn Wiley amp Sons httponlinelibrarywileycombook1010029780470549094

bull [Space-Time] Donna J Peuquet 2003 Representations of Space and Time Guilford

bull Roger S Bivand Edzer Pebesma Virgilio Gmez-Rubio 2013 Applied Spatial Data Analysis in RFree through library httplinkspringercomezaccesslibrariespsuedubook101007

2F978-1-4614-7618-4 See also Edzer Pebesma 2016 ldquoHandling and Analyzing Spatial Spa-tiotemporal and Movement Data in Rrdquo httpsedzergithubioUseR2016

bull Dagger[PySAL] Sergio J Rey and Dani Arribas-Bel 2016 ldquoGeographic Data Science with PySAL andthe pydata Stackrdquo httpdarribasorggds_scipy16

27

bull DataScience-Python NB 0413 ldquoGeographic Data with Basemaprdquo

bull Time ldquoLongitudinalrdquo intra-individual data as common in developmental psychology dagger[Longitudinal]Garrett Fitzmaurice Marie Davidian Geert Verbeke and Geert Molenberghs editors 2009 Longitu-dinal Data Analysis Chapman amp Hall CRC httppensueblibcompatronFullRecordaspxp=359998

bull Time ldquoTime Series Time Series Cross Section Panelrdquo as common in economics political scienceand sociology DataScience-Python Ch 3 re pandas DSHandbook-Py Ch 17 ldquoTime Series AnalysisrdquoEconometrics GelmanHill

bull Time ldquoSequential Data Streamsrdquo as in NLP machine learning Spark and other rdquostreaming datardquoreadings

bull Space-Time Data with continuity connectivity neighborhood structure See DeepLearning Chapter 9re convolution

bull CRAN Task Views Spatial httpscranr-projectorgwebviewsSpatialhtml Spatiotempo-ral httpscranr-projectorgwebviewsSpatioTemporalhtml Time Series httpscran

r-projectorgwebviewsTimeSerieshtml

Network Graphs

bull Dagger[Networks] David Easley and Jon Kleinberg 2010 Networks Crowds and Markets ReasoningAbout a Highly Connected World Cambridge University Press httpwwwcscornelleduhome

kleinbernetworks-book

bull MMDS Ch 5 ldquoLink Analysisrdquo Ch 10 ldquoMining Social-Network Graphsrdquo

bull dagger[Networks-R] Douglas Luke 2015 A Userrsquos Guide to Network Analysis in R Springer https

link-springer-comezaccesslibrariespsuedubook101007978-3-319-23883-8

bull dagger[Networks-Python] Mohammed Zuhair Al-Taie and Seifedine Kadry 2017 Python for Graph andNetwork Analysis Springer httpslink-springer-comezaccesslibrariespsuedubook

1010072F978-3-319-53004-8

Hierarchy Clustered Data Aggregation Mixed Models

bull dagger[GelmanHill] Andrew Gelman and Jennifer Hill 2006 Data Analysis Using Regression and Multi-levelHierarchical Models Cambridge University Press httppensueblibcompatronFullRecordaspxp=288457

bull dagger[MixedModels] Eugene Demidenko 2013 Mixed Models Theory and Applications with R 2nd edWiley httpsiteebrarycomezaccesslibrariespsuedulibpennstatedetailaction

docID=10748641

Social Data Channels

Language Text Speech Audio

bull Dagger[NLP] Daniel Jurafsky and James H Martin Forthcoming (2017) Speech and Language ProcessingAn Introduction to Natural Language Processing Speech Recognition and Computational Linguistics3rd ed Prentice-Hall Preprint httpswebstanfordedu~jurafskyslp3

bull InfoRetrieval

bull Dagger[CoreNLP] Stanford CoreNLP httpsstanfordnlpgithubioCoreNLP

28

bull dagger[TextAsData] Grimmer and Stewart 2013 ldquoText as Data The Promise and Pitfalls of AutomaticContent Analysis Methods for Political Textsrdquo Political Analysis httpsdoiorg101093pan

mps028

bull Quinn-Topics

bull FightinWords

bull dagger[NLTK-Python] Jacob Perkins 2010 Python Text Processing with NLTK 20 Cookbook PackthttpsiteebrarycomezaccesslibrariespsuedulibpennstatedetailactiondocID=10435387dagger[TextAnalytics-Python] Dipanjan Sarkar 2016 Text Analytics with Python Springer http

linkspringercomezaccesslibrariespsuedubook1010072F978-1-4842-2388-8

bull DSHandbook-Py Ch 16 ldquoNatural Language Processingrdquo

bull Matthew J Denny httpwwwmjdennycomText_Processing_In_Rhtml

bull CRAN Natural Language Processing httpscranr-projectorgwebviewsNaturalLanguageProcessinghtml

bull dagger[AudioData] Dean Knox and Christopher Lucas 2017 ldquoThe Speaker-Affect Model MeasuringEmotion in Political Speech with Audio Datardquo httpchristopherlucasorgfilesPDFssam

pdf httpswwwyoutubecomwatchv=Hs8A9dwkMzI

bull daggerMorgan amp Claypool Synthesis Lectures on Human Language Technologies httpwwwmorganclaypoolcomtochlt11

bull daggerMorgan amp Claypool Synthesis Lectures on Speech and Audio Processing httpwwwmorganclaypoolcomtocsap11

Vision Image Video

bull Dagger[ComputerVision] Richard Szelinski 2010 Computer Vision Algorithms and Applications SpringerhttpszeliskiorgBook

bull MixedModels Ch 11 ldquoStatistical Analysis of Shaperdquo Ch 12 ldquoStatistical Image Analysisrdquo

bull AudioVideoVolumetric See DeepLearning Chapter 9 re convolution

bull daggerMorgan amp Claypool Synthesis Lectures on Image Video and Multimedia Processing httpwww

morganclaypoolcomtocivm11

bull daggerMorgan amp Claypool Synthesis Lectures on Computer Vision httpwwwmorganclaypoolcom

toccov11

Approaches to Learning from Data (The Analytics Layer)

bull NRCreport Ch 7 ldquoBuilding Models from Massive Datardquo

bull MMDS (data-mining)

bull CausalInference

bull Dagger[VisualAnalytics] Daniel Keim Jorn Kohlhammer Geoffrey Ellis and Florian Mansmann editors2010 Mastering the Information Age Solving Problems with Visual Analytics The EurographicsAssociation Goslar Germany httpwwwvismastereubook See also daggerMorgan amp ClaypoolSynthesis Lectures on Visualization httpwwwmorganclaypoolcomtocvis21

bull Dagger[Econometrics] Bruce E Hansen 2014 Econometrics httpwwwsscwiscedu~bhansen

econometrics

29

bull Dagger[StatisticalLearning] Gareth James Daniela Witten Trevor Hastie and Robert Tibshirani 2013An Introduction to Statistical Learning with Applications in R Springer httpwww-bcfuscedu

~garethISLindexhtml

bull [MachineLearning] Christopher Bishop 2006 Pattern Recognition and Machine Learning Springer

bull dagger[Bayes] Peter D Hoff 2009 A First Course in Bayesian Statistical Methods Springer http

linkspringercombook1010072F978-0-387-92407-6

bull dagger[GraphicalModels] Soslashren Hoslashjsgaard David Edwards Steffen Lauritzen 2012 Graphical Mod-els with R Springer httpslink-springer-comezaccesslibrariespsuedubook101007

2F978-1-4614-2299-0

bull [InformationTheory] David Mackay 2003 Information Theory Inference and Learning AlgorithmsSpringer

bull Dagger[DeepLearning] Ian Goodfellow Yoshua Bengio and Aaron Courville 2016 Deep Learning MITPress httpwwwdeeplearningbookorg (see also Yann LeCun Yoshua Bengio and Geof-frey Hinton 2015 ldquoDeep Learningrdquo Nature 521(7553) 436ndash444 httpsdoiorg101038

nature14539)

See also Dagger[Keras] Keras The Python Deep Learning Library httpskerasio Francoios CholletDirectory of Keras tutorials httpsgithubcomfcholletkeras-resources

bull dagger[SignalProcessing] Jose Maria Giron-Sierra 2013 Digital Signal Processing with MATLAB Exam-ples (Volumes 1-3) Springer httpslink-springer-comezaccesslibrariespsuedubook

101007978-981-10-2534-1

30

Penn State Policy Statements

Academic Integrity

Academic integrity is the pursuit of scholarly activity in an open honest and responsible manner Academicintegrity is a basic guiding principle for all academic activity at The Pennsylvania State University and allmembers of the University community are expected to act in accordance with this principle Consistent withthis expectation the Universitys Code of Conduct states that all students should act with personal integrityrespect other students dignity rights and property and help create and maintain an environment in whichall can succeed through the fruits of their efforts

Academic integrity includes a commitment by all members of the University community not to engage in ortolerate acts of falsification misrepresentation or deception Such acts of dishonesty violate the fundamentalethical principles of the University community and compromise the worth of work completed by others

Disability Accomodation

Penn State welcomes students with disabilities into the Universitys educational programs Every Penn Statecampus has an office for students with disabilities Student Disability Resources (SDR) website provides con-tact information for every Penn State campus (httpequitypsuedusdrdisability-coordinator)For further information please visit the Student Disability Resources website (httpequitypsuedusdr)

In order to receive consideration for reasonable accommodations you must contact the appropriate disabilityservices office at the campus where you are officially enrolled participate in an intake interview and pro-vide documentation See documentation guidelines at (httpequitypsuedusdrguidelines) If thedocumentation supports your request for reasonable accommodations your campus disability services officewill provide you with an accommodation letter Please share this letter with your instructors and discussthe accommodations with them as early as possible You must follow this process for every semester thatyou request accommodations

Psychological Services

Many students at Penn State face personal challenges or have psychological needs that may interfere withtheir academic progress social development or emotional wellbeing The university offers a variety ofconfidential services to help you through difficult times including individual and group counseling crisisintervention consultations online chats and mental health screenings These services are provided by staffwho welcome all students and embrace a philosophy respectful of clients cultural and religious backgroundsand sensitive to differences in race ability gender identity and sexual orientation

bull Counseling and Psychological Services at University Park (CAPS) (httpstudentaffairspsueducounseling) 814-863-0395

bull Penn State Crisis Line (24 hours7 daysweek) 877-229-6400

bull Crisis Text Line (24 hours7 daysweek) Text LIONS to 741741

Educational Equity

Penn State takes great pride to foster a diverse and inclusive environment for students faculty and staffConsistent with University Policy AD29 students who believe they have experienced or observed a hatecrime an act of intolerance discrimination or harassment that occurs at Penn State are urged to report theseincidents as outlined on the Universitys Report Bias webpage (httpequitypsuedureportbias)

31

Background Knowledge

Students will have a variety of backgrounds Prior to beginning interdisciplinary coursework tofulfill Social Data Analytics degree requirements including SoDA 501 and 502 students are ex-pected to have advanced (graduate) training in at least one of the component areas of Social DataAnalytics and a familiarity with basic concepts in the others

With regard to specialization students are expected to have advanced (graduate) training in ONEof the following

bull quantitative social science methodology and a discipline of social science (as would be thecase for a second-year PhD student in Political Science Sociology Criminology Human De-velopment and Family Studies or Demography) OR

bull statistics (as would be the case for a second-year PhD student in Statistics) OR

bull information science or informatics (as would be the case for a second-year PhD student inInformation Science and Technology or a second-year PhD student in Geography specializingin GIScience) OR

bull computer science (as would be the case for a second-year PhD student in Computer Scienceand Engineering)

This requirement is met as a matter of meeting home program requirements for students in thedual-title PhD but may require additional coursework on the part of students in other programswishing to pursue the graduate minor

With regard to general preparation students are expected to have ALL of the following technicalknowledge

bull basic programming skills including basic facility with R ampor Python AND

bull basic knowledge of relational databases ampor geographic information systems AND

bull basic knowledge of probability applied statistics ampor social science research design AND

bull basic familiarity with a substantive or theoretical area of social science (eg 300-level course-work in political science sociology criminology human development psychology economicscommunication anthropology human geography social informatics or similar fields)

It is not unusual for students to have one or more gaps in this preparation at time of applicationto the SoDA program Students should work with Social Data Analytics advisers to develop aplan for timely remediation of any deficiencies which generally will not require formal courseworkfor students whose training and interests are otherwise appropriate for pursuit of the Social DataAnalytics degree Where possible this will be addressed at time of application to the Social DataAnalytics program

To this end some free training materials are linked in the reference section and there are alsoabundant free high-quality self-paced course-style training materials on these and related subjectsavailable through edX (httpwwwedxorg) Udacity (httpwwwudacitycom) Codecademy(httpcodecademycom) and Lynda (httpwwwlyndapsu)

32

Page 20: SoDA 501: Approaches and Issues in Big Social Data Spring ... · \Approaches and Issues in Big Social Data," o ered in the spring semester. Some SoDA / IGERT students will take more

bull dagger[Transparency] E Miguel C Camerer K Casey J Cohen K M Esterling A Gerber R Glen-nerster D P Green M Humphreys G Imbens D Laitin T Madon L Nelson B A Nosek M Pe-tersen R Sedlmayr J P Simmons U Simonsohn M Van der Laan 2014 ldquoPromoting Transparencyin Social Science Researchrdquo Science 343(6166) 30ndash1 httpsdoiorg101126science1245317

bull dagger[Reproducibility] Marcus R Munafo Brian A Nosek Dorothy V M Bishop Katherine S ButtonChristopher D Chambers Nathalie Percie du Sert Uri Simonsohn Eric-Jan Wagenmakers JenniferJ Ware and John P A Ioannidis 2017 ldquoA Manifesto for Reproducible Sciencerdquo Nature HumanBehavior 0021(2017) httpsdoiorg101038s41562-016-0021

bull Dagger[SoftwareCarpentry] esp ldquoLessonsrdquo httpsoftware-carpentryorglessons

bull DSHandbook-Py Ch 9 ldquoTechnical Communication and Documentationrdquo Ch 15 ldquoSoftware Engi-neering Best Practicesrdquo

bull dagger[TeamScienceToolkit] Vogel AL Hall KL Fiore SM Klein JT Bennett LM Gadlin H Stokols DNebeling LC Wuchty S Patrick K Spotts EL Pohl C Riley WT Falk-Krzesinski HJ 2013 ldquoTheTeam Science Toolkit enhancing research collaboration through online knowledge sharingrdquo AmericanJournal of Preventive Medicine 45 787-9 http101016jamepre201309001

bull sect[Databrary] Kara Hall Robert Croyle and Amanda Vogel Forthcoming (2017) Advancing Socialand Behavioral Health Research through Cross-disciplinary Team Science Springer Includes RickO Gilmore and Karen E Adolph ldquoOpen Sharing of Research Video Breaking the Boundaries of theResearch Teamrdquo (See httpdatabraryorg)

bull CRAN httpscranr-projectorgwebviewsReproducibleResearchhtml

Social Bias Fair Algorithms

bull MachineBias

bull Dagger[EmbeddingsBias] Aylin Caliskan Joanna J Bryson and Arvind Narayanan 2017 ldquoSemanticsDerived Automatically from Language Corpora Contain Human Biasesrdquo Science httpsarxiv

orgabs160807187

bull Dagger[Debiasing] Tolga Bolukbasi Kai-Wei Chang James Zou Venkatesh Saligrama Adam Kalai 2016ldquoMan is to Computer Programmer as Woman is to Homemaker Debiasing Word Embeddingsrdquohttpsarxivorgabs160706520

bull Dagger[AvoidingBias] Moritz Hardt Eric Price Nathan Srebro 2016 ldquoEquality of Opportunity inSupervised Learningrdquo httpsarxivorgabs161002413

bull Dagger[InevitableBias] Jon Kleinberg Sendhil Mullainathan Manish Raghavan 2016 ldquoInherent Trade-Offs in the Fair Determination of Risk Scoresrdquo httpsarxivorgabs160905807

Databases and Data Management

bull NRCreport Ch 3 ldquoScaling the Infrastructure for Data Managementrdquo

bull dagger[SQL] Jan L Harrington 2010 SQL Clearly Explained Elsevier httpwwwsciencedirectcomezaccesslibrariespsuedusciencebook9780123756978

bull DSHandbook-Py Ch 14 ldquoDatabasesrdquo

bull dagger[noSQL] Guy Harrison 2015 Next Generation Databases NoSQL NewSQL and Big Data Apresshttpslink-springer-comezaccesslibrariespsuedubook101007978-1-4842-1329-2

bull dagger[Cloud] Divyakant Agrawal Sudipto Das and Amr El Abbadi 2012 Data Management inthe Cloud Challenges and Opportunities Morgan amp Claypool httpwwwmorganclaypoolcom

ezaccesslibrariespsuedudoipdfplus102200S00456ED1V01Y201211DTM032

20

bull daggerMorgan amp Claypool Synthesis Lectures on Data Management httpwwwmorganclaypoolcom

tocdtm11

Data Wrangling

Theoretically-structured approaches to data wrangling

bull Dagger[TidyData-R] Garrett Grolemund and Hadley Wickham 2017 R for Data Science (httpr4dshadconz) esp Ch 12 ldquoTidy Datardquo also Wickham 2014 ldquoTidy Datardquo Journalof Statistical Software 59(10) httpwwwjstatsoftorgv59i10paper Tools The tidyversehttptidyverseorg

bull Dagger[DataScience-Py] Jake VanderPlas 2016 Python Data Science Handbook (httpsgithubcomjakevdpPythonDSHandbook-Py) esp Ch 3 on pandas httppandaspydataorg

bull Dagger[DataCarpentry] Colin Gillespie and Robin Lovelace 2017 Efficient R Programming https

csgillespiegithubioefficientR esp Ch 6 on ldquoefficient data carpentryrdquo

bull sect[Wrangler] Joseph M Hellerstein Jeffrey Heer Tye Rattenbury and Sean Kandel 2017 DataWrangling Practical Techniques for Data Preparation Tool Trifacta Wrangler httpwwwtrifactacomproductswrangler

Data wrangling practice

bull DSHandbook-Py Ch 4 ldquoData Munging String Manipulation Regular Expressions and Data Clean-ingrdquo

bull dagger[DataSimplification] Jules J Berman 2016 Data Simplification Taming Information with OpenSource Tools Elsevier httpwwwsciencedirectcomezaccesslibrariespsueduscience

book9780128037812

bull dagger[Wrangling-Py] Jacqueline Kazil Katharine Jarmul 2016 Data Wrangling with Python OrsquoReillyhttpproquestcombosafaribooksonlinecomezaccesslibrariespsuedu9781491948804

bull dagger[Wrangling-R] Bradley C Boehmke 2016 Data Wrangling with R Springer httplink

springercomezaccesslibrariespsuedubook1010072F978-3-319-45599-0

Record Linkage Entity Resolution Deduplication

bull dagger[DataMatching] Peter Christen 2012 Data Matching Concepts and Techniques for Record Link-age Entity Resolution and Duplicate Detectionhttplinkspringercomezaccesslibrariespsuedubook1010072F978-3-642-31164-2

(esp Ch 2 ldquoThe Data Matching Processrdquo)

bull DataSimplification Chapter 5 ldquoIdentifying and Deidentifying Datardquo

bull Dagger[EntityResolution] Lise Getoor and Ashwin Machanavajjhala 2013 ldquoEntity Resolution for BigDatardquo Tutorial KDD httpwwwumiacsumdedu~getoorTutorialsER_KDD2013pdf

bull Dagger[SyrianCasualties] Peter Sadosky Anshumali Shrivastava Megan Price and Rebecca C Steorts2015 ldquoBlocking Methods Applied to Casualty Records from the Syrian Conflictrdquo httpsarxiv

orgabs151007714 (For more on blocking see DataMatching Ch 4 ldquoIndexingrdquo)

bull CRAN Task Views httpscranr-projectorgwebviewsOfficialStatisticshtml ldquoStatis-tical Matching and Record Linkagerdquo

ldquoMaking up datardquo Imputation Smoothers Kernels Priors Filters Teleportation Negative Sampling Con-volution Augmentation Adversarial Training

21

bull dagger[Imputation] Yi Deng Changgee Chang Moges Seyoum Ido and Qi Long 2016 ldquoMultiple Im-putation for General Missing Data Patterns in the Presence of High-dimensional Datardquo ScientificReports 6(21689) httpswwwnaturecomarticlessrep21689

bull Dagger[Overimputation] Matthew Blackwell James Honaker and Gary King 2017 ldquoA Unified Ap-proach to Measurement Error and Missing Data Overview and Applicationsrdquo Sociological Methodsamp Research 46(3) 303-341 httpgkingharvardedufilesgkingfilesmeasurepdf

bull Shalizi-ADA Sect 15 (Linear Smoothers) Ch 8 (Splines) Sect 144 (Kernel Density Estimates)DataScience-Python NB 0513 ldquoKernel Density Estimationrdquo

bull FightinWords (re Bayesian priors as additional data impact of priors on regularization)

bull DeepLearning Section 75 (Data Augmentation) 712 (Dropout) 713 (Adversarial Examples) Chap-ter 9 (Convolutional Networks)

bull dagger[Adversarial] Ian J Goodfellow Jonathon Shlens Christian Szegedy 2015 ldquoExplaining and Har-vesting Adversarial Examplesrdquo httpsarxivorgabs14126572

bull See also resampling and simulation methods

bull See also feature engineering preprocessing

bull CRAN Task Views httpscranr-projectorgwebviewsOfficialStatisticshtml ldquoImpu-tationrdquo

(Direct) Data Representations Data Mappings

bull NRCreport Ch 5 ldquoLarge-Scale Data Representationsrdquo

bull dagger[PatternRecognition] M Narasimha Murty and V Susheela Devi 2011 Pattern RecognitionAn Algorithmic Approach Springer httpslink-springer-comezaccesslibrariespsuedu

book1010072F978-0-85729-495-1 Section 21 ldquoData Structures for Pattern Representationrdquo

bull Dagger[InfoRetrieval] Christopher D Manning Prabhakar Raghavan and Hinrich Schutze 2009 Intro-duction to Information Retrieval Cambridge University Press httpnlpstanfordeduIR-bookCh 1 2 6 (also note slides used in their class)

bull dagger[Algorithms] Brian Steele John Chandler Swarna Reddy 2016 Algorithms for Data ScienceWiley httplinkspringercomezaccesslibrariespsuedubook1010072F978-3-319-

45797-0 Ch 2 ldquoData Mapping and Data Dictionariesrdquo

bull See also Social Data Structures

Similarity Distance Association Covariance The Kernel Trick

bull PatternRecognition Section 23 ldquoProximity Measuresrdquo

bull Dagger[Similarity] Brendan OrsquoConnor 2012 ldquoCosine similarity Pearson correlation and OLS coeffi-cientsrdquo httpsbrenoconcomblog201203cosine-similarity-pearson-correlation-and-

ols-coefficients

bull For relatively comprehensive lists see also

M-J Lesot M Rifqi and H Benhadda 2009 ldquoSimilarity measures for binary and numerical data a surveyrdquoInt J Knowledge Engineering and Soft Data Paradigms 1(1) 63- httpciteseerxistpsueduviewdoc

downloaddoi=10112126533amprep=rep1amptype=pdf

Seung-Seok Choi Sung-Hyuk Cha Charles C Tappert 2010 ldquoA Survey of Binary Similarity and DistanceMeasuresrdquo Journal of Systemics Cybernetics amp Informatics 8(1)43-8 httpwwwiiisciorgJournalCV$

scipdfsGS315JGpdf

22

Sung-Hyuk Cha 2007 ldquoComprehensive Survey on DistanceSimilarity Measures between Probability DensityFunctionsrdquo International Journal of Mathematical Models and Methods in Applied Sciences 4(1) 300-7 http

csispaceeductappertdpsd861-12session4-p2pdf

Anna Huang 2008 ldquoSimilarity Measures for Text Document Clusteringrdquo Proceedings of the 6th New ZealandComputer Science Research Student Conference 49-56 httpwwwnzcsrsc08canterburyacnzsiteproceedingsIndividual_Paperspg049_Similarity_Measures_for_Text_Document_Clusteringpdf

bull Michael B Jordan ldquoThe Kernel Trickrdquo (Lecture Notes) httpspeopleeecsberkeleyedu

~jordancourses281B-spring04lectureslec3pdf

bull Eric Kim 2017 ldquoEverything You Ever Wanted to Know about the Kernel Trick (But Were Afraidto Ask)rdquo httpwwweric-kimneteric-kim-netposts1kernel_trick_blog_ekim_12_20_

2017pdf

Derived Data Representations - Dimensionality Reduction Compression De-composition Embeddings

The groupings here and under the measurement multivariate statistics section are particularly arbitraryFor example ldquok-Means clusteringrdquo can be viewed as a technique for ldquodimensionality reductionrdquo ldquocompres-sionrdquo ldquofeature extractionrdquo ldquolatent variable measurementrdquo ldquounsupervised learningrdquo ldquocollaborative filteringrdquo

Clustering hashing quantization blocking compression

bull Dagger[Compression] Khalid Sayood 2012 Introduction to Data Compression 4th ed Springer httpwwwsciencedirectcomezaccesslibrariespsuedusciencebook9780124157965 (eg cod-ing blocking via vector quantization)

bull MMDS Ch3 ldquoFinding Similar Itemsrdquo (minhashing locality sensitive hashing)

bull Multivariate-R Ch 6 ldquoClusteringrdquo

bull DataScience-Python NB 0511 ldquok-Means Clusteringrdquo NG 0512 ldquoGaussian Mixture Modelsrdquo

bull Dagger[KMeansHashing] Kaiming He Fang Wen Jian Sun 2013 ldquoK-means Hashing An Affinity-Preserving Quantization Method for Learning Binary Compact Codesrdquo CVPR httpswwwcv-foundationorgopenaccesscontent_cvpr_2013papersHe_K-Means_Hashing_An_2013_CVPR_paper

pdf

bull dagger[Squashing] Madigan D Raghavan N Dumouchel W Nason M Posse C and RidgewayG (2002) ldquoLikelihood-based data squashing A modeling approach to instance constructionrdquo DataMining and Knowledge Discovery 6(2) 173-190 httpdxdoiorgezaccesslibrariespsu

edu101023A1014095614948

bull dagger[Core-sets] Piotr Indyk Sepideh Mahabadi Mohammad Mahdian Vahab S Mirrokni 2014 ldquoCom-posable core-sets for diversity and coverage maximizationrdquo PODS rsquo14 100-8 httpsdoiorg10114525945382594560

Feature selection feature extraction feature engineering weighting preprocessing

bull PatternRecognition Sections 26-7 ldquoFeature Selection Feature Extractionrdquo

bull DSHandbook-Py Ch 7 ldquoInterlude Feature Extraction Ideasrdquo

bull DataScience-Python NB 0504 ldquoFeature Engineeringrdquo

bull Features in text NLP and InfoRetrieval re tfidf and similar FightinWords

bull Features in images DataScience-Python NB 0514 ldquoImage Featuresrdquo

23

Dimensionality reduction decomposition factorization change of basis reparameterization matrix com-pletion latent variables source separation

bull MMDS Ch 9 ldquoRecommendation Systemsrdquo Ch11 ldquoDimensionality Reductionrdquo

bull DSHandbook-Py Ch 10 ldquoUnsupervised Learning Clustering and Dimensionality Reductionrdquo

bull Dagger[Shalizi-ADA] Cosma Rohilla Shalizi 2017 Advanced Data Analysis from an Elementary Pointof View Ch 16 (ldquoPrincipal Components Analysisrdquo) Ch 17 (ldquoFactor Modelsrdquo) httpwwwstat

cmuedu~cshaliziADAfaEPoV

See also Multivariate-R Ch 3 ldquoPrincipal Components Analysisrdquo Ch 4 ldquoMultidimensional ScalingrdquoCh 5 ldquoExploratory Factor Analysisrdquo Latent

bull DeepLearning Ch 2 ldquoLinear Algebrardquo Ch 13 ldquoLinear Factor Modelsrdquo

bull Dagger[GloVe] Jeffrey Pennington Richard Socher Christopher Manning 2014 ldquoGloVe Global vectorsfor word representationrdquo EMNLP httpsnlpstanfordeduprojectsglove

bull dagger[NMF] Daniel D Lee and H Sebastian Seung 1999 ldquoLearning the parts of objects by non-negativematrix factorizationrdquo Nature 401788-791 httpdxdoiorgezaccesslibrariespsuedu

10103844565

bull dagger[CUR] Michael W Mahoney and Petros Drineas 2009 ldquoCUR matrix decompositions for improveddata analysisrdquo PNAS httpwwwpnasorgcontent1063697full

bull Dagger[ICA] Aapo Hyvarinen and Erkki Oja 2000 ldquoIndependent Component Analysis Algorithmsand Applicationsrdquo Neural Networks 13(4-5) 411-30 httpswwwcshelsinkifiuahyvarin

papersNN00newpdf

bull [RandomProjection] Ella Bingham and Heikki Mannilla 2001 ldquoRandom projection in dimension-ality reduction applications to image and text datardquo KDD httpsdoiorg1011452F502512

502546

bull Compression eg ldquoTransform codingrdquo ldquoWaveletsrdquo

Nonlinear dimensionality reduction Manifold learning

bull DeepLearning Section 5113 ldquoManifold Learningrdquo

bull DataScience-Python NB 0510 ldquoManifold Learningrdquo (Locally linear embedding [LLE] Isomap)

bull Dagger[KernelPCA] Sebastian Raschka 2014 ldquoKernel tricks and nonlinear dimensionality reduction viaRBF kernel PCArdquo httpsebastianraschkacomArticles2014_kernel_pcahtml

bull Shalizi-ADA Ch 18 ldquoNonlinear Dimensionality Reductionrdquo (LLE)

bull Dagger[LaplacianEigenmaps] Mikhail Belkin and Partha Niyogi 2003 ldquoLaplacian eigenmaps for di-mensionality reduction and data representationrdquo Neural Computation 15(6) 13731396 http

webcseohio-stateedu~belkin8papersLEM_NC_03pdf

bull DeepLearning Ch14 (ldquoAutoencodersrdquo)

bull Dagger[word2vec] Tomas Mikolov Kai Chen Greg Corrado Jeffrey Dean 2013 ldquoEfficient Estimation ofWord Representations in Vector Spacerdquo httpsarxivorgabs13013781

bull Dagger[word2vecExplained] Yoav Goldberg and Omer Levy 2014 ldquoword2vec Explained DerivingMikolov et als Negative-Sampling Word-Embedding Methodrdquo httpsarxivorgabs14023722

bull Dagger[t-SNE] Laurens van der Maaten and Geoffrey Hinton 2008 ldquoVisualising Data using t-SNErdquo Jour-nal of Machine Learning Research 9 2579-2605 httpjmlrcsailmitedupapersvolume9

vandermaaten08avandermaaten08apdf

24

Computation and Scaling Up

Scientific computing computation at scale

bull NRCreport Ch 10 ldquoThe Seven Computational Giants of Massive Data Analysisrdquo

bull DSHandbook-Py Ch 21 ldquoPerformance and Computer Memoryrdquo Ch 22 ldquoComputer Memory andData Structuresrdquo

bull Dagger[ComputationalStatistics-Python] Cliburn Chan Computational Statistics in Python https

peopledukeedu~ccc14sta-663indexhtml

bull CRAN Task View High Performance Computinghttpscranr-projectorgwebviewsHighPerformanceComputinghtml

Numerical computing

bull [NumericalComputing] Ward Cheney and David Kincaid 2013 Numerical Mathematics andComputing 7th ed BrooksCole Cengage Learning

bull CRAN Task View Numerical Mathematicshttpscranr-projectorgwebviewsNumericalMathematicshtml

Optimization (eg MLE gradient descent stochastic gradient descent EM algorithm neural nets)

bull DSHandbook-Py Ch 23 ldquoMaximum Likelihood Estimation and Optimizationrdquo (gradient descent)

bull DeepLearning Ch 4 ldquoNumerical Computationrdquo Ch 6 ldquoDeep Feedforward Networksrdquo Ch 8ldquoOptimization for Training Deep Modelsrdquo

bull CRAN Task View Optimization and Mathematical Programminghttpscranr-projectorgwebviewsOptimizationhtml

Linear algebra matrix computations

bull [MatrixComputations] Gene H Golub and Charles F Van Loan 2013 Matrix Computations 4thed Johns Hopkins University Press

bull Dagger[NetflixMatrix] Yehuda Koren Robert Bell and Chris Volinsky 2009 ldquoMatrix FactorizationTechniques for Recommender Systemsrdquo Computer August 42-9 httpsdatajobscomdata-

science-repoRecommender-Systems-[Netflix]pdf

bull Dagger[BigDataPCA] Jianqing Fan Qiang Sun Wen-Xin Zhou Ziwei Zhu ldquoPrincipal Component Anal-ysis for Big Datardquo httpwwwprincetonedu~ziweizpcapdf

bull Dagger[RandomSVD] Andrew Tulloch 2009 ldquoFast Randomized SVDrdquo httpsresearchfbcom

fast-randomized-svd

bull Dagger[FactorSGD] Rainer Gemulla Peter J Haas Erik Nijkamp and Yannis Sismanis 2011 ldquoLarge-Scale Matrix Factorization with Distributed Stochastic Gradient Descentrdquo KDD httpwwwcs

utahedu~hariteachingbigdatagemulla11dsgdpdf

bull Dagger[Sparse] Max Grossman 2015 ldquo101 Ways to Store a Sparse Matrixrdquo httpsmediumcom

jmaxg3101-ways-to-store-a-sparse-matrix-c7f2bf15a229

bull Dagger[DontInvert] John D Cook 2010 ldquoDonrsquot Invert that Matrixrdquo httpswwwjohndcookcom

blog20100119dont-invert-that-matrix

Simulation-based inference resampling Monte Carlo methods MCMC Bayes approximate inference

25

bull Dagger[MarkovVisually] Victor Powell ldquoMarkov Chains Explained Visuallyrdquo httpsetosaioev

markov-chains

bull DSHandbook-Py Ch 25 ldquoStochastic Modelingrdquo (Markov chains MCMC HMM)

bull Bayes ComputationalStatistics-Py

bull DeepLearning Ch 17 ldquoMonte Carlo Methodsrdquo Ch 19 ldquoApproximate Inferencerdquo

bull Dagger[VariationalInference] Jason Eisner 2011 ldquoHigh-Level Explanation of Variational Inferencerdquohttpswwwcsjhuedu~jasontutorialsvariationalhtml

bull CRAN Task View Bayesian Inferencehttpscranr-projectorgwebviewsBayesianhtml

Parallelism MapReduce Split-Apply-Combine

bull Dagger[MapReduceIntuition] Jigsaw Academy 2014 ldquoBig Data Specialist MapReducerdquo https

wwwyoutubecomwatchv=TwcYQzFqg-8ampfeature=youtube

bull MMDS Ch 2 ldquoMap-Reduce and the New Software Stackrdquo (3rd Edition discusses Spark amp Tensor-Flow)

bull Algorithms Ch 3ldquoScalable Algorithms and Associative Statisticsrdquo Ch 4 ldquoHadoop and MapReducerdquo

bull NRCreport Ch 6 ldquoResources Trade-offs and Limitationsrdquo

bull TidyData-R (Split-apply-combine is the motivating principle behind the ldquotidyverserdquo approach) Seealso Part III ldquoProgramrdquo (pipes functions vectors iteration)

bull Dagger[TidySAC-Video] Hadley Wickham 2017 ldquoData Science in the Tidyverserdquo httpswww

rstudiocomresourcesvideosdata-science-in-the-tidyverse

bull See also FactorSGD

Functional Programming

bull ComputationalStatistics-Python ldquoFunctions are first class objectsrdquo through first exercises

bull DSHandbook-Py Ch 20 ldquoProgramming Language Conceptsrdquo

bull See also Haskell

Scaling iteration streaming data online algorithms (Spark)

bull NRCreport Ch 4 ldquoTemporal Data and Real-Time Algorithmsrdquo

bull MMDS Ch 4 ldquoMining Data Streamsrdquo

bull Dagger[BDAS] AMPLab BDAS The Berkeley Data Analytics Stack httpsamplabcsberkeleyedu

software

bull dagger[Spark] Mohammed Guller 2015 Big Data Analytics with Spark A Practitioners Guide toUsing Spark for Large-Scale Data Processing Machine Learning and Graph Analytics and High-Velocity Data Stream Processing Apress httpslink-springer-comezaccesslibrariespsu

edubook101007978-1-4842-0964-6

bull DSHandbook-Py Ch 13 ldquoBig Datardquo

bull DeepLearning Ch 10 ldquoSequence Modeling Recurrent and Recursive Netsrdquo

General Resources for Python and R

26

bull dagger[DSHandbook-Py] Field Cady 2017 The Data Science Handbook (Python-based) Wiley http

onlinelibrarywileycomezaccesslibrariespsuedubook1010029781119092919

bull Dagger[Tutorials-R] Ujjwal Karn 2017 ldquoA curated list of R tutorials for Data Science NLP and MachineLearningrdquo httpsgithubcomujjwalkarnDataScienceR

bull Dagger[Tutorials-Python] Ujjwal Karn 2017 ldquoA curated list of Python tutorials for Data Science NLPand Machine Learningrdquo httpsgithubcomujjwalkarnDataSciencePython

Cutting and Bleeding Edge of Data Science Languages

bull [Scala] daggerVishal Layka and David Pollak 2015 Beginning Scala httpslink-springer-

comezaccesslibrariespsuedubook1010072F978-1-4842-0232-6 daggerNicolas Patrick 2014Scala for Machine Learning httpsebookcentralproquestcomlibpensudetailaction

docID=1901910 DaggerScala site httpswwwscala-langorg (See also )

bull [Julia] daggerIvo Baobaert 2015 Getting Started with Julia httpsebookcentralproquestcom

libpensudetailactiondocID=1973847 DaggerDouglas Bates 2013 ldquoJulia for R Programmersrdquohttpwwwstatwiscedu~batesJuliaForRProgrammerspdf DaggerJulia site httpsjulialang

org

bull [Haskell] daggerRichard Bird 2014 Thinking Functionally with Haskell httpsdoi-orgezaccess

librariespsuedu101017CBO9781316092415 daggerHakim Cassimally 2017 Learning Haskell Pro-gramming httpswwwlyndacomHaskell-tutorialsLearning-Haskell-Programming604926-2html daggerJames Church 2017 Learning Haskell for Data Analysis httpswwwlyndacomDeveloper-tutorialsLearning-Haskell-Data-Analysis604234-2html Haskell site httpswwwhaskellorg

bull [Clojure] daggerMark McDonnell 2017 Quick Clojure Essential Functional Programming https

link-springer-comezaccesslibrariespsuedubook1010072F978-1-4842-2952-1 daggerAkhillWali 2014 Clojure for Machine Learning httpsebookcentralproquestcomlibpensu

detailactiondocID=1674848 daggerArthur Ulfeldt 2015 httpswwwlyndacomClojure-tutorialsUp-Running-Clojure413127-2html DaggerClojure site httpsclojureorg

bull [TensorFlow] DaggerAbadi et al (Google) 2016 ldquoTensorFlow A System for Large-Scale Ma-chine Learningrdquo httpsarxivorgabs160508695 DaggerUdacity ldquoDeep Learningrdquo httpswww

udacitycomcoursedeep-learning--ud730 DaggerTensorFlow site httpswwwtensorfloworg

bull [H2O] Darren Cool 2016 Practical Machine Learning with H2O Powerful Scalable Techniques forDeep Learning and AI OrsquoReilly Arno Candel and Viraj Parmar 2015 Deep Learning with H2ODaggerH2O site httpswwwh2oai

Social Data Structures

Space and Time

bull dagger[GIA] David OrsquoSullivan David J Unwin 2010 Geographic Information Analysis Second EditionJohn Wiley amp Sons httponlinelibrarywileycombook1010029780470549094

bull [Space-Time] Donna J Peuquet 2003 Representations of Space and Time Guilford

bull Roger S Bivand Edzer Pebesma Virgilio Gmez-Rubio 2013 Applied Spatial Data Analysis in RFree through library httplinkspringercomezaccesslibrariespsuedubook101007

2F978-1-4614-7618-4 See also Edzer Pebesma 2016 ldquoHandling and Analyzing Spatial Spa-tiotemporal and Movement Data in Rrdquo httpsedzergithubioUseR2016

bull Dagger[PySAL] Sergio J Rey and Dani Arribas-Bel 2016 ldquoGeographic Data Science with PySAL andthe pydata Stackrdquo httpdarribasorggds_scipy16

27

bull DataScience-Python NB 0413 ldquoGeographic Data with Basemaprdquo

bull Time ldquoLongitudinalrdquo intra-individual data as common in developmental psychology dagger[Longitudinal]Garrett Fitzmaurice Marie Davidian Geert Verbeke and Geert Molenberghs editors 2009 Longitu-dinal Data Analysis Chapman amp Hall CRC httppensueblibcompatronFullRecordaspxp=359998

bull Time ldquoTime Series Time Series Cross Section Panelrdquo as common in economics political scienceand sociology DataScience-Python Ch 3 re pandas DSHandbook-Py Ch 17 ldquoTime Series AnalysisrdquoEconometrics GelmanHill

bull Time ldquoSequential Data Streamsrdquo as in NLP machine learning Spark and other rdquostreaming datardquoreadings

bull Space-Time Data with continuity connectivity neighborhood structure See DeepLearning Chapter 9re convolution

bull CRAN Task Views Spatial httpscranr-projectorgwebviewsSpatialhtml Spatiotempo-ral httpscranr-projectorgwebviewsSpatioTemporalhtml Time Series httpscran

r-projectorgwebviewsTimeSerieshtml

Network Graphs

bull Dagger[Networks] David Easley and Jon Kleinberg 2010 Networks Crowds and Markets ReasoningAbout a Highly Connected World Cambridge University Press httpwwwcscornelleduhome

kleinbernetworks-book

bull MMDS Ch 5 ldquoLink Analysisrdquo Ch 10 ldquoMining Social-Network Graphsrdquo

bull dagger[Networks-R] Douglas Luke 2015 A Userrsquos Guide to Network Analysis in R Springer https

link-springer-comezaccesslibrariespsuedubook101007978-3-319-23883-8

bull dagger[Networks-Python] Mohammed Zuhair Al-Taie and Seifedine Kadry 2017 Python for Graph andNetwork Analysis Springer httpslink-springer-comezaccesslibrariespsuedubook

1010072F978-3-319-53004-8

Hierarchy Clustered Data Aggregation Mixed Models

bull dagger[GelmanHill] Andrew Gelman and Jennifer Hill 2006 Data Analysis Using Regression and Multi-levelHierarchical Models Cambridge University Press httppensueblibcompatronFullRecordaspxp=288457

bull dagger[MixedModels] Eugene Demidenko 2013 Mixed Models Theory and Applications with R 2nd edWiley httpsiteebrarycomezaccesslibrariespsuedulibpennstatedetailaction

docID=10748641

Social Data Channels

Language Text Speech Audio

bull Dagger[NLP] Daniel Jurafsky and James H Martin Forthcoming (2017) Speech and Language ProcessingAn Introduction to Natural Language Processing Speech Recognition and Computational Linguistics3rd ed Prentice-Hall Preprint httpswebstanfordedu~jurafskyslp3

bull InfoRetrieval

bull Dagger[CoreNLP] Stanford CoreNLP httpsstanfordnlpgithubioCoreNLP

28

bull dagger[TextAsData] Grimmer and Stewart 2013 ldquoText as Data The Promise and Pitfalls of AutomaticContent Analysis Methods for Political Textsrdquo Political Analysis httpsdoiorg101093pan

mps028

bull Quinn-Topics

bull FightinWords

bull dagger[NLTK-Python] Jacob Perkins 2010 Python Text Processing with NLTK 20 Cookbook PackthttpsiteebrarycomezaccesslibrariespsuedulibpennstatedetailactiondocID=10435387dagger[TextAnalytics-Python] Dipanjan Sarkar 2016 Text Analytics with Python Springer http

linkspringercomezaccesslibrariespsuedubook1010072F978-1-4842-2388-8

bull DSHandbook-Py Ch 16 ldquoNatural Language Processingrdquo

bull Matthew J Denny httpwwwmjdennycomText_Processing_In_Rhtml

bull CRAN Natural Language Processing httpscranr-projectorgwebviewsNaturalLanguageProcessinghtml

bull dagger[AudioData] Dean Knox and Christopher Lucas 2017 ldquoThe Speaker-Affect Model MeasuringEmotion in Political Speech with Audio Datardquo httpchristopherlucasorgfilesPDFssam

pdf httpswwwyoutubecomwatchv=Hs8A9dwkMzI

bull daggerMorgan amp Claypool Synthesis Lectures on Human Language Technologies httpwwwmorganclaypoolcomtochlt11

bull daggerMorgan amp Claypool Synthesis Lectures on Speech and Audio Processing httpwwwmorganclaypoolcomtocsap11

Vision Image Video

bull Dagger[ComputerVision] Richard Szelinski 2010 Computer Vision Algorithms and Applications SpringerhttpszeliskiorgBook

bull MixedModels Ch 11 ldquoStatistical Analysis of Shaperdquo Ch 12 ldquoStatistical Image Analysisrdquo

bull AudioVideoVolumetric See DeepLearning Chapter 9 re convolution

bull daggerMorgan amp Claypool Synthesis Lectures on Image Video and Multimedia Processing httpwww

morganclaypoolcomtocivm11

bull daggerMorgan amp Claypool Synthesis Lectures on Computer Vision httpwwwmorganclaypoolcom

toccov11

Approaches to Learning from Data (The Analytics Layer)

bull NRCreport Ch 7 ldquoBuilding Models from Massive Datardquo

bull MMDS (data-mining)

bull CausalInference

bull Dagger[VisualAnalytics] Daniel Keim Jorn Kohlhammer Geoffrey Ellis and Florian Mansmann editors2010 Mastering the Information Age Solving Problems with Visual Analytics The EurographicsAssociation Goslar Germany httpwwwvismastereubook See also daggerMorgan amp ClaypoolSynthesis Lectures on Visualization httpwwwmorganclaypoolcomtocvis21

bull Dagger[Econometrics] Bruce E Hansen 2014 Econometrics httpwwwsscwiscedu~bhansen

econometrics

29

bull Dagger[StatisticalLearning] Gareth James Daniela Witten Trevor Hastie and Robert Tibshirani 2013An Introduction to Statistical Learning with Applications in R Springer httpwww-bcfuscedu

~garethISLindexhtml

bull [MachineLearning] Christopher Bishop 2006 Pattern Recognition and Machine Learning Springer

bull dagger[Bayes] Peter D Hoff 2009 A First Course in Bayesian Statistical Methods Springer http

linkspringercombook1010072F978-0-387-92407-6

bull dagger[GraphicalModels] Soslashren Hoslashjsgaard David Edwards Steffen Lauritzen 2012 Graphical Mod-els with R Springer httpslink-springer-comezaccesslibrariespsuedubook101007

2F978-1-4614-2299-0

bull [InformationTheory] David Mackay 2003 Information Theory Inference and Learning AlgorithmsSpringer

bull Dagger[DeepLearning] Ian Goodfellow Yoshua Bengio and Aaron Courville 2016 Deep Learning MITPress httpwwwdeeplearningbookorg (see also Yann LeCun Yoshua Bengio and Geof-frey Hinton 2015 ldquoDeep Learningrdquo Nature 521(7553) 436ndash444 httpsdoiorg101038

nature14539)

See also Dagger[Keras] Keras The Python Deep Learning Library httpskerasio Francoios CholletDirectory of Keras tutorials httpsgithubcomfcholletkeras-resources

bull dagger[SignalProcessing] Jose Maria Giron-Sierra 2013 Digital Signal Processing with MATLAB Exam-ples (Volumes 1-3) Springer httpslink-springer-comezaccesslibrariespsuedubook

101007978-981-10-2534-1

30

Penn State Policy Statements

Academic Integrity

Academic integrity is the pursuit of scholarly activity in an open honest and responsible manner Academicintegrity is a basic guiding principle for all academic activity at The Pennsylvania State University and allmembers of the University community are expected to act in accordance with this principle Consistent withthis expectation the Universitys Code of Conduct states that all students should act with personal integrityrespect other students dignity rights and property and help create and maintain an environment in whichall can succeed through the fruits of their efforts

Academic integrity includes a commitment by all members of the University community not to engage in ortolerate acts of falsification misrepresentation or deception Such acts of dishonesty violate the fundamentalethical principles of the University community and compromise the worth of work completed by others

Disability Accomodation

Penn State welcomes students with disabilities into the Universitys educational programs Every Penn Statecampus has an office for students with disabilities Student Disability Resources (SDR) website provides con-tact information for every Penn State campus (httpequitypsuedusdrdisability-coordinator)For further information please visit the Student Disability Resources website (httpequitypsuedusdr)

In order to receive consideration for reasonable accommodations you must contact the appropriate disabilityservices office at the campus where you are officially enrolled participate in an intake interview and pro-vide documentation See documentation guidelines at (httpequitypsuedusdrguidelines) If thedocumentation supports your request for reasonable accommodations your campus disability services officewill provide you with an accommodation letter Please share this letter with your instructors and discussthe accommodations with them as early as possible You must follow this process for every semester thatyou request accommodations

Psychological Services

Many students at Penn State face personal challenges or have psychological needs that may interfere withtheir academic progress social development or emotional wellbeing The university offers a variety ofconfidential services to help you through difficult times including individual and group counseling crisisintervention consultations online chats and mental health screenings These services are provided by staffwho welcome all students and embrace a philosophy respectful of clients cultural and religious backgroundsand sensitive to differences in race ability gender identity and sexual orientation

bull Counseling and Psychological Services at University Park (CAPS) (httpstudentaffairspsueducounseling) 814-863-0395

bull Penn State Crisis Line (24 hours7 daysweek) 877-229-6400

bull Crisis Text Line (24 hours7 daysweek) Text LIONS to 741741

Educational Equity

Penn State takes great pride to foster a diverse and inclusive environment for students faculty and staffConsistent with University Policy AD29 students who believe they have experienced or observed a hatecrime an act of intolerance discrimination or harassment that occurs at Penn State are urged to report theseincidents as outlined on the Universitys Report Bias webpage (httpequitypsuedureportbias)

31

Background Knowledge

Students will have a variety of backgrounds Prior to beginning interdisciplinary coursework tofulfill Social Data Analytics degree requirements including SoDA 501 and 502 students are ex-pected to have advanced (graduate) training in at least one of the component areas of Social DataAnalytics and a familiarity with basic concepts in the others

With regard to specialization students are expected to have advanced (graduate) training in ONEof the following

bull quantitative social science methodology and a discipline of social science (as would be thecase for a second-year PhD student in Political Science Sociology Criminology Human De-velopment and Family Studies or Demography) OR

bull statistics (as would be the case for a second-year PhD student in Statistics) OR

bull information science or informatics (as would be the case for a second-year PhD student inInformation Science and Technology or a second-year PhD student in Geography specializingin GIScience) OR

bull computer science (as would be the case for a second-year PhD student in Computer Scienceand Engineering)

This requirement is met as a matter of meeting home program requirements for students in thedual-title PhD but may require additional coursework on the part of students in other programswishing to pursue the graduate minor

With regard to general preparation students are expected to have ALL of the following technicalknowledge

bull basic programming skills including basic facility with R ampor Python AND

bull basic knowledge of relational databases ampor geographic information systems AND

bull basic knowledge of probability applied statistics ampor social science research design AND

bull basic familiarity with a substantive or theoretical area of social science (eg 300-level course-work in political science sociology criminology human development psychology economicscommunication anthropology human geography social informatics or similar fields)

It is not unusual for students to have one or more gaps in this preparation at time of applicationto the SoDA program Students should work with Social Data Analytics advisers to develop aplan for timely remediation of any deficiencies which generally will not require formal courseworkfor students whose training and interests are otherwise appropriate for pursuit of the Social DataAnalytics degree Where possible this will be addressed at time of application to the Social DataAnalytics program

To this end some free training materials are linked in the reference section and there are alsoabundant free high-quality self-paced course-style training materials on these and related subjectsavailable through edX (httpwwwedxorg) Udacity (httpwwwudacitycom) Codecademy(httpcodecademycom) and Lynda (httpwwwlyndapsu)

32

Page 21: SoDA 501: Approaches and Issues in Big Social Data Spring ... · \Approaches and Issues in Big Social Data," o ered in the spring semester. Some SoDA / IGERT students will take more

bull daggerMorgan amp Claypool Synthesis Lectures on Data Management httpwwwmorganclaypoolcom

tocdtm11

Data Wrangling

Theoretically-structured approaches to data wrangling

bull Dagger[TidyData-R] Garrett Grolemund and Hadley Wickham 2017 R for Data Science (httpr4dshadconz) esp Ch 12 ldquoTidy Datardquo also Wickham 2014 ldquoTidy Datardquo Journalof Statistical Software 59(10) httpwwwjstatsoftorgv59i10paper Tools The tidyversehttptidyverseorg

bull Dagger[DataScience-Py] Jake VanderPlas 2016 Python Data Science Handbook (httpsgithubcomjakevdpPythonDSHandbook-Py) esp Ch 3 on pandas httppandaspydataorg

bull Dagger[DataCarpentry] Colin Gillespie and Robin Lovelace 2017 Efficient R Programming https

csgillespiegithubioefficientR esp Ch 6 on ldquoefficient data carpentryrdquo

bull sect[Wrangler] Joseph M Hellerstein Jeffrey Heer Tye Rattenbury and Sean Kandel 2017 DataWrangling Practical Techniques for Data Preparation Tool Trifacta Wrangler httpwwwtrifactacomproductswrangler

Data wrangling practice

bull DSHandbook-Py Ch 4 ldquoData Munging String Manipulation Regular Expressions and Data Clean-ingrdquo

bull dagger[DataSimplification] Jules J Berman 2016 Data Simplification Taming Information with OpenSource Tools Elsevier httpwwwsciencedirectcomezaccesslibrariespsueduscience

book9780128037812

bull dagger[Wrangling-Py] Jacqueline Kazil Katharine Jarmul 2016 Data Wrangling with Python OrsquoReillyhttpproquestcombosafaribooksonlinecomezaccesslibrariespsuedu9781491948804

bull dagger[Wrangling-R] Bradley C Boehmke 2016 Data Wrangling with R Springer httplink

springercomezaccesslibrariespsuedubook1010072F978-3-319-45599-0

Record Linkage Entity Resolution Deduplication

bull dagger[DataMatching] Peter Christen 2012 Data Matching Concepts and Techniques for Record Link-age Entity Resolution and Duplicate Detectionhttplinkspringercomezaccesslibrariespsuedubook1010072F978-3-642-31164-2

(esp Ch 2 ldquoThe Data Matching Processrdquo)

bull DataSimplification Chapter 5 ldquoIdentifying and Deidentifying Datardquo

bull Dagger[EntityResolution] Lise Getoor and Ashwin Machanavajjhala 2013 ldquoEntity Resolution for BigDatardquo Tutorial KDD httpwwwumiacsumdedu~getoorTutorialsER_KDD2013pdf

bull Dagger[SyrianCasualties] Peter Sadosky Anshumali Shrivastava Megan Price and Rebecca C Steorts2015 ldquoBlocking Methods Applied to Casualty Records from the Syrian Conflictrdquo httpsarxiv

orgabs151007714 (For more on blocking see DataMatching Ch 4 ldquoIndexingrdquo)

bull CRAN Task Views httpscranr-projectorgwebviewsOfficialStatisticshtml ldquoStatis-tical Matching and Record Linkagerdquo

ldquoMaking up datardquo Imputation Smoothers Kernels Priors Filters Teleportation Negative Sampling Con-volution Augmentation Adversarial Training

21

bull dagger[Imputation] Yi Deng Changgee Chang Moges Seyoum Ido and Qi Long 2016 ldquoMultiple Im-putation for General Missing Data Patterns in the Presence of High-dimensional Datardquo ScientificReports 6(21689) httpswwwnaturecomarticlessrep21689

bull Dagger[Overimputation] Matthew Blackwell James Honaker and Gary King 2017 ldquoA Unified Ap-proach to Measurement Error and Missing Data Overview and Applicationsrdquo Sociological Methodsamp Research 46(3) 303-341 httpgkingharvardedufilesgkingfilesmeasurepdf

bull Shalizi-ADA Sect 15 (Linear Smoothers) Ch 8 (Splines) Sect 144 (Kernel Density Estimates)DataScience-Python NB 0513 ldquoKernel Density Estimationrdquo

bull FightinWords (re Bayesian priors as additional data impact of priors on regularization)

bull DeepLearning Section 75 (Data Augmentation) 712 (Dropout) 713 (Adversarial Examples) Chap-ter 9 (Convolutional Networks)

bull dagger[Adversarial] Ian J Goodfellow Jonathon Shlens Christian Szegedy 2015 ldquoExplaining and Har-vesting Adversarial Examplesrdquo httpsarxivorgabs14126572

bull See also resampling and simulation methods

bull See also feature engineering preprocessing

bull CRAN Task Views httpscranr-projectorgwebviewsOfficialStatisticshtml ldquoImpu-tationrdquo

(Direct) Data Representations Data Mappings

bull NRCreport Ch 5 ldquoLarge-Scale Data Representationsrdquo

bull dagger[PatternRecognition] M Narasimha Murty and V Susheela Devi 2011 Pattern RecognitionAn Algorithmic Approach Springer httpslink-springer-comezaccesslibrariespsuedu

book1010072F978-0-85729-495-1 Section 21 ldquoData Structures for Pattern Representationrdquo

bull Dagger[InfoRetrieval] Christopher D Manning Prabhakar Raghavan and Hinrich Schutze 2009 Intro-duction to Information Retrieval Cambridge University Press httpnlpstanfordeduIR-bookCh 1 2 6 (also note slides used in their class)

bull dagger[Algorithms] Brian Steele John Chandler Swarna Reddy 2016 Algorithms for Data ScienceWiley httplinkspringercomezaccesslibrariespsuedubook1010072F978-3-319-

45797-0 Ch 2 ldquoData Mapping and Data Dictionariesrdquo

bull See also Social Data Structures

Similarity Distance Association Covariance The Kernel Trick

bull PatternRecognition Section 23 ldquoProximity Measuresrdquo

bull Dagger[Similarity] Brendan OrsquoConnor 2012 ldquoCosine similarity Pearson correlation and OLS coeffi-cientsrdquo httpsbrenoconcomblog201203cosine-similarity-pearson-correlation-and-

ols-coefficients

bull For relatively comprehensive lists see also

M-J Lesot M Rifqi and H Benhadda 2009 ldquoSimilarity measures for binary and numerical data a surveyrdquoInt J Knowledge Engineering and Soft Data Paradigms 1(1) 63- httpciteseerxistpsueduviewdoc

downloaddoi=10112126533amprep=rep1amptype=pdf

Seung-Seok Choi Sung-Hyuk Cha Charles C Tappert 2010 ldquoA Survey of Binary Similarity and DistanceMeasuresrdquo Journal of Systemics Cybernetics amp Informatics 8(1)43-8 httpwwwiiisciorgJournalCV$

scipdfsGS315JGpdf

22

Sung-Hyuk Cha 2007 ldquoComprehensive Survey on DistanceSimilarity Measures between Probability DensityFunctionsrdquo International Journal of Mathematical Models and Methods in Applied Sciences 4(1) 300-7 http

csispaceeductappertdpsd861-12session4-p2pdf

Anna Huang 2008 ldquoSimilarity Measures for Text Document Clusteringrdquo Proceedings of the 6th New ZealandComputer Science Research Student Conference 49-56 httpwwwnzcsrsc08canterburyacnzsiteproceedingsIndividual_Paperspg049_Similarity_Measures_for_Text_Document_Clusteringpdf

bull Michael B Jordan ldquoThe Kernel Trickrdquo (Lecture Notes) httpspeopleeecsberkeleyedu

~jordancourses281B-spring04lectureslec3pdf

bull Eric Kim 2017 ldquoEverything You Ever Wanted to Know about the Kernel Trick (But Were Afraidto Ask)rdquo httpwwweric-kimneteric-kim-netposts1kernel_trick_blog_ekim_12_20_

2017pdf

Derived Data Representations - Dimensionality Reduction Compression De-composition Embeddings

The groupings here and under the measurement multivariate statistics section are particularly arbitraryFor example ldquok-Means clusteringrdquo can be viewed as a technique for ldquodimensionality reductionrdquo ldquocompres-sionrdquo ldquofeature extractionrdquo ldquolatent variable measurementrdquo ldquounsupervised learningrdquo ldquocollaborative filteringrdquo

Clustering hashing quantization blocking compression

bull Dagger[Compression] Khalid Sayood 2012 Introduction to Data Compression 4th ed Springer httpwwwsciencedirectcomezaccesslibrariespsuedusciencebook9780124157965 (eg cod-ing blocking via vector quantization)

bull MMDS Ch3 ldquoFinding Similar Itemsrdquo (minhashing locality sensitive hashing)

bull Multivariate-R Ch 6 ldquoClusteringrdquo

bull DataScience-Python NB 0511 ldquok-Means Clusteringrdquo NG 0512 ldquoGaussian Mixture Modelsrdquo

bull Dagger[KMeansHashing] Kaiming He Fang Wen Jian Sun 2013 ldquoK-means Hashing An Affinity-Preserving Quantization Method for Learning Binary Compact Codesrdquo CVPR httpswwwcv-foundationorgopenaccesscontent_cvpr_2013papersHe_K-Means_Hashing_An_2013_CVPR_paper

pdf

bull dagger[Squashing] Madigan D Raghavan N Dumouchel W Nason M Posse C and RidgewayG (2002) ldquoLikelihood-based data squashing A modeling approach to instance constructionrdquo DataMining and Knowledge Discovery 6(2) 173-190 httpdxdoiorgezaccesslibrariespsu

edu101023A1014095614948

bull dagger[Core-sets] Piotr Indyk Sepideh Mahabadi Mohammad Mahdian Vahab S Mirrokni 2014 ldquoCom-posable core-sets for diversity and coverage maximizationrdquo PODS rsquo14 100-8 httpsdoiorg10114525945382594560

Feature selection feature extraction feature engineering weighting preprocessing

bull PatternRecognition Sections 26-7 ldquoFeature Selection Feature Extractionrdquo

bull DSHandbook-Py Ch 7 ldquoInterlude Feature Extraction Ideasrdquo

bull DataScience-Python NB 0504 ldquoFeature Engineeringrdquo

bull Features in text NLP and InfoRetrieval re tfidf and similar FightinWords

bull Features in images DataScience-Python NB 0514 ldquoImage Featuresrdquo

23

Dimensionality reduction decomposition factorization change of basis reparameterization matrix com-pletion latent variables source separation

bull MMDS Ch 9 ldquoRecommendation Systemsrdquo Ch11 ldquoDimensionality Reductionrdquo

bull DSHandbook-Py Ch 10 ldquoUnsupervised Learning Clustering and Dimensionality Reductionrdquo

bull Dagger[Shalizi-ADA] Cosma Rohilla Shalizi 2017 Advanced Data Analysis from an Elementary Pointof View Ch 16 (ldquoPrincipal Components Analysisrdquo) Ch 17 (ldquoFactor Modelsrdquo) httpwwwstat

cmuedu~cshaliziADAfaEPoV

See also Multivariate-R Ch 3 ldquoPrincipal Components Analysisrdquo Ch 4 ldquoMultidimensional ScalingrdquoCh 5 ldquoExploratory Factor Analysisrdquo Latent

bull DeepLearning Ch 2 ldquoLinear Algebrardquo Ch 13 ldquoLinear Factor Modelsrdquo

bull Dagger[GloVe] Jeffrey Pennington Richard Socher Christopher Manning 2014 ldquoGloVe Global vectorsfor word representationrdquo EMNLP httpsnlpstanfordeduprojectsglove

bull dagger[NMF] Daniel D Lee and H Sebastian Seung 1999 ldquoLearning the parts of objects by non-negativematrix factorizationrdquo Nature 401788-791 httpdxdoiorgezaccesslibrariespsuedu

10103844565

bull dagger[CUR] Michael W Mahoney and Petros Drineas 2009 ldquoCUR matrix decompositions for improveddata analysisrdquo PNAS httpwwwpnasorgcontent1063697full

bull Dagger[ICA] Aapo Hyvarinen and Erkki Oja 2000 ldquoIndependent Component Analysis Algorithmsand Applicationsrdquo Neural Networks 13(4-5) 411-30 httpswwwcshelsinkifiuahyvarin

papersNN00newpdf

bull [RandomProjection] Ella Bingham and Heikki Mannilla 2001 ldquoRandom projection in dimension-ality reduction applications to image and text datardquo KDD httpsdoiorg1011452F502512

502546

bull Compression eg ldquoTransform codingrdquo ldquoWaveletsrdquo

Nonlinear dimensionality reduction Manifold learning

bull DeepLearning Section 5113 ldquoManifold Learningrdquo

bull DataScience-Python NB 0510 ldquoManifold Learningrdquo (Locally linear embedding [LLE] Isomap)

bull Dagger[KernelPCA] Sebastian Raschka 2014 ldquoKernel tricks and nonlinear dimensionality reduction viaRBF kernel PCArdquo httpsebastianraschkacomArticles2014_kernel_pcahtml

bull Shalizi-ADA Ch 18 ldquoNonlinear Dimensionality Reductionrdquo (LLE)

bull Dagger[LaplacianEigenmaps] Mikhail Belkin and Partha Niyogi 2003 ldquoLaplacian eigenmaps for di-mensionality reduction and data representationrdquo Neural Computation 15(6) 13731396 http

webcseohio-stateedu~belkin8papersLEM_NC_03pdf

bull DeepLearning Ch14 (ldquoAutoencodersrdquo)

bull Dagger[word2vec] Tomas Mikolov Kai Chen Greg Corrado Jeffrey Dean 2013 ldquoEfficient Estimation ofWord Representations in Vector Spacerdquo httpsarxivorgabs13013781

bull Dagger[word2vecExplained] Yoav Goldberg and Omer Levy 2014 ldquoword2vec Explained DerivingMikolov et als Negative-Sampling Word-Embedding Methodrdquo httpsarxivorgabs14023722

bull Dagger[t-SNE] Laurens van der Maaten and Geoffrey Hinton 2008 ldquoVisualising Data using t-SNErdquo Jour-nal of Machine Learning Research 9 2579-2605 httpjmlrcsailmitedupapersvolume9

vandermaaten08avandermaaten08apdf

24

Computation and Scaling Up

Scientific computing computation at scale

bull NRCreport Ch 10 ldquoThe Seven Computational Giants of Massive Data Analysisrdquo

bull DSHandbook-Py Ch 21 ldquoPerformance and Computer Memoryrdquo Ch 22 ldquoComputer Memory andData Structuresrdquo

bull Dagger[ComputationalStatistics-Python] Cliburn Chan Computational Statistics in Python https

peopledukeedu~ccc14sta-663indexhtml

bull CRAN Task View High Performance Computinghttpscranr-projectorgwebviewsHighPerformanceComputinghtml

Numerical computing

bull [NumericalComputing] Ward Cheney and David Kincaid 2013 Numerical Mathematics andComputing 7th ed BrooksCole Cengage Learning

bull CRAN Task View Numerical Mathematicshttpscranr-projectorgwebviewsNumericalMathematicshtml

Optimization (eg MLE gradient descent stochastic gradient descent EM algorithm neural nets)

bull DSHandbook-Py Ch 23 ldquoMaximum Likelihood Estimation and Optimizationrdquo (gradient descent)

bull DeepLearning Ch 4 ldquoNumerical Computationrdquo Ch 6 ldquoDeep Feedforward Networksrdquo Ch 8ldquoOptimization for Training Deep Modelsrdquo

bull CRAN Task View Optimization and Mathematical Programminghttpscranr-projectorgwebviewsOptimizationhtml

Linear algebra matrix computations

bull [MatrixComputations] Gene H Golub and Charles F Van Loan 2013 Matrix Computations 4thed Johns Hopkins University Press

bull Dagger[NetflixMatrix] Yehuda Koren Robert Bell and Chris Volinsky 2009 ldquoMatrix FactorizationTechniques for Recommender Systemsrdquo Computer August 42-9 httpsdatajobscomdata-

science-repoRecommender-Systems-[Netflix]pdf

bull Dagger[BigDataPCA] Jianqing Fan Qiang Sun Wen-Xin Zhou Ziwei Zhu ldquoPrincipal Component Anal-ysis for Big Datardquo httpwwwprincetonedu~ziweizpcapdf

bull Dagger[RandomSVD] Andrew Tulloch 2009 ldquoFast Randomized SVDrdquo httpsresearchfbcom

fast-randomized-svd

bull Dagger[FactorSGD] Rainer Gemulla Peter J Haas Erik Nijkamp and Yannis Sismanis 2011 ldquoLarge-Scale Matrix Factorization with Distributed Stochastic Gradient Descentrdquo KDD httpwwwcs

utahedu~hariteachingbigdatagemulla11dsgdpdf

bull Dagger[Sparse] Max Grossman 2015 ldquo101 Ways to Store a Sparse Matrixrdquo httpsmediumcom

jmaxg3101-ways-to-store-a-sparse-matrix-c7f2bf15a229

bull Dagger[DontInvert] John D Cook 2010 ldquoDonrsquot Invert that Matrixrdquo httpswwwjohndcookcom

blog20100119dont-invert-that-matrix

Simulation-based inference resampling Monte Carlo methods MCMC Bayes approximate inference

25

bull Dagger[MarkovVisually] Victor Powell ldquoMarkov Chains Explained Visuallyrdquo httpsetosaioev

markov-chains

bull DSHandbook-Py Ch 25 ldquoStochastic Modelingrdquo (Markov chains MCMC HMM)

bull Bayes ComputationalStatistics-Py

bull DeepLearning Ch 17 ldquoMonte Carlo Methodsrdquo Ch 19 ldquoApproximate Inferencerdquo

bull Dagger[VariationalInference] Jason Eisner 2011 ldquoHigh-Level Explanation of Variational Inferencerdquohttpswwwcsjhuedu~jasontutorialsvariationalhtml

bull CRAN Task View Bayesian Inferencehttpscranr-projectorgwebviewsBayesianhtml

Parallelism MapReduce Split-Apply-Combine

bull Dagger[MapReduceIntuition] Jigsaw Academy 2014 ldquoBig Data Specialist MapReducerdquo https

wwwyoutubecomwatchv=TwcYQzFqg-8ampfeature=youtube

bull MMDS Ch 2 ldquoMap-Reduce and the New Software Stackrdquo (3rd Edition discusses Spark amp Tensor-Flow)

bull Algorithms Ch 3ldquoScalable Algorithms and Associative Statisticsrdquo Ch 4 ldquoHadoop and MapReducerdquo

bull NRCreport Ch 6 ldquoResources Trade-offs and Limitationsrdquo

bull TidyData-R (Split-apply-combine is the motivating principle behind the ldquotidyverserdquo approach) Seealso Part III ldquoProgramrdquo (pipes functions vectors iteration)

bull Dagger[TidySAC-Video] Hadley Wickham 2017 ldquoData Science in the Tidyverserdquo httpswww

rstudiocomresourcesvideosdata-science-in-the-tidyverse

bull See also FactorSGD

Functional Programming

bull ComputationalStatistics-Python ldquoFunctions are first class objectsrdquo through first exercises

bull DSHandbook-Py Ch 20 ldquoProgramming Language Conceptsrdquo

bull See also Haskell

Scaling iteration streaming data online algorithms (Spark)

bull NRCreport Ch 4 ldquoTemporal Data and Real-Time Algorithmsrdquo

bull MMDS Ch 4 ldquoMining Data Streamsrdquo

bull Dagger[BDAS] AMPLab BDAS The Berkeley Data Analytics Stack httpsamplabcsberkeleyedu

software

bull dagger[Spark] Mohammed Guller 2015 Big Data Analytics with Spark A Practitioners Guide toUsing Spark for Large-Scale Data Processing Machine Learning and Graph Analytics and High-Velocity Data Stream Processing Apress httpslink-springer-comezaccesslibrariespsu

edubook101007978-1-4842-0964-6

bull DSHandbook-Py Ch 13 ldquoBig Datardquo

bull DeepLearning Ch 10 ldquoSequence Modeling Recurrent and Recursive Netsrdquo

General Resources for Python and R

26

bull dagger[DSHandbook-Py] Field Cady 2017 The Data Science Handbook (Python-based) Wiley http

onlinelibrarywileycomezaccesslibrariespsuedubook1010029781119092919

bull Dagger[Tutorials-R] Ujjwal Karn 2017 ldquoA curated list of R tutorials for Data Science NLP and MachineLearningrdquo httpsgithubcomujjwalkarnDataScienceR

bull Dagger[Tutorials-Python] Ujjwal Karn 2017 ldquoA curated list of Python tutorials for Data Science NLPand Machine Learningrdquo httpsgithubcomujjwalkarnDataSciencePython

Cutting and Bleeding Edge of Data Science Languages

bull [Scala] daggerVishal Layka and David Pollak 2015 Beginning Scala httpslink-springer-

comezaccesslibrariespsuedubook1010072F978-1-4842-0232-6 daggerNicolas Patrick 2014Scala for Machine Learning httpsebookcentralproquestcomlibpensudetailaction

docID=1901910 DaggerScala site httpswwwscala-langorg (See also )

bull [Julia] daggerIvo Baobaert 2015 Getting Started with Julia httpsebookcentralproquestcom

libpensudetailactiondocID=1973847 DaggerDouglas Bates 2013 ldquoJulia for R Programmersrdquohttpwwwstatwiscedu~batesJuliaForRProgrammerspdf DaggerJulia site httpsjulialang

org

bull [Haskell] daggerRichard Bird 2014 Thinking Functionally with Haskell httpsdoi-orgezaccess

librariespsuedu101017CBO9781316092415 daggerHakim Cassimally 2017 Learning Haskell Pro-gramming httpswwwlyndacomHaskell-tutorialsLearning-Haskell-Programming604926-2html daggerJames Church 2017 Learning Haskell for Data Analysis httpswwwlyndacomDeveloper-tutorialsLearning-Haskell-Data-Analysis604234-2html Haskell site httpswwwhaskellorg

bull [Clojure] daggerMark McDonnell 2017 Quick Clojure Essential Functional Programming https

link-springer-comezaccesslibrariespsuedubook1010072F978-1-4842-2952-1 daggerAkhillWali 2014 Clojure for Machine Learning httpsebookcentralproquestcomlibpensu

detailactiondocID=1674848 daggerArthur Ulfeldt 2015 httpswwwlyndacomClojure-tutorialsUp-Running-Clojure413127-2html DaggerClojure site httpsclojureorg

bull [TensorFlow] DaggerAbadi et al (Google) 2016 ldquoTensorFlow A System for Large-Scale Ma-chine Learningrdquo httpsarxivorgabs160508695 DaggerUdacity ldquoDeep Learningrdquo httpswww

udacitycomcoursedeep-learning--ud730 DaggerTensorFlow site httpswwwtensorfloworg

bull [H2O] Darren Cool 2016 Practical Machine Learning with H2O Powerful Scalable Techniques forDeep Learning and AI OrsquoReilly Arno Candel and Viraj Parmar 2015 Deep Learning with H2ODaggerH2O site httpswwwh2oai

Social Data Structures

Space and Time

bull dagger[GIA] David OrsquoSullivan David J Unwin 2010 Geographic Information Analysis Second EditionJohn Wiley amp Sons httponlinelibrarywileycombook1010029780470549094

bull [Space-Time] Donna J Peuquet 2003 Representations of Space and Time Guilford

bull Roger S Bivand Edzer Pebesma Virgilio Gmez-Rubio 2013 Applied Spatial Data Analysis in RFree through library httplinkspringercomezaccesslibrariespsuedubook101007

2F978-1-4614-7618-4 See also Edzer Pebesma 2016 ldquoHandling and Analyzing Spatial Spa-tiotemporal and Movement Data in Rrdquo httpsedzergithubioUseR2016

bull Dagger[PySAL] Sergio J Rey and Dani Arribas-Bel 2016 ldquoGeographic Data Science with PySAL andthe pydata Stackrdquo httpdarribasorggds_scipy16

27

bull DataScience-Python NB 0413 ldquoGeographic Data with Basemaprdquo

bull Time ldquoLongitudinalrdquo intra-individual data as common in developmental psychology dagger[Longitudinal]Garrett Fitzmaurice Marie Davidian Geert Verbeke and Geert Molenberghs editors 2009 Longitu-dinal Data Analysis Chapman amp Hall CRC httppensueblibcompatronFullRecordaspxp=359998

bull Time ldquoTime Series Time Series Cross Section Panelrdquo as common in economics political scienceand sociology DataScience-Python Ch 3 re pandas DSHandbook-Py Ch 17 ldquoTime Series AnalysisrdquoEconometrics GelmanHill

bull Time ldquoSequential Data Streamsrdquo as in NLP machine learning Spark and other rdquostreaming datardquoreadings

bull Space-Time Data with continuity connectivity neighborhood structure See DeepLearning Chapter 9re convolution

bull CRAN Task Views Spatial httpscranr-projectorgwebviewsSpatialhtml Spatiotempo-ral httpscranr-projectorgwebviewsSpatioTemporalhtml Time Series httpscran

r-projectorgwebviewsTimeSerieshtml

Network Graphs

bull Dagger[Networks] David Easley and Jon Kleinberg 2010 Networks Crowds and Markets ReasoningAbout a Highly Connected World Cambridge University Press httpwwwcscornelleduhome

kleinbernetworks-book

bull MMDS Ch 5 ldquoLink Analysisrdquo Ch 10 ldquoMining Social-Network Graphsrdquo

bull dagger[Networks-R] Douglas Luke 2015 A Userrsquos Guide to Network Analysis in R Springer https

link-springer-comezaccesslibrariespsuedubook101007978-3-319-23883-8

bull dagger[Networks-Python] Mohammed Zuhair Al-Taie and Seifedine Kadry 2017 Python for Graph andNetwork Analysis Springer httpslink-springer-comezaccesslibrariespsuedubook

1010072F978-3-319-53004-8

Hierarchy Clustered Data Aggregation Mixed Models

bull dagger[GelmanHill] Andrew Gelman and Jennifer Hill 2006 Data Analysis Using Regression and Multi-levelHierarchical Models Cambridge University Press httppensueblibcompatronFullRecordaspxp=288457

bull dagger[MixedModels] Eugene Demidenko 2013 Mixed Models Theory and Applications with R 2nd edWiley httpsiteebrarycomezaccesslibrariespsuedulibpennstatedetailaction

docID=10748641

Social Data Channels

Language Text Speech Audio

bull Dagger[NLP] Daniel Jurafsky and James H Martin Forthcoming (2017) Speech and Language ProcessingAn Introduction to Natural Language Processing Speech Recognition and Computational Linguistics3rd ed Prentice-Hall Preprint httpswebstanfordedu~jurafskyslp3

bull InfoRetrieval

bull Dagger[CoreNLP] Stanford CoreNLP httpsstanfordnlpgithubioCoreNLP

28

bull dagger[TextAsData] Grimmer and Stewart 2013 ldquoText as Data The Promise and Pitfalls of AutomaticContent Analysis Methods for Political Textsrdquo Political Analysis httpsdoiorg101093pan

mps028

bull Quinn-Topics

bull FightinWords

bull dagger[NLTK-Python] Jacob Perkins 2010 Python Text Processing with NLTK 20 Cookbook PackthttpsiteebrarycomezaccesslibrariespsuedulibpennstatedetailactiondocID=10435387dagger[TextAnalytics-Python] Dipanjan Sarkar 2016 Text Analytics with Python Springer http

linkspringercomezaccesslibrariespsuedubook1010072F978-1-4842-2388-8

bull DSHandbook-Py Ch 16 ldquoNatural Language Processingrdquo

bull Matthew J Denny httpwwwmjdennycomText_Processing_In_Rhtml

bull CRAN Natural Language Processing httpscranr-projectorgwebviewsNaturalLanguageProcessinghtml

bull dagger[AudioData] Dean Knox and Christopher Lucas 2017 ldquoThe Speaker-Affect Model MeasuringEmotion in Political Speech with Audio Datardquo httpchristopherlucasorgfilesPDFssam

pdf httpswwwyoutubecomwatchv=Hs8A9dwkMzI

bull daggerMorgan amp Claypool Synthesis Lectures on Human Language Technologies httpwwwmorganclaypoolcomtochlt11

bull daggerMorgan amp Claypool Synthesis Lectures on Speech and Audio Processing httpwwwmorganclaypoolcomtocsap11

Vision Image Video

bull Dagger[ComputerVision] Richard Szelinski 2010 Computer Vision Algorithms and Applications SpringerhttpszeliskiorgBook

bull MixedModels Ch 11 ldquoStatistical Analysis of Shaperdquo Ch 12 ldquoStatistical Image Analysisrdquo

bull AudioVideoVolumetric See DeepLearning Chapter 9 re convolution

bull daggerMorgan amp Claypool Synthesis Lectures on Image Video and Multimedia Processing httpwww

morganclaypoolcomtocivm11

bull daggerMorgan amp Claypool Synthesis Lectures on Computer Vision httpwwwmorganclaypoolcom

toccov11

Approaches to Learning from Data (The Analytics Layer)

bull NRCreport Ch 7 ldquoBuilding Models from Massive Datardquo

bull MMDS (data-mining)

bull CausalInference

bull Dagger[VisualAnalytics] Daniel Keim Jorn Kohlhammer Geoffrey Ellis and Florian Mansmann editors2010 Mastering the Information Age Solving Problems with Visual Analytics The EurographicsAssociation Goslar Germany httpwwwvismastereubook See also daggerMorgan amp ClaypoolSynthesis Lectures on Visualization httpwwwmorganclaypoolcomtocvis21

bull Dagger[Econometrics] Bruce E Hansen 2014 Econometrics httpwwwsscwiscedu~bhansen

econometrics

29

bull Dagger[StatisticalLearning] Gareth James Daniela Witten Trevor Hastie and Robert Tibshirani 2013An Introduction to Statistical Learning with Applications in R Springer httpwww-bcfuscedu

~garethISLindexhtml

bull [MachineLearning] Christopher Bishop 2006 Pattern Recognition and Machine Learning Springer

bull dagger[Bayes] Peter D Hoff 2009 A First Course in Bayesian Statistical Methods Springer http

linkspringercombook1010072F978-0-387-92407-6

bull dagger[GraphicalModels] Soslashren Hoslashjsgaard David Edwards Steffen Lauritzen 2012 Graphical Mod-els with R Springer httpslink-springer-comezaccesslibrariespsuedubook101007

2F978-1-4614-2299-0

bull [InformationTheory] David Mackay 2003 Information Theory Inference and Learning AlgorithmsSpringer

bull Dagger[DeepLearning] Ian Goodfellow Yoshua Bengio and Aaron Courville 2016 Deep Learning MITPress httpwwwdeeplearningbookorg (see also Yann LeCun Yoshua Bengio and Geof-frey Hinton 2015 ldquoDeep Learningrdquo Nature 521(7553) 436ndash444 httpsdoiorg101038

nature14539)

See also Dagger[Keras] Keras The Python Deep Learning Library httpskerasio Francoios CholletDirectory of Keras tutorials httpsgithubcomfcholletkeras-resources

bull dagger[SignalProcessing] Jose Maria Giron-Sierra 2013 Digital Signal Processing with MATLAB Exam-ples (Volumes 1-3) Springer httpslink-springer-comezaccesslibrariespsuedubook

101007978-981-10-2534-1

30

Penn State Policy Statements

Academic Integrity

Academic integrity is the pursuit of scholarly activity in an open honest and responsible manner Academicintegrity is a basic guiding principle for all academic activity at The Pennsylvania State University and allmembers of the University community are expected to act in accordance with this principle Consistent withthis expectation the Universitys Code of Conduct states that all students should act with personal integrityrespect other students dignity rights and property and help create and maintain an environment in whichall can succeed through the fruits of their efforts

Academic integrity includes a commitment by all members of the University community not to engage in ortolerate acts of falsification misrepresentation or deception Such acts of dishonesty violate the fundamentalethical principles of the University community and compromise the worth of work completed by others

Disability Accomodation

Penn State welcomes students with disabilities into the Universitys educational programs Every Penn Statecampus has an office for students with disabilities Student Disability Resources (SDR) website provides con-tact information for every Penn State campus (httpequitypsuedusdrdisability-coordinator)For further information please visit the Student Disability Resources website (httpequitypsuedusdr)

In order to receive consideration for reasonable accommodations you must contact the appropriate disabilityservices office at the campus where you are officially enrolled participate in an intake interview and pro-vide documentation See documentation guidelines at (httpequitypsuedusdrguidelines) If thedocumentation supports your request for reasonable accommodations your campus disability services officewill provide you with an accommodation letter Please share this letter with your instructors and discussthe accommodations with them as early as possible You must follow this process for every semester thatyou request accommodations

Psychological Services

Many students at Penn State face personal challenges or have psychological needs that may interfere withtheir academic progress social development or emotional wellbeing The university offers a variety ofconfidential services to help you through difficult times including individual and group counseling crisisintervention consultations online chats and mental health screenings These services are provided by staffwho welcome all students and embrace a philosophy respectful of clients cultural and religious backgroundsand sensitive to differences in race ability gender identity and sexual orientation

bull Counseling and Psychological Services at University Park (CAPS) (httpstudentaffairspsueducounseling) 814-863-0395

bull Penn State Crisis Line (24 hours7 daysweek) 877-229-6400

bull Crisis Text Line (24 hours7 daysweek) Text LIONS to 741741

Educational Equity

Penn State takes great pride to foster a diverse and inclusive environment for students faculty and staffConsistent with University Policy AD29 students who believe they have experienced or observed a hatecrime an act of intolerance discrimination or harassment that occurs at Penn State are urged to report theseincidents as outlined on the Universitys Report Bias webpage (httpequitypsuedureportbias)

31

Background Knowledge

Students will have a variety of backgrounds Prior to beginning interdisciplinary coursework tofulfill Social Data Analytics degree requirements including SoDA 501 and 502 students are ex-pected to have advanced (graduate) training in at least one of the component areas of Social DataAnalytics and a familiarity with basic concepts in the others

With regard to specialization students are expected to have advanced (graduate) training in ONEof the following

bull quantitative social science methodology and a discipline of social science (as would be thecase for a second-year PhD student in Political Science Sociology Criminology Human De-velopment and Family Studies or Demography) OR

bull statistics (as would be the case for a second-year PhD student in Statistics) OR

bull information science or informatics (as would be the case for a second-year PhD student inInformation Science and Technology or a second-year PhD student in Geography specializingin GIScience) OR

bull computer science (as would be the case for a second-year PhD student in Computer Scienceand Engineering)

This requirement is met as a matter of meeting home program requirements for students in thedual-title PhD but may require additional coursework on the part of students in other programswishing to pursue the graduate minor

With regard to general preparation students are expected to have ALL of the following technicalknowledge

bull basic programming skills including basic facility with R ampor Python AND

bull basic knowledge of relational databases ampor geographic information systems AND

bull basic knowledge of probability applied statistics ampor social science research design AND

bull basic familiarity with a substantive or theoretical area of social science (eg 300-level course-work in political science sociology criminology human development psychology economicscommunication anthropology human geography social informatics or similar fields)

It is not unusual for students to have one or more gaps in this preparation at time of applicationto the SoDA program Students should work with Social Data Analytics advisers to develop aplan for timely remediation of any deficiencies which generally will not require formal courseworkfor students whose training and interests are otherwise appropriate for pursuit of the Social DataAnalytics degree Where possible this will be addressed at time of application to the Social DataAnalytics program

To this end some free training materials are linked in the reference section and there are alsoabundant free high-quality self-paced course-style training materials on these and related subjectsavailable through edX (httpwwwedxorg) Udacity (httpwwwudacitycom) Codecademy(httpcodecademycom) and Lynda (httpwwwlyndapsu)

32

Page 22: SoDA 501: Approaches and Issues in Big Social Data Spring ... · \Approaches and Issues in Big Social Data," o ered in the spring semester. Some SoDA / IGERT students will take more

bull dagger[Imputation] Yi Deng Changgee Chang Moges Seyoum Ido and Qi Long 2016 ldquoMultiple Im-putation for General Missing Data Patterns in the Presence of High-dimensional Datardquo ScientificReports 6(21689) httpswwwnaturecomarticlessrep21689

bull Dagger[Overimputation] Matthew Blackwell James Honaker and Gary King 2017 ldquoA Unified Ap-proach to Measurement Error and Missing Data Overview and Applicationsrdquo Sociological Methodsamp Research 46(3) 303-341 httpgkingharvardedufilesgkingfilesmeasurepdf

bull Shalizi-ADA Sect 15 (Linear Smoothers) Ch 8 (Splines) Sect 144 (Kernel Density Estimates)DataScience-Python NB 0513 ldquoKernel Density Estimationrdquo

bull FightinWords (re Bayesian priors as additional data impact of priors on regularization)

bull DeepLearning Section 75 (Data Augmentation) 712 (Dropout) 713 (Adversarial Examples) Chap-ter 9 (Convolutional Networks)

bull dagger[Adversarial] Ian J Goodfellow Jonathon Shlens Christian Szegedy 2015 ldquoExplaining and Har-vesting Adversarial Examplesrdquo httpsarxivorgabs14126572

bull See also resampling and simulation methods

bull See also feature engineering preprocessing

bull CRAN Task Views httpscranr-projectorgwebviewsOfficialStatisticshtml ldquoImpu-tationrdquo

(Direct) Data Representations Data Mappings

bull NRCreport Ch 5 ldquoLarge-Scale Data Representationsrdquo

bull dagger[PatternRecognition] M Narasimha Murty and V Susheela Devi 2011 Pattern RecognitionAn Algorithmic Approach Springer httpslink-springer-comezaccesslibrariespsuedu

book1010072F978-0-85729-495-1 Section 21 ldquoData Structures for Pattern Representationrdquo

bull Dagger[InfoRetrieval] Christopher D Manning Prabhakar Raghavan and Hinrich Schutze 2009 Intro-duction to Information Retrieval Cambridge University Press httpnlpstanfordeduIR-bookCh 1 2 6 (also note slides used in their class)

bull dagger[Algorithms] Brian Steele John Chandler Swarna Reddy 2016 Algorithms for Data ScienceWiley httplinkspringercomezaccesslibrariespsuedubook1010072F978-3-319-

45797-0 Ch 2 ldquoData Mapping and Data Dictionariesrdquo

bull See also Social Data Structures

Similarity Distance Association Covariance The Kernel Trick

bull PatternRecognition Section 23 ldquoProximity Measuresrdquo

bull Dagger[Similarity] Brendan OrsquoConnor 2012 ldquoCosine similarity Pearson correlation and OLS coeffi-cientsrdquo httpsbrenoconcomblog201203cosine-similarity-pearson-correlation-and-

ols-coefficients

bull For relatively comprehensive lists see also

M-J Lesot M Rifqi and H Benhadda 2009 ldquoSimilarity measures for binary and numerical data a surveyrdquoInt J Knowledge Engineering and Soft Data Paradigms 1(1) 63- httpciteseerxistpsueduviewdoc

downloaddoi=10112126533amprep=rep1amptype=pdf

Seung-Seok Choi Sung-Hyuk Cha Charles C Tappert 2010 ldquoA Survey of Binary Similarity and DistanceMeasuresrdquo Journal of Systemics Cybernetics amp Informatics 8(1)43-8 httpwwwiiisciorgJournalCV$

scipdfsGS315JGpdf

22

Sung-Hyuk Cha 2007 ldquoComprehensive Survey on DistanceSimilarity Measures between Probability DensityFunctionsrdquo International Journal of Mathematical Models and Methods in Applied Sciences 4(1) 300-7 http

csispaceeductappertdpsd861-12session4-p2pdf

Anna Huang 2008 ldquoSimilarity Measures for Text Document Clusteringrdquo Proceedings of the 6th New ZealandComputer Science Research Student Conference 49-56 httpwwwnzcsrsc08canterburyacnzsiteproceedingsIndividual_Paperspg049_Similarity_Measures_for_Text_Document_Clusteringpdf

bull Michael B Jordan ldquoThe Kernel Trickrdquo (Lecture Notes) httpspeopleeecsberkeleyedu

~jordancourses281B-spring04lectureslec3pdf

bull Eric Kim 2017 ldquoEverything You Ever Wanted to Know about the Kernel Trick (But Were Afraidto Ask)rdquo httpwwweric-kimneteric-kim-netposts1kernel_trick_blog_ekim_12_20_

2017pdf

Derived Data Representations - Dimensionality Reduction Compression De-composition Embeddings

The groupings here and under the measurement multivariate statistics section are particularly arbitraryFor example ldquok-Means clusteringrdquo can be viewed as a technique for ldquodimensionality reductionrdquo ldquocompres-sionrdquo ldquofeature extractionrdquo ldquolatent variable measurementrdquo ldquounsupervised learningrdquo ldquocollaborative filteringrdquo

Clustering hashing quantization blocking compression

bull Dagger[Compression] Khalid Sayood 2012 Introduction to Data Compression 4th ed Springer httpwwwsciencedirectcomezaccesslibrariespsuedusciencebook9780124157965 (eg cod-ing blocking via vector quantization)

bull MMDS Ch3 ldquoFinding Similar Itemsrdquo (minhashing locality sensitive hashing)

bull Multivariate-R Ch 6 ldquoClusteringrdquo

bull DataScience-Python NB 0511 ldquok-Means Clusteringrdquo NG 0512 ldquoGaussian Mixture Modelsrdquo

bull Dagger[KMeansHashing] Kaiming He Fang Wen Jian Sun 2013 ldquoK-means Hashing An Affinity-Preserving Quantization Method for Learning Binary Compact Codesrdquo CVPR httpswwwcv-foundationorgopenaccesscontent_cvpr_2013papersHe_K-Means_Hashing_An_2013_CVPR_paper

pdf

bull dagger[Squashing] Madigan D Raghavan N Dumouchel W Nason M Posse C and RidgewayG (2002) ldquoLikelihood-based data squashing A modeling approach to instance constructionrdquo DataMining and Knowledge Discovery 6(2) 173-190 httpdxdoiorgezaccesslibrariespsu

edu101023A1014095614948

bull dagger[Core-sets] Piotr Indyk Sepideh Mahabadi Mohammad Mahdian Vahab S Mirrokni 2014 ldquoCom-posable core-sets for diversity and coverage maximizationrdquo PODS rsquo14 100-8 httpsdoiorg10114525945382594560

Feature selection feature extraction feature engineering weighting preprocessing

bull PatternRecognition Sections 26-7 ldquoFeature Selection Feature Extractionrdquo

bull DSHandbook-Py Ch 7 ldquoInterlude Feature Extraction Ideasrdquo

bull DataScience-Python NB 0504 ldquoFeature Engineeringrdquo

bull Features in text NLP and InfoRetrieval re tfidf and similar FightinWords

bull Features in images DataScience-Python NB 0514 ldquoImage Featuresrdquo

23

Dimensionality reduction decomposition factorization change of basis reparameterization matrix com-pletion latent variables source separation

bull MMDS Ch 9 ldquoRecommendation Systemsrdquo Ch11 ldquoDimensionality Reductionrdquo

bull DSHandbook-Py Ch 10 ldquoUnsupervised Learning Clustering and Dimensionality Reductionrdquo

bull Dagger[Shalizi-ADA] Cosma Rohilla Shalizi 2017 Advanced Data Analysis from an Elementary Pointof View Ch 16 (ldquoPrincipal Components Analysisrdquo) Ch 17 (ldquoFactor Modelsrdquo) httpwwwstat

cmuedu~cshaliziADAfaEPoV

See also Multivariate-R Ch 3 ldquoPrincipal Components Analysisrdquo Ch 4 ldquoMultidimensional ScalingrdquoCh 5 ldquoExploratory Factor Analysisrdquo Latent

bull DeepLearning Ch 2 ldquoLinear Algebrardquo Ch 13 ldquoLinear Factor Modelsrdquo

bull Dagger[GloVe] Jeffrey Pennington Richard Socher Christopher Manning 2014 ldquoGloVe Global vectorsfor word representationrdquo EMNLP httpsnlpstanfordeduprojectsglove

bull dagger[NMF] Daniel D Lee and H Sebastian Seung 1999 ldquoLearning the parts of objects by non-negativematrix factorizationrdquo Nature 401788-791 httpdxdoiorgezaccesslibrariespsuedu

10103844565

bull dagger[CUR] Michael W Mahoney and Petros Drineas 2009 ldquoCUR matrix decompositions for improveddata analysisrdquo PNAS httpwwwpnasorgcontent1063697full

bull Dagger[ICA] Aapo Hyvarinen and Erkki Oja 2000 ldquoIndependent Component Analysis Algorithmsand Applicationsrdquo Neural Networks 13(4-5) 411-30 httpswwwcshelsinkifiuahyvarin

papersNN00newpdf

bull [RandomProjection] Ella Bingham and Heikki Mannilla 2001 ldquoRandom projection in dimension-ality reduction applications to image and text datardquo KDD httpsdoiorg1011452F502512

502546

bull Compression eg ldquoTransform codingrdquo ldquoWaveletsrdquo

Nonlinear dimensionality reduction Manifold learning

bull DeepLearning Section 5113 ldquoManifold Learningrdquo

bull DataScience-Python NB 0510 ldquoManifold Learningrdquo (Locally linear embedding [LLE] Isomap)

bull Dagger[KernelPCA] Sebastian Raschka 2014 ldquoKernel tricks and nonlinear dimensionality reduction viaRBF kernel PCArdquo httpsebastianraschkacomArticles2014_kernel_pcahtml

bull Shalizi-ADA Ch 18 ldquoNonlinear Dimensionality Reductionrdquo (LLE)

bull Dagger[LaplacianEigenmaps] Mikhail Belkin and Partha Niyogi 2003 ldquoLaplacian eigenmaps for di-mensionality reduction and data representationrdquo Neural Computation 15(6) 13731396 http

webcseohio-stateedu~belkin8papersLEM_NC_03pdf

bull DeepLearning Ch14 (ldquoAutoencodersrdquo)

bull Dagger[word2vec] Tomas Mikolov Kai Chen Greg Corrado Jeffrey Dean 2013 ldquoEfficient Estimation ofWord Representations in Vector Spacerdquo httpsarxivorgabs13013781

bull Dagger[word2vecExplained] Yoav Goldberg and Omer Levy 2014 ldquoword2vec Explained DerivingMikolov et als Negative-Sampling Word-Embedding Methodrdquo httpsarxivorgabs14023722

bull Dagger[t-SNE] Laurens van der Maaten and Geoffrey Hinton 2008 ldquoVisualising Data using t-SNErdquo Jour-nal of Machine Learning Research 9 2579-2605 httpjmlrcsailmitedupapersvolume9

vandermaaten08avandermaaten08apdf

24

Computation and Scaling Up

Scientific computing computation at scale

bull NRCreport Ch 10 ldquoThe Seven Computational Giants of Massive Data Analysisrdquo

bull DSHandbook-Py Ch 21 ldquoPerformance and Computer Memoryrdquo Ch 22 ldquoComputer Memory andData Structuresrdquo

bull Dagger[ComputationalStatistics-Python] Cliburn Chan Computational Statistics in Python https

peopledukeedu~ccc14sta-663indexhtml

bull CRAN Task View High Performance Computinghttpscranr-projectorgwebviewsHighPerformanceComputinghtml

Numerical computing

bull [NumericalComputing] Ward Cheney and David Kincaid 2013 Numerical Mathematics andComputing 7th ed BrooksCole Cengage Learning

bull CRAN Task View Numerical Mathematicshttpscranr-projectorgwebviewsNumericalMathematicshtml

Optimization (eg MLE gradient descent stochastic gradient descent EM algorithm neural nets)

bull DSHandbook-Py Ch 23 ldquoMaximum Likelihood Estimation and Optimizationrdquo (gradient descent)

bull DeepLearning Ch 4 ldquoNumerical Computationrdquo Ch 6 ldquoDeep Feedforward Networksrdquo Ch 8ldquoOptimization for Training Deep Modelsrdquo

bull CRAN Task View Optimization and Mathematical Programminghttpscranr-projectorgwebviewsOptimizationhtml

Linear algebra matrix computations

bull [MatrixComputations] Gene H Golub and Charles F Van Loan 2013 Matrix Computations 4thed Johns Hopkins University Press

bull Dagger[NetflixMatrix] Yehuda Koren Robert Bell and Chris Volinsky 2009 ldquoMatrix FactorizationTechniques for Recommender Systemsrdquo Computer August 42-9 httpsdatajobscomdata-

science-repoRecommender-Systems-[Netflix]pdf

bull Dagger[BigDataPCA] Jianqing Fan Qiang Sun Wen-Xin Zhou Ziwei Zhu ldquoPrincipal Component Anal-ysis for Big Datardquo httpwwwprincetonedu~ziweizpcapdf

bull Dagger[RandomSVD] Andrew Tulloch 2009 ldquoFast Randomized SVDrdquo httpsresearchfbcom

fast-randomized-svd

bull Dagger[FactorSGD] Rainer Gemulla Peter J Haas Erik Nijkamp and Yannis Sismanis 2011 ldquoLarge-Scale Matrix Factorization with Distributed Stochastic Gradient Descentrdquo KDD httpwwwcs

utahedu~hariteachingbigdatagemulla11dsgdpdf

bull Dagger[Sparse] Max Grossman 2015 ldquo101 Ways to Store a Sparse Matrixrdquo httpsmediumcom

jmaxg3101-ways-to-store-a-sparse-matrix-c7f2bf15a229

bull Dagger[DontInvert] John D Cook 2010 ldquoDonrsquot Invert that Matrixrdquo httpswwwjohndcookcom

blog20100119dont-invert-that-matrix

Simulation-based inference resampling Monte Carlo methods MCMC Bayes approximate inference

25

bull Dagger[MarkovVisually] Victor Powell ldquoMarkov Chains Explained Visuallyrdquo httpsetosaioev

markov-chains

bull DSHandbook-Py Ch 25 ldquoStochastic Modelingrdquo (Markov chains MCMC HMM)

bull Bayes ComputationalStatistics-Py

bull DeepLearning Ch 17 ldquoMonte Carlo Methodsrdquo Ch 19 ldquoApproximate Inferencerdquo

bull Dagger[VariationalInference] Jason Eisner 2011 ldquoHigh-Level Explanation of Variational Inferencerdquohttpswwwcsjhuedu~jasontutorialsvariationalhtml

bull CRAN Task View Bayesian Inferencehttpscranr-projectorgwebviewsBayesianhtml

Parallelism MapReduce Split-Apply-Combine

bull Dagger[MapReduceIntuition] Jigsaw Academy 2014 ldquoBig Data Specialist MapReducerdquo https

wwwyoutubecomwatchv=TwcYQzFqg-8ampfeature=youtube

bull MMDS Ch 2 ldquoMap-Reduce and the New Software Stackrdquo (3rd Edition discusses Spark amp Tensor-Flow)

bull Algorithms Ch 3ldquoScalable Algorithms and Associative Statisticsrdquo Ch 4 ldquoHadoop and MapReducerdquo

bull NRCreport Ch 6 ldquoResources Trade-offs and Limitationsrdquo

bull TidyData-R (Split-apply-combine is the motivating principle behind the ldquotidyverserdquo approach) Seealso Part III ldquoProgramrdquo (pipes functions vectors iteration)

bull Dagger[TidySAC-Video] Hadley Wickham 2017 ldquoData Science in the Tidyverserdquo httpswww

rstudiocomresourcesvideosdata-science-in-the-tidyverse

bull See also FactorSGD

Functional Programming

bull ComputationalStatistics-Python ldquoFunctions are first class objectsrdquo through first exercises

bull DSHandbook-Py Ch 20 ldquoProgramming Language Conceptsrdquo

bull See also Haskell

Scaling iteration streaming data online algorithms (Spark)

bull NRCreport Ch 4 ldquoTemporal Data and Real-Time Algorithmsrdquo

bull MMDS Ch 4 ldquoMining Data Streamsrdquo

bull Dagger[BDAS] AMPLab BDAS The Berkeley Data Analytics Stack httpsamplabcsberkeleyedu

software

bull dagger[Spark] Mohammed Guller 2015 Big Data Analytics with Spark A Practitioners Guide toUsing Spark for Large-Scale Data Processing Machine Learning and Graph Analytics and High-Velocity Data Stream Processing Apress httpslink-springer-comezaccesslibrariespsu

edubook101007978-1-4842-0964-6

bull DSHandbook-Py Ch 13 ldquoBig Datardquo

bull DeepLearning Ch 10 ldquoSequence Modeling Recurrent and Recursive Netsrdquo

General Resources for Python and R

26

bull dagger[DSHandbook-Py] Field Cady 2017 The Data Science Handbook (Python-based) Wiley http

onlinelibrarywileycomezaccesslibrariespsuedubook1010029781119092919

bull Dagger[Tutorials-R] Ujjwal Karn 2017 ldquoA curated list of R tutorials for Data Science NLP and MachineLearningrdquo httpsgithubcomujjwalkarnDataScienceR

bull Dagger[Tutorials-Python] Ujjwal Karn 2017 ldquoA curated list of Python tutorials for Data Science NLPand Machine Learningrdquo httpsgithubcomujjwalkarnDataSciencePython

Cutting and Bleeding Edge of Data Science Languages

bull [Scala] daggerVishal Layka and David Pollak 2015 Beginning Scala httpslink-springer-

comezaccesslibrariespsuedubook1010072F978-1-4842-0232-6 daggerNicolas Patrick 2014Scala for Machine Learning httpsebookcentralproquestcomlibpensudetailaction

docID=1901910 DaggerScala site httpswwwscala-langorg (See also )

bull [Julia] daggerIvo Baobaert 2015 Getting Started with Julia httpsebookcentralproquestcom

libpensudetailactiondocID=1973847 DaggerDouglas Bates 2013 ldquoJulia for R Programmersrdquohttpwwwstatwiscedu~batesJuliaForRProgrammerspdf DaggerJulia site httpsjulialang

org

bull [Haskell] daggerRichard Bird 2014 Thinking Functionally with Haskell httpsdoi-orgezaccess

librariespsuedu101017CBO9781316092415 daggerHakim Cassimally 2017 Learning Haskell Pro-gramming httpswwwlyndacomHaskell-tutorialsLearning-Haskell-Programming604926-2html daggerJames Church 2017 Learning Haskell for Data Analysis httpswwwlyndacomDeveloper-tutorialsLearning-Haskell-Data-Analysis604234-2html Haskell site httpswwwhaskellorg

bull [Clojure] daggerMark McDonnell 2017 Quick Clojure Essential Functional Programming https

link-springer-comezaccesslibrariespsuedubook1010072F978-1-4842-2952-1 daggerAkhillWali 2014 Clojure for Machine Learning httpsebookcentralproquestcomlibpensu

detailactiondocID=1674848 daggerArthur Ulfeldt 2015 httpswwwlyndacomClojure-tutorialsUp-Running-Clojure413127-2html DaggerClojure site httpsclojureorg

bull [TensorFlow] DaggerAbadi et al (Google) 2016 ldquoTensorFlow A System for Large-Scale Ma-chine Learningrdquo httpsarxivorgabs160508695 DaggerUdacity ldquoDeep Learningrdquo httpswww

udacitycomcoursedeep-learning--ud730 DaggerTensorFlow site httpswwwtensorfloworg

bull [H2O] Darren Cool 2016 Practical Machine Learning with H2O Powerful Scalable Techniques forDeep Learning and AI OrsquoReilly Arno Candel and Viraj Parmar 2015 Deep Learning with H2ODaggerH2O site httpswwwh2oai

Social Data Structures

Space and Time

bull dagger[GIA] David OrsquoSullivan David J Unwin 2010 Geographic Information Analysis Second EditionJohn Wiley amp Sons httponlinelibrarywileycombook1010029780470549094

bull [Space-Time] Donna J Peuquet 2003 Representations of Space and Time Guilford

bull Roger S Bivand Edzer Pebesma Virgilio Gmez-Rubio 2013 Applied Spatial Data Analysis in RFree through library httplinkspringercomezaccesslibrariespsuedubook101007

2F978-1-4614-7618-4 See also Edzer Pebesma 2016 ldquoHandling and Analyzing Spatial Spa-tiotemporal and Movement Data in Rrdquo httpsedzergithubioUseR2016

bull Dagger[PySAL] Sergio J Rey and Dani Arribas-Bel 2016 ldquoGeographic Data Science with PySAL andthe pydata Stackrdquo httpdarribasorggds_scipy16

27

bull DataScience-Python NB 0413 ldquoGeographic Data with Basemaprdquo

bull Time ldquoLongitudinalrdquo intra-individual data as common in developmental psychology dagger[Longitudinal]Garrett Fitzmaurice Marie Davidian Geert Verbeke and Geert Molenberghs editors 2009 Longitu-dinal Data Analysis Chapman amp Hall CRC httppensueblibcompatronFullRecordaspxp=359998

bull Time ldquoTime Series Time Series Cross Section Panelrdquo as common in economics political scienceand sociology DataScience-Python Ch 3 re pandas DSHandbook-Py Ch 17 ldquoTime Series AnalysisrdquoEconometrics GelmanHill

bull Time ldquoSequential Data Streamsrdquo as in NLP machine learning Spark and other rdquostreaming datardquoreadings

bull Space-Time Data with continuity connectivity neighborhood structure See DeepLearning Chapter 9re convolution

bull CRAN Task Views Spatial httpscranr-projectorgwebviewsSpatialhtml Spatiotempo-ral httpscranr-projectorgwebviewsSpatioTemporalhtml Time Series httpscran

r-projectorgwebviewsTimeSerieshtml

Network Graphs

bull Dagger[Networks] David Easley and Jon Kleinberg 2010 Networks Crowds and Markets ReasoningAbout a Highly Connected World Cambridge University Press httpwwwcscornelleduhome

kleinbernetworks-book

bull MMDS Ch 5 ldquoLink Analysisrdquo Ch 10 ldquoMining Social-Network Graphsrdquo

bull dagger[Networks-R] Douglas Luke 2015 A Userrsquos Guide to Network Analysis in R Springer https

link-springer-comezaccesslibrariespsuedubook101007978-3-319-23883-8

bull dagger[Networks-Python] Mohammed Zuhair Al-Taie and Seifedine Kadry 2017 Python for Graph andNetwork Analysis Springer httpslink-springer-comezaccesslibrariespsuedubook

1010072F978-3-319-53004-8

Hierarchy Clustered Data Aggregation Mixed Models

bull dagger[GelmanHill] Andrew Gelman and Jennifer Hill 2006 Data Analysis Using Regression and Multi-levelHierarchical Models Cambridge University Press httppensueblibcompatronFullRecordaspxp=288457

bull dagger[MixedModels] Eugene Demidenko 2013 Mixed Models Theory and Applications with R 2nd edWiley httpsiteebrarycomezaccesslibrariespsuedulibpennstatedetailaction

docID=10748641

Social Data Channels

Language Text Speech Audio

bull Dagger[NLP] Daniel Jurafsky and James H Martin Forthcoming (2017) Speech and Language ProcessingAn Introduction to Natural Language Processing Speech Recognition and Computational Linguistics3rd ed Prentice-Hall Preprint httpswebstanfordedu~jurafskyslp3

bull InfoRetrieval

bull Dagger[CoreNLP] Stanford CoreNLP httpsstanfordnlpgithubioCoreNLP

28

bull dagger[TextAsData] Grimmer and Stewart 2013 ldquoText as Data The Promise and Pitfalls of AutomaticContent Analysis Methods for Political Textsrdquo Political Analysis httpsdoiorg101093pan

mps028

bull Quinn-Topics

bull FightinWords

bull dagger[NLTK-Python] Jacob Perkins 2010 Python Text Processing with NLTK 20 Cookbook PackthttpsiteebrarycomezaccesslibrariespsuedulibpennstatedetailactiondocID=10435387dagger[TextAnalytics-Python] Dipanjan Sarkar 2016 Text Analytics with Python Springer http

linkspringercomezaccesslibrariespsuedubook1010072F978-1-4842-2388-8

bull DSHandbook-Py Ch 16 ldquoNatural Language Processingrdquo

bull Matthew J Denny httpwwwmjdennycomText_Processing_In_Rhtml

bull CRAN Natural Language Processing httpscranr-projectorgwebviewsNaturalLanguageProcessinghtml

bull dagger[AudioData] Dean Knox and Christopher Lucas 2017 ldquoThe Speaker-Affect Model MeasuringEmotion in Political Speech with Audio Datardquo httpchristopherlucasorgfilesPDFssam

pdf httpswwwyoutubecomwatchv=Hs8A9dwkMzI

bull daggerMorgan amp Claypool Synthesis Lectures on Human Language Technologies httpwwwmorganclaypoolcomtochlt11

bull daggerMorgan amp Claypool Synthesis Lectures on Speech and Audio Processing httpwwwmorganclaypoolcomtocsap11

Vision Image Video

bull Dagger[ComputerVision] Richard Szelinski 2010 Computer Vision Algorithms and Applications SpringerhttpszeliskiorgBook

bull MixedModels Ch 11 ldquoStatistical Analysis of Shaperdquo Ch 12 ldquoStatistical Image Analysisrdquo

bull AudioVideoVolumetric See DeepLearning Chapter 9 re convolution

bull daggerMorgan amp Claypool Synthesis Lectures on Image Video and Multimedia Processing httpwww

morganclaypoolcomtocivm11

bull daggerMorgan amp Claypool Synthesis Lectures on Computer Vision httpwwwmorganclaypoolcom

toccov11

Approaches to Learning from Data (The Analytics Layer)

bull NRCreport Ch 7 ldquoBuilding Models from Massive Datardquo

bull MMDS (data-mining)

bull CausalInference

bull Dagger[VisualAnalytics] Daniel Keim Jorn Kohlhammer Geoffrey Ellis and Florian Mansmann editors2010 Mastering the Information Age Solving Problems with Visual Analytics The EurographicsAssociation Goslar Germany httpwwwvismastereubook See also daggerMorgan amp ClaypoolSynthesis Lectures on Visualization httpwwwmorganclaypoolcomtocvis21

bull Dagger[Econometrics] Bruce E Hansen 2014 Econometrics httpwwwsscwiscedu~bhansen

econometrics

29

bull Dagger[StatisticalLearning] Gareth James Daniela Witten Trevor Hastie and Robert Tibshirani 2013An Introduction to Statistical Learning with Applications in R Springer httpwww-bcfuscedu

~garethISLindexhtml

bull [MachineLearning] Christopher Bishop 2006 Pattern Recognition and Machine Learning Springer

bull dagger[Bayes] Peter D Hoff 2009 A First Course in Bayesian Statistical Methods Springer http

linkspringercombook1010072F978-0-387-92407-6

bull dagger[GraphicalModels] Soslashren Hoslashjsgaard David Edwards Steffen Lauritzen 2012 Graphical Mod-els with R Springer httpslink-springer-comezaccesslibrariespsuedubook101007

2F978-1-4614-2299-0

bull [InformationTheory] David Mackay 2003 Information Theory Inference and Learning AlgorithmsSpringer

bull Dagger[DeepLearning] Ian Goodfellow Yoshua Bengio and Aaron Courville 2016 Deep Learning MITPress httpwwwdeeplearningbookorg (see also Yann LeCun Yoshua Bengio and Geof-frey Hinton 2015 ldquoDeep Learningrdquo Nature 521(7553) 436ndash444 httpsdoiorg101038

nature14539)

See also Dagger[Keras] Keras The Python Deep Learning Library httpskerasio Francoios CholletDirectory of Keras tutorials httpsgithubcomfcholletkeras-resources

bull dagger[SignalProcessing] Jose Maria Giron-Sierra 2013 Digital Signal Processing with MATLAB Exam-ples (Volumes 1-3) Springer httpslink-springer-comezaccesslibrariespsuedubook

101007978-981-10-2534-1

30

Penn State Policy Statements

Academic Integrity

Academic integrity is the pursuit of scholarly activity in an open honest and responsible manner Academicintegrity is a basic guiding principle for all academic activity at The Pennsylvania State University and allmembers of the University community are expected to act in accordance with this principle Consistent withthis expectation the Universitys Code of Conduct states that all students should act with personal integrityrespect other students dignity rights and property and help create and maintain an environment in whichall can succeed through the fruits of their efforts

Academic integrity includes a commitment by all members of the University community not to engage in ortolerate acts of falsification misrepresentation or deception Such acts of dishonesty violate the fundamentalethical principles of the University community and compromise the worth of work completed by others

Disability Accomodation

Penn State welcomes students with disabilities into the Universitys educational programs Every Penn Statecampus has an office for students with disabilities Student Disability Resources (SDR) website provides con-tact information for every Penn State campus (httpequitypsuedusdrdisability-coordinator)For further information please visit the Student Disability Resources website (httpequitypsuedusdr)

In order to receive consideration for reasonable accommodations you must contact the appropriate disabilityservices office at the campus where you are officially enrolled participate in an intake interview and pro-vide documentation See documentation guidelines at (httpequitypsuedusdrguidelines) If thedocumentation supports your request for reasonable accommodations your campus disability services officewill provide you with an accommodation letter Please share this letter with your instructors and discussthe accommodations with them as early as possible You must follow this process for every semester thatyou request accommodations

Psychological Services

Many students at Penn State face personal challenges or have psychological needs that may interfere withtheir academic progress social development or emotional wellbeing The university offers a variety ofconfidential services to help you through difficult times including individual and group counseling crisisintervention consultations online chats and mental health screenings These services are provided by staffwho welcome all students and embrace a philosophy respectful of clients cultural and religious backgroundsand sensitive to differences in race ability gender identity and sexual orientation

bull Counseling and Psychological Services at University Park (CAPS) (httpstudentaffairspsueducounseling) 814-863-0395

bull Penn State Crisis Line (24 hours7 daysweek) 877-229-6400

bull Crisis Text Line (24 hours7 daysweek) Text LIONS to 741741

Educational Equity

Penn State takes great pride to foster a diverse and inclusive environment for students faculty and staffConsistent with University Policy AD29 students who believe they have experienced or observed a hatecrime an act of intolerance discrimination or harassment that occurs at Penn State are urged to report theseincidents as outlined on the Universitys Report Bias webpage (httpequitypsuedureportbias)

31

Background Knowledge

Students will have a variety of backgrounds Prior to beginning interdisciplinary coursework tofulfill Social Data Analytics degree requirements including SoDA 501 and 502 students are ex-pected to have advanced (graduate) training in at least one of the component areas of Social DataAnalytics and a familiarity with basic concepts in the others

With regard to specialization students are expected to have advanced (graduate) training in ONEof the following

bull quantitative social science methodology and a discipline of social science (as would be thecase for a second-year PhD student in Political Science Sociology Criminology Human De-velopment and Family Studies or Demography) OR

bull statistics (as would be the case for a second-year PhD student in Statistics) OR

bull information science or informatics (as would be the case for a second-year PhD student inInformation Science and Technology or a second-year PhD student in Geography specializingin GIScience) OR

bull computer science (as would be the case for a second-year PhD student in Computer Scienceand Engineering)

This requirement is met as a matter of meeting home program requirements for students in thedual-title PhD but may require additional coursework on the part of students in other programswishing to pursue the graduate minor

With regard to general preparation students are expected to have ALL of the following technicalknowledge

bull basic programming skills including basic facility with R ampor Python AND

bull basic knowledge of relational databases ampor geographic information systems AND

bull basic knowledge of probability applied statistics ampor social science research design AND

bull basic familiarity with a substantive or theoretical area of social science (eg 300-level course-work in political science sociology criminology human development psychology economicscommunication anthropology human geography social informatics or similar fields)

It is not unusual for students to have one or more gaps in this preparation at time of applicationto the SoDA program Students should work with Social Data Analytics advisers to develop aplan for timely remediation of any deficiencies which generally will not require formal courseworkfor students whose training and interests are otherwise appropriate for pursuit of the Social DataAnalytics degree Where possible this will be addressed at time of application to the Social DataAnalytics program

To this end some free training materials are linked in the reference section and there are alsoabundant free high-quality self-paced course-style training materials on these and related subjectsavailable through edX (httpwwwedxorg) Udacity (httpwwwudacitycom) Codecademy(httpcodecademycom) and Lynda (httpwwwlyndapsu)

32

Page 23: SoDA 501: Approaches and Issues in Big Social Data Spring ... · \Approaches and Issues in Big Social Data," o ered in the spring semester. Some SoDA / IGERT students will take more

Sung-Hyuk Cha 2007 ldquoComprehensive Survey on DistanceSimilarity Measures between Probability DensityFunctionsrdquo International Journal of Mathematical Models and Methods in Applied Sciences 4(1) 300-7 http

csispaceeductappertdpsd861-12session4-p2pdf

Anna Huang 2008 ldquoSimilarity Measures for Text Document Clusteringrdquo Proceedings of the 6th New ZealandComputer Science Research Student Conference 49-56 httpwwwnzcsrsc08canterburyacnzsiteproceedingsIndividual_Paperspg049_Similarity_Measures_for_Text_Document_Clusteringpdf

bull Michael B Jordan ldquoThe Kernel Trickrdquo (Lecture Notes) httpspeopleeecsberkeleyedu

~jordancourses281B-spring04lectureslec3pdf

bull Eric Kim 2017 ldquoEverything You Ever Wanted to Know about the Kernel Trick (But Were Afraidto Ask)rdquo httpwwweric-kimneteric-kim-netposts1kernel_trick_blog_ekim_12_20_

2017pdf

Derived Data Representations - Dimensionality Reduction Compression De-composition Embeddings

The groupings here and under the measurement multivariate statistics section are particularly arbitraryFor example ldquok-Means clusteringrdquo can be viewed as a technique for ldquodimensionality reductionrdquo ldquocompres-sionrdquo ldquofeature extractionrdquo ldquolatent variable measurementrdquo ldquounsupervised learningrdquo ldquocollaborative filteringrdquo

Clustering hashing quantization blocking compression

bull Dagger[Compression] Khalid Sayood 2012 Introduction to Data Compression 4th ed Springer httpwwwsciencedirectcomezaccesslibrariespsuedusciencebook9780124157965 (eg cod-ing blocking via vector quantization)

bull MMDS Ch3 ldquoFinding Similar Itemsrdquo (minhashing locality sensitive hashing)

bull Multivariate-R Ch 6 ldquoClusteringrdquo

bull DataScience-Python NB 0511 ldquok-Means Clusteringrdquo NG 0512 ldquoGaussian Mixture Modelsrdquo

bull Dagger[KMeansHashing] Kaiming He Fang Wen Jian Sun 2013 ldquoK-means Hashing An Affinity-Preserving Quantization Method for Learning Binary Compact Codesrdquo CVPR httpswwwcv-foundationorgopenaccesscontent_cvpr_2013papersHe_K-Means_Hashing_An_2013_CVPR_paper

pdf

bull dagger[Squashing] Madigan D Raghavan N Dumouchel W Nason M Posse C and RidgewayG (2002) ldquoLikelihood-based data squashing A modeling approach to instance constructionrdquo DataMining and Knowledge Discovery 6(2) 173-190 httpdxdoiorgezaccesslibrariespsu

edu101023A1014095614948

bull dagger[Core-sets] Piotr Indyk Sepideh Mahabadi Mohammad Mahdian Vahab S Mirrokni 2014 ldquoCom-posable core-sets for diversity and coverage maximizationrdquo PODS rsquo14 100-8 httpsdoiorg10114525945382594560

Feature selection feature extraction feature engineering weighting preprocessing

bull PatternRecognition Sections 26-7 ldquoFeature Selection Feature Extractionrdquo

bull DSHandbook-Py Ch 7 ldquoInterlude Feature Extraction Ideasrdquo

bull DataScience-Python NB 0504 ldquoFeature Engineeringrdquo

bull Features in text NLP and InfoRetrieval re tfidf and similar FightinWords

bull Features in images DataScience-Python NB 0514 ldquoImage Featuresrdquo

23

Dimensionality reduction decomposition factorization change of basis reparameterization matrix com-pletion latent variables source separation

bull MMDS Ch 9 ldquoRecommendation Systemsrdquo Ch11 ldquoDimensionality Reductionrdquo

bull DSHandbook-Py Ch 10 ldquoUnsupervised Learning Clustering and Dimensionality Reductionrdquo

bull Dagger[Shalizi-ADA] Cosma Rohilla Shalizi 2017 Advanced Data Analysis from an Elementary Pointof View Ch 16 (ldquoPrincipal Components Analysisrdquo) Ch 17 (ldquoFactor Modelsrdquo) httpwwwstat

cmuedu~cshaliziADAfaEPoV

See also Multivariate-R Ch 3 ldquoPrincipal Components Analysisrdquo Ch 4 ldquoMultidimensional ScalingrdquoCh 5 ldquoExploratory Factor Analysisrdquo Latent

bull DeepLearning Ch 2 ldquoLinear Algebrardquo Ch 13 ldquoLinear Factor Modelsrdquo

bull Dagger[GloVe] Jeffrey Pennington Richard Socher Christopher Manning 2014 ldquoGloVe Global vectorsfor word representationrdquo EMNLP httpsnlpstanfordeduprojectsglove

bull dagger[NMF] Daniel D Lee and H Sebastian Seung 1999 ldquoLearning the parts of objects by non-negativematrix factorizationrdquo Nature 401788-791 httpdxdoiorgezaccesslibrariespsuedu

10103844565

bull dagger[CUR] Michael W Mahoney and Petros Drineas 2009 ldquoCUR matrix decompositions for improveddata analysisrdquo PNAS httpwwwpnasorgcontent1063697full

bull Dagger[ICA] Aapo Hyvarinen and Erkki Oja 2000 ldquoIndependent Component Analysis Algorithmsand Applicationsrdquo Neural Networks 13(4-5) 411-30 httpswwwcshelsinkifiuahyvarin

papersNN00newpdf

bull [RandomProjection] Ella Bingham and Heikki Mannilla 2001 ldquoRandom projection in dimension-ality reduction applications to image and text datardquo KDD httpsdoiorg1011452F502512

502546

bull Compression eg ldquoTransform codingrdquo ldquoWaveletsrdquo

Nonlinear dimensionality reduction Manifold learning

bull DeepLearning Section 5113 ldquoManifold Learningrdquo

bull DataScience-Python NB 0510 ldquoManifold Learningrdquo (Locally linear embedding [LLE] Isomap)

bull Dagger[KernelPCA] Sebastian Raschka 2014 ldquoKernel tricks and nonlinear dimensionality reduction viaRBF kernel PCArdquo httpsebastianraschkacomArticles2014_kernel_pcahtml

bull Shalizi-ADA Ch 18 ldquoNonlinear Dimensionality Reductionrdquo (LLE)

bull Dagger[LaplacianEigenmaps] Mikhail Belkin and Partha Niyogi 2003 ldquoLaplacian eigenmaps for di-mensionality reduction and data representationrdquo Neural Computation 15(6) 13731396 http

webcseohio-stateedu~belkin8papersLEM_NC_03pdf

bull DeepLearning Ch14 (ldquoAutoencodersrdquo)

bull Dagger[word2vec] Tomas Mikolov Kai Chen Greg Corrado Jeffrey Dean 2013 ldquoEfficient Estimation ofWord Representations in Vector Spacerdquo httpsarxivorgabs13013781

bull Dagger[word2vecExplained] Yoav Goldberg and Omer Levy 2014 ldquoword2vec Explained DerivingMikolov et als Negative-Sampling Word-Embedding Methodrdquo httpsarxivorgabs14023722

bull Dagger[t-SNE] Laurens van der Maaten and Geoffrey Hinton 2008 ldquoVisualising Data using t-SNErdquo Jour-nal of Machine Learning Research 9 2579-2605 httpjmlrcsailmitedupapersvolume9

vandermaaten08avandermaaten08apdf

24

Computation and Scaling Up

Scientific computing computation at scale

bull NRCreport Ch 10 ldquoThe Seven Computational Giants of Massive Data Analysisrdquo

bull DSHandbook-Py Ch 21 ldquoPerformance and Computer Memoryrdquo Ch 22 ldquoComputer Memory andData Structuresrdquo

bull Dagger[ComputationalStatistics-Python] Cliburn Chan Computational Statistics in Python https

peopledukeedu~ccc14sta-663indexhtml

bull CRAN Task View High Performance Computinghttpscranr-projectorgwebviewsHighPerformanceComputinghtml

Numerical computing

bull [NumericalComputing] Ward Cheney and David Kincaid 2013 Numerical Mathematics andComputing 7th ed BrooksCole Cengage Learning

bull CRAN Task View Numerical Mathematicshttpscranr-projectorgwebviewsNumericalMathematicshtml

Optimization (eg MLE gradient descent stochastic gradient descent EM algorithm neural nets)

bull DSHandbook-Py Ch 23 ldquoMaximum Likelihood Estimation and Optimizationrdquo (gradient descent)

bull DeepLearning Ch 4 ldquoNumerical Computationrdquo Ch 6 ldquoDeep Feedforward Networksrdquo Ch 8ldquoOptimization for Training Deep Modelsrdquo

bull CRAN Task View Optimization and Mathematical Programminghttpscranr-projectorgwebviewsOptimizationhtml

Linear algebra matrix computations

bull [MatrixComputations] Gene H Golub and Charles F Van Loan 2013 Matrix Computations 4thed Johns Hopkins University Press

bull Dagger[NetflixMatrix] Yehuda Koren Robert Bell and Chris Volinsky 2009 ldquoMatrix FactorizationTechniques for Recommender Systemsrdquo Computer August 42-9 httpsdatajobscomdata-

science-repoRecommender-Systems-[Netflix]pdf

bull Dagger[BigDataPCA] Jianqing Fan Qiang Sun Wen-Xin Zhou Ziwei Zhu ldquoPrincipal Component Anal-ysis for Big Datardquo httpwwwprincetonedu~ziweizpcapdf

bull Dagger[RandomSVD] Andrew Tulloch 2009 ldquoFast Randomized SVDrdquo httpsresearchfbcom

fast-randomized-svd

bull Dagger[FactorSGD] Rainer Gemulla Peter J Haas Erik Nijkamp and Yannis Sismanis 2011 ldquoLarge-Scale Matrix Factorization with Distributed Stochastic Gradient Descentrdquo KDD httpwwwcs

utahedu~hariteachingbigdatagemulla11dsgdpdf

bull Dagger[Sparse] Max Grossman 2015 ldquo101 Ways to Store a Sparse Matrixrdquo httpsmediumcom

jmaxg3101-ways-to-store-a-sparse-matrix-c7f2bf15a229

bull Dagger[DontInvert] John D Cook 2010 ldquoDonrsquot Invert that Matrixrdquo httpswwwjohndcookcom

blog20100119dont-invert-that-matrix

Simulation-based inference resampling Monte Carlo methods MCMC Bayes approximate inference

25

bull Dagger[MarkovVisually] Victor Powell ldquoMarkov Chains Explained Visuallyrdquo httpsetosaioev

markov-chains

bull DSHandbook-Py Ch 25 ldquoStochastic Modelingrdquo (Markov chains MCMC HMM)

bull Bayes ComputationalStatistics-Py

bull DeepLearning Ch 17 ldquoMonte Carlo Methodsrdquo Ch 19 ldquoApproximate Inferencerdquo

bull Dagger[VariationalInference] Jason Eisner 2011 ldquoHigh-Level Explanation of Variational Inferencerdquohttpswwwcsjhuedu~jasontutorialsvariationalhtml

bull CRAN Task View Bayesian Inferencehttpscranr-projectorgwebviewsBayesianhtml

Parallelism MapReduce Split-Apply-Combine

bull Dagger[MapReduceIntuition] Jigsaw Academy 2014 ldquoBig Data Specialist MapReducerdquo https

wwwyoutubecomwatchv=TwcYQzFqg-8ampfeature=youtube

bull MMDS Ch 2 ldquoMap-Reduce and the New Software Stackrdquo (3rd Edition discusses Spark amp Tensor-Flow)

bull Algorithms Ch 3ldquoScalable Algorithms and Associative Statisticsrdquo Ch 4 ldquoHadoop and MapReducerdquo

bull NRCreport Ch 6 ldquoResources Trade-offs and Limitationsrdquo

bull TidyData-R (Split-apply-combine is the motivating principle behind the ldquotidyverserdquo approach) Seealso Part III ldquoProgramrdquo (pipes functions vectors iteration)

bull Dagger[TidySAC-Video] Hadley Wickham 2017 ldquoData Science in the Tidyverserdquo httpswww

rstudiocomresourcesvideosdata-science-in-the-tidyverse

bull See also FactorSGD

Functional Programming

bull ComputationalStatistics-Python ldquoFunctions are first class objectsrdquo through first exercises

bull DSHandbook-Py Ch 20 ldquoProgramming Language Conceptsrdquo

bull See also Haskell

Scaling iteration streaming data online algorithms (Spark)

bull NRCreport Ch 4 ldquoTemporal Data and Real-Time Algorithmsrdquo

bull MMDS Ch 4 ldquoMining Data Streamsrdquo

bull Dagger[BDAS] AMPLab BDAS The Berkeley Data Analytics Stack httpsamplabcsberkeleyedu

software

bull dagger[Spark] Mohammed Guller 2015 Big Data Analytics with Spark A Practitioners Guide toUsing Spark for Large-Scale Data Processing Machine Learning and Graph Analytics and High-Velocity Data Stream Processing Apress httpslink-springer-comezaccesslibrariespsu

edubook101007978-1-4842-0964-6

bull DSHandbook-Py Ch 13 ldquoBig Datardquo

bull DeepLearning Ch 10 ldquoSequence Modeling Recurrent and Recursive Netsrdquo

General Resources for Python and R

26

bull dagger[DSHandbook-Py] Field Cady 2017 The Data Science Handbook (Python-based) Wiley http

onlinelibrarywileycomezaccesslibrariespsuedubook1010029781119092919

bull Dagger[Tutorials-R] Ujjwal Karn 2017 ldquoA curated list of R tutorials for Data Science NLP and MachineLearningrdquo httpsgithubcomujjwalkarnDataScienceR

bull Dagger[Tutorials-Python] Ujjwal Karn 2017 ldquoA curated list of Python tutorials for Data Science NLPand Machine Learningrdquo httpsgithubcomujjwalkarnDataSciencePython

Cutting and Bleeding Edge of Data Science Languages

bull [Scala] daggerVishal Layka and David Pollak 2015 Beginning Scala httpslink-springer-

comezaccesslibrariespsuedubook1010072F978-1-4842-0232-6 daggerNicolas Patrick 2014Scala for Machine Learning httpsebookcentralproquestcomlibpensudetailaction

docID=1901910 DaggerScala site httpswwwscala-langorg (See also )

bull [Julia] daggerIvo Baobaert 2015 Getting Started with Julia httpsebookcentralproquestcom

libpensudetailactiondocID=1973847 DaggerDouglas Bates 2013 ldquoJulia for R Programmersrdquohttpwwwstatwiscedu~batesJuliaForRProgrammerspdf DaggerJulia site httpsjulialang

org

bull [Haskell] daggerRichard Bird 2014 Thinking Functionally with Haskell httpsdoi-orgezaccess

librariespsuedu101017CBO9781316092415 daggerHakim Cassimally 2017 Learning Haskell Pro-gramming httpswwwlyndacomHaskell-tutorialsLearning-Haskell-Programming604926-2html daggerJames Church 2017 Learning Haskell for Data Analysis httpswwwlyndacomDeveloper-tutorialsLearning-Haskell-Data-Analysis604234-2html Haskell site httpswwwhaskellorg

bull [Clojure] daggerMark McDonnell 2017 Quick Clojure Essential Functional Programming https

link-springer-comezaccesslibrariespsuedubook1010072F978-1-4842-2952-1 daggerAkhillWali 2014 Clojure for Machine Learning httpsebookcentralproquestcomlibpensu

detailactiondocID=1674848 daggerArthur Ulfeldt 2015 httpswwwlyndacomClojure-tutorialsUp-Running-Clojure413127-2html DaggerClojure site httpsclojureorg

bull [TensorFlow] DaggerAbadi et al (Google) 2016 ldquoTensorFlow A System for Large-Scale Ma-chine Learningrdquo httpsarxivorgabs160508695 DaggerUdacity ldquoDeep Learningrdquo httpswww

udacitycomcoursedeep-learning--ud730 DaggerTensorFlow site httpswwwtensorfloworg

bull [H2O] Darren Cool 2016 Practical Machine Learning with H2O Powerful Scalable Techniques forDeep Learning and AI OrsquoReilly Arno Candel and Viraj Parmar 2015 Deep Learning with H2ODaggerH2O site httpswwwh2oai

Social Data Structures

Space and Time

bull dagger[GIA] David OrsquoSullivan David J Unwin 2010 Geographic Information Analysis Second EditionJohn Wiley amp Sons httponlinelibrarywileycombook1010029780470549094

bull [Space-Time] Donna J Peuquet 2003 Representations of Space and Time Guilford

bull Roger S Bivand Edzer Pebesma Virgilio Gmez-Rubio 2013 Applied Spatial Data Analysis in RFree through library httplinkspringercomezaccesslibrariespsuedubook101007

2F978-1-4614-7618-4 See also Edzer Pebesma 2016 ldquoHandling and Analyzing Spatial Spa-tiotemporal and Movement Data in Rrdquo httpsedzergithubioUseR2016

bull Dagger[PySAL] Sergio J Rey and Dani Arribas-Bel 2016 ldquoGeographic Data Science with PySAL andthe pydata Stackrdquo httpdarribasorggds_scipy16

27

bull DataScience-Python NB 0413 ldquoGeographic Data with Basemaprdquo

bull Time ldquoLongitudinalrdquo intra-individual data as common in developmental psychology dagger[Longitudinal]Garrett Fitzmaurice Marie Davidian Geert Verbeke and Geert Molenberghs editors 2009 Longitu-dinal Data Analysis Chapman amp Hall CRC httppensueblibcompatronFullRecordaspxp=359998

bull Time ldquoTime Series Time Series Cross Section Panelrdquo as common in economics political scienceand sociology DataScience-Python Ch 3 re pandas DSHandbook-Py Ch 17 ldquoTime Series AnalysisrdquoEconometrics GelmanHill

bull Time ldquoSequential Data Streamsrdquo as in NLP machine learning Spark and other rdquostreaming datardquoreadings

bull Space-Time Data with continuity connectivity neighborhood structure See DeepLearning Chapter 9re convolution

bull CRAN Task Views Spatial httpscranr-projectorgwebviewsSpatialhtml Spatiotempo-ral httpscranr-projectorgwebviewsSpatioTemporalhtml Time Series httpscran

r-projectorgwebviewsTimeSerieshtml

Network Graphs

bull Dagger[Networks] David Easley and Jon Kleinberg 2010 Networks Crowds and Markets ReasoningAbout a Highly Connected World Cambridge University Press httpwwwcscornelleduhome

kleinbernetworks-book

bull MMDS Ch 5 ldquoLink Analysisrdquo Ch 10 ldquoMining Social-Network Graphsrdquo

bull dagger[Networks-R] Douglas Luke 2015 A Userrsquos Guide to Network Analysis in R Springer https

link-springer-comezaccesslibrariespsuedubook101007978-3-319-23883-8

bull dagger[Networks-Python] Mohammed Zuhair Al-Taie and Seifedine Kadry 2017 Python for Graph andNetwork Analysis Springer httpslink-springer-comezaccesslibrariespsuedubook

1010072F978-3-319-53004-8

Hierarchy Clustered Data Aggregation Mixed Models

bull dagger[GelmanHill] Andrew Gelman and Jennifer Hill 2006 Data Analysis Using Regression and Multi-levelHierarchical Models Cambridge University Press httppensueblibcompatronFullRecordaspxp=288457

bull dagger[MixedModels] Eugene Demidenko 2013 Mixed Models Theory and Applications with R 2nd edWiley httpsiteebrarycomezaccesslibrariespsuedulibpennstatedetailaction

docID=10748641

Social Data Channels

Language Text Speech Audio

bull Dagger[NLP] Daniel Jurafsky and James H Martin Forthcoming (2017) Speech and Language ProcessingAn Introduction to Natural Language Processing Speech Recognition and Computational Linguistics3rd ed Prentice-Hall Preprint httpswebstanfordedu~jurafskyslp3

bull InfoRetrieval

bull Dagger[CoreNLP] Stanford CoreNLP httpsstanfordnlpgithubioCoreNLP

28

bull dagger[TextAsData] Grimmer and Stewart 2013 ldquoText as Data The Promise and Pitfalls of AutomaticContent Analysis Methods for Political Textsrdquo Political Analysis httpsdoiorg101093pan

mps028

bull Quinn-Topics

bull FightinWords

bull dagger[NLTK-Python] Jacob Perkins 2010 Python Text Processing with NLTK 20 Cookbook PackthttpsiteebrarycomezaccesslibrariespsuedulibpennstatedetailactiondocID=10435387dagger[TextAnalytics-Python] Dipanjan Sarkar 2016 Text Analytics with Python Springer http

linkspringercomezaccesslibrariespsuedubook1010072F978-1-4842-2388-8

bull DSHandbook-Py Ch 16 ldquoNatural Language Processingrdquo

bull Matthew J Denny httpwwwmjdennycomText_Processing_In_Rhtml

bull CRAN Natural Language Processing httpscranr-projectorgwebviewsNaturalLanguageProcessinghtml

bull dagger[AudioData] Dean Knox and Christopher Lucas 2017 ldquoThe Speaker-Affect Model MeasuringEmotion in Political Speech with Audio Datardquo httpchristopherlucasorgfilesPDFssam

pdf httpswwwyoutubecomwatchv=Hs8A9dwkMzI

bull daggerMorgan amp Claypool Synthesis Lectures on Human Language Technologies httpwwwmorganclaypoolcomtochlt11

bull daggerMorgan amp Claypool Synthesis Lectures on Speech and Audio Processing httpwwwmorganclaypoolcomtocsap11

Vision Image Video

bull Dagger[ComputerVision] Richard Szelinski 2010 Computer Vision Algorithms and Applications SpringerhttpszeliskiorgBook

bull MixedModels Ch 11 ldquoStatistical Analysis of Shaperdquo Ch 12 ldquoStatistical Image Analysisrdquo

bull AudioVideoVolumetric See DeepLearning Chapter 9 re convolution

bull daggerMorgan amp Claypool Synthesis Lectures on Image Video and Multimedia Processing httpwww

morganclaypoolcomtocivm11

bull daggerMorgan amp Claypool Synthesis Lectures on Computer Vision httpwwwmorganclaypoolcom

toccov11

Approaches to Learning from Data (The Analytics Layer)

bull NRCreport Ch 7 ldquoBuilding Models from Massive Datardquo

bull MMDS (data-mining)

bull CausalInference

bull Dagger[VisualAnalytics] Daniel Keim Jorn Kohlhammer Geoffrey Ellis and Florian Mansmann editors2010 Mastering the Information Age Solving Problems with Visual Analytics The EurographicsAssociation Goslar Germany httpwwwvismastereubook See also daggerMorgan amp ClaypoolSynthesis Lectures on Visualization httpwwwmorganclaypoolcomtocvis21

bull Dagger[Econometrics] Bruce E Hansen 2014 Econometrics httpwwwsscwiscedu~bhansen

econometrics

29

bull Dagger[StatisticalLearning] Gareth James Daniela Witten Trevor Hastie and Robert Tibshirani 2013An Introduction to Statistical Learning with Applications in R Springer httpwww-bcfuscedu

~garethISLindexhtml

bull [MachineLearning] Christopher Bishop 2006 Pattern Recognition and Machine Learning Springer

bull dagger[Bayes] Peter D Hoff 2009 A First Course in Bayesian Statistical Methods Springer http

linkspringercombook1010072F978-0-387-92407-6

bull dagger[GraphicalModels] Soslashren Hoslashjsgaard David Edwards Steffen Lauritzen 2012 Graphical Mod-els with R Springer httpslink-springer-comezaccesslibrariespsuedubook101007

2F978-1-4614-2299-0

bull [InformationTheory] David Mackay 2003 Information Theory Inference and Learning AlgorithmsSpringer

bull Dagger[DeepLearning] Ian Goodfellow Yoshua Bengio and Aaron Courville 2016 Deep Learning MITPress httpwwwdeeplearningbookorg (see also Yann LeCun Yoshua Bengio and Geof-frey Hinton 2015 ldquoDeep Learningrdquo Nature 521(7553) 436ndash444 httpsdoiorg101038

nature14539)

See also Dagger[Keras] Keras The Python Deep Learning Library httpskerasio Francoios CholletDirectory of Keras tutorials httpsgithubcomfcholletkeras-resources

bull dagger[SignalProcessing] Jose Maria Giron-Sierra 2013 Digital Signal Processing with MATLAB Exam-ples (Volumes 1-3) Springer httpslink-springer-comezaccesslibrariespsuedubook

101007978-981-10-2534-1

30

Penn State Policy Statements

Academic Integrity

Academic integrity is the pursuit of scholarly activity in an open honest and responsible manner Academicintegrity is a basic guiding principle for all academic activity at The Pennsylvania State University and allmembers of the University community are expected to act in accordance with this principle Consistent withthis expectation the Universitys Code of Conduct states that all students should act with personal integrityrespect other students dignity rights and property and help create and maintain an environment in whichall can succeed through the fruits of their efforts

Academic integrity includes a commitment by all members of the University community not to engage in ortolerate acts of falsification misrepresentation or deception Such acts of dishonesty violate the fundamentalethical principles of the University community and compromise the worth of work completed by others

Disability Accomodation

Penn State welcomes students with disabilities into the Universitys educational programs Every Penn Statecampus has an office for students with disabilities Student Disability Resources (SDR) website provides con-tact information for every Penn State campus (httpequitypsuedusdrdisability-coordinator)For further information please visit the Student Disability Resources website (httpequitypsuedusdr)

In order to receive consideration for reasonable accommodations you must contact the appropriate disabilityservices office at the campus where you are officially enrolled participate in an intake interview and pro-vide documentation See documentation guidelines at (httpequitypsuedusdrguidelines) If thedocumentation supports your request for reasonable accommodations your campus disability services officewill provide you with an accommodation letter Please share this letter with your instructors and discussthe accommodations with them as early as possible You must follow this process for every semester thatyou request accommodations

Psychological Services

Many students at Penn State face personal challenges or have psychological needs that may interfere withtheir academic progress social development or emotional wellbeing The university offers a variety ofconfidential services to help you through difficult times including individual and group counseling crisisintervention consultations online chats and mental health screenings These services are provided by staffwho welcome all students and embrace a philosophy respectful of clients cultural and religious backgroundsand sensitive to differences in race ability gender identity and sexual orientation

bull Counseling and Psychological Services at University Park (CAPS) (httpstudentaffairspsueducounseling) 814-863-0395

bull Penn State Crisis Line (24 hours7 daysweek) 877-229-6400

bull Crisis Text Line (24 hours7 daysweek) Text LIONS to 741741

Educational Equity

Penn State takes great pride to foster a diverse and inclusive environment for students faculty and staffConsistent with University Policy AD29 students who believe they have experienced or observed a hatecrime an act of intolerance discrimination or harassment that occurs at Penn State are urged to report theseincidents as outlined on the Universitys Report Bias webpage (httpequitypsuedureportbias)

31

Background Knowledge

Students will have a variety of backgrounds Prior to beginning interdisciplinary coursework tofulfill Social Data Analytics degree requirements including SoDA 501 and 502 students are ex-pected to have advanced (graduate) training in at least one of the component areas of Social DataAnalytics and a familiarity with basic concepts in the others

With regard to specialization students are expected to have advanced (graduate) training in ONEof the following

bull quantitative social science methodology and a discipline of social science (as would be thecase for a second-year PhD student in Political Science Sociology Criminology Human De-velopment and Family Studies or Demography) OR

bull statistics (as would be the case for a second-year PhD student in Statistics) OR

bull information science or informatics (as would be the case for a second-year PhD student inInformation Science and Technology or a second-year PhD student in Geography specializingin GIScience) OR

bull computer science (as would be the case for a second-year PhD student in Computer Scienceand Engineering)

This requirement is met as a matter of meeting home program requirements for students in thedual-title PhD but may require additional coursework on the part of students in other programswishing to pursue the graduate minor

With regard to general preparation students are expected to have ALL of the following technicalknowledge

bull basic programming skills including basic facility with R ampor Python AND

bull basic knowledge of relational databases ampor geographic information systems AND

bull basic knowledge of probability applied statistics ampor social science research design AND

bull basic familiarity with a substantive or theoretical area of social science (eg 300-level course-work in political science sociology criminology human development psychology economicscommunication anthropology human geography social informatics or similar fields)

It is not unusual for students to have one or more gaps in this preparation at time of applicationto the SoDA program Students should work with Social Data Analytics advisers to develop aplan for timely remediation of any deficiencies which generally will not require formal courseworkfor students whose training and interests are otherwise appropriate for pursuit of the Social DataAnalytics degree Where possible this will be addressed at time of application to the Social DataAnalytics program

To this end some free training materials are linked in the reference section and there are alsoabundant free high-quality self-paced course-style training materials on these and related subjectsavailable through edX (httpwwwedxorg) Udacity (httpwwwudacitycom) Codecademy(httpcodecademycom) and Lynda (httpwwwlyndapsu)

32

Page 24: SoDA 501: Approaches and Issues in Big Social Data Spring ... · \Approaches and Issues in Big Social Data," o ered in the spring semester. Some SoDA / IGERT students will take more

Dimensionality reduction decomposition factorization change of basis reparameterization matrix com-pletion latent variables source separation

bull MMDS Ch 9 ldquoRecommendation Systemsrdquo Ch11 ldquoDimensionality Reductionrdquo

bull DSHandbook-Py Ch 10 ldquoUnsupervised Learning Clustering and Dimensionality Reductionrdquo

bull Dagger[Shalizi-ADA] Cosma Rohilla Shalizi 2017 Advanced Data Analysis from an Elementary Pointof View Ch 16 (ldquoPrincipal Components Analysisrdquo) Ch 17 (ldquoFactor Modelsrdquo) httpwwwstat

cmuedu~cshaliziADAfaEPoV

See also Multivariate-R Ch 3 ldquoPrincipal Components Analysisrdquo Ch 4 ldquoMultidimensional ScalingrdquoCh 5 ldquoExploratory Factor Analysisrdquo Latent

bull DeepLearning Ch 2 ldquoLinear Algebrardquo Ch 13 ldquoLinear Factor Modelsrdquo

bull Dagger[GloVe] Jeffrey Pennington Richard Socher Christopher Manning 2014 ldquoGloVe Global vectorsfor word representationrdquo EMNLP httpsnlpstanfordeduprojectsglove

bull dagger[NMF] Daniel D Lee and H Sebastian Seung 1999 ldquoLearning the parts of objects by non-negativematrix factorizationrdquo Nature 401788-791 httpdxdoiorgezaccesslibrariespsuedu

10103844565

bull dagger[CUR] Michael W Mahoney and Petros Drineas 2009 ldquoCUR matrix decompositions for improveddata analysisrdquo PNAS httpwwwpnasorgcontent1063697full

bull Dagger[ICA] Aapo Hyvarinen and Erkki Oja 2000 ldquoIndependent Component Analysis Algorithmsand Applicationsrdquo Neural Networks 13(4-5) 411-30 httpswwwcshelsinkifiuahyvarin

papersNN00newpdf

bull [RandomProjection] Ella Bingham and Heikki Mannilla 2001 ldquoRandom projection in dimension-ality reduction applications to image and text datardquo KDD httpsdoiorg1011452F502512

502546

bull Compression eg ldquoTransform codingrdquo ldquoWaveletsrdquo

Nonlinear dimensionality reduction Manifold learning

bull DeepLearning Section 5113 ldquoManifold Learningrdquo

bull DataScience-Python NB 0510 ldquoManifold Learningrdquo (Locally linear embedding [LLE] Isomap)

bull Dagger[KernelPCA] Sebastian Raschka 2014 ldquoKernel tricks and nonlinear dimensionality reduction viaRBF kernel PCArdquo httpsebastianraschkacomArticles2014_kernel_pcahtml

bull Shalizi-ADA Ch 18 ldquoNonlinear Dimensionality Reductionrdquo (LLE)

bull Dagger[LaplacianEigenmaps] Mikhail Belkin and Partha Niyogi 2003 ldquoLaplacian eigenmaps for di-mensionality reduction and data representationrdquo Neural Computation 15(6) 13731396 http

webcseohio-stateedu~belkin8papersLEM_NC_03pdf

bull DeepLearning Ch14 (ldquoAutoencodersrdquo)

bull Dagger[word2vec] Tomas Mikolov Kai Chen Greg Corrado Jeffrey Dean 2013 ldquoEfficient Estimation ofWord Representations in Vector Spacerdquo httpsarxivorgabs13013781

bull Dagger[word2vecExplained] Yoav Goldberg and Omer Levy 2014 ldquoword2vec Explained DerivingMikolov et als Negative-Sampling Word-Embedding Methodrdquo httpsarxivorgabs14023722

bull Dagger[t-SNE] Laurens van der Maaten and Geoffrey Hinton 2008 ldquoVisualising Data using t-SNErdquo Jour-nal of Machine Learning Research 9 2579-2605 httpjmlrcsailmitedupapersvolume9

vandermaaten08avandermaaten08apdf

24

Computation and Scaling Up

Scientific computing computation at scale

bull NRCreport Ch 10 ldquoThe Seven Computational Giants of Massive Data Analysisrdquo

bull DSHandbook-Py Ch 21 ldquoPerformance and Computer Memoryrdquo Ch 22 ldquoComputer Memory andData Structuresrdquo

bull Dagger[ComputationalStatistics-Python] Cliburn Chan Computational Statistics in Python https

peopledukeedu~ccc14sta-663indexhtml

bull CRAN Task View High Performance Computinghttpscranr-projectorgwebviewsHighPerformanceComputinghtml

Numerical computing

bull [NumericalComputing] Ward Cheney and David Kincaid 2013 Numerical Mathematics andComputing 7th ed BrooksCole Cengage Learning

bull CRAN Task View Numerical Mathematicshttpscranr-projectorgwebviewsNumericalMathematicshtml

Optimization (eg MLE gradient descent stochastic gradient descent EM algorithm neural nets)

bull DSHandbook-Py Ch 23 ldquoMaximum Likelihood Estimation and Optimizationrdquo (gradient descent)

bull DeepLearning Ch 4 ldquoNumerical Computationrdquo Ch 6 ldquoDeep Feedforward Networksrdquo Ch 8ldquoOptimization for Training Deep Modelsrdquo

bull CRAN Task View Optimization and Mathematical Programminghttpscranr-projectorgwebviewsOptimizationhtml

Linear algebra matrix computations

bull [MatrixComputations] Gene H Golub and Charles F Van Loan 2013 Matrix Computations 4thed Johns Hopkins University Press

bull Dagger[NetflixMatrix] Yehuda Koren Robert Bell and Chris Volinsky 2009 ldquoMatrix FactorizationTechniques for Recommender Systemsrdquo Computer August 42-9 httpsdatajobscomdata-

science-repoRecommender-Systems-[Netflix]pdf

bull Dagger[BigDataPCA] Jianqing Fan Qiang Sun Wen-Xin Zhou Ziwei Zhu ldquoPrincipal Component Anal-ysis for Big Datardquo httpwwwprincetonedu~ziweizpcapdf

bull Dagger[RandomSVD] Andrew Tulloch 2009 ldquoFast Randomized SVDrdquo httpsresearchfbcom

fast-randomized-svd

bull Dagger[FactorSGD] Rainer Gemulla Peter J Haas Erik Nijkamp and Yannis Sismanis 2011 ldquoLarge-Scale Matrix Factorization with Distributed Stochastic Gradient Descentrdquo KDD httpwwwcs

utahedu~hariteachingbigdatagemulla11dsgdpdf

bull Dagger[Sparse] Max Grossman 2015 ldquo101 Ways to Store a Sparse Matrixrdquo httpsmediumcom

jmaxg3101-ways-to-store-a-sparse-matrix-c7f2bf15a229

bull Dagger[DontInvert] John D Cook 2010 ldquoDonrsquot Invert that Matrixrdquo httpswwwjohndcookcom

blog20100119dont-invert-that-matrix

Simulation-based inference resampling Monte Carlo methods MCMC Bayes approximate inference

25

bull Dagger[MarkovVisually] Victor Powell ldquoMarkov Chains Explained Visuallyrdquo httpsetosaioev

markov-chains

bull DSHandbook-Py Ch 25 ldquoStochastic Modelingrdquo (Markov chains MCMC HMM)

bull Bayes ComputationalStatistics-Py

bull DeepLearning Ch 17 ldquoMonte Carlo Methodsrdquo Ch 19 ldquoApproximate Inferencerdquo

bull Dagger[VariationalInference] Jason Eisner 2011 ldquoHigh-Level Explanation of Variational Inferencerdquohttpswwwcsjhuedu~jasontutorialsvariationalhtml

bull CRAN Task View Bayesian Inferencehttpscranr-projectorgwebviewsBayesianhtml

Parallelism MapReduce Split-Apply-Combine

bull Dagger[MapReduceIntuition] Jigsaw Academy 2014 ldquoBig Data Specialist MapReducerdquo https

wwwyoutubecomwatchv=TwcYQzFqg-8ampfeature=youtube

bull MMDS Ch 2 ldquoMap-Reduce and the New Software Stackrdquo (3rd Edition discusses Spark amp Tensor-Flow)

bull Algorithms Ch 3ldquoScalable Algorithms and Associative Statisticsrdquo Ch 4 ldquoHadoop and MapReducerdquo

bull NRCreport Ch 6 ldquoResources Trade-offs and Limitationsrdquo

bull TidyData-R (Split-apply-combine is the motivating principle behind the ldquotidyverserdquo approach) Seealso Part III ldquoProgramrdquo (pipes functions vectors iteration)

bull Dagger[TidySAC-Video] Hadley Wickham 2017 ldquoData Science in the Tidyverserdquo httpswww

rstudiocomresourcesvideosdata-science-in-the-tidyverse

bull See also FactorSGD

Functional Programming

bull ComputationalStatistics-Python ldquoFunctions are first class objectsrdquo through first exercises

bull DSHandbook-Py Ch 20 ldquoProgramming Language Conceptsrdquo

bull See also Haskell

Scaling iteration streaming data online algorithms (Spark)

bull NRCreport Ch 4 ldquoTemporal Data and Real-Time Algorithmsrdquo

bull MMDS Ch 4 ldquoMining Data Streamsrdquo

bull Dagger[BDAS] AMPLab BDAS The Berkeley Data Analytics Stack httpsamplabcsberkeleyedu

software

bull dagger[Spark] Mohammed Guller 2015 Big Data Analytics with Spark A Practitioners Guide toUsing Spark for Large-Scale Data Processing Machine Learning and Graph Analytics and High-Velocity Data Stream Processing Apress httpslink-springer-comezaccesslibrariespsu

edubook101007978-1-4842-0964-6

bull DSHandbook-Py Ch 13 ldquoBig Datardquo

bull DeepLearning Ch 10 ldquoSequence Modeling Recurrent and Recursive Netsrdquo

General Resources for Python and R

26

bull dagger[DSHandbook-Py] Field Cady 2017 The Data Science Handbook (Python-based) Wiley http

onlinelibrarywileycomezaccesslibrariespsuedubook1010029781119092919

bull Dagger[Tutorials-R] Ujjwal Karn 2017 ldquoA curated list of R tutorials for Data Science NLP and MachineLearningrdquo httpsgithubcomujjwalkarnDataScienceR

bull Dagger[Tutorials-Python] Ujjwal Karn 2017 ldquoA curated list of Python tutorials for Data Science NLPand Machine Learningrdquo httpsgithubcomujjwalkarnDataSciencePython

Cutting and Bleeding Edge of Data Science Languages

bull [Scala] daggerVishal Layka and David Pollak 2015 Beginning Scala httpslink-springer-

comezaccesslibrariespsuedubook1010072F978-1-4842-0232-6 daggerNicolas Patrick 2014Scala for Machine Learning httpsebookcentralproquestcomlibpensudetailaction

docID=1901910 DaggerScala site httpswwwscala-langorg (See also )

bull [Julia] daggerIvo Baobaert 2015 Getting Started with Julia httpsebookcentralproquestcom

libpensudetailactiondocID=1973847 DaggerDouglas Bates 2013 ldquoJulia for R Programmersrdquohttpwwwstatwiscedu~batesJuliaForRProgrammerspdf DaggerJulia site httpsjulialang

org

bull [Haskell] daggerRichard Bird 2014 Thinking Functionally with Haskell httpsdoi-orgezaccess

librariespsuedu101017CBO9781316092415 daggerHakim Cassimally 2017 Learning Haskell Pro-gramming httpswwwlyndacomHaskell-tutorialsLearning-Haskell-Programming604926-2html daggerJames Church 2017 Learning Haskell for Data Analysis httpswwwlyndacomDeveloper-tutorialsLearning-Haskell-Data-Analysis604234-2html Haskell site httpswwwhaskellorg

bull [Clojure] daggerMark McDonnell 2017 Quick Clojure Essential Functional Programming https

link-springer-comezaccesslibrariespsuedubook1010072F978-1-4842-2952-1 daggerAkhillWali 2014 Clojure for Machine Learning httpsebookcentralproquestcomlibpensu

detailactiondocID=1674848 daggerArthur Ulfeldt 2015 httpswwwlyndacomClojure-tutorialsUp-Running-Clojure413127-2html DaggerClojure site httpsclojureorg

bull [TensorFlow] DaggerAbadi et al (Google) 2016 ldquoTensorFlow A System for Large-Scale Ma-chine Learningrdquo httpsarxivorgabs160508695 DaggerUdacity ldquoDeep Learningrdquo httpswww

udacitycomcoursedeep-learning--ud730 DaggerTensorFlow site httpswwwtensorfloworg

bull [H2O] Darren Cool 2016 Practical Machine Learning with H2O Powerful Scalable Techniques forDeep Learning and AI OrsquoReilly Arno Candel and Viraj Parmar 2015 Deep Learning with H2ODaggerH2O site httpswwwh2oai

Social Data Structures

Space and Time

bull dagger[GIA] David OrsquoSullivan David J Unwin 2010 Geographic Information Analysis Second EditionJohn Wiley amp Sons httponlinelibrarywileycombook1010029780470549094

bull [Space-Time] Donna J Peuquet 2003 Representations of Space and Time Guilford

bull Roger S Bivand Edzer Pebesma Virgilio Gmez-Rubio 2013 Applied Spatial Data Analysis in RFree through library httplinkspringercomezaccesslibrariespsuedubook101007

2F978-1-4614-7618-4 See also Edzer Pebesma 2016 ldquoHandling and Analyzing Spatial Spa-tiotemporal and Movement Data in Rrdquo httpsedzergithubioUseR2016

bull Dagger[PySAL] Sergio J Rey and Dani Arribas-Bel 2016 ldquoGeographic Data Science with PySAL andthe pydata Stackrdquo httpdarribasorggds_scipy16

27

bull DataScience-Python NB 0413 ldquoGeographic Data with Basemaprdquo

bull Time ldquoLongitudinalrdquo intra-individual data as common in developmental psychology dagger[Longitudinal]Garrett Fitzmaurice Marie Davidian Geert Verbeke and Geert Molenberghs editors 2009 Longitu-dinal Data Analysis Chapman amp Hall CRC httppensueblibcompatronFullRecordaspxp=359998

bull Time ldquoTime Series Time Series Cross Section Panelrdquo as common in economics political scienceand sociology DataScience-Python Ch 3 re pandas DSHandbook-Py Ch 17 ldquoTime Series AnalysisrdquoEconometrics GelmanHill

bull Time ldquoSequential Data Streamsrdquo as in NLP machine learning Spark and other rdquostreaming datardquoreadings

bull Space-Time Data with continuity connectivity neighborhood structure See DeepLearning Chapter 9re convolution

bull CRAN Task Views Spatial httpscranr-projectorgwebviewsSpatialhtml Spatiotempo-ral httpscranr-projectorgwebviewsSpatioTemporalhtml Time Series httpscran

r-projectorgwebviewsTimeSerieshtml

Network Graphs

bull Dagger[Networks] David Easley and Jon Kleinberg 2010 Networks Crowds and Markets ReasoningAbout a Highly Connected World Cambridge University Press httpwwwcscornelleduhome

kleinbernetworks-book

bull MMDS Ch 5 ldquoLink Analysisrdquo Ch 10 ldquoMining Social-Network Graphsrdquo

bull dagger[Networks-R] Douglas Luke 2015 A Userrsquos Guide to Network Analysis in R Springer https

link-springer-comezaccesslibrariespsuedubook101007978-3-319-23883-8

bull dagger[Networks-Python] Mohammed Zuhair Al-Taie and Seifedine Kadry 2017 Python for Graph andNetwork Analysis Springer httpslink-springer-comezaccesslibrariespsuedubook

1010072F978-3-319-53004-8

Hierarchy Clustered Data Aggregation Mixed Models

bull dagger[GelmanHill] Andrew Gelman and Jennifer Hill 2006 Data Analysis Using Regression and Multi-levelHierarchical Models Cambridge University Press httppensueblibcompatronFullRecordaspxp=288457

bull dagger[MixedModels] Eugene Demidenko 2013 Mixed Models Theory and Applications with R 2nd edWiley httpsiteebrarycomezaccesslibrariespsuedulibpennstatedetailaction

docID=10748641

Social Data Channels

Language Text Speech Audio

bull Dagger[NLP] Daniel Jurafsky and James H Martin Forthcoming (2017) Speech and Language ProcessingAn Introduction to Natural Language Processing Speech Recognition and Computational Linguistics3rd ed Prentice-Hall Preprint httpswebstanfordedu~jurafskyslp3

bull InfoRetrieval

bull Dagger[CoreNLP] Stanford CoreNLP httpsstanfordnlpgithubioCoreNLP

28

bull dagger[TextAsData] Grimmer and Stewart 2013 ldquoText as Data The Promise and Pitfalls of AutomaticContent Analysis Methods for Political Textsrdquo Political Analysis httpsdoiorg101093pan

mps028

bull Quinn-Topics

bull FightinWords

bull dagger[NLTK-Python] Jacob Perkins 2010 Python Text Processing with NLTK 20 Cookbook PackthttpsiteebrarycomezaccesslibrariespsuedulibpennstatedetailactiondocID=10435387dagger[TextAnalytics-Python] Dipanjan Sarkar 2016 Text Analytics with Python Springer http

linkspringercomezaccesslibrariespsuedubook1010072F978-1-4842-2388-8

bull DSHandbook-Py Ch 16 ldquoNatural Language Processingrdquo

bull Matthew J Denny httpwwwmjdennycomText_Processing_In_Rhtml

bull CRAN Natural Language Processing httpscranr-projectorgwebviewsNaturalLanguageProcessinghtml

bull dagger[AudioData] Dean Knox and Christopher Lucas 2017 ldquoThe Speaker-Affect Model MeasuringEmotion in Political Speech with Audio Datardquo httpchristopherlucasorgfilesPDFssam

pdf httpswwwyoutubecomwatchv=Hs8A9dwkMzI

bull daggerMorgan amp Claypool Synthesis Lectures on Human Language Technologies httpwwwmorganclaypoolcomtochlt11

bull daggerMorgan amp Claypool Synthesis Lectures on Speech and Audio Processing httpwwwmorganclaypoolcomtocsap11

Vision Image Video

bull Dagger[ComputerVision] Richard Szelinski 2010 Computer Vision Algorithms and Applications SpringerhttpszeliskiorgBook

bull MixedModels Ch 11 ldquoStatistical Analysis of Shaperdquo Ch 12 ldquoStatistical Image Analysisrdquo

bull AudioVideoVolumetric See DeepLearning Chapter 9 re convolution

bull daggerMorgan amp Claypool Synthesis Lectures on Image Video and Multimedia Processing httpwww

morganclaypoolcomtocivm11

bull daggerMorgan amp Claypool Synthesis Lectures on Computer Vision httpwwwmorganclaypoolcom

toccov11

Approaches to Learning from Data (The Analytics Layer)

bull NRCreport Ch 7 ldquoBuilding Models from Massive Datardquo

bull MMDS (data-mining)

bull CausalInference

bull Dagger[VisualAnalytics] Daniel Keim Jorn Kohlhammer Geoffrey Ellis and Florian Mansmann editors2010 Mastering the Information Age Solving Problems with Visual Analytics The EurographicsAssociation Goslar Germany httpwwwvismastereubook See also daggerMorgan amp ClaypoolSynthesis Lectures on Visualization httpwwwmorganclaypoolcomtocvis21

bull Dagger[Econometrics] Bruce E Hansen 2014 Econometrics httpwwwsscwiscedu~bhansen

econometrics

29

bull Dagger[StatisticalLearning] Gareth James Daniela Witten Trevor Hastie and Robert Tibshirani 2013An Introduction to Statistical Learning with Applications in R Springer httpwww-bcfuscedu

~garethISLindexhtml

bull [MachineLearning] Christopher Bishop 2006 Pattern Recognition and Machine Learning Springer

bull dagger[Bayes] Peter D Hoff 2009 A First Course in Bayesian Statistical Methods Springer http

linkspringercombook1010072F978-0-387-92407-6

bull dagger[GraphicalModels] Soslashren Hoslashjsgaard David Edwards Steffen Lauritzen 2012 Graphical Mod-els with R Springer httpslink-springer-comezaccesslibrariespsuedubook101007

2F978-1-4614-2299-0

bull [InformationTheory] David Mackay 2003 Information Theory Inference and Learning AlgorithmsSpringer

bull Dagger[DeepLearning] Ian Goodfellow Yoshua Bengio and Aaron Courville 2016 Deep Learning MITPress httpwwwdeeplearningbookorg (see also Yann LeCun Yoshua Bengio and Geof-frey Hinton 2015 ldquoDeep Learningrdquo Nature 521(7553) 436ndash444 httpsdoiorg101038

nature14539)

See also Dagger[Keras] Keras The Python Deep Learning Library httpskerasio Francoios CholletDirectory of Keras tutorials httpsgithubcomfcholletkeras-resources

bull dagger[SignalProcessing] Jose Maria Giron-Sierra 2013 Digital Signal Processing with MATLAB Exam-ples (Volumes 1-3) Springer httpslink-springer-comezaccesslibrariespsuedubook

101007978-981-10-2534-1

30

Penn State Policy Statements

Academic Integrity

Academic integrity is the pursuit of scholarly activity in an open honest and responsible manner Academicintegrity is a basic guiding principle for all academic activity at The Pennsylvania State University and allmembers of the University community are expected to act in accordance with this principle Consistent withthis expectation the Universitys Code of Conduct states that all students should act with personal integrityrespect other students dignity rights and property and help create and maintain an environment in whichall can succeed through the fruits of their efforts

Academic integrity includes a commitment by all members of the University community not to engage in ortolerate acts of falsification misrepresentation or deception Such acts of dishonesty violate the fundamentalethical principles of the University community and compromise the worth of work completed by others

Disability Accomodation

Penn State welcomes students with disabilities into the Universitys educational programs Every Penn Statecampus has an office for students with disabilities Student Disability Resources (SDR) website provides con-tact information for every Penn State campus (httpequitypsuedusdrdisability-coordinator)For further information please visit the Student Disability Resources website (httpequitypsuedusdr)

In order to receive consideration for reasonable accommodations you must contact the appropriate disabilityservices office at the campus where you are officially enrolled participate in an intake interview and pro-vide documentation See documentation guidelines at (httpequitypsuedusdrguidelines) If thedocumentation supports your request for reasonable accommodations your campus disability services officewill provide you with an accommodation letter Please share this letter with your instructors and discussthe accommodations with them as early as possible You must follow this process for every semester thatyou request accommodations

Psychological Services

Many students at Penn State face personal challenges or have psychological needs that may interfere withtheir academic progress social development or emotional wellbeing The university offers a variety ofconfidential services to help you through difficult times including individual and group counseling crisisintervention consultations online chats and mental health screenings These services are provided by staffwho welcome all students and embrace a philosophy respectful of clients cultural and religious backgroundsand sensitive to differences in race ability gender identity and sexual orientation

bull Counseling and Psychological Services at University Park (CAPS) (httpstudentaffairspsueducounseling) 814-863-0395

bull Penn State Crisis Line (24 hours7 daysweek) 877-229-6400

bull Crisis Text Line (24 hours7 daysweek) Text LIONS to 741741

Educational Equity

Penn State takes great pride to foster a diverse and inclusive environment for students faculty and staffConsistent with University Policy AD29 students who believe they have experienced or observed a hatecrime an act of intolerance discrimination or harassment that occurs at Penn State are urged to report theseincidents as outlined on the Universitys Report Bias webpage (httpequitypsuedureportbias)

31

Background Knowledge

Students will have a variety of backgrounds Prior to beginning interdisciplinary coursework tofulfill Social Data Analytics degree requirements including SoDA 501 and 502 students are ex-pected to have advanced (graduate) training in at least one of the component areas of Social DataAnalytics and a familiarity with basic concepts in the others

With regard to specialization students are expected to have advanced (graduate) training in ONEof the following

bull quantitative social science methodology and a discipline of social science (as would be thecase for a second-year PhD student in Political Science Sociology Criminology Human De-velopment and Family Studies or Demography) OR

bull statistics (as would be the case for a second-year PhD student in Statistics) OR

bull information science or informatics (as would be the case for a second-year PhD student inInformation Science and Technology or a second-year PhD student in Geography specializingin GIScience) OR

bull computer science (as would be the case for a second-year PhD student in Computer Scienceand Engineering)

This requirement is met as a matter of meeting home program requirements for students in thedual-title PhD but may require additional coursework on the part of students in other programswishing to pursue the graduate minor

With regard to general preparation students are expected to have ALL of the following technicalknowledge

bull basic programming skills including basic facility with R ampor Python AND

bull basic knowledge of relational databases ampor geographic information systems AND

bull basic knowledge of probability applied statistics ampor social science research design AND

bull basic familiarity with a substantive or theoretical area of social science (eg 300-level course-work in political science sociology criminology human development psychology economicscommunication anthropology human geography social informatics or similar fields)

It is not unusual for students to have one or more gaps in this preparation at time of applicationto the SoDA program Students should work with Social Data Analytics advisers to develop aplan for timely remediation of any deficiencies which generally will not require formal courseworkfor students whose training and interests are otherwise appropriate for pursuit of the Social DataAnalytics degree Where possible this will be addressed at time of application to the Social DataAnalytics program

To this end some free training materials are linked in the reference section and there are alsoabundant free high-quality self-paced course-style training materials on these and related subjectsavailable through edX (httpwwwedxorg) Udacity (httpwwwudacitycom) Codecademy(httpcodecademycom) and Lynda (httpwwwlyndapsu)

32

Page 25: SoDA 501: Approaches and Issues in Big Social Data Spring ... · \Approaches and Issues in Big Social Data," o ered in the spring semester. Some SoDA / IGERT students will take more

Computation and Scaling Up

Scientific computing computation at scale

bull NRCreport Ch 10 ldquoThe Seven Computational Giants of Massive Data Analysisrdquo

bull DSHandbook-Py Ch 21 ldquoPerformance and Computer Memoryrdquo Ch 22 ldquoComputer Memory andData Structuresrdquo

bull Dagger[ComputationalStatistics-Python] Cliburn Chan Computational Statistics in Python https

peopledukeedu~ccc14sta-663indexhtml

bull CRAN Task View High Performance Computinghttpscranr-projectorgwebviewsHighPerformanceComputinghtml

Numerical computing

bull [NumericalComputing] Ward Cheney and David Kincaid 2013 Numerical Mathematics andComputing 7th ed BrooksCole Cengage Learning

bull CRAN Task View Numerical Mathematicshttpscranr-projectorgwebviewsNumericalMathematicshtml

Optimization (eg MLE gradient descent stochastic gradient descent EM algorithm neural nets)

bull DSHandbook-Py Ch 23 ldquoMaximum Likelihood Estimation and Optimizationrdquo (gradient descent)

bull DeepLearning Ch 4 ldquoNumerical Computationrdquo Ch 6 ldquoDeep Feedforward Networksrdquo Ch 8ldquoOptimization for Training Deep Modelsrdquo

bull CRAN Task View Optimization and Mathematical Programminghttpscranr-projectorgwebviewsOptimizationhtml

Linear algebra matrix computations

bull [MatrixComputations] Gene H Golub and Charles F Van Loan 2013 Matrix Computations 4thed Johns Hopkins University Press

bull Dagger[NetflixMatrix] Yehuda Koren Robert Bell and Chris Volinsky 2009 ldquoMatrix FactorizationTechniques for Recommender Systemsrdquo Computer August 42-9 httpsdatajobscomdata-

science-repoRecommender-Systems-[Netflix]pdf

bull Dagger[BigDataPCA] Jianqing Fan Qiang Sun Wen-Xin Zhou Ziwei Zhu ldquoPrincipal Component Anal-ysis for Big Datardquo httpwwwprincetonedu~ziweizpcapdf

bull Dagger[RandomSVD] Andrew Tulloch 2009 ldquoFast Randomized SVDrdquo httpsresearchfbcom

fast-randomized-svd

bull Dagger[FactorSGD] Rainer Gemulla Peter J Haas Erik Nijkamp and Yannis Sismanis 2011 ldquoLarge-Scale Matrix Factorization with Distributed Stochastic Gradient Descentrdquo KDD httpwwwcs

utahedu~hariteachingbigdatagemulla11dsgdpdf

bull Dagger[Sparse] Max Grossman 2015 ldquo101 Ways to Store a Sparse Matrixrdquo httpsmediumcom

jmaxg3101-ways-to-store-a-sparse-matrix-c7f2bf15a229

bull Dagger[DontInvert] John D Cook 2010 ldquoDonrsquot Invert that Matrixrdquo httpswwwjohndcookcom

blog20100119dont-invert-that-matrix

Simulation-based inference resampling Monte Carlo methods MCMC Bayes approximate inference

25

bull Dagger[MarkovVisually] Victor Powell ldquoMarkov Chains Explained Visuallyrdquo httpsetosaioev

markov-chains

bull DSHandbook-Py Ch 25 ldquoStochastic Modelingrdquo (Markov chains MCMC HMM)

bull Bayes ComputationalStatistics-Py

bull DeepLearning Ch 17 ldquoMonte Carlo Methodsrdquo Ch 19 ldquoApproximate Inferencerdquo

bull Dagger[VariationalInference] Jason Eisner 2011 ldquoHigh-Level Explanation of Variational Inferencerdquohttpswwwcsjhuedu~jasontutorialsvariationalhtml

bull CRAN Task View Bayesian Inferencehttpscranr-projectorgwebviewsBayesianhtml

Parallelism MapReduce Split-Apply-Combine

bull Dagger[MapReduceIntuition] Jigsaw Academy 2014 ldquoBig Data Specialist MapReducerdquo https

wwwyoutubecomwatchv=TwcYQzFqg-8ampfeature=youtube

bull MMDS Ch 2 ldquoMap-Reduce and the New Software Stackrdquo (3rd Edition discusses Spark amp Tensor-Flow)

bull Algorithms Ch 3ldquoScalable Algorithms and Associative Statisticsrdquo Ch 4 ldquoHadoop and MapReducerdquo

bull NRCreport Ch 6 ldquoResources Trade-offs and Limitationsrdquo

bull TidyData-R (Split-apply-combine is the motivating principle behind the ldquotidyverserdquo approach) Seealso Part III ldquoProgramrdquo (pipes functions vectors iteration)

bull Dagger[TidySAC-Video] Hadley Wickham 2017 ldquoData Science in the Tidyverserdquo httpswww

rstudiocomresourcesvideosdata-science-in-the-tidyverse

bull See also FactorSGD

Functional Programming

bull ComputationalStatistics-Python ldquoFunctions are first class objectsrdquo through first exercises

bull DSHandbook-Py Ch 20 ldquoProgramming Language Conceptsrdquo

bull See also Haskell

Scaling iteration streaming data online algorithms (Spark)

bull NRCreport Ch 4 ldquoTemporal Data and Real-Time Algorithmsrdquo

bull MMDS Ch 4 ldquoMining Data Streamsrdquo

bull Dagger[BDAS] AMPLab BDAS The Berkeley Data Analytics Stack httpsamplabcsberkeleyedu

software

bull dagger[Spark] Mohammed Guller 2015 Big Data Analytics with Spark A Practitioners Guide toUsing Spark for Large-Scale Data Processing Machine Learning and Graph Analytics and High-Velocity Data Stream Processing Apress httpslink-springer-comezaccesslibrariespsu

edubook101007978-1-4842-0964-6

bull DSHandbook-Py Ch 13 ldquoBig Datardquo

bull DeepLearning Ch 10 ldquoSequence Modeling Recurrent and Recursive Netsrdquo

General Resources for Python and R

26

bull dagger[DSHandbook-Py] Field Cady 2017 The Data Science Handbook (Python-based) Wiley http

onlinelibrarywileycomezaccesslibrariespsuedubook1010029781119092919

bull Dagger[Tutorials-R] Ujjwal Karn 2017 ldquoA curated list of R tutorials for Data Science NLP and MachineLearningrdquo httpsgithubcomujjwalkarnDataScienceR

bull Dagger[Tutorials-Python] Ujjwal Karn 2017 ldquoA curated list of Python tutorials for Data Science NLPand Machine Learningrdquo httpsgithubcomujjwalkarnDataSciencePython

Cutting and Bleeding Edge of Data Science Languages

bull [Scala] daggerVishal Layka and David Pollak 2015 Beginning Scala httpslink-springer-

comezaccesslibrariespsuedubook1010072F978-1-4842-0232-6 daggerNicolas Patrick 2014Scala for Machine Learning httpsebookcentralproquestcomlibpensudetailaction

docID=1901910 DaggerScala site httpswwwscala-langorg (See also )

bull [Julia] daggerIvo Baobaert 2015 Getting Started with Julia httpsebookcentralproquestcom

libpensudetailactiondocID=1973847 DaggerDouglas Bates 2013 ldquoJulia for R Programmersrdquohttpwwwstatwiscedu~batesJuliaForRProgrammerspdf DaggerJulia site httpsjulialang

org

bull [Haskell] daggerRichard Bird 2014 Thinking Functionally with Haskell httpsdoi-orgezaccess

librariespsuedu101017CBO9781316092415 daggerHakim Cassimally 2017 Learning Haskell Pro-gramming httpswwwlyndacomHaskell-tutorialsLearning-Haskell-Programming604926-2html daggerJames Church 2017 Learning Haskell for Data Analysis httpswwwlyndacomDeveloper-tutorialsLearning-Haskell-Data-Analysis604234-2html Haskell site httpswwwhaskellorg

bull [Clojure] daggerMark McDonnell 2017 Quick Clojure Essential Functional Programming https

link-springer-comezaccesslibrariespsuedubook1010072F978-1-4842-2952-1 daggerAkhillWali 2014 Clojure for Machine Learning httpsebookcentralproquestcomlibpensu

detailactiondocID=1674848 daggerArthur Ulfeldt 2015 httpswwwlyndacomClojure-tutorialsUp-Running-Clojure413127-2html DaggerClojure site httpsclojureorg

bull [TensorFlow] DaggerAbadi et al (Google) 2016 ldquoTensorFlow A System for Large-Scale Ma-chine Learningrdquo httpsarxivorgabs160508695 DaggerUdacity ldquoDeep Learningrdquo httpswww

udacitycomcoursedeep-learning--ud730 DaggerTensorFlow site httpswwwtensorfloworg

bull [H2O] Darren Cool 2016 Practical Machine Learning with H2O Powerful Scalable Techniques forDeep Learning and AI OrsquoReilly Arno Candel and Viraj Parmar 2015 Deep Learning with H2ODaggerH2O site httpswwwh2oai

Social Data Structures

Space and Time

bull dagger[GIA] David OrsquoSullivan David J Unwin 2010 Geographic Information Analysis Second EditionJohn Wiley amp Sons httponlinelibrarywileycombook1010029780470549094

bull [Space-Time] Donna J Peuquet 2003 Representations of Space and Time Guilford

bull Roger S Bivand Edzer Pebesma Virgilio Gmez-Rubio 2013 Applied Spatial Data Analysis in RFree through library httplinkspringercomezaccesslibrariespsuedubook101007

2F978-1-4614-7618-4 See also Edzer Pebesma 2016 ldquoHandling and Analyzing Spatial Spa-tiotemporal and Movement Data in Rrdquo httpsedzergithubioUseR2016

bull Dagger[PySAL] Sergio J Rey and Dani Arribas-Bel 2016 ldquoGeographic Data Science with PySAL andthe pydata Stackrdquo httpdarribasorggds_scipy16

27

bull DataScience-Python NB 0413 ldquoGeographic Data with Basemaprdquo

bull Time ldquoLongitudinalrdquo intra-individual data as common in developmental psychology dagger[Longitudinal]Garrett Fitzmaurice Marie Davidian Geert Verbeke and Geert Molenberghs editors 2009 Longitu-dinal Data Analysis Chapman amp Hall CRC httppensueblibcompatronFullRecordaspxp=359998

bull Time ldquoTime Series Time Series Cross Section Panelrdquo as common in economics political scienceand sociology DataScience-Python Ch 3 re pandas DSHandbook-Py Ch 17 ldquoTime Series AnalysisrdquoEconometrics GelmanHill

bull Time ldquoSequential Data Streamsrdquo as in NLP machine learning Spark and other rdquostreaming datardquoreadings

bull Space-Time Data with continuity connectivity neighborhood structure See DeepLearning Chapter 9re convolution

bull CRAN Task Views Spatial httpscranr-projectorgwebviewsSpatialhtml Spatiotempo-ral httpscranr-projectorgwebviewsSpatioTemporalhtml Time Series httpscran

r-projectorgwebviewsTimeSerieshtml

Network Graphs

bull Dagger[Networks] David Easley and Jon Kleinberg 2010 Networks Crowds and Markets ReasoningAbout a Highly Connected World Cambridge University Press httpwwwcscornelleduhome

kleinbernetworks-book

bull MMDS Ch 5 ldquoLink Analysisrdquo Ch 10 ldquoMining Social-Network Graphsrdquo

bull dagger[Networks-R] Douglas Luke 2015 A Userrsquos Guide to Network Analysis in R Springer https

link-springer-comezaccesslibrariespsuedubook101007978-3-319-23883-8

bull dagger[Networks-Python] Mohammed Zuhair Al-Taie and Seifedine Kadry 2017 Python for Graph andNetwork Analysis Springer httpslink-springer-comezaccesslibrariespsuedubook

1010072F978-3-319-53004-8

Hierarchy Clustered Data Aggregation Mixed Models

bull dagger[GelmanHill] Andrew Gelman and Jennifer Hill 2006 Data Analysis Using Regression and Multi-levelHierarchical Models Cambridge University Press httppensueblibcompatronFullRecordaspxp=288457

bull dagger[MixedModels] Eugene Demidenko 2013 Mixed Models Theory and Applications with R 2nd edWiley httpsiteebrarycomezaccesslibrariespsuedulibpennstatedetailaction

docID=10748641

Social Data Channels

Language Text Speech Audio

bull Dagger[NLP] Daniel Jurafsky and James H Martin Forthcoming (2017) Speech and Language ProcessingAn Introduction to Natural Language Processing Speech Recognition and Computational Linguistics3rd ed Prentice-Hall Preprint httpswebstanfordedu~jurafskyslp3

bull InfoRetrieval

bull Dagger[CoreNLP] Stanford CoreNLP httpsstanfordnlpgithubioCoreNLP

28

bull dagger[TextAsData] Grimmer and Stewart 2013 ldquoText as Data The Promise and Pitfalls of AutomaticContent Analysis Methods for Political Textsrdquo Political Analysis httpsdoiorg101093pan

mps028

bull Quinn-Topics

bull FightinWords

bull dagger[NLTK-Python] Jacob Perkins 2010 Python Text Processing with NLTK 20 Cookbook PackthttpsiteebrarycomezaccesslibrariespsuedulibpennstatedetailactiondocID=10435387dagger[TextAnalytics-Python] Dipanjan Sarkar 2016 Text Analytics with Python Springer http

linkspringercomezaccesslibrariespsuedubook1010072F978-1-4842-2388-8

bull DSHandbook-Py Ch 16 ldquoNatural Language Processingrdquo

bull Matthew J Denny httpwwwmjdennycomText_Processing_In_Rhtml

bull CRAN Natural Language Processing httpscranr-projectorgwebviewsNaturalLanguageProcessinghtml

bull dagger[AudioData] Dean Knox and Christopher Lucas 2017 ldquoThe Speaker-Affect Model MeasuringEmotion in Political Speech with Audio Datardquo httpchristopherlucasorgfilesPDFssam

pdf httpswwwyoutubecomwatchv=Hs8A9dwkMzI

bull daggerMorgan amp Claypool Synthesis Lectures on Human Language Technologies httpwwwmorganclaypoolcomtochlt11

bull daggerMorgan amp Claypool Synthesis Lectures on Speech and Audio Processing httpwwwmorganclaypoolcomtocsap11

Vision Image Video

bull Dagger[ComputerVision] Richard Szelinski 2010 Computer Vision Algorithms and Applications SpringerhttpszeliskiorgBook

bull MixedModels Ch 11 ldquoStatistical Analysis of Shaperdquo Ch 12 ldquoStatistical Image Analysisrdquo

bull AudioVideoVolumetric See DeepLearning Chapter 9 re convolution

bull daggerMorgan amp Claypool Synthesis Lectures on Image Video and Multimedia Processing httpwww

morganclaypoolcomtocivm11

bull daggerMorgan amp Claypool Synthesis Lectures on Computer Vision httpwwwmorganclaypoolcom

toccov11

Approaches to Learning from Data (The Analytics Layer)

bull NRCreport Ch 7 ldquoBuilding Models from Massive Datardquo

bull MMDS (data-mining)

bull CausalInference

bull Dagger[VisualAnalytics] Daniel Keim Jorn Kohlhammer Geoffrey Ellis and Florian Mansmann editors2010 Mastering the Information Age Solving Problems with Visual Analytics The EurographicsAssociation Goslar Germany httpwwwvismastereubook See also daggerMorgan amp ClaypoolSynthesis Lectures on Visualization httpwwwmorganclaypoolcomtocvis21

bull Dagger[Econometrics] Bruce E Hansen 2014 Econometrics httpwwwsscwiscedu~bhansen

econometrics

29

bull Dagger[StatisticalLearning] Gareth James Daniela Witten Trevor Hastie and Robert Tibshirani 2013An Introduction to Statistical Learning with Applications in R Springer httpwww-bcfuscedu

~garethISLindexhtml

bull [MachineLearning] Christopher Bishop 2006 Pattern Recognition and Machine Learning Springer

bull dagger[Bayes] Peter D Hoff 2009 A First Course in Bayesian Statistical Methods Springer http

linkspringercombook1010072F978-0-387-92407-6

bull dagger[GraphicalModels] Soslashren Hoslashjsgaard David Edwards Steffen Lauritzen 2012 Graphical Mod-els with R Springer httpslink-springer-comezaccesslibrariespsuedubook101007

2F978-1-4614-2299-0

bull [InformationTheory] David Mackay 2003 Information Theory Inference and Learning AlgorithmsSpringer

bull Dagger[DeepLearning] Ian Goodfellow Yoshua Bengio and Aaron Courville 2016 Deep Learning MITPress httpwwwdeeplearningbookorg (see also Yann LeCun Yoshua Bengio and Geof-frey Hinton 2015 ldquoDeep Learningrdquo Nature 521(7553) 436ndash444 httpsdoiorg101038

nature14539)

See also Dagger[Keras] Keras The Python Deep Learning Library httpskerasio Francoios CholletDirectory of Keras tutorials httpsgithubcomfcholletkeras-resources

bull dagger[SignalProcessing] Jose Maria Giron-Sierra 2013 Digital Signal Processing with MATLAB Exam-ples (Volumes 1-3) Springer httpslink-springer-comezaccesslibrariespsuedubook

101007978-981-10-2534-1

30

Penn State Policy Statements

Academic Integrity

Academic integrity is the pursuit of scholarly activity in an open honest and responsible manner Academicintegrity is a basic guiding principle for all academic activity at The Pennsylvania State University and allmembers of the University community are expected to act in accordance with this principle Consistent withthis expectation the Universitys Code of Conduct states that all students should act with personal integrityrespect other students dignity rights and property and help create and maintain an environment in whichall can succeed through the fruits of their efforts

Academic integrity includes a commitment by all members of the University community not to engage in ortolerate acts of falsification misrepresentation or deception Such acts of dishonesty violate the fundamentalethical principles of the University community and compromise the worth of work completed by others

Disability Accomodation

Penn State welcomes students with disabilities into the Universitys educational programs Every Penn Statecampus has an office for students with disabilities Student Disability Resources (SDR) website provides con-tact information for every Penn State campus (httpequitypsuedusdrdisability-coordinator)For further information please visit the Student Disability Resources website (httpequitypsuedusdr)

In order to receive consideration for reasonable accommodations you must contact the appropriate disabilityservices office at the campus where you are officially enrolled participate in an intake interview and pro-vide documentation See documentation guidelines at (httpequitypsuedusdrguidelines) If thedocumentation supports your request for reasonable accommodations your campus disability services officewill provide you with an accommodation letter Please share this letter with your instructors and discussthe accommodations with them as early as possible You must follow this process for every semester thatyou request accommodations

Psychological Services

Many students at Penn State face personal challenges or have psychological needs that may interfere withtheir academic progress social development or emotional wellbeing The university offers a variety ofconfidential services to help you through difficult times including individual and group counseling crisisintervention consultations online chats and mental health screenings These services are provided by staffwho welcome all students and embrace a philosophy respectful of clients cultural and religious backgroundsand sensitive to differences in race ability gender identity and sexual orientation

bull Counseling and Psychological Services at University Park (CAPS) (httpstudentaffairspsueducounseling) 814-863-0395

bull Penn State Crisis Line (24 hours7 daysweek) 877-229-6400

bull Crisis Text Line (24 hours7 daysweek) Text LIONS to 741741

Educational Equity

Penn State takes great pride to foster a diverse and inclusive environment for students faculty and staffConsistent with University Policy AD29 students who believe they have experienced or observed a hatecrime an act of intolerance discrimination or harassment that occurs at Penn State are urged to report theseincidents as outlined on the Universitys Report Bias webpage (httpequitypsuedureportbias)

31

Background Knowledge

Students will have a variety of backgrounds Prior to beginning interdisciplinary coursework tofulfill Social Data Analytics degree requirements including SoDA 501 and 502 students are ex-pected to have advanced (graduate) training in at least one of the component areas of Social DataAnalytics and a familiarity with basic concepts in the others

With regard to specialization students are expected to have advanced (graduate) training in ONEof the following

bull quantitative social science methodology and a discipline of social science (as would be thecase for a second-year PhD student in Political Science Sociology Criminology Human De-velopment and Family Studies or Demography) OR

bull statistics (as would be the case for a second-year PhD student in Statistics) OR

bull information science or informatics (as would be the case for a second-year PhD student inInformation Science and Technology or a second-year PhD student in Geography specializingin GIScience) OR

bull computer science (as would be the case for a second-year PhD student in Computer Scienceand Engineering)

This requirement is met as a matter of meeting home program requirements for students in thedual-title PhD but may require additional coursework on the part of students in other programswishing to pursue the graduate minor

With regard to general preparation students are expected to have ALL of the following technicalknowledge

bull basic programming skills including basic facility with R ampor Python AND

bull basic knowledge of relational databases ampor geographic information systems AND

bull basic knowledge of probability applied statistics ampor social science research design AND

bull basic familiarity with a substantive or theoretical area of social science (eg 300-level course-work in political science sociology criminology human development psychology economicscommunication anthropology human geography social informatics or similar fields)

It is not unusual for students to have one or more gaps in this preparation at time of applicationto the SoDA program Students should work with Social Data Analytics advisers to develop aplan for timely remediation of any deficiencies which generally will not require formal courseworkfor students whose training and interests are otherwise appropriate for pursuit of the Social DataAnalytics degree Where possible this will be addressed at time of application to the Social DataAnalytics program

To this end some free training materials are linked in the reference section and there are alsoabundant free high-quality self-paced course-style training materials on these and related subjectsavailable through edX (httpwwwedxorg) Udacity (httpwwwudacitycom) Codecademy(httpcodecademycom) and Lynda (httpwwwlyndapsu)

32

Page 26: SoDA 501: Approaches and Issues in Big Social Data Spring ... · \Approaches and Issues in Big Social Data," o ered in the spring semester. Some SoDA / IGERT students will take more

bull Dagger[MarkovVisually] Victor Powell ldquoMarkov Chains Explained Visuallyrdquo httpsetosaioev

markov-chains

bull DSHandbook-Py Ch 25 ldquoStochastic Modelingrdquo (Markov chains MCMC HMM)

bull Bayes ComputationalStatistics-Py

bull DeepLearning Ch 17 ldquoMonte Carlo Methodsrdquo Ch 19 ldquoApproximate Inferencerdquo

bull Dagger[VariationalInference] Jason Eisner 2011 ldquoHigh-Level Explanation of Variational Inferencerdquohttpswwwcsjhuedu~jasontutorialsvariationalhtml

bull CRAN Task View Bayesian Inferencehttpscranr-projectorgwebviewsBayesianhtml

Parallelism MapReduce Split-Apply-Combine

bull Dagger[MapReduceIntuition] Jigsaw Academy 2014 ldquoBig Data Specialist MapReducerdquo https

wwwyoutubecomwatchv=TwcYQzFqg-8ampfeature=youtube

bull MMDS Ch 2 ldquoMap-Reduce and the New Software Stackrdquo (3rd Edition discusses Spark amp Tensor-Flow)

bull Algorithms Ch 3ldquoScalable Algorithms and Associative Statisticsrdquo Ch 4 ldquoHadoop and MapReducerdquo

bull NRCreport Ch 6 ldquoResources Trade-offs and Limitationsrdquo

bull TidyData-R (Split-apply-combine is the motivating principle behind the ldquotidyverserdquo approach) Seealso Part III ldquoProgramrdquo (pipes functions vectors iteration)

bull Dagger[TidySAC-Video] Hadley Wickham 2017 ldquoData Science in the Tidyverserdquo httpswww

rstudiocomresourcesvideosdata-science-in-the-tidyverse

bull See also FactorSGD

Functional Programming

bull ComputationalStatistics-Python ldquoFunctions are first class objectsrdquo through first exercises

bull DSHandbook-Py Ch 20 ldquoProgramming Language Conceptsrdquo

bull See also Haskell

Scaling iteration streaming data online algorithms (Spark)

bull NRCreport Ch 4 ldquoTemporal Data and Real-Time Algorithmsrdquo

bull MMDS Ch 4 ldquoMining Data Streamsrdquo

bull Dagger[BDAS] AMPLab BDAS The Berkeley Data Analytics Stack httpsamplabcsberkeleyedu

software

bull dagger[Spark] Mohammed Guller 2015 Big Data Analytics with Spark A Practitioners Guide toUsing Spark for Large-Scale Data Processing Machine Learning and Graph Analytics and High-Velocity Data Stream Processing Apress httpslink-springer-comezaccesslibrariespsu

edubook101007978-1-4842-0964-6

bull DSHandbook-Py Ch 13 ldquoBig Datardquo

bull DeepLearning Ch 10 ldquoSequence Modeling Recurrent and Recursive Netsrdquo

General Resources for Python and R

26

bull dagger[DSHandbook-Py] Field Cady 2017 The Data Science Handbook (Python-based) Wiley http

onlinelibrarywileycomezaccesslibrariespsuedubook1010029781119092919

bull Dagger[Tutorials-R] Ujjwal Karn 2017 ldquoA curated list of R tutorials for Data Science NLP and MachineLearningrdquo httpsgithubcomujjwalkarnDataScienceR

bull Dagger[Tutorials-Python] Ujjwal Karn 2017 ldquoA curated list of Python tutorials for Data Science NLPand Machine Learningrdquo httpsgithubcomujjwalkarnDataSciencePython

Cutting and Bleeding Edge of Data Science Languages

bull [Scala] daggerVishal Layka and David Pollak 2015 Beginning Scala httpslink-springer-

comezaccesslibrariespsuedubook1010072F978-1-4842-0232-6 daggerNicolas Patrick 2014Scala for Machine Learning httpsebookcentralproquestcomlibpensudetailaction

docID=1901910 DaggerScala site httpswwwscala-langorg (See also )

bull [Julia] daggerIvo Baobaert 2015 Getting Started with Julia httpsebookcentralproquestcom

libpensudetailactiondocID=1973847 DaggerDouglas Bates 2013 ldquoJulia for R Programmersrdquohttpwwwstatwiscedu~batesJuliaForRProgrammerspdf DaggerJulia site httpsjulialang

org

bull [Haskell] daggerRichard Bird 2014 Thinking Functionally with Haskell httpsdoi-orgezaccess

librariespsuedu101017CBO9781316092415 daggerHakim Cassimally 2017 Learning Haskell Pro-gramming httpswwwlyndacomHaskell-tutorialsLearning-Haskell-Programming604926-2html daggerJames Church 2017 Learning Haskell for Data Analysis httpswwwlyndacomDeveloper-tutorialsLearning-Haskell-Data-Analysis604234-2html Haskell site httpswwwhaskellorg

bull [Clojure] daggerMark McDonnell 2017 Quick Clojure Essential Functional Programming https

link-springer-comezaccesslibrariespsuedubook1010072F978-1-4842-2952-1 daggerAkhillWali 2014 Clojure for Machine Learning httpsebookcentralproquestcomlibpensu

detailactiondocID=1674848 daggerArthur Ulfeldt 2015 httpswwwlyndacomClojure-tutorialsUp-Running-Clojure413127-2html DaggerClojure site httpsclojureorg

bull [TensorFlow] DaggerAbadi et al (Google) 2016 ldquoTensorFlow A System for Large-Scale Ma-chine Learningrdquo httpsarxivorgabs160508695 DaggerUdacity ldquoDeep Learningrdquo httpswww

udacitycomcoursedeep-learning--ud730 DaggerTensorFlow site httpswwwtensorfloworg

bull [H2O] Darren Cool 2016 Practical Machine Learning with H2O Powerful Scalable Techniques forDeep Learning and AI OrsquoReilly Arno Candel and Viraj Parmar 2015 Deep Learning with H2ODaggerH2O site httpswwwh2oai

Social Data Structures

Space and Time

bull dagger[GIA] David OrsquoSullivan David J Unwin 2010 Geographic Information Analysis Second EditionJohn Wiley amp Sons httponlinelibrarywileycombook1010029780470549094

bull [Space-Time] Donna J Peuquet 2003 Representations of Space and Time Guilford

bull Roger S Bivand Edzer Pebesma Virgilio Gmez-Rubio 2013 Applied Spatial Data Analysis in RFree through library httplinkspringercomezaccesslibrariespsuedubook101007

2F978-1-4614-7618-4 See also Edzer Pebesma 2016 ldquoHandling and Analyzing Spatial Spa-tiotemporal and Movement Data in Rrdquo httpsedzergithubioUseR2016

bull Dagger[PySAL] Sergio J Rey and Dani Arribas-Bel 2016 ldquoGeographic Data Science with PySAL andthe pydata Stackrdquo httpdarribasorggds_scipy16

27

bull DataScience-Python NB 0413 ldquoGeographic Data with Basemaprdquo

bull Time ldquoLongitudinalrdquo intra-individual data as common in developmental psychology dagger[Longitudinal]Garrett Fitzmaurice Marie Davidian Geert Verbeke and Geert Molenberghs editors 2009 Longitu-dinal Data Analysis Chapman amp Hall CRC httppensueblibcompatronFullRecordaspxp=359998

bull Time ldquoTime Series Time Series Cross Section Panelrdquo as common in economics political scienceand sociology DataScience-Python Ch 3 re pandas DSHandbook-Py Ch 17 ldquoTime Series AnalysisrdquoEconometrics GelmanHill

bull Time ldquoSequential Data Streamsrdquo as in NLP machine learning Spark and other rdquostreaming datardquoreadings

bull Space-Time Data with continuity connectivity neighborhood structure See DeepLearning Chapter 9re convolution

bull CRAN Task Views Spatial httpscranr-projectorgwebviewsSpatialhtml Spatiotempo-ral httpscranr-projectorgwebviewsSpatioTemporalhtml Time Series httpscran

r-projectorgwebviewsTimeSerieshtml

Network Graphs

bull Dagger[Networks] David Easley and Jon Kleinberg 2010 Networks Crowds and Markets ReasoningAbout a Highly Connected World Cambridge University Press httpwwwcscornelleduhome

kleinbernetworks-book

bull MMDS Ch 5 ldquoLink Analysisrdquo Ch 10 ldquoMining Social-Network Graphsrdquo

bull dagger[Networks-R] Douglas Luke 2015 A Userrsquos Guide to Network Analysis in R Springer https

link-springer-comezaccesslibrariespsuedubook101007978-3-319-23883-8

bull dagger[Networks-Python] Mohammed Zuhair Al-Taie and Seifedine Kadry 2017 Python for Graph andNetwork Analysis Springer httpslink-springer-comezaccesslibrariespsuedubook

1010072F978-3-319-53004-8

Hierarchy Clustered Data Aggregation Mixed Models

bull dagger[GelmanHill] Andrew Gelman and Jennifer Hill 2006 Data Analysis Using Regression and Multi-levelHierarchical Models Cambridge University Press httppensueblibcompatronFullRecordaspxp=288457

bull dagger[MixedModels] Eugene Demidenko 2013 Mixed Models Theory and Applications with R 2nd edWiley httpsiteebrarycomezaccesslibrariespsuedulibpennstatedetailaction

docID=10748641

Social Data Channels

Language Text Speech Audio

bull Dagger[NLP] Daniel Jurafsky and James H Martin Forthcoming (2017) Speech and Language ProcessingAn Introduction to Natural Language Processing Speech Recognition and Computational Linguistics3rd ed Prentice-Hall Preprint httpswebstanfordedu~jurafskyslp3

bull InfoRetrieval

bull Dagger[CoreNLP] Stanford CoreNLP httpsstanfordnlpgithubioCoreNLP

28

bull dagger[TextAsData] Grimmer and Stewart 2013 ldquoText as Data The Promise and Pitfalls of AutomaticContent Analysis Methods for Political Textsrdquo Political Analysis httpsdoiorg101093pan

mps028

bull Quinn-Topics

bull FightinWords

bull dagger[NLTK-Python] Jacob Perkins 2010 Python Text Processing with NLTK 20 Cookbook PackthttpsiteebrarycomezaccesslibrariespsuedulibpennstatedetailactiondocID=10435387dagger[TextAnalytics-Python] Dipanjan Sarkar 2016 Text Analytics with Python Springer http

linkspringercomezaccesslibrariespsuedubook1010072F978-1-4842-2388-8

bull DSHandbook-Py Ch 16 ldquoNatural Language Processingrdquo

bull Matthew J Denny httpwwwmjdennycomText_Processing_In_Rhtml

bull CRAN Natural Language Processing httpscranr-projectorgwebviewsNaturalLanguageProcessinghtml

bull dagger[AudioData] Dean Knox and Christopher Lucas 2017 ldquoThe Speaker-Affect Model MeasuringEmotion in Political Speech with Audio Datardquo httpchristopherlucasorgfilesPDFssam

pdf httpswwwyoutubecomwatchv=Hs8A9dwkMzI

bull daggerMorgan amp Claypool Synthesis Lectures on Human Language Technologies httpwwwmorganclaypoolcomtochlt11

bull daggerMorgan amp Claypool Synthesis Lectures on Speech and Audio Processing httpwwwmorganclaypoolcomtocsap11

Vision Image Video

bull Dagger[ComputerVision] Richard Szelinski 2010 Computer Vision Algorithms and Applications SpringerhttpszeliskiorgBook

bull MixedModels Ch 11 ldquoStatistical Analysis of Shaperdquo Ch 12 ldquoStatistical Image Analysisrdquo

bull AudioVideoVolumetric See DeepLearning Chapter 9 re convolution

bull daggerMorgan amp Claypool Synthesis Lectures on Image Video and Multimedia Processing httpwww

morganclaypoolcomtocivm11

bull daggerMorgan amp Claypool Synthesis Lectures on Computer Vision httpwwwmorganclaypoolcom

toccov11

Approaches to Learning from Data (The Analytics Layer)

bull NRCreport Ch 7 ldquoBuilding Models from Massive Datardquo

bull MMDS (data-mining)

bull CausalInference

bull Dagger[VisualAnalytics] Daniel Keim Jorn Kohlhammer Geoffrey Ellis and Florian Mansmann editors2010 Mastering the Information Age Solving Problems with Visual Analytics The EurographicsAssociation Goslar Germany httpwwwvismastereubook See also daggerMorgan amp ClaypoolSynthesis Lectures on Visualization httpwwwmorganclaypoolcomtocvis21

bull Dagger[Econometrics] Bruce E Hansen 2014 Econometrics httpwwwsscwiscedu~bhansen

econometrics

29

bull Dagger[StatisticalLearning] Gareth James Daniela Witten Trevor Hastie and Robert Tibshirani 2013An Introduction to Statistical Learning with Applications in R Springer httpwww-bcfuscedu

~garethISLindexhtml

bull [MachineLearning] Christopher Bishop 2006 Pattern Recognition and Machine Learning Springer

bull dagger[Bayes] Peter D Hoff 2009 A First Course in Bayesian Statistical Methods Springer http

linkspringercombook1010072F978-0-387-92407-6

bull dagger[GraphicalModels] Soslashren Hoslashjsgaard David Edwards Steffen Lauritzen 2012 Graphical Mod-els with R Springer httpslink-springer-comezaccesslibrariespsuedubook101007

2F978-1-4614-2299-0

bull [InformationTheory] David Mackay 2003 Information Theory Inference and Learning AlgorithmsSpringer

bull Dagger[DeepLearning] Ian Goodfellow Yoshua Bengio and Aaron Courville 2016 Deep Learning MITPress httpwwwdeeplearningbookorg (see also Yann LeCun Yoshua Bengio and Geof-frey Hinton 2015 ldquoDeep Learningrdquo Nature 521(7553) 436ndash444 httpsdoiorg101038

nature14539)

See also Dagger[Keras] Keras The Python Deep Learning Library httpskerasio Francoios CholletDirectory of Keras tutorials httpsgithubcomfcholletkeras-resources

bull dagger[SignalProcessing] Jose Maria Giron-Sierra 2013 Digital Signal Processing with MATLAB Exam-ples (Volumes 1-3) Springer httpslink-springer-comezaccesslibrariespsuedubook

101007978-981-10-2534-1

30

Penn State Policy Statements

Academic Integrity

Academic integrity is the pursuit of scholarly activity in an open honest and responsible manner Academicintegrity is a basic guiding principle for all academic activity at The Pennsylvania State University and allmembers of the University community are expected to act in accordance with this principle Consistent withthis expectation the Universitys Code of Conduct states that all students should act with personal integrityrespect other students dignity rights and property and help create and maintain an environment in whichall can succeed through the fruits of their efforts

Academic integrity includes a commitment by all members of the University community not to engage in ortolerate acts of falsification misrepresentation or deception Such acts of dishonesty violate the fundamentalethical principles of the University community and compromise the worth of work completed by others

Disability Accomodation

Penn State welcomes students with disabilities into the Universitys educational programs Every Penn Statecampus has an office for students with disabilities Student Disability Resources (SDR) website provides con-tact information for every Penn State campus (httpequitypsuedusdrdisability-coordinator)For further information please visit the Student Disability Resources website (httpequitypsuedusdr)

In order to receive consideration for reasonable accommodations you must contact the appropriate disabilityservices office at the campus where you are officially enrolled participate in an intake interview and pro-vide documentation See documentation guidelines at (httpequitypsuedusdrguidelines) If thedocumentation supports your request for reasonable accommodations your campus disability services officewill provide you with an accommodation letter Please share this letter with your instructors and discussthe accommodations with them as early as possible You must follow this process for every semester thatyou request accommodations

Psychological Services

Many students at Penn State face personal challenges or have psychological needs that may interfere withtheir academic progress social development or emotional wellbeing The university offers a variety ofconfidential services to help you through difficult times including individual and group counseling crisisintervention consultations online chats and mental health screenings These services are provided by staffwho welcome all students and embrace a philosophy respectful of clients cultural and religious backgroundsand sensitive to differences in race ability gender identity and sexual orientation

bull Counseling and Psychological Services at University Park (CAPS) (httpstudentaffairspsueducounseling) 814-863-0395

bull Penn State Crisis Line (24 hours7 daysweek) 877-229-6400

bull Crisis Text Line (24 hours7 daysweek) Text LIONS to 741741

Educational Equity

Penn State takes great pride to foster a diverse and inclusive environment for students faculty and staffConsistent with University Policy AD29 students who believe they have experienced or observed a hatecrime an act of intolerance discrimination or harassment that occurs at Penn State are urged to report theseincidents as outlined on the Universitys Report Bias webpage (httpequitypsuedureportbias)

31

Background Knowledge

Students will have a variety of backgrounds Prior to beginning interdisciplinary coursework tofulfill Social Data Analytics degree requirements including SoDA 501 and 502 students are ex-pected to have advanced (graduate) training in at least one of the component areas of Social DataAnalytics and a familiarity with basic concepts in the others

With regard to specialization students are expected to have advanced (graduate) training in ONEof the following

bull quantitative social science methodology and a discipline of social science (as would be thecase for a second-year PhD student in Political Science Sociology Criminology Human De-velopment and Family Studies or Demography) OR

bull statistics (as would be the case for a second-year PhD student in Statistics) OR

bull information science or informatics (as would be the case for a second-year PhD student inInformation Science and Technology or a second-year PhD student in Geography specializingin GIScience) OR

bull computer science (as would be the case for a second-year PhD student in Computer Scienceand Engineering)

This requirement is met as a matter of meeting home program requirements for students in thedual-title PhD but may require additional coursework on the part of students in other programswishing to pursue the graduate minor

With regard to general preparation students are expected to have ALL of the following technicalknowledge

bull basic programming skills including basic facility with R ampor Python AND

bull basic knowledge of relational databases ampor geographic information systems AND

bull basic knowledge of probability applied statistics ampor social science research design AND

bull basic familiarity with a substantive or theoretical area of social science (eg 300-level course-work in political science sociology criminology human development psychology economicscommunication anthropology human geography social informatics or similar fields)

It is not unusual for students to have one or more gaps in this preparation at time of applicationto the SoDA program Students should work with Social Data Analytics advisers to develop aplan for timely remediation of any deficiencies which generally will not require formal courseworkfor students whose training and interests are otherwise appropriate for pursuit of the Social DataAnalytics degree Where possible this will be addressed at time of application to the Social DataAnalytics program

To this end some free training materials are linked in the reference section and there are alsoabundant free high-quality self-paced course-style training materials on these and related subjectsavailable through edX (httpwwwedxorg) Udacity (httpwwwudacitycom) Codecademy(httpcodecademycom) and Lynda (httpwwwlyndapsu)

32

Page 27: SoDA 501: Approaches and Issues in Big Social Data Spring ... · \Approaches and Issues in Big Social Data," o ered in the spring semester. Some SoDA / IGERT students will take more

bull dagger[DSHandbook-Py] Field Cady 2017 The Data Science Handbook (Python-based) Wiley http

onlinelibrarywileycomezaccesslibrariespsuedubook1010029781119092919

bull Dagger[Tutorials-R] Ujjwal Karn 2017 ldquoA curated list of R tutorials for Data Science NLP and MachineLearningrdquo httpsgithubcomujjwalkarnDataScienceR

bull Dagger[Tutorials-Python] Ujjwal Karn 2017 ldquoA curated list of Python tutorials for Data Science NLPand Machine Learningrdquo httpsgithubcomujjwalkarnDataSciencePython

Cutting and Bleeding Edge of Data Science Languages

bull [Scala] daggerVishal Layka and David Pollak 2015 Beginning Scala httpslink-springer-

comezaccesslibrariespsuedubook1010072F978-1-4842-0232-6 daggerNicolas Patrick 2014Scala for Machine Learning httpsebookcentralproquestcomlibpensudetailaction

docID=1901910 DaggerScala site httpswwwscala-langorg (See also )

bull [Julia] daggerIvo Baobaert 2015 Getting Started with Julia httpsebookcentralproquestcom

libpensudetailactiondocID=1973847 DaggerDouglas Bates 2013 ldquoJulia for R Programmersrdquohttpwwwstatwiscedu~batesJuliaForRProgrammerspdf DaggerJulia site httpsjulialang

org

bull [Haskell] daggerRichard Bird 2014 Thinking Functionally with Haskell httpsdoi-orgezaccess

librariespsuedu101017CBO9781316092415 daggerHakim Cassimally 2017 Learning Haskell Pro-gramming httpswwwlyndacomHaskell-tutorialsLearning-Haskell-Programming604926-2html daggerJames Church 2017 Learning Haskell for Data Analysis httpswwwlyndacomDeveloper-tutorialsLearning-Haskell-Data-Analysis604234-2html Haskell site httpswwwhaskellorg

bull [Clojure] daggerMark McDonnell 2017 Quick Clojure Essential Functional Programming https

link-springer-comezaccesslibrariespsuedubook1010072F978-1-4842-2952-1 daggerAkhillWali 2014 Clojure for Machine Learning httpsebookcentralproquestcomlibpensu

detailactiondocID=1674848 daggerArthur Ulfeldt 2015 httpswwwlyndacomClojure-tutorialsUp-Running-Clojure413127-2html DaggerClojure site httpsclojureorg

bull [TensorFlow] DaggerAbadi et al (Google) 2016 ldquoTensorFlow A System for Large-Scale Ma-chine Learningrdquo httpsarxivorgabs160508695 DaggerUdacity ldquoDeep Learningrdquo httpswww

udacitycomcoursedeep-learning--ud730 DaggerTensorFlow site httpswwwtensorfloworg

bull [H2O] Darren Cool 2016 Practical Machine Learning with H2O Powerful Scalable Techniques forDeep Learning and AI OrsquoReilly Arno Candel and Viraj Parmar 2015 Deep Learning with H2ODaggerH2O site httpswwwh2oai

Social Data Structures

Space and Time

bull dagger[GIA] David OrsquoSullivan David J Unwin 2010 Geographic Information Analysis Second EditionJohn Wiley amp Sons httponlinelibrarywileycombook1010029780470549094

bull [Space-Time] Donna J Peuquet 2003 Representations of Space and Time Guilford

bull Roger S Bivand Edzer Pebesma Virgilio Gmez-Rubio 2013 Applied Spatial Data Analysis in RFree through library httplinkspringercomezaccesslibrariespsuedubook101007

2F978-1-4614-7618-4 See also Edzer Pebesma 2016 ldquoHandling and Analyzing Spatial Spa-tiotemporal and Movement Data in Rrdquo httpsedzergithubioUseR2016

bull Dagger[PySAL] Sergio J Rey and Dani Arribas-Bel 2016 ldquoGeographic Data Science with PySAL andthe pydata Stackrdquo httpdarribasorggds_scipy16

27

bull DataScience-Python NB 0413 ldquoGeographic Data with Basemaprdquo

bull Time ldquoLongitudinalrdquo intra-individual data as common in developmental psychology dagger[Longitudinal]Garrett Fitzmaurice Marie Davidian Geert Verbeke and Geert Molenberghs editors 2009 Longitu-dinal Data Analysis Chapman amp Hall CRC httppensueblibcompatronFullRecordaspxp=359998

bull Time ldquoTime Series Time Series Cross Section Panelrdquo as common in economics political scienceand sociology DataScience-Python Ch 3 re pandas DSHandbook-Py Ch 17 ldquoTime Series AnalysisrdquoEconometrics GelmanHill

bull Time ldquoSequential Data Streamsrdquo as in NLP machine learning Spark and other rdquostreaming datardquoreadings

bull Space-Time Data with continuity connectivity neighborhood structure See DeepLearning Chapter 9re convolution

bull CRAN Task Views Spatial httpscranr-projectorgwebviewsSpatialhtml Spatiotempo-ral httpscranr-projectorgwebviewsSpatioTemporalhtml Time Series httpscran

r-projectorgwebviewsTimeSerieshtml

Network Graphs

bull Dagger[Networks] David Easley and Jon Kleinberg 2010 Networks Crowds and Markets ReasoningAbout a Highly Connected World Cambridge University Press httpwwwcscornelleduhome

kleinbernetworks-book

bull MMDS Ch 5 ldquoLink Analysisrdquo Ch 10 ldquoMining Social-Network Graphsrdquo

bull dagger[Networks-R] Douglas Luke 2015 A Userrsquos Guide to Network Analysis in R Springer https

link-springer-comezaccesslibrariespsuedubook101007978-3-319-23883-8

bull dagger[Networks-Python] Mohammed Zuhair Al-Taie and Seifedine Kadry 2017 Python for Graph andNetwork Analysis Springer httpslink-springer-comezaccesslibrariespsuedubook

1010072F978-3-319-53004-8

Hierarchy Clustered Data Aggregation Mixed Models

bull dagger[GelmanHill] Andrew Gelman and Jennifer Hill 2006 Data Analysis Using Regression and Multi-levelHierarchical Models Cambridge University Press httppensueblibcompatronFullRecordaspxp=288457

bull dagger[MixedModels] Eugene Demidenko 2013 Mixed Models Theory and Applications with R 2nd edWiley httpsiteebrarycomezaccesslibrariespsuedulibpennstatedetailaction

docID=10748641

Social Data Channels

Language Text Speech Audio

bull Dagger[NLP] Daniel Jurafsky and James H Martin Forthcoming (2017) Speech and Language ProcessingAn Introduction to Natural Language Processing Speech Recognition and Computational Linguistics3rd ed Prentice-Hall Preprint httpswebstanfordedu~jurafskyslp3

bull InfoRetrieval

bull Dagger[CoreNLP] Stanford CoreNLP httpsstanfordnlpgithubioCoreNLP

28

bull dagger[TextAsData] Grimmer and Stewart 2013 ldquoText as Data The Promise and Pitfalls of AutomaticContent Analysis Methods for Political Textsrdquo Political Analysis httpsdoiorg101093pan

mps028

bull Quinn-Topics

bull FightinWords

bull dagger[NLTK-Python] Jacob Perkins 2010 Python Text Processing with NLTK 20 Cookbook PackthttpsiteebrarycomezaccesslibrariespsuedulibpennstatedetailactiondocID=10435387dagger[TextAnalytics-Python] Dipanjan Sarkar 2016 Text Analytics with Python Springer http

linkspringercomezaccesslibrariespsuedubook1010072F978-1-4842-2388-8

bull DSHandbook-Py Ch 16 ldquoNatural Language Processingrdquo

bull Matthew J Denny httpwwwmjdennycomText_Processing_In_Rhtml

bull CRAN Natural Language Processing httpscranr-projectorgwebviewsNaturalLanguageProcessinghtml

bull dagger[AudioData] Dean Knox and Christopher Lucas 2017 ldquoThe Speaker-Affect Model MeasuringEmotion in Political Speech with Audio Datardquo httpchristopherlucasorgfilesPDFssam

pdf httpswwwyoutubecomwatchv=Hs8A9dwkMzI

bull daggerMorgan amp Claypool Synthesis Lectures on Human Language Technologies httpwwwmorganclaypoolcomtochlt11

bull daggerMorgan amp Claypool Synthesis Lectures on Speech and Audio Processing httpwwwmorganclaypoolcomtocsap11

Vision Image Video

bull Dagger[ComputerVision] Richard Szelinski 2010 Computer Vision Algorithms and Applications SpringerhttpszeliskiorgBook

bull MixedModels Ch 11 ldquoStatistical Analysis of Shaperdquo Ch 12 ldquoStatistical Image Analysisrdquo

bull AudioVideoVolumetric See DeepLearning Chapter 9 re convolution

bull daggerMorgan amp Claypool Synthesis Lectures on Image Video and Multimedia Processing httpwww

morganclaypoolcomtocivm11

bull daggerMorgan amp Claypool Synthesis Lectures on Computer Vision httpwwwmorganclaypoolcom

toccov11

Approaches to Learning from Data (The Analytics Layer)

bull NRCreport Ch 7 ldquoBuilding Models from Massive Datardquo

bull MMDS (data-mining)

bull CausalInference

bull Dagger[VisualAnalytics] Daniel Keim Jorn Kohlhammer Geoffrey Ellis and Florian Mansmann editors2010 Mastering the Information Age Solving Problems with Visual Analytics The EurographicsAssociation Goslar Germany httpwwwvismastereubook See also daggerMorgan amp ClaypoolSynthesis Lectures on Visualization httpwwwmorganclaypoolcomtocvis21

bull Dagger[Econometrics] Bruce E Hansen 2014 Econometrics httpwwwsscwiscedu~bhansen

econometrics

29

bull Dagger[StatisticalLearning] Gareth James Daniela Witten Trevor Hastie and Robert Tibshirani 2013An Introduction to Statistical Learning with Applications in R Springer httpwww-bcfuscedu

~garethISLindexhtml

bull [MachineLearning] Christopher Bishop 2006 Pattern Recognition and Machine Learning Springer

bull dagger[Bayes] Peter D Hoff 2009 A First Course in Bayesian Statistical Methods Springer http

linkspringercombook1010072F978-0-387-92407-6

bull dagger[GraphicalModels] Soslashren Hoslashjsgaard David Edwards Steffen Lauritzen 2012 Graphical Mod-els with R Springer httpslink-springer-comezaccesslibrariespsuedubook101007

2F978-1-4614-2299-0

bull [InformationTheory] David Mackay 2003 Information Theory Inference and Learning AlgorithmsSpringer

bull Dagger[DeepLearning] Ian Goodfellow Yoshua Bengio and Aaron Courville 2016 Deep Learning MITPress httpwwwdeeplearningbookorg (see also Yann LeCun Yoshua Bengio and Geof-frey Hinton 2015 ldquoDeep Learningrdquo Nature 521(7553) 436ndash444 httpsdoiorg101038

nature14539)

See also Dagger[Keras] Keras The Python Deep Learning Library httpskerasio Francoios CholletDirectory of Keras tutorials httpsgithubcomfcholletkeras-resources

bull dagger[SignalProcessing] Jose Maria Giron-Sierra 2013 Digital Signal Processing with MATLAB Exam-ples (Volumes 1-3) Springer httpslink-springer-comezaccesslibrariespsuedubook

101007978-981-10-2534-1

30

Penn State Policy Statements

Academic Integrity

Academic integrity is the pursuit of scholarly activity in an open honest and responsible manner Academicintegrity is a basic guiding principle for all academic activity at The Pennsylvania State University and allmembers of the University community are expected to act in accordance with this principle Consistent withthis expectation the Universitys Code of Conduct states that all students should act with personal integrityrespect other students dignity rights and property and help create and maintain an environment in whichall can succeed through the fruits of their efforts

Academic integrity includes a commitment by all members of the University community not to engage in ortolerate acts of falsification misrepresentation or deception Such acts of dishonesty violate the fundamentalethical principles of the University community and compromise the worth of work completed by others

Disability Accomodation

Penn State welcomes students with disabilities into the Universitys educational programs Every Penn Statecampus has an office for students with disabilities Student Disability Resources (SDR) website provides con-tact information for every Penn State campus (httpequitypsuedusdrdisability-coordinator)For further information please visit the Student Disability Resources website (httpequitypsuedusdr)

In order to receive consideration for reasonable accommodations you must contact the appropriate disabilityservices office at the campus where you are officially enrolled participate in an intake interview and pro-vide documentation See documentation guidelines at (httpequitypsuedusdrguidelines) If thedocumentation supports your request for reasonable accommodations your campus disability services officewill provide you with an accommodation letter Please share this letter with your instructors and discussthe accommodations with them as early as possible You must follow this process for every semester thatyou request accommodations

Psychological Services

Many students at Penn State face personal challenges or have psychological needs that may interfere withtheir academic progress social development or emotional wellbeing The university offers a variety ofconfidential services to help you through difficult times including individual and group counseling crisisintervention consultations online chats and mental health screenings These services are provided by staffwho welcome all students and embrace a philosophy respectful of clients cultural and religious backgroundsand sensitive to differences in race ability gender identity and sexual orientation

bull Counseling and Psychological Services at University Park (CAPS) (httpstudentaffairspsueducounseling) 814-863-0395

bull Penn State Crisis Line (24 hours7 daysweek) 877-229-6400

bull Crisis Text Line (24 hours7 daysweek) Text LIONS to 741741

Educational Equity

Penn State takes great pride to foster a diverse and inclusive environment for students faculty and staffConsistent with University Policy AD29 students who believe they have experienced or observed a hatecrime an act of intolerance discrimination or harassment that occurs at Penn State are urged to report theseincidents as outlined on the Universitys Report Bias webpage (httpequitypsuedureportbias)

31

Background Knowledge

Students will have a variety of backgrounds Prior to beginning interdisciplinary coursework tofulfill Social Data Analytics degree requirements including SoDA 501 and 502 students are ex-pected to have advanced (graduate) training in at least one of the component areas of Social DataAnalytics and a familiarity with basic concepts in the others

With regard to specialization students are expected to have advanced (graduate) training in ONEof the following

bull quantitative social science methodology and a discipline of social science (as would be thecase for a second-year PhD student in Political Science Sociology Criminology Human De-velopment and Family Studies or Demography) OR

bull statistics (as would be the case for a second-year PhD student in Statistics) OR

bull information science or informatics (as would be the case for a second-year PhD student inInformation Science and Technology or a second-year PhD student in Geography specializingin GIScience) OR

bull computer science (as would be the case for a second-year PhD student in Computer Scienceand Engineering)

This requirement is met as a matter of meeting home program requirements for students in thedual-title PhD but may require additional coursework on the part of students in other programswishing to pursue the graduate minor

With regard to general preparation students are expected to have ALL of the following technicalknowledge

bull basic programming skills including basic facility with R ampor Python AND

bull basic knowledge of relational databases ampor geographic information systems AND

bull basic knowledge of probability applied statistics ampor social science research design AND

bull basic familiarity with a substantive or theoretical area of social science (eg 300-level course-work in political science sociology criminology human development psychology economicscommunication anthropology human geography social informatics or similar fields)

It is not unusual for students to have one or more gaps in this preparation at time of applicationto the SoDA program Students should work with Social Data Analytics advisers to develop aplan for timely remediation of any deficiencies which generally will not require formal courseworkfor students whose training and interests are otherwise appropriate for pursuit of the Social DataAnalytics degree Where possible this will be addressed at time of application to the Social DataAnalytics program

To this end some free training materials are linked in the reference section and there are alsoabundant free high-quality self-paced course-style training materials on these and related subjectsavailable through edX (httpwwwedxorg) Udacity (httpwwwudacitycom) Codecademy(httpcodecademycom) and Lynda (httpwwwlyndapsu)

32

Page 28: SoDA 501: Approaches and Issues in Big Social Data Spring ... · \Approaches and Issues in Big Social Data," o ered in the spring semester. Some SoDA / IGERT students will take more

bull DataScience-Python NB 0413 ldquoGeographic Data with Basemaprdquo

bull Time ldquoLongitudinalrdquo intra-individual data as common in developmental psychology dagger[Longitudinal]Garrett Fitzmaurice Marie Davidian Geert Verbeke and Geert Molenberghs editors 2009 Longitu-dinal Data Analysis Chapman amp Hall CRC httppensueblibcompatronFullRecordaspxp=359998

bull Time ldquoTime Series Time Series Cross Section Panelrdquo as common in economics political scienceand sociology DataScience-Python Ch 3 re pandas DSHandbook-Py Ch 17 ldquoTime Series AnalysisrdquoEconometrics GelmanHill

bull Time ldquoSequential Data Streamsrdquo as in NLP machine learning Spark and other rdquostreaming datardquoreadings

bull Space-Time Data with continuity connectivity neighborhood structure See DeepLearning Chapter 9re convolution

bull CRAN Task Views Spatial httpscranr-projectorgwebviewsSpatialhtml Spatiotempo-ral httpscranr-projectorgwebviewsSpatioTemporalhtml Time Series httpscran

r-projectorgwebviewsTimeSerieshtml

Network Graphs

bull Dagger[Networks] David Easley and Jon Kleinberg 2010 Networks Crowds and Markets ReasoningAbout a Highly Connected World Cambridge University Press httpwwwcscornelleduhome

kleinbernetworks-book

bull MMDS Ch 5 ldquoLink Analysisrdquo Ch 10 ldquoMining Social-Network Graphsrdquo

bull dagger[Networks-R] Douglas Luke 2015 A Userrsquos Guide to Network Analysis in R Springer https

link-springer-comezaccesslibrariespsuedubook101007978-3-319-23883-8

bull dagger[Networks-Python] Mohammed Zuhair Al-Taie and Seifedine Kadry 2017 Python for Graph andNetwork Analysis Springer httpslink-springer-comezaccesslibrariespsuedubook

1010072F978-3-319-53004-8

Hierarchy Clustered Data Aggregation Mixed Models

bull dagger[GelmanHill] Andrew Gelman and Jennifer Hill 2006 Data Analysis Using Regression and Multi-levelHierarchical Models Cambridge University Press httppensueblibcompatronFullRecordaspxp=288457

bull dagger[MixedModels] Eugene Demidenko 2013 Mixed Models Theory and Applications with R 2nd edWiley httpsiteebrarycomezaccesslibrariespsuedulibpennstatedetailaction

docID=10748641

Social Data Channels

Language Text Speech Audio

bull Dagger[NLP] Daniel Jurafsky and James H Martin Forthcoming (2017) Speech and Language ProcessingAn Introduction to Natural Language Processing Speech Recognition and Computational Linguistics3rd ed Prentice-Hall Preprint httpswebstanfordedu~jurafskyslp3

bull InfoRetrieval

bull Dagger[CoreNLP] Stanford CoreNLP httpsstanfordnlpgithubioCoreNLP

28

bull dagger[TextAsData] Grimmer and Stewart 2013 ldquoText as Data The Promise and Pitfalls of AutomaticContent Analysis Methods for Political Textsrdquo Political Analysis httpsdoiorg101093pan

mps028

bull Quinn-Topics

bull FightinWords

bull dagger[NLTK-Python] Jacob Perkins 2010 Python Text Processing with NLTK 20 Cookbook PackthttpsiteebrarycomezaccesslibrariespsuedulibpennstatedetailactiondocID=10435387dagger[TextAnalytics-Python] Dipanjan Sarkar 2016 Text Analytics with Python Springer http

linkspringercomezaccesslibrariespsuedubook1010072F978-1-4842-2388-8

bull DSHandbook-Py Ch 16 ldquoNatural Language Processingrdquo

bull Matthew J Denny httpwwwmjdennycomText_Processing_In_Rhtml

bull CRAN Natural Language Processing httpscranr-projectorgwebviewsNaturalLanguageProcessinghtml

bull dagger[AudioData] Dean Knox and Christopher Lucas 2017 ldquoThe Speaker-Affect Model MeasuringEmotion in Political Speech with Audio Datardquo httpchristopherlucasorgfilesPDFssam

pdf httpswwwyoutubecomwatchv=Hs8A9dwkMzI

bull daggerMorgan amp Claypool Synthesis Lectures on Human Language Technologies httpwwwmorganclaypoolcomtochlt11

bull daggerMorgan amp Claypool Synthesis Lectures on Speech and Audio Processing httpwwwmorganclaypoolcomtocsap11

Vision Image Video

bull Dagger[ComputerVision] Richard Szelinski 2010 Computer Vision Algorithms and Applications SpringerhttpszeliskiorgBook

bull MixedModels Ch 11 ldquoStatistical Analysis of Shaperdquo Ch 12 ldquoStatistical Image Analysisrdquo

bull AudioVideoVolumetric See DeepLearning Chapter 9 re convolution

bull daggerMorgan amp Claypool Synthesis Lectures on Image Video and Multimedia Processing httpwww

morganclaypoolcomtocivm11

bull daggerMorgan amp Claypool Synthesis Lectures on Computer Vision httpwwwmorganclaypoolcom

toccov11

Approaches to Learning from Data (The Analytics Layer)

bull NRCreport Ch 7 ldquoBuilding Models from Massive Datardquo

bull MMDS (data-mining)

bull CausalInference

bull Dagger[VisualAnalytics] Daniel Keim Jorn Kohlhammer Geoffrey Ellis and Florian Mansmann editors2010 Mastering the Information Age Solving Problems with Visual Analytics The EurographicsAssociation Goslar Germany httpwwwvismastereubook See also daggerMorgan amp ClaypoolSynthesis Lectures on Visualization httpwwwmorganclaypoolcomtocvis21

bull Dagger[Econometrics] Bruce E Hansen 2014 Econometrics httpwwwsscwiscedu~bhansen

econometrics

29

bull Dagger[StatisticalLearning] Gareth James Daniela Witten Trevor Hastie and Robert Tibshirani 2013An Introduction to Statistical Learning with Applications in R Springer httpwww-bcfuscedu

~garethISLindexhtml

bull [MachineLearning] Christopher Bishop 2006 Pattern Recognition and Machine Learning Springer

bull dagger[Bayes] Peter D Hoff 2009 A First Course in Bayesian Statistical Methods Springer http

linkspringercombook1010072F978-0-387-92407-6

bull dagger[GraphicalModels] Soslashren Hoslashjsgaard David Edwards Steffen Lauritzen 2012 Graphical Mod-els with R Springer httpslink-springer-comezaccesslibrariespsuedubook101007

2F978-1-4614-2299-0

bull [InformationTheory] David Mackay 2003 Information Theory Inference and Learning AlgorithmsSpringer

bull Dagger[DeepLearning] Ian Goodfellow Yoshua Bengio and Aaron Courville 2016 Deep Learning MITPress httpwwwdeeplearningbookorg (see also Yann LeCun Yoshua Bengio and Geof-frey Hinton 2015 ldquoDeep Learningrdquo Nature 521(7553) 436ndash444 httpsdoiorg101038

nature14539)

See also Dagger[Keras] Keras The Python Deep Learning Library httpskerasio Francoios CholletDirectory of Keras tutorials httpsgithubcomfcholletkeras-resources

bull dagger[SignalProcessing] Jose Maria Giron-Sierra 2013 Digital Signal Processing with MATLAB Exam-ples (Volumes 1-3) Springer httpslink-springer-comezaccesslibrariespsuedubook

101007978-981-10-2534-1

30

Penn State Policy Statements

Academic Integrity

Academic integrity is the pursuit of scholarly activity in an open honest and responsible manner Academicintegrity is a basic guiding principle for all academic activity at The Pennsylvania State University and allmembers of the University community are expected to act in accordance with this principle Consistent withthis expectation the Universitys Code of Conduct states that all students should act with personal integrityrespect other students dignity rights and property and help create and maintain an environment in whichall can succeed through the fruits of their efforts

Academic integrity includes a commitment by all members of the University community not to engage in ortolerate acts of falsification misrepresentation or deception Such acts of dishonesty violate the fundamentalethical principles of the University community and compromise the worth of work completed by others

Disability Accomodation

Penn State welcomes students with disabilities into the Universitys educational programs Every Penn Statecampus has an office for students with disabilities Student Disability Resources (SDR) website provides con-tact information for every Penn State campus (httpequitypsuedusdrdisability-coordinator)For further information please visit the Student Disability Resources website (httpequitypsuedusdr)

In order to receive consideration for reasonable accommodations you must contact the appropriate disabilityservices office at the campus where you are officially enrolled participate in an intake interview and pro-vide documentation See documentation guidelines at (httpequitypsuedusdrguidelines) If thedocumentation supports your request for reasonable accommodations your campus disability services officewill provide you with an accommodation letter Please share this letter with your instructors and discussthe accommodations with them as early as possible You must follow this process for every semester thatyou request accommodations

Psychological Services

Many students at Penn State face personal challenges or have psychological needs that may interfere withtheir academic progress social development or emotional wellbeing The university offers a variety ofconfidential services to help you through difficult times including individual and group counseling crisisintervention consultations online chats and mental health screenings These services are provided by staffwho welcome all students and embrace a philosophy respectful of clients cultural and religious backgroundsand sensitive to differences in race ability gender identity and sexual orientation

bull Counseling and Psychological Services at University Park (CAPS) (httpstudentaffairspsueducounseling) 814-863-0395

bull Penn State Crisis Line (24 hours7 daysweek) 877-229-6400

bull Crisis Text Line (24 hours7 daysweek) Text LIONS to 741741

Educational Equity

Penn State takes great pride to foster a diverse and inclusive environment for students faculty and staffConsistent with University Policy AD29 students who believe they have experienced or observed a hatecrime an act of intolerance discrimination or harassment that occurs at Penn State are urged to report theseincidents as outlined on the Universitys Report Bias webpage (httpequitypsuedureportbias)

31

Background Knowledge

Students will have a variety of backgrounds Prior to beginning interdisciplinary coursework tofulfill Social Data Analytics degree requirements including SoDA 501 and 502 students are ex-pected to have advanced (graduate) training in at least one of the component areas of Social DataAnalytics and a familiarity with basic concepts in the others

With regard to specialization students are expected to have advanced (graduate) training in ONEof the following

bull quantitative social science methodology and a discipline of social science (as would be thecase for a second-year PhD student in Political Science Sociology Criminology Human De-velopment and Family Studies or Demography) OR

bull statistics (as would be the case for a second-year PhD student in Statistics) OR

bull information science or informatics (as would be the case for a second-year PhD student inInformation Science and Technology or a second-year PhD student in Geography specializingin GIScience) OR

bull computer science (as would be the case for a second-year PhD student in Computer Scienceand Engineering)

This requirement is met as a matter of meeting home program requirements for students in thedual-title PhD but may require additional coursework on the part of students in other programswishing to pursue the graduate minor

With regard to general preparation students are expected to have ALL of the following technicalknowledge

bull basic programming skills including basic facility with R ampor Python AND

bull basic knowledge of relational databases ampor geographic information systems AND

bull basic knowledge of probability applied statistics ampor social science research design AND

bull basic familiarity with a substantive or theoretical area of social science (eg 300-level course-work in political science sociology criminology human development psychology economicscommunication anthropology human geography social informatics or similar fields)

It is not unusual for students to have one or more gaps in this preparation at time of applicationto the SoDA program Students should work with Social Data Analytics advisers to develop aplan for timely remediation of any deficiencies which generally will not require formal courseworkfor students whose training and interests are otherwise appropriate for pursuit of the Social DataAnalytics degree Where possible this will be addressed at time of application to the Social DataAnalytics program

To this end some free training materials are linked in the reference section and there are alsoabundant free high-quality self-paced course-style training materials on these and related subjectsavailable through edX (httpwwwedxorg) Udacity (httpwwwudacitycom) Codecademy(httpcodecademycom) and Lynda (httpwwwlyndapsu)

32

Page 29: SoDA 501: Approaches and Issues in Big Social Data Spring ... · \Approaches and Issues in Big Social Data," o ered in the spring semester. Some SoDA / IGERT students will take more

bull dagger[TextAsData] Grimmer and Stewart 2013 ldquoText as Data The Promise and Pitfalls of AutomaticContent Analysis Methods for Political Textsrdquo Political Analysis httpsdoiorg101093pan

mps028

bull Quinn-Topics

bull FightinWords

bull dagger[NLTK-Python] Jacob Perkins 2010 Python Text Processing with NLTK 20 Cookbook PackthttpsiteebrarycomezaccesslibrariespsuedulibpennstatedetailactiondocID=10435387dagger[TextAnalytics-Python] Dipanjan Sarkar 2016 Text Analytics with Python Springer http

linkspringercomezaccesslibrariespsuedubook1010072F978-1-4842-2388-8

bull DSHandbook-Py Ch 16 ldquoNatural Language Processingrdquo

bull Matthew J Denny httpwwwmjdennycomText_Processing_In_Rhtml

bull CRAN Natural Language Processing httpscranr-projectorgwebviewsNaturalLanguageProcessinghtml

bull dagger[AudioData] Dean Knox and Christopher Lucas 2017 ldquoThe Speaker-Affect Model MeasuringEmotion in Political Speech with Audio Datardquo httpchristopherlucasorgfilesPDFssam

pdf httpswwwyoutubecomwatchv=Hs8A9dwkMzI

bull daggerMorgan amp Claypool Synthesis Lectures on Human Language Technologies httpwwwmorganclaypoolcomtochlt11

bull daggerMorgan amp Claypool Synthesis Lectures on Speech and Audio Processing httpwwwmorganclaypoolcomtocsap11

Vision Image Video

bull Dagger[ComputerVision] Richard Szelinski 2010 Computer Vision Algorithms and Applications SpringerhttpszeliskiorgBook

bull MixedModels Ch 11 ldquoStatistical Analysis of Shaperdquo Ch 12 ldquoStatistical Image Analysisrdquo

bull AudioVideoVolumetric See DeepLearning Chapter 9 re convolution

bull daggerMorgan amp Claypool Synthesis Lectures on Image Video and Multimedia Processing httpwww

morganclaypoolcomtocivm11

bull daggerMorgan amp Claypool Synthesis Lectures on Computer Vision httpwwwmorganclaypoolcom

toccov11

Approaches to Learning from Data (The Analytics Layer)

bull NRCreport Ch 7 ldquoBuilding Models from Massive Datardquo

bull MMDS (data-mining)

bull CausalInference

bull Dagger[VisualAnalytics] Daniel Keim Jorn Kohlhammer Geoffrey Ellis and Florian Mansmann editors2010 Mastering the Information Age Solving Problems with Visual Analytics The EurographicsAssociation Goslar Germany httpwwwvismastereubook See also daggerMorgan amp ClaypoolSynthesis Lectures on Visualization httpwwwmorganclaypoolcomtocvis21

bull Dagger[Econometrics] Bruce E Hansen 2014 Econometrics httpwwwsscwiscedu~bhansen

econometrics

29

bull Dagger[StatisticalLearning] Gareth James Daniela Witten Trevor Hastie and Robert Tibshirani 2013An Introduction to Statistical Learning with Applications in R Springer httpwww-bcfuscedu

~garethISLindexhtml

bull [MachineLearning] Christopher Bishop 2006 Pattern Recognition and Machine Learning Springer

bull dagger[Bayes] Peter D Hoff 2009 A First Course in Bayesian Statistical Methods Springer http

linkspringercombook1010072F978-0-387-92407-6

bull dagger[GraphicalModels] Soslashren Hoslashjsgaard David Edwards Steffen Lauritzen 2012 Graphical Mod-els with R Springer httpslink-springer-comezaccesslibrariespsuedubook101007

2F978-1-4614-2299-0

bull [InformationTheory] David Mackay 2003 Information Theory Inference and Learning AlgorithmsSpringer

bull Dagger[DeepLearning] Ian Goodfellow Yoshua Bengio and Aaron Courville 2016 Deep Learning MITPress httpwwwdeeplearningbookorg (see also Yann LeCun Yoshua Bengio and Geof-frey Hinton 2015 ldquoDeep Learningrdquo Nature 521(7553) 436ndash444 httpsdoiorg101038

nature14539)

See also Dagger[Keras] Keras The Python Deep Learning Library httpskerasio Francoios CholletDirectory of Keras tutorials httpsgithubcomfcholletkeras-resources

bull dagger[SignalProcessing] Jose Maria Giron-Sierra 2013 Digital Signal Processing with MATLAB Exam-ples (Volumes 1-3) Springer httpslink-springer-comezaccesslibrariespsuedubook

101007978-981-10-2534-1

30

Penn State Policy Statements

Academic Integrity

Academic integrity is the pursuit of scholarly activity in an open honest and responsible manner Academicintegrity is a basic guiding principle for all academic activity at The Pennsylvania State University and allmembers of the University community are expected to act in accordance with this principle Consistent withthis expectation the Universitys Code of Conduct states that all students should act with personal integrityrespect other students dignity rights and property and help create and maintain an environment in whichall can succeed through the fruits of their efforts

Academic integrity includes a commitment by all members of the University community not to engage in ortolerate acts of falsification misrepresentation or deception Such acts of dishonesty violate the fundamentalethical principles of the University community and compromise the worth of work completed by others

Disability Accomodation

Penn State welcomes students with disabilities into the Universitys educational programs Every Penn Statecampus has an office for students with disabilities Student Disability Resources (SDR) website provides con-tact information for every Penn State campus (httpequitypsuedusdrdisability-coordinator)For further information please visit the Student Disability Resources website (httpequitypsuedusdr)

In order to receive consideration for reasonable accommodations you must contact the appropriate disabilityservices office at the campus where you are officially enrolled participate in an intake interview and pro-vide documentation See documentation guidelines at (httpequitypsuedusdrguidelines) If thedocumentation supports your request for reasonable accommodations your campus disability services officewill provide you with an accommodation letter Please share this letter with your instructors and discussthe accommodations with them as early as possible You must follow this process for every semester thatyou request accommodations

Psychological Services

Many students at Penn State face personal challenges or have psychological needs that may interfere withtheir academic progress social development or emotional wellbeing The university offers a variety ofconfidential services to help you through difficult times including individual and group counseling crisisintervention consultations online chats and mental health screenings These services are provided by staffwho welcome all students and embrace a philosophy respectful of clients cultural and religious backgroundsand sensitive to differences in race ability gender identity and sexual orientation

bull Counseling and Psychological Services at University Park (CAPS) (httpstudentaffairspsueducounseling) 814-863-0395

bull Penn State Crisis Line (24 hours7 daysweek) 877-229-6400

bull Crisis Text Line (24 hours7 daysweek) Text LIONS to 741741

Educational Equity

Penn State takes great pride to foster a diverse and inclusive environment for students faculty and staffConsistent with University Policy AD29 students who believe they have experienced or observed a hatecrime an act of intolerance discrimination or harassment that occurs at Penn State are urged to report theseincidents as outlined on the Universitys Report Bias webpage (httpequitypsuedureportbias)

31

Background Knowledge

Students will have a variety of backgrounds Prior to beginning interdisciplinary coursework tofulfill Social Data Analytics degree requirements including SoDA 501 and 502 students are ex-pected to have advanced (graduate) training in at least one of the component areas of Social DataAnalytics and a familiarity with basic concepts in the others

With regard to specialization students are expected to have advanced (graduate) training in ONEof the following

bull quantitative social science methodology and a discipline of social science (as would be thecase for a second-year PhD student in Political Science Sociology Criminology Human De-velopment and Family Studies or Demography) OR

bull statistics (as would be the case for a second-year PhD student in Statistics) OR

bull information science or informatics (as would be the case for a second-year PhD student inInformation Science and Technology or a second-year PhD student in Geography specializingin GIScience) OR

bull computer science (as would be the case for a second-year PhD student in Computer Scienceand Engineering)

This requirement is met as a matter of meeting home program requirements for students in thedual-title PhD but may require additional coursework on the part of students in other programswishing to pursue the graduate minor

With regard to general preparation students are expected to have ALL of the following technicalknowledge

bull basic programming skills including basic facility with R ampor Python AND

bull basic knowledge of relational databases ampor geographic information systems AND

bull basic knowledge of probability applied statistics ampor social science research design AND

bull basic familiarity with a substantive or theoretical area of social science (eg 300-level course-work in political science sociology criminology human development psychology economicscommunication anthropology human geography social informatics or similar fields)

It is not unusual for students to have one or more gaps in this preparation at time of applicationto the SoDA program Students should work with Social Data Analytics advisers to develop aplan for timely remediation of any deficiencies which generally will not require formal courseworkfor students whose training and interests are otherwise appropriate for pursuit of the Social DataAnalytics degree Where possible this will be addressed at time of application to the Social DataAnalytics program

To this end some free training materials are linked in the reference section and there are alsoabundant free high-quality self-paced course-style training materials on these and related subjectsavailable through edX (httpwwwedxorg) Udacity (httpwwwudacitycom) Codecademy(httpcodecademycom) and Lynda (httpwwwlyndapsu)

32

Page 30: SoDA 501: Approaches and Issues in Big Social Data Spring ... · \Approaches and Issues in Big Social Data," o ered in the spring semester. Some SoDA / IGERT students will take more

bull Dagger[StatisticalLearning] Gareth James Daniela Witten Trevor Hastie and Robert Tibshirani 2013An Introduction to Statistical Learning with Applications in R Springer httpwww-bcfuscedu

~garethISLindexhtml

bull [MachineLearning] Christopher Bishop 2006 Pattern Recognition and Machine Learning Springer

bull dagger[Bayes] Peter D Hoff 2009 A First Course in Bayesian Statistical Methods Springer http

linkspringercombook1010072F978-0-387-92407-6

bull dagger[GraphicalModels] Soslashren Hoslashjsgaard David Edwards Steffen Lauritzen 2012 Graphical Mod-els with R Springer httpslink-springer-comezaccesslibrariespsuedubook101007

2F978-1-4614-2299-0

bull [InformationTheory] David Mackay 2003 Information Theory Inference and Learning AlgorithmsSpringer

bull Dagger[DeepLearning] Ian Goodfellow Yoshua Bengio and Aaron Courville 2016 Deep Learning MITPress httpwwwdeeplearningbookorg (see also Yann LeCun Yoshua Bengio and Geof-frey Hinton 2015 ldquoDeep Learningrdquo Nature 521(7553) 436ndash444 httpsdoiorg101038

nature14539)

See also Dagger[Keras] Keras The Python Deep Learning Library httpskerasio Francoios CholletDirectory of Keras tutorials httpsgithubcomfcholletkeras-resources

bull dagger[SignalProcessing] Jose Maria Giron-Sierra 2013 Digital Signal Processing with MATLAB Exam-ples (Volumes 1-3) Springer httpslink-springer-comezaccesslibrariespsuedubook

101007978-981-10-2534-1

30

Penn State Policy Statements

Academic Integrity

Academic integrity is the pursuit of scholarly activity in an open honest and responsible manner Academicintegrity is a basic guiding principle for all academic activity at The Pennsylvania State University and allmembers of the University community are expected to act in accordance with this principle Consistent withthis expectation the Universitys Code of Conduct states that all students should act with personal integrityrespect other students dignity rights and property and help create and maintain an environment in whichall can succeed through the fruits of their efforts

Academic integrity includes a commitment by all members of the University community not to engage in ortolerate acts of falsification misrepresentation or deception Such acts of dishonesty violate the fundamentalethical principles of the University community and compromise the worth of work completed by others

Disability Accomodation

Penn State welcomes students with disabilities into the Universitys educational programs Every Penn Statecampus has an office for students with disabilities Student Disability Resources (SDR) website provides con-tact information for every Penn State campus (httpequitypsuedusdrdisability-coordinator)For further information please visit the Student Disability Resources website (httpequitypsuedusdr)

In order to receive consideration for reasonable accommodations you must contact the appropriate disabilityservices office at the campus where you are officially enrolled participate in an intake interview and pro-vide documentation See documentation guidelines at (httpequitypsuedusdrguidelines) If thedocumentation supports your request for reasonable accommodations your campus disability services officewill provide you with an accommodation letter Please share this letter with your instructors and discussthe accommodations with them as early as possible You must follow this process for every semester thatyou request accommodations

Psychological Services

Many students at Penn State face personal challenges or have psychological needs that may interfere withtheir academic progress social development or emotional wellbeing The university offers a variety ofconfidential services to help you through difficult times including individual and group counseling crisisintervention consultations online chats and mental health screenings These services are provided by staffwho welcome all students and embrace a philosophy respectful of clients cultural and religious backgroundsand sensitive to differences in race ability gender identity and sexual orientation

bull Counseling and Psychological Services at University Park (CAPS) (httpstudentaffairspsueducounseling) 814-863-0395

bull Penn State Crisis Line (24 hours7 daysweek) 877-229-6400

bull Crisis Text Line (24 hours7 daysweek) Text LIONS to 741741

Educational Equity

Penn State takes great pride to foster a diverse and inclusive environment for students faculty and staffConsistent with University Policy AD29 students who believe they have experienced or observed a hatecrime an act of intolerance discrimination or harassment that occurs at Penn State are urged to report theseincidents as outlined on the Universitys Report Bias webpage (httpequitypsuedureportbias)

31

Background Knowledge

Students will have a variety of backgrounds Prior to beginning interdisciplinary coursework tofulfill Social Data Analytics degree requirements including SoDA 501 and 502 students are ex-pected to have advanced (graduate) training in at least one of the component areas of Social DataAnalytics and a familiarity with basic concepts in the others

With regard to specialization students are expected to have advanced (graduate) training in ONEof the following

bull quantitative social science methodology and a discipline of social science (as would be thecase for a second-year PhD student in Political Science Sociology Criminology Human De-velopment and Family Studies or Demography) OR

bull statistics (as would be the case for a second-year PhD student in Statistics) OR

bull information science or informatics (as would be the case for a second-year PhD student inInformation Science and Technology or a second-year PhD student in Geography specializingin GIScience) OR

bull computer science (as would be the case for a second-year PhD student in Computer Scienceand Engineering)

This requirement is met as a matter of meeting home program requirements for students in thedual-title PhD but may require additional coursework on the part of students in other programswishing to pursue the graduate minor

With regard to general preparation students are expected to have ALL of the following technicalknowledge

bull basic programming skills including basic facility with R ampor Python AND

bull basic knowledge of relational databases ampor geographic information systems AND

bull basic knowledge of probability applied statistics ampor social science research design AND

bull basic familiarity with a substantive or theoretical area of social science (eg 300-level course-work in political science sociology criminology human development psychology economicscommunication anthropology human geography social informatics or similar fields)

It is not unusual for students to have one or more gaps in this preparation at time of applicationto the SoDA program Students should work with Social Data Analytics advisers to develop aplan for timely remediation of any deficiencies which generally will not require formal courseworkfor students whose training and interests are otherwise appropriate for pursuit of the Social DataAnalytics degree Where possible this will be addressed at time of application to the Social DataAnalytics program

To this end some free training materials are linked in the reference section and there are alsoabundant free high-quality self-paced course-style training materials on these and related subjectsavailable through edX (httpwwwedxorg) Udacity (httpwwwudacitycom) Codecademy(httpcodecademycom) and Lynda (httpwwwlyndapsu)

32

Page 31: SoDA 501: Approaches and Issues in Big Social Data Spring ... · \Approaches and Issues in Big Social Data," o ered in the spring semester. Some SoDA / IGERT students will take more

Penn State Policy Statements

Academic Integrity

Academic integrity is the pursuit of scholarly activity in an open honest and responsible manner Academicintegrity is a basic guiding principle for all academic activity at The Pennsylvania State University and allmembers of the University community are expected to act in accordance with this principle Consistent withthis expectation the Universitys Code of Conduct states that all students should act with personal integrityrespect other students dignity rights and property and help create and maintain an environment in whichall can succeed through the fruits of their efforts

Academic integrity includes a commitment by all members of the University community not to engage in ortolerate acts of falsification misrepresentation or deception Such acts of dishonesty violate the fundamentalethical principles of the University community and compromise the worth of work completed by others

Disability Accomodation

Penn State welcomes students with disabilities into the Universitys educational programs Every Penn Statecampus has an office for students with disabilities Student Disability Resources (SDR) website provides con-tact information for every Penn State campus (httpequitypsuedusdrdisability-coordinator)For further information please visit the Student Disability Resources website (httpequitypsuedusdr)

In order to receive consideration for reasonable accommodations you must contact the appropriate disabilityservices office at the campus where you are officially enrolled participate in an intake interview and pro-vide documentation See documentation guidelines at (httpequitypsuedusdrguidelines) If thedocumentation supports your request for reasonable accommodations your campus disability services officewill provide you with an accommodation letter Please share this letter with your instructors and discussthe accommodations with them as early as possible You must follow this process for every semester thatyou request accommodations

Psychological Services

Many students at Penn State face personal challenges or have psychological needs that may interfere withtheir academic progress social development or emotional wellbeing The university offers a variety ofconfidential services to help you through difficult times including individual and group counseling crisisintervention consultations online chats and mental health screenings These services are provided by staffwho welcome all students and embrace a philosophy respectful of clients cultural and religious backgroundsand sensitive to differences in race ability gender identity and sexual orientation

bull Counseling and Psychological Services at University Park (CAPS) (httpstudentaffairspsueducounseling) 814-863-0395

bull Penn State Crisis Line (24 hours7 daysweek) 877-229-6400

bull Crisis Text Line (24 hours7 daysweek) Text LIONS to 741741

Educational Equity

Penn State takes great pride to foster a diverse and inclusive environment for students faculty and staffConsistent with University Policy AD29 students who believe they have experienced or observed a hatecrime an act of intolerance discrimination or harassment that occurs at Penn State are urged to report theseincidents as outlined on the Universitys Report Bias webpage (httpequitypsuedureportbias)

31

Background Knowledge

Students will have a variety of backgrounds Prior to beginning interdisciplinary coursework tofulfill Social Data Analytics degree requirements including SoDA 501 and 502 students are ex-pected to have advanced (graduate) training in at least one of the component areas of Social DataAnalytics and a familiarity with basic concepts in the others

With regard to specialization students are expected to have advanced (graduate) training in ONEof the following

bull quantitative social science methodology and a discipline of social science (as would be thecase for a second-year PhD student in Political Science Sociology Criminology Human De-velopment and Family Studies or Demography) OR

bull statistics (as would be the case for a second-year PhD student in Statistics) OR

bull information science or informatics (as would be the case for a second-year PhD student inInformation Science and Technology or a second-year PhD student in Geography specializingin GIScience) OR

bull computer science (as would be the case for a second-year PhD student in Computer Scienceand Engineering)

This requirement is met as a matter of meeting home program requirements for students in thedual-title PhD but may require additional coursework on the part of students in other programswishing to pursue the graduate minor

With regard to general preparation students are expected to have ALL of the following technicalknowledge

bull basic programming skills including basic facility with R ampor Python AND

bull basic knowledge of relational databases ampor geographic information systems AND

bull basic knowledge of probability applied statistics ampor social science research design AND

bull basic familiarity with a substantive or theoretical area of social science (eg 300-level course-work in political science sociology criminology human development psychology economicscommunication anthropology human geography social informatics or similar fields)

It is not unusual for students to have one or more gaps in this preparation at time of applicationto the SoDA program Students should work with Social Data Analytics advisers to develop aplan for timely remediation of any deficiencies which generally will not require formal courseworkfor students whose training and interests are otherwise appropriate for pursuit of the Social DataAnalytics degree Where possible this will be addressed at time of application to the Social DataAnalytics program

To this end some free training materials are linked in the reference section and there are alsoabundant free high-quality self-paced course-style training materials on these and related subjectsavailable through edX (httpwwwedxorg) Udacity (httpwwwudacitycom) Codecademy(httpcodecademycom) and Lynda (httpwwwlyndapsu)

32

Page 32: SoDA 501: Approaches and Issues in Big Social Data Spring ... · \Approaches and Issues in Big Social Data," o ered in the spring semester. Some SoDA / IGERT students will take more

Background Knowledge

Students will have a variety of backgrounds Prior to beginning interdisciplinary coursework tofulfill Social Data Analytics degree requirements including SoDA 501 and 502 students are ex-pected to have advanced (graduate) training in at least one of the component areas of Social DataAnalytics and a familiarity with basic concepts in the others

With regard to specialization students are expected to have advanced (graduate) training in ONEof the following

bull quantitative social science methodology and a discipline of social science (as would be thecase for a second-year PhD student in Political Science Sociology Criminology Human De-velopment and Family Studies or Demography) OR

bull statistics (as would be the case for a second-year PhD student in Statistics) OR

bull information science or informatics (as would be the case for a second-year PhD student inInformation Science and Technology or a second-year PhD student in Geography specializingin GIScience) OR

bull computer science (as would be the case for a second-year PhD student in Computer Scienceand Engineering)

This requirement is met as a matter of meeting home program requirements for students in thedual-title PhD but may require additional coursework on the part of students in other programswishing to pursue the graduate minor

With regard to general preparation students are expected to have ALL of the following technicalknowledge

bull basic programming skills including basic facility with R ampor Python AND

bull basic knowledge of relational databases ampor geographic information systems AND

bull basic knowledge of probability applied statistics ampor social science research design AND

bull basic familiarity with a substantive or theoretical area of social science (eg 300-level course-work in political science sociology criminology human development psychology economicscommunication anthropology human geography social informatics or similar fields)

It is not unusual for students to have one or more gaps in this preparation at time of applicationto the SoDA program Students should work with Social Data Analytics advisers to develop aplan for timely remediation of any deficiencies which generally will not require formal courseworkfor students whose training and interests are otherwise appropriate for pursuit of the Social DataAnalytics degree Where possible this will be addressed at time of application to the Social DataAnalytics program

To this end some free training materials are linked in the reference section and there are alsoabundant free high-quality self-paced course-style training materials on these and related subjectsavailable through edX (httpwwwedxorg) Udacity (httpwwwudacitycom) Codecademy(httpcodecademycom) and Lynda (httpwwwlyndapsu)

32