ntc16 - open data and open source data science

26
3 Affordable Solutions Open Data and Open Source Data Science for You March, 2016

Upload: steph-nagoski

Post on 16-Feb-2017

798 views

Category:

Data & Analytics


1 download

TRANSCRIPT

Page 1: NTC16 - Open Data and Open Source Data Science

3 Affordable Solutions

Open Data and Open Source Data Science for You

March, 2016

Page 2: NTC16 - Open Data and Open Source Data Science

. © TechSoup Global | All rights reserved2

Introduction

Paula Alves @LadyData

Steph Nagoski @InformationChef

Session ID 104 #16NTCopendata

Materials & Collaboration Noteshttp://po.st/opendata-16NTC

Evaluation Link: http://po.st/fUt2gY

** WARNING: This presentation exposes information that you may find disturbing.

Page 3: NTC16 - Open Data and Open Source Data Science

. © TechSoup Global | All rights reserved3

Outline

Data Wrangling, Merging

Small Data Problems

Open Data Examples

Online Abuse management - Social and Technical, together

Abusive Community Analysis - Example using Reddit data

Bot Detection & Usage

Page 4: NTC16 - Open Data and Open Source Data Science

. © TechSoup Global | All rights reserved4

Data Wrangling / Data Merging Tools

Cleaning and merging multiple data sources :databases, CSV, txt files, JSON, XML, web services & Open Data Files

Trifacta www.trifacta.com/trifacta-wrangler/

OpenRefine - previously Google Refine http://openrefine.org/

Microsoft offerings you might already have: SSIS & Azure Data Factory

Other options include Crowdflower for data cleansing & tagginghttp://crowdflower.com

Page 5: NTC16 - Open Data and Open Source Data Science

. © TechSoup Global | All rights reserved5

Data Wrangling - Trifacta

Page 6: NTC16 - Open Data and Open Source Data Science

. © TechSoup Global | All rights reserved6

Data Wrangling - OpenRefine

Clean, Merge, and Transform data – for Javascript developers

Page 7: NTC16 - Open Data and Open Source Data Science

. © TechSoup Global | All rights reserved7

Crowdflower

Tool to enrich your data through technical and crowdsourced tagging, flagging, manual review.

Page 8: NTC16 - Open Data and Open Source Data Science

. © TechSoup Global | All rights reserved8

Big Data? We all hope we grow that big. For now…

Page 9: NTC16 - Open Data and Open Source Data Science

. © TechSoup Global | All rights reserved9

Small Data Problem Examples

San Francisco Health Improvement Partnership - Alcohol Policy Partnership Working Group w/Trifacta https://jrnew.shinyapps.io/sfhip-app/ Is neighborhood crime correlated with alcohol sales?

Page 10: NTC16 - Open Data and Open Source Data Science

. © TechSoup Global | All rights reserved10

Small Data Problem Examples

Bosnian/Hertzegovinan Electoral data w/Google Refine https://www.youtube.com/watch?v=BcxgAOCFppY

Southern Poverty Law Center Hate group listhttps://www.splcenter.org/hate-map

Conversion Therapy source listhttp://www.truthwinsout.org/ex-gay-consumer-fraud-division/ 

Govt Data sources - 18F - College Information https://collegescorecard.ed.gov/search/?major=computer&sort=advantage:desc

Page 11: NTC16 - Open Data and Open Source Data Science

. © TechSoup Global | All rights reserved11

Outline

Data Wrangling, Merging

Small Data Problems

Open Data Examples

Online Abuse management - Social and Technical, together

Abusive Community Analysis - Example using Reddit data

Bot Detection & Usage

Page 12: NTC16 - Open Data and Open Source Data Science

. © TechSoup Global | All rights reserved12

Reusable Open Data Analysis

DataKind - http://www.datakind.org/blog/open-data-in-action-our-top-25

Page 13: NTC16 - Open Data and Open Source Data Science

. © TechSoup Global | All rights reserved13

Reusable Open Data Analysis

CivicTech – Trends in Civic Tech Investment toolhttp://knightfoundation.org/features/civictech/

Page 14: NTC16 - Open Data and Open Source Data Science

. © TechSoup Global | All rights reserved14

Reusable Open Data Analysis

Data For Good: http://datalook.io/non-techies/ Library of reusable projects, with a focus on Non-Tech Users!

Page 15: NTC16 - Open Data and Open Source Data Science

. © TechSoup Global | All rights reserved15

Open Data Formats -> Open Data Services

18F - GSA branch committed to open development & open data https://18f.gsa.gov/

Open Data Maker: convert CSV files to an extensible open API w/analytics https://github.com/18F/open-data-maker

First large example of use of OpenDataMaker API: https://collegescorecard.ed.gov/

Page 16: NTC16 - Open Data and Open Source Data Science

. © TechSoup Global | All rights reserved16

Free Speech, and Groups that may disagree w/you

#BlackLivesMatter

Feminist Frequency - Media Criticism from Feminist perspective

Jewish and Islamic communities

Disability Organizations

Reproductive Health and Women’s rights

Any nonprofit that advocates for oppressed minorities

Page 17: NTC16 - Open Data and Open Source Data Science

. © TechSoup Global | All rights reserved17

Handling Online Abuse 1

Crowdsourcing support/handling: Online Abuse Prevention Initiative (OAPI) http://onlineabuseprevention.org/

Projects: https://github.com/oapi

Hollaback’s new Heartmob https://iheartmob.org/

Shared Block Lists - https://blocktogether.org/

Hiding blocked users from Twitter Search http://blog.randi.io/2016/01/13/hiding-blocked-users-from-twitter-search/ GoodGame AutoBlocker https://github.com/freebsdgirl/ggautoblocker

Page 18: NTC16 - Open Data and Open Source Data Science

. © TechSoup Global | All rights reserved18

Outline

Data Wrangling, Merging

Small Data Problems

Open Data Examples

Online Abuse management - Social and Technical, together

Abusive Community Analysis - Example using Reddit data

Bot Detection & Usage

Page 19: NTC16 - Open Data and Open Source Data Science

. © TechSoup Global | All rights reserved19

Reddit Common Terms in Offensive Thread

http://reddit.com/r/WhiteRights

Page 20: NTC16 - Open Data and Open Source Data Science

. © TechSoup Global | All rights reserved20

Top 25 Most Frequent Words

Page 21: NTC16 - Open Data and Open Source Data Science

. © TechSoup Global | All rights reserved21

Sample from Top 50 Bigrams in Reddit dataset

Word1 Word2 Rank

bin laden 5ann coulter 9jim crow 14hip hop 18pearl harbor 22nelson mandela 27martin luther 39charlie hebdo 40bernie sanders 48anglo saxon 50

Page 22: NTC16 - Open Data and Open Source Data Science

. © TechSoup Global | All rights reserved22

Code Examples of Reddit Analysis

Placeholder

Page 23: NTC16 - Open Data and Open Source Data Science

. © TechSoup Global | All rights reserved23

Handling Online Abuse : Bots

Bot Detection: http://www.erinshellman.com/bot-or-not/

Page 24: NTC16 - Open Data and Open Source Data Science

. © TechSoup Global | All rights reserved24

Handling Online Abuse: Bots

Productized simple analysis of twitter bots: https://www.twitteraudit.com/

Page 25: NTC16 - Open Data and Open Source Data Science

. © TechSoup Global | All rights reserved25

Takeaways

Many tools for merging, cleaning & preparing your data for analysis are now accessible to end-users, many of them open source or free for nonprofits.

Accessing Open Data through API-based applications is more efficient, centrally updated, fresher data, better performance, end-user focused.

Lots of tools are available to help monitor and manage Social Media.

Advanced Data Science tools to detect problems are starting to be used in more end-user friendly ways.

Page 26: NTC16 - Open Data and Open Source Data Science

26

What do YOU think?

Collaborative Q&A Session