introduction to open data and data science

36
Open Data Learn-up

Upload: suraj-kumar-jana

Post on 24-Jan-2018

20 views

Category:

Technology


2 download

TRANSCRIPT

Page 1: Introduction to Open Data and Data Science

Open Data Learn-up

Page 2: Introduction to Open Data and Data Science

@_opendatahack

About opendatahack.org

Open data hack is a collaborative effort in solving day to day difficulties faced by local communities, civic bodies and non profit institutions. Technologists, designers, innovators and government bodies with greatsocial insights come aboard for a day together to build technology based solutions availing enormously accessible free open data.

Page 3: Introduction to Open Data and Data Science

@_opendatahack

Our current projects in India?

● Real-time environment vitals monitoring system with suggestions

● Health factors heat map of urban localities in India

● Mapping of all quality abortion clinics in India

Page 4: Introduction to Open Data and Data Science

@_opendatahack

Introduction to Open Data

Page 5: Introduction to Open Data and Data Science

@_opendatahack

What is Data?

A collection of facts, information and statistics thatcan be analyzed to develop new knowledge

Page 6: Introduction to Open Data and Data Science

@_opendatahack

What is Open Data?

Page 7: Introduction to Open Data and Data Science

@_opendatahack

Definition by OKF

A piece of data or content is open if anyone is free to use, reuse, and redistribute it -

subject only, at most, to the requirement to attribute and/or share-alike.

Page 8: Introduction to Open Data and Data Science

@_opendatahack

Definition by ODI

Open data is data that is made available by organizations, businesses and individuals for

anyone to access, use and share.

Page 9: Introduction to Open Data and Data Science

@_opendatahack

Let’s define it...

Page 10: Introduction to Open Data and Data Science

@_opendatahack

Open Data is accessible public data that people, companies and organizations can use to launch new ventures, analyze patterns and trends, make data-

driven decisions, and solve complex problems.

Page 11: Introduction to Open Data and Data Science

@_opendatahack

Benefits of Open Data

● Data Driven Decision Making● Performance Measurement● Reduction of Government Costs● Support an Open Government Initiative

– e.g. Transparency● Economic Development● Increased Citizen Engagement● Talent Attraction / Retention

Page 12: Introduction to Open Data and Data Science

@_opendatahack

Types of Open Data

● Government data

● Commercial data

● Crowd sourced data

Page 13: Introduction to Open Data and Data Science

@_opendatahack

Few Open Data projects...

Page 14: Introduction to Open Data and Data Science

@_opendatahack

Open Data sources

Page 15: Introduction to Open Data and Data Science

@_opendatahack

Open Data Licenses

● Open Data Commons Public Domain Dedication and Licence (ODC PDDL) – Public domain

● Creative Commons CCZero – Public domain

● Open Data Commons Attribution License – Attribution for data(bases)

● Open Data Commons Open Database License (OdbL) - Attribution-ShareAlike for data(bases)

Page 16: Introduction to Open Data and Data Science

@_opendatahack

Introduction to Data Science

Page 17: Introduction to Open Data and Data Science

@_opendatahack

What is Data Science?

Data science ~ computer science +mathematics/statistics + visualization

Page 18: Introduction to Open Data and Data Science

@_opendatahack

Data is just like crude

● It’s valuable, but if unrefined it cannot really be used.

● It has to be changed into gas, plastic, chemicals, etc to create a valuable entity that drives profitable activity

- Data must be broken down and analyzed for it to have value.

Page 19: Introduction to Open Data and Data Science

@_opendatahack

Outline

● Harvesting

● Cleaning

● Analyzing

● Visualizing

● Publishing

DATA

Page 20: Introduction to Open Data and Data Science

@_opendatahack

Data harvesting

● Locally available data

● Data dumps from Web

● Data through Web APIs

● Structured data in Web documents

Page 21: Introduction to Open Data and Data Science

@_opendatahack

Data cleansing

● Harvested data may come with lots of noise or interesting anomalies.

● Goal is to provide structured presentation for analysis.

- Network(graph)- Values with dimension

Page 22: Introduction to Open Data and Data Science

@_opendatahack

Data Science Tools

Page 23: Introduction to Open Data and Data Science

@_opendatahack

Data harvesting

● urllib & BeautifulSoup

● Scrapy

Page 24: Introduction to Open Data and Data Science

@_opendatahack

Some tips & ethics

● Use the mobile version of the sites if available● No cookies● Respect robots.txt● Identify yourself● If possible, download bulk data first, process it later● Prefer dumps over APIs, APIs over scraping● Be polite and request permission to gather the data● Worth checking: https://scraperwiki.com/

Page 25: Introduction to Open Data and Data Science

@_opendatahack

Data analyzing

● Numpy- Offers efficient multidimensional array object, ndarray- Basic linear algebra operations and data types- Requires GNU Fortran

● Scipy- Builds on top of NumPy- Modules for statistics, optimization, signal processing, ...- Add-ons (called SciKits) for machine learning, data mining, etc

● For analysing networks- NetworkX- igraph

Page 26: Introduction to Open Data and Data Science

@_opendatahack

Data visualizing

● Matplotlib

● NetworkX

● PyGraphviz

Page 27: Introduction to Open Data and Data Science

@_opendatahack

Page 28: Introduction to Open Data and Data Science

@_opendatahack

NumPy + SciPy + Matplotlib + IPython

● Provides Matlab ”-ish” environment

● ipython provides extended interactive interpreter (tab completion, magic functions for object querying, debugging, ...)

Page 29: Introduction to Open Data and Data Science

@_opendatahack

Some conviniet data formats

● JSON (import simplejson)

● XML (import xml)

● RDF (import rdflib, SPARQLWrapper)

● GraphML (import networkx)

● CSV (import csv)

Page 30: Introduction to Open Data and Data Science

@_opendatahack

Resource Description Framework (RDF)

● Collection of W3C standards for modeling complex relations and to exchange information

● Allows data from multiple sources to combine nicely

● RDF describes data with triples● - each triple has form subject - predicate - object

e.g. PyconIndia2017 is organized in Delhi

Page 31: Introduction to Open Data and Data Science

@_opendatahack

Why R for Data Science?

● Algorithms

● Visualizations

● Data manupulation

● Integrations

● Easily scalable

Page 32: Introduction to Open Data and Data Science

@_opendatahack

Simple R code for bar graph

# Create the data for the chart.H <- c(7,12,28,3,41)

# Give the chart file a name.png(file = "barchart.png")

# Plot the bar chart.barplot(H)

# Save the file.dev.off()

Page 33: Introduction to Open Data and Data Science

@_opendatahack

Shiny R

https://shiny.rstudio.com/gallery/

Page 34: Introduction to Open Data and Data Science

@_opendatahack

Few commonly used algorithms

● Naïve Bayes Classifier Algorithm● K Means Clustering Algorithm● Support Vector Machine Algorithm● Apriori Algorithm● Linear Regression● Logistic Regression● Artificial Neural Networks● Random Forests● Decision Trees● Nearest Neighbours

Page 35: Introduction to Open Data and Data Science

@_opendatahack

Anaconda

Page 36: Introduction to Open Data and Data Science

@_opendatahack

Thank you

opendatahack.orgfb.com/opendatahack