introduction to open data and data science
TRANSCRIPT
Open Data Learn-up
@_opendatahack
About opendatahack.org
Open data hack is a collaborative effort in solving day to day difficulties faced by local communities, civic bodies and non profit institutions. Technologists, designers, innovators and government bodies with greatsocial insights come aboard for a day together to build technology based solutions availing enormously accessible free open data.
@_opendatahack
Our current projects in India?
● Real-time environment vitals monitoring system with suggestions
● Health factors heat map of urban localities in India
● Mapping of all quality abortion clinics in India
@_opendatahack
Introduction to Open Data
@_opendatahack
What is Data?
A collection of facts, information and statistics thatcan be analyzed to develop new knowledge
@_opendatahack
What is Open Data?
@_opendatahack
Definition by OKF
A piece of data or content is open if anyone is free to use, reuse, and redistribute it -
subject only, at most, to the requirement to attribute and/or share-alike.
@_opendatahack
Definition by ODI
Open data is data that is made available by organizations, businesses and individuals for
anyone to access, use and share.
@_opendatahack
Let’s define it...
@_opendatahack
Open Data is accessible public data that people, companies and organizations can use to launch new ventures, analyze patterns and trends, make data-
driven decisions, and solve complex problems.
@_opendatahack
Benefits of Open Data
● Data Driven Decision Making● Performance Measurement● Reduction of Government Costs● Support an Open Government Initiative
– e.g. Transparency● Economic Development● Increased Citizen Engagement● Talent Attraction / Retention
@_opendatahack
Types of Open Data
● Government data
● Commercial data
● Crowd sourced data
@_opendatahack
Few Open Data projects...
@_opendatahack
Open Data sources
@_opendatahack
Open Data Licenses
● Open Data Commons Public Domain Dedication and Licence (ODC PDDL) – Public domain
● Creative Commons CCZero – Public domain
● Open Data Commons Attribution License – Attribution for data(bases)
● Open Data Commons Open Database License (OdbL) - Attribution-ShareAlike for data(bases)
@_opendatahack
Introduction to Data Science
@_opendatahack
What is Data Science?
Data science ~ computer science +mathematics/statistics + visualization
@_opendatahack
Data is just like crude
● It’s valuable, but if unrefined it cannot really be used.
● It has to be changed into gas, plastic, chemicals, etc to create a valuable entity that drives profitable activity
- Data must be broken down and analyzed for it to have value.
@_opendatahack
Outline
● Harvesting
● Cleaning
● Analyzing
● Visualizing
● Publishing
DATA
@_opendatahack
Data harvesting
● Locally available data
● Data dumps from Web
● Data through Web APIs
● Structured data in Web documents
@_opendatahack
Data cleansing
● Harvested data may come with lots of noise or interesting anomalies.
● Goal is to provide structured presentation for analysis.
- Network(graph)- Values with dimension
@_opendatahack
Data Science Tools
@_opendatahack
Data harvesting
● urllib & BeautifulSoup
● Scrapy
@_opendatahack
Some tips & ethics
● Use the mobile version of the sites if available● No cookies● Respect robots.txt● Identify yourself● If possible, download bulk data first, process it later● Prefer dumps over APIs, APIs over scraping● Be polite and request permission to gather the data● Worth checking: https://scraperwiki.com/
@_opendatahack
Data analyzing
● Numpy- Offers efficient multidimensional array object, ndarray- Basic linear algebra operations and data types- Requires GNU Fortran
● Scipy- Builds on top of NumPy- Modules for statistics, optimization, signal processing, ...- Add-ons (called SciKits) for machine learning, data mining, etc
● For analysing networks- NetworkX- igraph
@_opendatahack
Data visualizing
● Matplotlib
● NetworkX
● PyGraphviz
@_opendatahack
@_opendatahack
NumPy + SciPy + Matplotlib + IPython
● Provides Matlab ”-ish” environment
● ipython provides extended interactive interpreter (tab completion, magic functions for object querying, debugging, ...)
@_opendatahack
Some conviniet data formats
● JSON (import simplejson)
● XML (import xml)
● RDF (import rdflib, SPARQLWrapper)
● GraphML (import networkx)
● CSV (import csv)
@_opendatahack
Resource Description Framework (RDF)
● Collection of W3C standards for modeling complex relations and to exchange information
● Allows data from multiple sources to combine nicely
● RDF describes data with triples● - each triple has form subject - predicate - object
e.g. PyconIndia2017 is organized in Delhi
@_opendatahack
Why R for Data Science?
● Algorithms
● Visualizations
● Data manupulation
● Integrations
● Easily scalable
@_opendatahack
Simple R code for bar graph
# Create the data for the chart.H <- c(7,12,28,3,41)
# Give the chart file a name.png(file = "barchart.png")
# Plot the bar chart.barplot(H)
# Save the file.dev.off()
@_opendatahack
Shiny R
https://shiny.rstudio.com/gallery/
@_opendatahack
Few commonly used algorithms
● Naïve Bayes Classifier Algorithm● K Means Clustering Algorithm● Support Vector Machine Algorithm● Apriori Algorithm● Linear Regression● Logistic Regression● Artificial Neural Networks● Random Forests● Decision Trees● Nearest Neighbours
@_opendatahack
Anaconda
@_opendatahack
Thank you
opendatahack.orgfb.com/opendatahack