automatic data validation and cleaning with pysemantic

12
Automatic Data Validation & Cleaning with PySemantic Jaidev Deshpande Data Scientist, Cube26 Software Pvt Ltd

Upload: jaidev-deshpande

Post on 12-Feb-2017

337 views

Category:

Engineering


4 download

TRANSCRIPT

Page 1: Automatic Data Validation and Cleaning with PySemantic

Automatic Data Validation & Cleaning with PySemantic

Jaidev DeshpandeData Scientist, Cube26 Software Pvt Ltd

Page 2: Automatic Data Validation and Cleaning with PySemantic

About Me

● Data Scientist at Cube26 Software Pvt Ltd● Previously software developer at Enthought● Research assistant at TIFR and UoP● Active contributor to the SciPy stack

/ jaidevd

/ jaidevd

Page 3: Automatic Data Validation and Cleaning with PySemantic

Typical Data Pipeline

Page 4: Automatic Data Validation and Cleaning with PySemantic

The Problem● Curating and the data and standardizing across the team● Data quality problems:

○ Unstructured data○ Unorganized data○ Duplicated data○ Irrelevant data

● Communication problems:○ Large and distributed teams○ “What has happened to get the dataset to the current stage?”○ Messier data means more communication.

HOW DO I DESCRIBE THE STRUCTURE OF THE DATA EFFECTIVELY?

Page 5: Automatic Data Validation and Cleaning with PySemantic
Page 6: Automatic Data Validation and Cleaning with PySemantic

PySemantic

Page 7: Automatic Data Validation and Cleaning with PySemantic

Pythonically, PySemantic is:● A wrapper around pandas parsers and dataframe manipulation routines.● Not a parser● A loader for feature extraction for machine learning tasks● A logger for all operations on a dataset

PySemantic supports:● Recursive elimination of parser errors● Automatic validation based on rules

Page 8: Automatic Data Validation and Cleaning with PySemantic

How it works

$ semantic add mydictionary.yaml

mydataset1: path: /path/to/mydataset.csv nrows: 100 use_columns:

- col_a- col_b- col_c

>>> from pysemantic import Project>>> project = Project(“myproject”)>>>project.load_dataset(“mydataset”)

Page 9: Automatic Data Validation and Cleaning with PySemantic

PySemantic Internals

● Infer and validate parser arguments from the schema using traits

● Dynamically change parser arguments based on the errors raised, if any

● Log everything● Post loading a dataset, apply common preprocessing

methods by default

Page 10: Automatic Data Validation and Cleaning with PySemantic

Software Development Practices

● Fully test-driven● Fully documented● Pylint score > 9.0

Page 11: Automatic Data Validation and Cleaning with PySemantic

Limitations

● Only supports local files and MySQL tables (untested)● Not as smart as MS Excel● Architecture isn’t very clean - the main classes are

somewhat confusing

Page 12: Automatic Data Validation and Cleaning with PySemantic

Feedback, Issues, PRs Welcome!

http://github.com/jaidevd/pysemantic