categorical data analysis in python
TRANSCRIPT
![Page 1: Categorical Data Analysis in Python](https://reader036.vdocuments.net/reader036/viewer/2022082218/55a696021a28ab752d8b4590/html5/thumbnails/1.jpg)
1
Categorical Data Analysis in Python
By
Jaidev DeshpandeData Scientist, DataCulture Analytics
twitter.com/jaidevd
![Page 2: Categorical Data Analysis in Python](https://reader036.vdocuments.net/reader036/viewer/2022082218/55a696021a28ab752d8b4590/html5/thumbnails/2.jpg)
2
Problem: Who's likely to attend the next meetup?
● Who comes often?● Men / Women?● Where do you live? How far from the venue?● Proficiency with Python
(Beginner / Intermediate / Advanced)?● Area of interest?
![Page 3: Categorical Data Analysis in Python](https://reader036.vdocuments.net/reader036/viewer/2022082218/55a696021a28ab752d8b4590/html5/thumbnails/3.jpg)
3
Something like..
Attendees Features
Attendance (%)
Gender Pincode Proficiency in Python
Interest ...
attendee_1 80 M 411013 Intermediate Web ...
attendee_2 30 F 411040 Advanced Test / Automation
...
attendee_3 55 M 411001 Beginners Scientific ...
... ... ... ... ... ... ...
● 1. Numerical features – continuous and quantitative● 2. Categorical features – discrete and qualitative
![Page 4: Categorical Data Analysis in Python](https://reader036.vdocuments.net/reader036/viewer/2022082218/55a696021a28ab752d8b4590/html5/thumbnails/4.jpg)
4
Common Numerical Operations on Data
● Obviously – add, subtract, mu ltiply divide● Statistical moments● Operations in vector spaces
– D istance measures– Slicing
![Page 5: Categorical Data Analysis in Python](https://reader036.vdocuments.net/reader036/viewer/2022082218/55a696021a28ab752d8b4590/html5/thumbnails/5.jpg)
5
Comparison of Operations
Numerical Data
Addition, subtract, multiply, divide
Mean, Variance, Standard Deviation
Vector Spaces – the very idea of 'measuring'
Categorical Data (Strings, etc)
What's the product of two strings?
The average pincode of two areas?
&%%#&$$*&!!!!
At least get some numbers!
![Page 6: Categorical Data Analysis in Python](https://reader036.vdocuments.net/reader036/viewer/2022082218/55a696021a28ab752d8b4590/html5/thumbnails/6.jpg)
6
One-hot Encoding
● [Apples,
Oranges,
Mangoes]
● sklearn.preprocessing.OneHotEncoder
● sklearn.feature_extraction.DictVectorizer
[0, 0, 1;
0, 1, 0;
1, 0, 0]
![Page 7: Categorical Data Analysis in Python](https://reader036.vdocuments.net/reader036/viewer/2022082218/55a696021a28ab752d8b4590/html5/thumbnails/7.jpg)
7
Original Data
Attendees Features
Attendance (%)
Gender Pincode Proficiency in Python
Interest ...
attendee_1 80 [0 1] [1 0 0 … 0] [0 1 0] [1 0 0 0 0 0] ...
attendee_2 30 [1 0] [0 1 0 … 0] [1 0 0] [0 1 0 0 0 0] ...
attendee_3 55 [0 1] [0 0 1 … 0] [0 0 1] [0 0 1 0 0 0] ...
... ... ... ... ... ... ...
![Page 8: Categorical Data Analysis in Python](https://reader036.vdocuments.net/reader036/viewer/2022082218/55a696021a28ab752d8b4590/html5/thumbnails/8.jpg)
8
Curse of Dimensionality
![Page 9: Categorical Data Analysis in Python](https://reader036.vdocuments.net/reader036/viewer/2022082218/55a696021a28ab752d8b4590/html5/thumbnails/9.jpg)
9
Correspondence Analysis
● Contingency tables (pandas.crosstab)
profeciency advanced beginner intermediate
gender
F 1 0 0
M 0 1 1● Different numerical measures● Perceptual maps
![Page 10: Categorical Data Analysis in Python](https://reader036.vdocuments.net/reader036/viewer/2022082218/55a696021a28ab752d8b4590/html5/thumbnails/10.jpg)
10
Correspondence Analysis
● How are proficiencies related w.r.t gender? (Row profiles)● How are genders related w.r.t proficiency? (Column profiles)
– Cosine similarity– Correlation / Covariance
● How are they interrelated?– Weighted chi-squared distance
● Can the dimensionality be reduced?– Singular value decomposition / PCA– sklearn.decomposition.PCA
– sklearn.decomposition.TruncatedSVD
![Page 11: Categorical Data Analysis in Python](https://reader036.vdocuments.net/reader036/viewer/2022082218/55a696021a28ab752d8b4590/html5/thumbnails/11.jpg)
11
Sample Problem
● Consider the proficiency and interest features from the original problem
● Fake data with 100 observations ● Contingency matrix:
automation scientific web
advanced 8 1 7
beginner 13 9 35
intermediate 7 1 19
![Page 12: Categorical Data Analysis in Python](https://reader036.vdocuments.net/reader036/viewer/2022082218/55a696021a28ab752d8b4590/html5/thumbnails/12.jpg)
12
Results
![Page 13: Categorical Data Analysis in Python](https://reader036.vdocuments.net/reader036/viewer/2022082218/55a696021a28ab752d8b4590/html5/thumbnails/13.jpg)
13
Source and Tutorials
● http://github.com/motherbox/mca