Download - Text modeling with R, Python, and Spark
Modeling Text Data
Small
Big
Cluster Topics
1 2
3
Technologies
Small
Big
Cluster Topics
Clustering the SOTU
Small
Big
Cluster Topics
1
Data Set• 70 years of the State of the Union address
• 1945 (Truman) - 2015 (Obama)
• Avg. Length: ~ 6,700 words
• longest: ~34,000 words
• shortest: ~ 2,000 words
• total: 467,000 words
• Raw Data: 2.4 MB
Pipeline
Config Wrangle Model Cluster Visualize
Config Wrangle Model Cluster Visualize
Config Wrangle Model Cluster Visualize
America has enjoyed twenty-two months of uninterrupted economic recovery. But recovery is not enough. If we are to prevail in the long run, we must expand the long-run strength of our economy.
america enjoyed twenty-two months uninterrupted economic recovery recovery not enough prevail long run expand long-run strength economy
Config Wrangle Model Cluster Visualize
america enjoyed twenty-two months uninterrupted economic recovery recovery not enough prevail long run expand long-run strength economy
Config Wrangle Model Cluster Visualize
Config Wrangle Model Cluster Visualize
Config Wrangle Model Cluster Visualize
Config Wrangle Model Cluster Visualize
Config Wrangle Model Cluster Visualize
Config Wrangle Model Cluster Visualize
Config Wrangle Model Cluster Visualize
Topic Modeling SOTU
Small
Big
Cluster Topics
2
Data Set• 70 years of the State of the Union address
• 1945 (Truman) - 2015 (Obama)
• Avg. Length: ~ 6,700 words
• longest: ~34,000 words
• shortest: ~ 2,000 words
• total: 467,000 words
• Raw Data: 2.4 MB
Pipeline
Config Wrangle Model Extract Visualize
Config Wrangle Model Extract Visualize
Config Wrangle Model Extract Visualize
America has enjoyed twenty-two months of uninterrupted economic recovery. But recovery is not enough. If we are to prevail in the long run, we must expand the long-run strength of our economy.
america enjoy twenty-two month uninterrupted economy recovery recovery not enough prevail long run expand long-run strength economy
Config Wrangle Model Extract Visualize
America has enjoyed twenty-two months of uninterrupted economic recovery. But recovery is not enough. If we are to prevail in the long run, we must expand the long-run strength of our economy.
america enjoy twenty-two month uninterrupted economy recovery recovery not enough prevail long run expand long-run strength economy
Config Wrangle Model Extract Visualize
Config Wrangle Model Extract Visualize
Config Wrangle Model Extract Visualize
Config Wrangle Model Extract Visualize
Config Wrangle Model Extract Visualize
Config Wrangle Model Extract Visualize
Config Wrangle Model Extract Visualize
Topic Modeling Congress
Small
Big
Cluster Topics
3
Data Set (Congress loves to talk)
• 20 years of Congressional Hearings (1995 - 2015)
• 19,381 documents (about 1,000 a year)
• Avg. Length: ~ 32,000 words (5x SOTU)
• longest: ~ 900,000 words (length of all 7 HP books)
• shortest: ~ 50 words
• total: 613 million words (1,300x SOTU)
• Raw Data: 3.8 GB
Pipeline
Config Wrangle Model Extract Visualize
Config Wrangle Model Extract Visualize
Config Wrangle Model Extract Visualize
America has enjoyed twenty-two months of uninterrupted economic recovery. But recovery is not enough. If we are to prevail in the long run, we must expand the long-run strength of our economy.
america enjoy twenty-two month uninterrupted economy recovery recovery not enough prevail long run expand long-run strength economy
Config Wrangle Model Extract Visualize
Config Wrangle Model Extract Visualize
Config Wrangle Model Extract Visualize
Config Wrangle Model Extract Visualize
Config Wrangle Model Extract Visualize
Modeling Text Data
Small
Big
Cluster Topics
1 2
3
exaptive.com/blog
Frank D. Evans@frankdevans
@exaptive
slideshare.net/frankdevansgithub.com/frankdevans/odsc_meetup_text_processing