bionic info pro - taxonomies and machine learning sla 2014
DESCRIPTION
Presentation for Special Libraries Association on machine assisted taxonomy creation and the human element.TRANSCRIPT
Bionic Info Pro:New Takes on an Old Theme
Machine Learning, Taxonomy Creation, Big Data, Competitive Intelligence, and the Human Element
Elaine M. Lasda BergmanAnnual Conference
Special Libraries Association Vancouver, BC, CanadaMonday, June 9, 2014
Overview• A little bit about Machine Learning
• A little bit about Taxonomies
• A little bit about Big Data
• A little bit about Hybrid Techniques
NOT NEW: Machine Learning for CI
Mena, Jesus. (1996). Data Mining for Competitive Intelligence, Competitive Intelligence Review, 7(4):18-25.
Refinement of Machine Learning
• Decision Trees/Classification
• Clustering
• Anomaly Detection
Refinement of Machine Learning
• Support Vector Machines- – Predictive Classification
• Association Rules– Marketbasket analysis
• Natural Language Processing– Sentiment Analysis
Getting up to Speed• http://efytimes.com• 6 Video Tutorials and Playlists on
Machine Learning (January 2014)
NOT NEW: Taxonomies in Information Retrieval
http://comsaad.blogspot.com/p/old-computer-photos.html
http://commons.wikimedia.org/wiki/File:A_Library_Primer_illustration_Joined_Hand.jpg
Need for Taxonomic Structures
http://farm9.staticflickr.com/8262/8673326413_4492b5dc68_o.jpg
NOT NEW: Datasets
http://www.conceptdraw.com/solution-park/resource/images/solutions/entity-relationship-diagram-(erd)/Diagramming-Crow's-Foot-ERD-Sample60.png
Enter BIG DATA
http://commons.wikimedia.org/wiki/File:DARPA_Big_Data.jpg
BigData Sources and Analysis DataType Qualities Analysis Tools Result
Social Media Demographics API integration More profiles of like-minded users
“Social Influencers” User Reviews NLP, Text Analysis Sentiment readings
“Internet of Things” Logs/Sensors/Check-Ins Parsing Usage and behavior patterns
SaaS Cloud/Web-based/Subscription software
Dist. data integration/in-memory caching technology/API integration
Usage behavior patterns, customer data, etc.
Public Data e.g., Amazon Data Market, WorldBank, Wikipedia
All above (depends on data structure) Depends on Dataset (and there are LOTS of them!)
Hadoop/MapReduce Volume! Parallel Processing/Parsing/Reduction Big patterns, correlations, needles in haystacks
Data Warehouses Internal transactional data Likely same as above Correlations, marketbasket, etc.
NoSQL/Columnar Volume! Fills gaps in Parallel processing tools Real time activity and patterns
In-Stream Monitoring Network traffic (streaming videos, system outages)
Packet evaluation, distributed query processing Network/Stream usage patterns
Legacy Data Usually PDFs & Documents/SemiStructured
Transformation tools(eg, Xenos d2e) + above Depends on content (could be all)
http://www.zdnet.com/top-10-categories-for-big-data-sources-and-mining-technologies-7000000926/
Why “Concept Hierarchies” in an Unstructured Environment?
Advantages• When term is too low to appear in
frequent item/rulesets• Create more interesting rules using
more general, aggregated concepts[DVD, wheat bread, home electronics, electronitcs, food]
Kumar, T.S. (2005) Introduction to Data Science
Disadvantages• How low and how high in the hierarchy
do you set the threshold? • Increased computation time• If threshold is to high, redundant rules
for more specific terms can be summarized by rules using more general terms
Hybrid Taxonomic Development
• Understand your auto-classification model
• Work with domain experts to create basic taxonomy
• Test Taxonomy in the Model• Rinse, repeat
Wendy Pohs,ASIS&T Bulletin 12/1/13
Domain Knowledge and Thick Data
• Thick Data analysis primarily relies on human brain power to process a small “N” while big data analysis requires computational power (of course with humans writing the algorithms) to process a large “N”.
• Big Data reveals insights with a particular range of data points, while Thick Data reveals the social context of and connections between data points. Big Data delivers numbers; thick data delivers stories. Big data relies on machine learning; thick data relies on human learning.
http://ethnographymatters.net/blog/2013/05/13/big-data-needs-thick-data/ (Tricia Wang)
Data Driven CI is Meaningless Without
Human/Domain Knowledge
http://www.wired.com/2014/04/your-big-data-is-worthless-if-you-dont-bring-it-into-the-real-world/
Recap• Data Mining for CI is not new
• Refinement and Improvement
• Bigger, Weirder Data
Recap• Where it’s at: Hybrid Schemas
• Thick Data, not just Big Data
• HUMAN ELEMENT IS ESSENTIAL
Questions? Elaine Lasda BergmanUniversity at Albany
http://www.slideshare.net/librarian68
@ElaineLibrarian