on the quest for changing knowledge. capturing emerging entities from social media. webscience 2016...
TRANSCRIPT
On the Quest for Changing KnowledgeMarco Brambilla, Stefano Ceri, Florian Daniel, Emanuele Della
Valle
@marcobrambi
Data-driven innovation
and
Innovation-driven data
Innovation requires
PreciseTo the pointUp-to-date
Domain-specific
information
There are more things In heaven and earth, Horatio, Than are dreamt of in your philosophy.
Shakespeare (Hamlet Act 1, scene 5)
From Data to Wisdom
Formalizing new knowledge is hard
Only high frequency emerges
The long tail challenge
Knowledge Extraction
Text miningSemantic Web
Search and recommendation systems
No specific care for emerging knowledge
Heaven and HeartHow to peer through an effective window
on real world?
Social media, our blessing and curse
Domain experts matter
Can we use social networks to discover emerging knowledge?
Beware the streetlamp effect
The bias of the sourceThe bias of the observer
Famous Emerging
Evolving Knowledge
Overview
Knowledge Enrichment Setting
Emerging Knowledge Harvesting
Domain TypesTypes selected by the experts
Relevant for the domain
Seed characterizationSelected by the expert
Belonging to an expert type
Thoroughly Described# @ a w
Social Media Sourcing
Content coming from the seeds’ accounts
Candidate Selection
Potentially any entity extracted from the social streams
Resulting in huge sets of candidates
# @ a w ♥
Candidate Typing
Candidate Pruning
Initial pruning of candidates based on
TF-DF:= df * tf / (N – df +1)
(*) variant of TF-IDF that does not discount document frequency because we are actually happy about frequent appearance
(we don’t look for information entropy!)
Candidate Ranking
Candidate Vector Space
Purely syntactic
Semantic:Based on entity extraction / DBpedia
Based on deep learning on images / ClarifAI
Example Analysis
Experiments
Fashion brands Writers Painters
Exhibitions
4,400 strategies evaluated
44 alternative feature vectors (12 basic features and 32 aggregations)
9 different weighting values for aggregations
5 levels of recall for entity extraction
3 different distances
Pruning PhaseFrom 4,400 down to 10 strategiesEliminating the less relevant parameters
Italian Fashion BrandsPrecision @5 = 0.2Increasing # seeds reduces precision
Australian Writers – 22 seedsPrecision @5 = 0.8
Innovative Painters – 21 seedsPrecision @5 = 0.6
Twitter vs. Instagram P@5 = 1.0 P@5 = 0.8
vs.
Fashion: Twitter + Instagram&
&
Writers: Twitter + Instagram
Prec. = 1
Conclusion
It’s about time to build innovation based on data
and build knowledge based on innovation
Harvesting can be iterative
On the Quest for Changing Knowledge
contact usMarco Brambilla, @marcobrambi, [email protected]
http://datascience.deib.polimi.it