analysis and knowledge extraction of user behaviour and social media content for art culture events

Post on 21-Jan-2018

455 Views

Category:

Data & Analytics

4 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Analysis & Knowledge Extraction of Online User Behaviour and Visual Content

for Art and Culture Events

Marco Brambilla Tahereh Arabghalizi Behnam Rahdari

Marco Brambilla

Contacts: @marcobrambi, marco.brambilla@polimi.it, http://datascience.deib.polimi.it

UNIVERSITY OF PITTSBURGH

Agenda

Context

Method

• Pre-processing

• Topic analysis

• User clustering

• Multimedia: Images• concepts vs. text extraction

• color schema and the main color pattern(s)

• Prediction of interests

Challenges & Conclusions

Context

• Role of social media in our life

• Social media for cultural and artistic events

• Behaviour and content

• Multi-disciplinary collaboration on social media analysis and

cultural heritage

• Collaboration: Politecnico di Milano, Musei di Brescia, University

of Pittsburg

Research Questions

Topics of interest of visitors?

Categorization of users?

Demographics of visitors?

Engagement and online

participation?

Relation between photos, time,

location, text and the event?

Approach

Domain-specific pipeline to profile social media users

and content in cultural or art events

Case Study

The Floating Piers by Christo and Jeanne Claude

Iseo Lake, Italy

June 2016

Case Study

Case Study

• 17 MLN $

• 220,000 floating blocks

• 1.5 MLN visitors in 16 days

Pre-processing

Data Extraction

• Using Instagram and Twitter APIs

• Extract relevant tweets/posts during the event

• Extract all relevant users

o That tweet/post directly

o that like, comment, retweet, etc.

• Extract all properties

o Textual: bio, tweet/post text, hashtag, etc.

o Quantitative: #followers, #followings, etc.

o Media: photos, metadata (geotag, …)

Tweets Posts

14,062 30,256

Users Users

23,916 94,666

Authors Reacting Authors Reacting

7,724 16,197 16,681 77,985

From June 10th to July 30th

Collected Data

• Text normalization (NLP)

• Language identification and translation

• Gender detection

• Data cleansing

• Store clean and transformed data

Preprocessing

Time Distribution (Twitter)

Time series – Instagram vs. Twitter

Instagram Likes and Comments

Italy Lombardy Region Iseo Lake

Geographical Distribution (Instagram)

Data Analysis Process

1. Document Term Matrix (DTM)

2. Topic Extraction

3. Dimension Reduction

4. Cluster Analysis and Validation

5. Prediction

6. Media Analysis

7. Content Network Analysis

Topics

Document-term Matrix

A matrix that describes the frequency of terms that

occur in a collection of documents

Terms

Documents

Art Travel Italy Design …

Post 1 0 1 1 0

Post 2 1 2 0 1

Post 3 0 0 1 0

Post 4 1 1 3 1

Topic Extraction

Latent Dirichlet Allocation (LDA):

documents as mixtures of topics (with probability)

Input: Document Term Matrix

Outputs: Topics, Topic Probabilities Matrix

Topic 1 Topic 2 Topic 3 Topic 4 Topic 5 Topic 6 …

Post 1 0.19 0.16 0.27 0.14 0.11 0.13

Post 2 0.31 0.18 0.21 0.08 0.10 0.12

Post 3 0.25 0.24 0.20 0.17 0.09 0.05

Post 4 0.19 0.32 0.22 0.10 0.07 0.10

Dimensionality Reduction

• Hundreds of topics extracted with LDA

• Using Principle Component Analysis (PCA) to extract a smaller set

of linearly uncorrelated topics

> 0.95

Variance share Cumulative variance share

User Clustering

Cluster Analysis

• Apply clustering algorithms over Topic Probabilities

Matrix to cluster users

• Multiple data slices

• Multiple algorithms

o K-means

o Hierarchical

o DBSCAN

Topic 1

Topic 3

Topic 2

Cluster Validity

• How to evaluate the “goodness” of the resulting

clusters?

• Validation Measures

– Internal : ex. Silhouette Coefficient, Dunn’s Index,

Calinski-Harabasz index, etc.

– External: ex. Entropy, Purity, Rand index, etc.

User Clustering

Travel

Lovers

Art

Lovers

Internet & Tech

Lovers

Users’ Biography Word Clouds

Cluster Labeling

Word Network for Clusters

Travel Lovers

Art Lovers

Tech Lovers

Hierarchical Clustering

Language

Gender

Impact of Demographics

Prediction

Prediction

Predict the category or the interest area of potential new users for

similar cultural or art events in the future

Decision Trees

o Prepare Required Data

o Grow Decision Tree

o Extract rules from the tree

o Predict using test data

o Evaluate

Extracted Rules

Rule 1 : if (0.36 < Bio_score < 0.37 OR Bio_score < 0.35)

then Travel Lover

Rule 2: if (0.35 < Bio_score < 0.36 AND Status_count >

14.5) OR (Bio_score > 0.37 AND language != Italian)

then Art Lover

Rule 3: if (Bio_score > 0.37 AND Language = Italian) then

Tech Lover

Otherwise: Not Interested

accuracy = 62 %

Prediction rules

Decision Tree

Image Analysis

Tweets Posts

14,062 30,256

Users Users

23,916 94,666

Authors Reacting Authors Reacting

7,724 16,197 16,681 77,985

From June 10th to July 30th

Only Instagram

Used Instagram Filters

People in Pictures

Age Sex50.4% female

49.6% male

Visitor Analytics

Race

Bias of the medium?

Image content analsys

Concept extraction (DNN based third party

service)

Comparison with hashtags / text

Image low-level feature analysis

Concepts in Pictures Hashtags

Users tend not to report the actual content of the photos

in their textual descriptions /hashtags

Object Extraction from Pictures

Main color shades among all photos

Color Detection for Subject Identification

Confusion Matrix

Simple techniques “good enough”?

Objects or Colors?

Ongoing Challenges

Future Challenges of KE

Determining exact

positioning based on

perspective

Future Challenges of KE

Network structures

and their temporal

evolution

Max graph perturbation

Daily graph variations

Future Challenges

Real cross-disciplinarity

(cultural heritage, humanities,

social science)

No visitors for the cultural part of the event!

(exhibition at the museum)

Exhibit--->

Conclusions

• (Sometimes) Simple methods work just fine

• Interesting profiling and behaviour detection

• Still far from cross-disciplinary approaches

Contacts: Marco Brambilla, @marcobrambi, marco.brambilla@polimi.it

http://datascience.deib.polimi.it

http://www.marco-brambilla.com

Analysis of Online User Behaviourfor Art and Culture Events

Marco Brambilla, Tahereh Arabghalizi, Behnam Rahdari

UNIVERSITY OF PITTSBURGH

top related