automating assessment of website design melody y. ivory and marti a. hearst uc berkeley 1999 - 2002
Post on 20-Dec-2015
216 views
TRANSCRIPT
Automating Assessment of WebSite Design Melody Y. Ivory and Marti A. HearstUC Berkeley
1999 - 2002
2
Usability affects the bottom line
IBM case study [1999]Spent $millions to redesign
site 84% decrease in help usage 400% increase in sales Attributed to improvements in
information architectureCreative Good Study [1999]
Studied 10 e-commerce sites59% attempts failedIf 25% of these had succeeded ->
estimated additional $3.9B in sales
3
Problem Statement
Non-professionals need help designing high-quality Web sites– Design guidelines conflict; are not
empirically-validated; and ignore context
One solution– Empirically-validated, automated
analysis of Web sites
4
The WebTango Approach
•Predictions•Similarities•Differences•Suggestions•Design Modification
Quality Checker
Web Site Design
Profiles
Quality Designs
5
Developing Statistical Profiles:The WebTango Approach1. Create a large set of
measures to assess various design attributes (benchmark)
2. Obtain a large set of evaluated sites
3. Create models of good vs. avg. vs. poor sites (guidelines)
• Take into account the context and type of site
4. Use models to evaluate other sites (guideline review)
5. Validate models
Idea: Reverse engineer design patterns from high-quality sites and use to check the quality of other sites
Measures
Data
ModelsEvaluate
Validate
6
WebTango Architecture
7
Step 1: Measuring Web Design Aspects Identified key aspects from the literature
– Extensive survey of Web design literature: texts from recognized experts; user studies
• the amount of text on a page, text alignment, fonts, colors, consistency of page layout in the site, use of frames, …
– Example guidelines• Use 2–4 words in text links [Nielsen00].• Use links with 7–12 useful words [Sawyer & Schroeder00].• Consistent layout of graphical interfaces result in a 10–25%
speedup in performance [Mahajan & Shneiderman96].• Use several layouts (e.g., one for each page style) for variation
within the site [Sano96].• Adhere to accessibility principles in order to create sites that serve
a broad user community [Cooper99; Nielsen00]• Avoid using ‘Click Here’ for link text [Nielsen00]• Use left-justified, ragged-right margins for text [Schriver97]
– No theories about what to measure
8
157 Web Design Measures(Metrics Computation Tool)
Text Elements (31)– # words, type of words
Link Elements (6)– # graphic links, type of links
Graphic Elements (6)– # images, type of images
Text Formatting (24)– # font styles, colors, alignment, clustering
Link Formatting (3)– # colors used for links, standard colors
Graphics Formatting (7) – max width of images, page area
Page Formatting (27)– quality of color combos, scrolling
Page Performance (37)– download time, accessibility, scent quality
Site Architecture (16)– consistency, breadth, depth
information, navigation,& graphicdesign
experiencedesign
9
Page-Level Measures
10
Word Count: 157
11
Good Word Count: 81
12
Body Word Count: 94
13
Link Count: 34
14
Page Title Hits: 3
15
Visible Link Text Hits: 25
16
Site-Level Measures
17
Text Element Variation: 119%
Good Word Count = 81Average Link Words = 3…
Good Word Count = 733Average Link Words = 2…
Good Word Count = 240Average Link Words = 2…
Good Word Count = 292Average Link Words = 2…
Good Word Count = 236Average Link Words = 2…
Good Word Count = 142Average Link Words = 2…
Good Word Count = 72Average Link Words = 2…
Good Word Count = 29Average Link Words = 2…
Good Word Count = 785Average Link Words = 2…
Good Word Count = 294Average Link Words = 2…
Good Word Count = 363Average Link Words = 2…
Good Word Count = 1350Average Link Words = 2…
18
Page Title Variation: 185%
Page Title Hits = 3Page Title Score = 3
Page Title Hits = 3Page Title Score = 3
Page Title Hits = 0Page Title Score = 0
Page Title Hits = 2Page Title Score = 2
19
Webby Awards Data
20
Step 2: Obtaining a Sample of Evaluated Sites Webby Awards 2000
– Only large corpus of rated Web sites 3000 sites initially
– 27 topical categories• Studied sites from informational categories
– Finance, education, community, living, health, services
100 judges– International Academy of Digital Arts & Sciences
• Internet professionals, familiarity with a category
– 3 rounds of judging (only first round used)• Scores are averaged from 3 or more judges• Converted scores into good (top 33%), average (middle
34%), and poor (bottom 33%)
21
Example Page from Good Site
22
Example Page from Avg. Site
23
Example Page from Poor Site
24
Webby Awards 2000 6 criteria
– Content– Structure &
navigation– Visual design– Functionality– Interactivity– Overall experience
Scale: 1–10 (highest)
Nearly normally distributed
25
Which criteria contribute most to overall rating?
Figure 2a. Review StageContribution of Specific Criteria to Overall Site
Rating
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Content Navigation VisualDesign Interactivity Functionality
26
Summary of Analysis of Webby Awards Data
The specific ratings do explain overall experience.
The best predictor of overall score is content.
The second best predictor is interactivity.
The worst predictor is visual design. These results varied by type of site
– Art vs health, for example.
27
Do Webby Ratings Reflect Usability? Do the profiles assess usability or something
else? User study (30 participants)
– Usability ratings (WAMMI scale) for 57 sites• Two conditions – actual and perceived usability
– Contrast to judges’ ratings
Results– Some correlation between users’ and judges’ ratings– Not a strong finding – Virtually no difference between actual and perceived
usability ratings• Participants thought it would be easier to find info in the
perceived usability condition
28
Building the Data Set Downloaded pages from sites using
a Site Crawler Tool– Downloads informational pages at
multiple levels of the site Used a Metrics Computation Tool to
compute measures for the sample– Processes static HTML, English pages
• Measures for 5346 pages• Measures for 333 sites
– No discussion of site-level models
29
Step 3: Creating Prediction Models
Statistical analysis of quantitative measures– Methods
• Classification & regression tree, linear discriminant classification, & K-means clustering analysis
– Context sensitive models
• Content category, page style, etc.
– Models identify a subset of measures relevant for each prediction
??Good
Average
Poor
30
Page-Level Models (5346 Pages)
Model Method Accuracy
Good
Avg.
Poor
Overall page quality~1782 pgs/class
C&RT 96% 94%
93%
Content category quality~297 pgs/class & cat
LDC 92% 91%
94%
ANOVAs showed that all differences in measures were significant (good vs. avg, good vs. poor, etc.)
31
Page-Level Models (5346 Pages)
Model Method Accuracy
Good
Avg.
Poor
Page type quality~356 pgs/class & type
LDC 84% 78%
84%
Overall page quality C&RT 96% 94% 93%
Content category quality LDC 92% 91% 94%
ANOVAs showed that all differences in measures were significant (good vs. avg, good vs. poor, etc.)
Page Type Classifier (decision tree)Types: home page, content, form, link, other1770 manually-classified pages, 84% accurate
32
Characteristics of Good Pages K-means clustering to
identify 3 subgroups ANOVAs revealed key
differences– # words on page, HTML
bytes, table count Characterize clusters as:
– Small-page cluster (1008 pages)
– Large-page cluster (364 pages)
– Formatted-page cluster (450 pages)
Use for detailed analysis of pages
Small page
Large page
Formatted page
33
The Models in More Detail
34
Step 4: Evaluate Other Sites Embed prediction profiles into an Analysis Tool
– For each model • Prediction: good, average, poor, mapped cluster• Rationale: decision tree rule, deviant measures, etc.
– Example page-level feedback• Overall page quality model
– Predicted quality: poor– Rationale: if (Italicized Body Word Count is not missing AND
(Italicized Body Word Count > 2.5))• Good page cluster model
– Mapped cluster: small-page, Cluster distance: 22.74– Similar measures: Word Count;Good Word Count …– Deviant measures: Link Count [12.0] out of range (12.40--
41.24);Text Link Count [2.0] out of range (4.97--27.98)…
– Limitation: no suggestions for improvement or examples
35
Assessment of GVU Home Page
Predicted page style: link (average)
Overall Quality: Average
Rationale: min graphic width > 8.5
Cluster: Small page
Differences: word counts
Education Quality: Average
36
Assessment of the School Home Page
Take away: example of when the system fails due to extensive use of scripts
Predicted page style: home
Home Page Quality: poor
Rationale: too few redundant links, interactive objects; too many scripts, italicized body text
Overall Quality: poor
Rationale: use of italicized body text
Cluster: Formatted page
Education Quality: poor
37
Example Assessment Demonstrate use of profiles to assess site
quality and identify areas for improvement
Site drawn from Yahoo Education/Health– Discusses training programs on numerous
health issues– Not in original study– Chose one that looked good at first glance, but
on further inspection seemed to have problems.
– Only 9 pages were available, at level 0 and 1
38
Sample Page (Before)
39
Page-Level Assessment Decision tree predicts: all 9 pages
consistent with poor pages– Content page does not have accent color;
has colored, bolded body text words• Avoid mixing text attributes (e.g., color, bolding, and
size) [Flanders & Willis98] • Avoid italicizing and underlining text [Schriver97]
40
Page-Level Assessment Cluster mapping
– All pages mapped into the small-page cluster
– Deviated on key measures, including• text link, link cluster, interactive object, content link
word, ad• Most deviations can be attributed to using graphic links
without corresponding text links– Use corresponding text links [Flanders &
Willis98,Sano96]
Link Count Text Link
Count
Good Link Word Count
Font CountSans Serif Word Count
Display Word Count
Top deviant measures for content page
41
Page-Level Assessment Compared to models for health and
education categories– All pages found to be poor for both models
Compared to models for the 5 page styles– All 9 pages were considered poor pages by
page style (after correcting predicted types)
42
Improving the Site Eventually want to automate the translation
from differences to recommendations Revised the pages by hand as follows:
– To improve color count and link count:• Added a link text cluster that mirrors the content of
the graphic links
– To improve text element and text formatting variation
• Added headings to break up paragraphs • Added font variations for body text and headings and
made the copyright text smaller
– Several other changes based on small-page cluster characteristics
43
Sample Page (After)
Added linked menu that mirrors image menu.Removed colored and italicized body words.Added an accent color.
44
After the Changes All pages now classified correctly by
style All pages rated good overall All pages rated good health pages Most pages rated as average education
pages Most pages rated as average by style
45
Before & After Pages Participants improved pages based on overall page
quality measures and closest good-page cluster models.
46
Step 5: Validating the Prediction Models Small study
– Hypothesis: pages and sites modified based on the profiles are preferred over original versions
– 5 sites modified based on profiles (including the example site)
• Modifications by 2 undergraduate (Deep Debroy & Toni Wadjiji) and 1 graduate student (Wai-ling Ho-Ching)
– Students had little to no design experience– Same procedure as in the example assessment– Minimal changes based on overall page quality and
good page cluster models
– 13 participants• 4 professional, 3 non-professional, and 6 non Web
designers
47
Profile Evaluation
– Page-level comparisons (15 page pairs)• Participants preferred modified pages (57.4% vs.
42.6% of the time, p =.038)
– Site-level ratings (original and modified versions of 2 sites)
• Participants rated modified sites higher than original sites (3.5 vs. 3.0., p=.025)
• Non Web designers had difficulty gauging Web design quality
– Freeform Comments• Subtle changes result in major improvements
48
Summary of the Approach Advantages
– Derived from empirical data
– Context-sensitive– More insight for
improving designs– Evolve over time– Applicable to other
types of UIs Limitations
– Based on expert ratings– Correlation, not
causality – Not a substitute for
user studies
Measures
Data
ModelsEvaluate
Validate
49
Conclusions
Let’s hear from you!