Trove crowdsourcing behaviour Paul Hagon@paulhagon
@TroveAustralia
Monday, 4 March 13
Crowdsourcing profile
Look beyond the numbers
Monday, 4 March 13
2 things I want you to take away today.
2,328,207
www.nla.gov.au
Monday, 4 March 13
Visits to www.nla.gov.au in 2011-2012. Is this a lot? Is it not very much. You don’t know.
67%Monday, 4 March 13
How does your perception of that number change when I say that 67% of our visitors spent less than 30 seconds on our site?
27%
14%
12%
9%
8%
8%
5%
4% 3.5% 3%
2.2%
1.6%
1.5%
1.4%
0.9%
0.7%
0.7%
1%1%
0.6%0.5%
0.6%
0.3%
Monday, 4 March 13
Does it change further when we measure where people click & you start to get another idea of how your site is used. Doing studies like this lead to the recent redesign of our website.
Monday, 4 March 13
Hive from National Archives of Australia - http://transcribe.naa.gov.au
Monday, 4 March 13
What’s on the menu from the New York Public Library, where users can transcribe menus - http://menus.nypl.org
Monday, 4 March 13
British library georeference, where users can place maps over Google Earth to give coordinates - http://www.bl.uk/maps/
Monday, 4 March 13
Flickr Commons where institutions can upload photos for people to add comments & tags. - http://www.flickr.com/commons
Monday, 4 March 13
And Trove from the National Library of Australia with their newspaper corrections. - http://trove.nla.gov.au/newspaper
Search
Monday, 4 March 13
A bit of background. When digitising text, if it just goes up, then it’s the equivalent of a browse interface of the physical object. In the case of newspapers you need to know the title, the date, the page etc. Not a good experience. So we need to search...
Search
OCR
Monday, 4 March 13
Search needs text so in this case we need to apply OCR over the digitised text. But OCR isn’t perfect so.
Search
OCRCorrection
Monday, 4 March 13
We can improve the OCR by adding in human correction - crowdsourcing.
Search
OCRCorrection
Monday, 4 March 13
In turn, this improves search
0
400,000
800,000
1,200,000
1,600,000
1803 1819 1835 1851 1867 1883 1899 1915 1931 1947 1963 1979
OCR corrections Articles digitised
Monday, 4 March 13
OCR correction levels. There’s a relatively high OCR correction rate for articles. Human correction is the “icing on the cake”
316 million resources60,000+ unique visitors per day10+% visits from mobile78 million newspaper lines corrected1.7 million tags added75,000 registered users46,000 comments29,000 Trove lists
Monday, 4 March 13
These are a little out of date but they are the sorts of stats that we typically report on for things like the annual report. Fine for overall figures, but not really good at telling us exactly how users are using our resources and how we can improve our services based upon this. I’m really interested in the newspaper corrections.
$12 million
Monday, 4 March 13
We’ve estimated that if we had to employ staff it would have cost in the vicinity of $12 million.Massive benefit to the Library & to the community.
2%1%
45%
6%
24%
22%
Archived websites Australian newspapersBooks Diaries, letters, archivesJournal articles ListsMaps Music, sound & videoPeople & organisations Pictures & photos
8%1%
3%
85%
Work count UsageMonday, 4 March 13
This shows what Trove is made up of. Journals, Archived websites & Newspapers are the resources with the most content.
This shows what is being used. 85% of Trove use is from newspapers. It starts to give us an indication of where we can focus time, energy & resources.
I’m really interested in newspapers & the activity surrounding that.
There’s more to Trove than newspaper corrections
Monday, 4 March 13
One thing to keep in the back of your minds is there’s more to Trove than text corrections. Say after me....
15%
85%
Registered users Anonymous users
Monday, 4 March 13
One thing to focus on is newspapers & one of the appeals of newspapers is the correction. 85% of text corrections have been made by users that have created an account on Trove and are logged in. This is a commitment & an indication of having a relationship with the Library.
0
250,000
500,000
750,000
1,000,000
1,250,000
1,500,000
Number of corrected lines
Top of leaderboard Bottom of leaderboard
Monday, 4 March 13
We have a leaderboard with 23,000 users that have made corrections. It’s not quite gamification and other studies have shown that competitiveness isn’t a main motivation for corrections. There’s a very small amount that have done a lot of corrections & then a super long tail of lots of users that have made very few corrections.
50% users < 100 corrections
75% users < 500 corrections
0.01% users > 1 million corrections
Monday, 4 March 13
Now less than 100 lines of corrections isn’t a small amount of correction. The big numbers that you most commonly hear are being done by a very small amount of users.
Users
Corrections
23,000
68,000,000
Monday, 4 March 13
Let’s look at it in a different way. We can’t track behaviour of non logged in usersApprox 23,000 logged in users have made 68,000,000 corrections
Users
Corrections
43%
100
Monday, 4 March 13
100 users have made 43% of corrections
Users
Corrections
43% 38%
1000100
Monday, 4 March 13
The top 1000 users have made 81% of all corrections
Users
Corrections
5000
43% 38% 15%
1000100
Monday, 4 March 13
The top 5000 users have made 96% of all corrections
Top 100 users extremely important
Top 1000 users very important
Monday, 4 March 13
So we’re starting to see the top users are extremely important to the corrections program.
Monday, 4 March 13
How do these patterns compare across other crowdsourcing activities? Hive from National Archives of Australia
0
300,000
600,000
900,000
1,200,000
Correction activity at Hive
Top of leaderboard Bottom of leaderboard
Monday, 4 March 13
Hive from National Archives. Much smaller numbers at 448 users, but the usage patterns are nearly identical.
Monday, 4 March 13
Lets look at the 6 Australian institutions that participate in Flickr Commons. Crowdsourcing their photographs using tags & comments.
0
1500
3000
4500
6000
Number of tags per user
Top of leaderboard Bottom of leaderboard
Monday, 4 March 13
Flickr tags. Takes the same shape.
Users
Tags
62% 27%
10010
Monday, 4 March 13
Approx 1,005 users have added 31,026 tags
The top 100 users have added 89% of all tags
0
400
800
1200
1600
Number of comments per user
Top of leaderboard Bottom of leaderboard
Monday, 4 March 13
Flickr comments per user. Once again we start to see an indentical pattern.
Users
Comments
20% 10%
10020
17%
1000
Monday, 4 March 13
Approx 12,753 users have added 26,173 comments
The top 1000 users have made 47% of all comments
0
250,000
500,000
750,000
1,000,000
1,250,000
1,500,000
Number of corrected lines
Top of leaderboard Bottom of leaderboard
Monday, 4 March 13
Given the user behaviour, how can we encourage someone from the top 1000 to keep at it to reach the top 100. It’s a massive difference in the amount of corrections needed. To get there, you need to give up work and start text correcting full time.
Number of times user edited the same article
Recency
OCR accuracy
Monday, 4 March 13
Could this be ranked not just by the number of corrections but also incorporating how often the user returns, how efficient they are at correcting (not returning to the same article) or by how “difficult” the article might be (as a measure of the initial OCR accurancy). Let’s look at a few options.
http://wraggelabs.com/shed/presentations/nla/pages/years_accuracy.html
Monday, 4 March 13
Could we rank on accuracy or difficulty of article? We have an approximate OCR accuracy rate and we know the exact amount of characters corrected.
01234567
8-1415-3031-60
61-120121-364
365+
0% 10% 20% 30% 40% 50%
Days since last correction - top 100
Top 100
Monday, 4 March 13
How often do users make corrections. Using same recency patterns as Google Analytics. Over 40% of the top 100 users return on a daily basis.
01234567
8-1415-3031-60
61-120121-364
365+
0% 10% 20% 30% 40% 50%
Days since last correction - top 1000
Top 1000
Monday, 4 March 13
The top 1000 aren’t quite as dedicated. There’s a decrease in the immediate recency & an increase in the long term return rates.
01234567
8-1415-3031-60
61-120121-364
365+
0% 10% 20% 30% 40% 50%
Days since last correction - overall
Overall
Monday, 4 March 13
Looking at the pattern for the overall registered user base, 7% of users have made corrections within the past week. Nearly 70% of users haven’t made corrections in the previous 6 months & for nearly 45% of users it’s been more than 12 months since they last made a correction.
01234567
8-1415-3031-60
61-120121-364
365+
0% 10% 20% 30% 40% 50%
Days since last correction
Top 100 Top 1000 Overall
Monday, 4 March 13
So the behaviour for the top 100 users is the opposite to the general behaviour patterns.
48,822http://www.flickr.com/photos/denial_land/4183422564/
Monday, 4 March 13
To give a bit of an idea of the patterns, 48,822 lines of correction on Christmas Day. Given that an average day will see in the vicinity of 120,000 corrections, it’s quite amazing.
0
400
800
1,200
1,600
2,000
Days between first correction & last correction (lifespan)
Monday, 4 March 13
Is there a burnout time, when people have enough of text correction? First time a user made a correction & the last time a visitor made a correction. There isn’t really a burnout.
0
400,000
800,000
1,200,000
1,600,000
1803 1819 1835 1851 1867 1883 1899 1915 1931 1947 1963 1979
Articles with corrections OCR corrections Articles digitised
Monday, 4 March 13
There doesn’t appear to be any specific time periods that people are targeting (eg: First World War etc).
1%10%
75%
14%
Article types
Advertising ArticleDetailed lists, results, guides Family NoticesLiterature Other
15%
2%
72%
10%
Corrected article types
Monday, 4 March 13
Each article is classified according to the type of article it is: an artcile, advertisement, births deaths marriages etc. Trove newspapers are mostly articles. Not many articles that are Family Notices.
Once we look at what type of articles are being corrected, there’s some definite activity around family notices.
Advertising
Article
Detailed lists results guides
Family Notices
Literature
Other
0 20 40 60 80 100
% of article types corrected
Monday, 4 March 13
If we look at it a bit differently. As a percentage of the total article types, nearly 64% of the family notices have had some level of correction.
The future?
Monday, 4 March 13
How can we use this information to dictate the future direction of Trove newspapers?
0
20,000,000
40,000,000
60,000,000
80,000,000
2008-06 2009-02 2009-10 2010-06 2011-02 2011-10 2012-06
Number of articles
Monday, 4 March 13
We keep adding articles to Trove. This isn’t going to stop. It’s increasing in a linear fashion.
0
500,000
1,000,000
1,500,000
2,000,000
2,500,000
3,000,000
2008-06 2009-02 2009-10 2010-06 2011-02 2011-10 2012-06
Number of corrections per month
Monday, 4 March 13
The number of corrections that are happening each month isn’t increasing at the same rate.
0
1,000
2,000
3,000
4,000
2008-06 2009-02 2009-10 2010-06 2011-02 2011-10 2012-06
Number of users making corrections per month
Monday, 4 March 13
Likewise the number of users making corrections isn’t increasing in a linear fashion. Are we reaching a plateau in what our existing users are capable of doing?
0
750,000
1,500,000
2,250,000
3,000,000
2008-06 2009-04 2010-02 2010-12 2011-10 2012-080
20,000,000
40,000,000
60,000,000
80,000,000
Number of corrections Number of articles
?
Monday, 4 March 13
Let’s get back to the situation we faced at the start of the project. What’s going to happen over the next couple of years into the future? If we keep on putting more & more pages up - what happens when our correctors can’t keep up?
Search
OCRCorrection
Monday, 4 March 13
If articles keep getting added & the corresponding number of users aren’t joining or correcting, will search slowly become less effective?
Search
OCRCorrection
Monday, 4 March 13
Do we need to improve OCR through automated terms. Improvements in OCR technology, general text pattern analysis.
Search
OCRCorrection
Monday, 4 March 13
Or improve manual corrections through marketing, promotion, incentives Do we need to change our API to allow write access so machines could programatically correct text?
http://wraggelabs.com/shed/presentations/nla/pages/years_accuracy.html
Monday, 4 March 13
Do we redesign the interface to highlight articles that have a low correction level? For instance Do we concentrate on years around 1880 or 1930 and not so much the years surrounding the First World War?
https://twitter.com/paulhagon/status/122846665722957826
Monday, 4 March 13
How can we get our passionate users doing high value tasks? Would other crowdsourcing activities like geo-spatial references be more valuable. Could they be set up doing specific tasks on uncatalogued material?
We have:
Great content
Passionate users
Family History
Monday, 4 March 13
We have: Passionate users who want to help us. We have niche interest groups like Family History. Getting all of these factors to align with our strategic directions.
78 million newspaper lines corrected
Monday, 4 March 13
Do you look at it the same way as you did 20 minutes ago?
@paulhagon
@TroveAustralia
Monday, 4 March 13