Download - Trove crowdsourcing behaviour Paul HagonUsers Corrections 23,000 68,000,000 Monday, 4 March 13 Let’s look at it in a di!erent way. We can’t track behaviour of non logged in users

Trove crowdsourcing behaviour Paul Hagon@paulhagon

@TroveAustralia

Monday, 4 March 13

Crowdsourcing profile

Look beyond the numbers

Monday, 4 March 13

2 things I want you to take away today.

2,328,207

www.nla.gov.au

Monday, 4 March 13

Visits to www.nla.gov.au in 2011-2012. Is this a lot? Is it not very much. You don’t know.

http://www.nla.gov.au

http://www.nla.gov.au

67%Monday, 4 March 13

How does your perception of that number change when I say that 67% of our visitors spent less than 30 seconds on our site?

27%

14%

12%

9%

8%

8%

5%

4% 3.5% 3%

2.2%

1.6%

1.5%

1.4%

0.9%

0.7%

0.7%

1%1%

0.6%0.5%

0.6%

0.3%

Monday, 4 March 13

Does it change further when we measure where people click & you start to get another idea of how your site is used. Doing studies like this lead to the recent redesign of our website.

Monday, 4 March 13

Hive from National Archives of Australia - http://transcribe.naa.gov.au

Monday, 4 March 13

What’s on the menu from the New York Public Library, where users can transcribe menus - http://menus.nypl.org

Monday, 4 March 13

British library georeference, where users can place maps over Google Earth to give coordinates - http://www.bl.uk/maps/

Monday, 4 March 13

Flickr Commons where institutions can upload photos for people to add comments & tags. - http://www.flickr.com/commons

Monday, 4 March 13

And Trove from the National Library of Australia with their newspaper corrections. - http://trove.nla.gov.au/newspaper

Search

Monday, 4 March 13

A bit of background. When digitising text, if it just goes up, then it’s the equivalent of a browse interface of the physical object. In the case of newspapers you need to know the title, the date, the page etc. Not a good experience. So we need to search...

Search

OCR

Monday, 4 March 13

Search needs text so in this case we need to apply OCR over the digitised text. But OCR isn’t perfect so.

Search

OCRCorrection

Monday, 4 March 13

We can improve the OCR by adding in human correction - crowdsourcing.

Search

OCRCorrection

Monday, 4 March 13

In turn, this improves search

0

400,000

800,000

1,200,000

1,600,000

1803 1819 1835 1851 1867 1883 1899 1915 1931 1947 1963 1979

OCR corrections Articles digitised

Monday, 4 March 13

OCR correction levels. There’s a relatively high OCR correction rate for articles. Human correction is the “icing on the cake”

316 million resources60,000+ unique visitors per day10+% visits from mobile78 million newspaper lines corrected1.7 million tags added75,000 registered users46,000 comments29,000 Trove lists

Monday, 4 March 13

These are a little out of date but they are the sorts of stats that we typically report on for things like the annual report. Fine for overall figures, but not really good at telling us exactly how users are using our resources and how we can improve our services based upon this. I’m really interested in the newspaper corrections.

$12 million

Monday, 4 March 13

We’ve estimated that if we had to employ staff it would have cost in the vicinity of $12 million.Massive benefit to the Library & to the community.

2%1%

45%

6%

24%

22%

Archived websites Australian newspapersBooks Diaries, letters, archivesJournal articles ListsMaps Music, sound & videoPeople & organisations Pictures & photos

8%1%

3%

85%

Work count UsageMonday, 4 March 13

This shows what Trove is made up of. Journals, Archived websites & Newspapers are the resources with the most content.

This shows what is being used. 85% of Trove use is from newspapers. It starts to give us an indication of where we can focus time, energy & resources.

I’m really interested in newspapers & the activity surrounding that.

There’s more to Trove than newspaper corrections

Monday, 4 March 13

One thing to keep in the back of your minds is there’s more to Trove than text corrections. Say after me....

15%

85%

Registered users Anonymous users

Monday, 4 March 13

One thing to focus on is newspapers & one of the appeals of newspapers is the correction. 85% of text corrections have been made by users that have created an account on Trove and are logged in. This is a commitment & an indication of having a relationship with the Library.

0

250,000

500,000

750,000

1,000,000

1,250,000

1,500,000

Number of corrected lines

Top of leaderboard Bottom of leaderboard

Monday, 4 March 13

We have a leaderboard with 23,000 users that have made corrections. It’s not quite gamification and other studies have shown that competitiveness isn’t a main motivation for corrections. There’s a very small amount that have done a lot of corrections & then a super long tail of lots of users that have made very few corrections.

50% users < 100 corrections

75% users < 500 corrections

0.01% users > 1 million corrections

Monday, 4 March 13

Now less than 100 lines of corrections isn’t a small amount of correction. The big numbers that you most commonly hear are being done by a very small amount of users.

Users

Corrections

23,000

68,000,000

Monday, 4 March 13

Let’s look at it in a different way. We can’t track behaviour of non logged in usersApprox 23,000 logged in users have made 68,000,000 corrections

Users

Corrections

43%

100

Monday, 4 March 13

100 users have made 43% of corrections

Users

Corrections

43% 38%

1000100

Monday, 4 March 13

The top 1000 users have made 81% of all corrections

Users

Corrections

5000

43% 38% 15%

1000100

Monday, 4 March 13

The top 5000 users have made 96% of all corrections

Top 100 users extremely important

Top 1000 users very important

Monday, 4 March 13

So we’re starting to see the top users are extremely important to the corrections program.

Monday, 4 March 13

How do these patterns compare across other crowdsourcing activities? Hive from National Archives of Australia

0

300,000

600,000

900,000

1,200,000

Correction activity at Hive


Monday, 4 March 13

Hive from National Archives. Much smaller numbers at 448 users, but the usage patterns are nearly identical.

Monday, 4 March 13

Lets look at the 6 Australian institutions that participate in Flickr Commons. Crowdsourcing their photographs using tags & comments.

0

1500

3000

4500

6000

Number of tags per user


Monday, 4 March 13

Flickr tags. Takes the same shape.

Users

Tags

62% 27%

10010

Monday, 4 March 13

Approx 1,005 users have added 31,026 tags

The top 100 users have added 89% of all tags

0

400

800

1200

1600

Number of comments per user


Monday, 4 March 13

Flickr comments per user. Once again we start to see an indentical pattern.

Users

Comments

20% 10%

10020

17%

1000

Monday, 4 March 13

Approx 12,753 users have added 26,173 comments

The top 1000 users have made 47% of all comments

0

250,000

500,000

750,000

1,000,000

1,250,000

1,500,000

Number of corrected lines


Monday, 4 March 13

Given the user behaviour, how can we encourage someone from the top 1000 to keep at it to reach the top 100. It’s a massive difference in the amount of corrections needed. To get there, you need to give up work and start text correcting full time.

Number of times user edited the same article

Recency

OCR accuracy

Monday, 4 March 13

Could this be ranked not just by the number of corrections but also incorporating how often the user returns, how efficient they are at correcting (not returning to the same article) or by how “difficult” the article might be (as a measure of the initial OCR accurancy). Let’s look at a few options.

http://wraggelabs.com/shed/presentations/nla/pages/years_accuracy.html

Monday, 4 March 13

Could we rank on accuracy or difficulty of article? We have an approximate OCR accuracy rate and we know the exact amount of characters corrected.



01234567

8-1415-3031-60

61-120121-364

365+

0% 10% 20% 30% 40% 50%

Days since last correction - top 100

Top 100

Monday, 4 March 13

How often do users make corrections. Using same recency patterns as Google Analytics. Over 40% of the top 100 users return on a daily basis.

01234567

8-1415-3031-60

61-120121-364

365+

0% 10% 20% 30% 40% 50%

Days since last correction - top 1000

Top 1000

Monday, 4 March 13

The top 1000 aren’t quite as dedicated. There’s a decrease in the immediate recency & an increase in the long term return rates.

01234567

8-1415-3031-60

61-120121-364

365+

0% 10% 20% 30% 40% 50%

Days since last correction - overall

Overall

Monday, 4 March 13

Looking at the pattern for the overall registered user base, 7% of users have made corrections within the past week. Nearly 70% of users haven’t made corrections in the previous 6 months & for nearly 45% of users it’s been more than 12 months since they last made a correction.

01234567

8-1415-3031-60

61-120121-364

365+

0% 10% 20% 30% 40% 50%

Days since last correction

Top 100 Top 1000 Overall

Monday, 4 March 13

So the behaviour for the top 100 users is the opposite to the general behaviour patterns.

48,822http://www.flickr.com/photos/denial_land/4183422564/

Monday, 4 March 13

To give a bit of an idea of the patterns, 48,822 lines of correction on Christmas Day. Given that an average day will see in the vicinity of 120,000 corrections, it’s quite amazing.

http://www.flickr.com/photos/denial_land/4183422564/

http://www.flickr.com/photos/denial_land/4183422564/

0

400

800

1,200

1,600

2,000

Days between first correction & last correction (lifespan)

Monday, 4 March 13

Is there a burnout time, when people have enough of text correction? First time a user made a correction & the last time a visitor made a correction. There isn’t really a burnout.

0

400,000

800,000

1,200,000

1,600,000

1803 1819 1835 1851 1867 1883 1899 1915 1931 1947 1963 1979

Articles with corrections OCR corrections Articles digitised

Monday, 4 March 13

There doesn’t appear to be any specific time periods that people are targeting (eg: First World War etc).

1%10%

75%

14%

Article types

Advertising ArticleDetailed lists, results, guides Family NoticesLiterature Other

15%

2%

72%

10%

Corrected article types

Monday, 4 March 13

Each article is classified according to the type of article it is: an artcile, advertisement, births deaths marriages etc. Trove newspapers are mostly articles. Not many articles that are Family Notices.

Once we look at what type of articles are being corrected, there’s some definite activity around family notices.

Advertising

Article

Detailed lists results guides

Family Notices

Literature

Other

0 20 40 60 80 100

% of article types corrected

Monday, 4 March 13

If we look at it a bit differently. As a percentage of the total article types, nearly 64% of the family notices have had some level of correction.

The future?

Monday, 4 March 13

How can we use this information to dictate the future direction of Trove newspapers?

0

20,000,000

40,000,000

60,000,000

80,000,000

2008-06 2009-02 2009-10 2010-06 2011-02 2011-10 2012-06

Number of articles

Monday, 4 March 13

We keep adding articles to Trove. This isn’t going to stop. It’s increasing in a linear fashion.

0

500,000

1,000,000

1,500,000

2,000,000

2,500,000

3,000,000

2008-06 2009-02 2009-10 2010-06 2011-02 2011-10 2012-06

Number of corrections per month

Monday, 4 March 13

The number of corrections that are happening each month isn’t increasing at the same rate.

0

1,000

2,000

3,000

4,000

2008-06 2009-02 2009-10 2010-06 2011-02 2011-10 2012-06

Number of users making corrections per month

Monday, 4 March 13

Likewise the number of users making corrections isn’t increasing in a linear fashion. Are we reaching a plateau in what our existing users are capable of doing?

0

750,000

1,500,000

2,250,000

3,000,000

2008-06 2009-04 2010-02 2010-12 2011-10 2012-080

20,000,000

40,000,000

60,000,000

80,000,000

Number of corrections Number of articles

?

Monday, 4 March 13

Let’s get back to the situation we faced at the start of the project. What’s going to happen over the next couple of years into the future? If we keep on putting more & more pages up - what happens when our correctors can’t keep up?

Search

OCRCorrection

Monday, 4 March 13

If articles keep getting added & the corresponding number of users aren’t joining or correcting, will search slowly become less effective?

Search

OCRCorrection

Monday, 4 March 13

Do we need to improve OCR through automated terms. Improvements in OCR technology, general text pattern analysis.

Search

OCRCorrection

Monday, 4 March 13

Or improve manual corrections through marketing, promotion, incentives Do we need to change our API to allow write access so machines could programatically correct text?


Monday, 4 March 13

Do we redesign the interface to highlight articles that have a low correction level? For instance Do we concentrate on years around 1880 or 1930 and not so much the years surrounding the First World War?



https://twitter.com/paulhagon/status/122846665722957826

Monday, 4 March 13

How can we get our passionate users doing high value tasks? Would other crowdsourcing activities like geo-spatial references be more valuable. Could they be set up doing specific tasks on uncatalogued material?

We have:

Great content

Passionate users

Family History

Monday, 4 March 13

We have: Passionate users who want to help us. We have niche interest groups like Family History. Getting all of these factors to align with our strategic directions.

78 million newspaper lines corrected

Monday, 4 March 13

Do you look at it the same way as you did 20 minutes ago?

[email protected]

@paulhagon

@TroveAustralia

Monday, 4 March 13

mailto:[email protected]

mailto:[email protected]

Download - Trove crowdsourcing behaviour Paul HagonUsers Corrections 23,000 68,000,000 Monday, 4 March 13 Let’s look at it in a di!erent way. We can’t track behaviour of non logged in users

Top Related