panel on citizen science and crowdsourcing games - march 27, 2015

12
1 Expert-guided classifier design Gene-centric web portal Bioinformatics algorithm optimization Andrew Su, Ph.D. @andrewsu [email protected] http://sulab.org Slides: slideshare.net/andrewsu

Upload: andrew-su

Post on 17-Jul-2015

90 views

Category:

Science


1 download

TRANSCRIPT

1

Gene-specific review

article for every

human gene

Data integration for

genes, drugs,

diseases

Robust classifiers of breast

cancer prognosis

Annotation of

biomedical literature

Expert-guided

classifier design

Gene-centric

web portal

Bioinformatics

algorithm

optimization

Andrew Su, Ph.D.@andrewsu

[email protected]

http://sulab.org

Slides: slideshare.net/andrewsu

Mark2Cure – biocuration by microtasking

• Challenge: The biomedical literature is

massive and growing exponentially, but it is

largely inaccessible

• Opportunity: Better access to existing

knowledge can make scientific process more

efficient and productive

• Current situation

– Manual biocuration by experts

– Natural language processing

2

Mark2Cure – biocuration by microtasking

• Our approach: Use Amazon Mechanical Turk

platform for paid microtask crowdsourcing

• Results: reproduced an expert-generated gold

standard at equivalent accuracy, shorter time,

fraction of cost

3

K = 6

F score = 0.87

Precision

Recall

• 593 documents

• 9 days

• 145 workers

• $0.06 / task

• Total cost: $630.96

Mark2Cure – biocuration by citizen science

• Our approach: Use volunteer-based citizen

science for microtask crowdsourcing

• Results: reproduced an expert-generated gold

standard at equivalent accuracy, shorter time,

at no cost

4

• 593 documents

• 28 days

• 212 workers

• Total cost: $0.00

0

0.2

0.4

0.6

0.8

1

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

k = 6

F score = 0.84

PrecisionRecall

Voting threshold

http://mark2cure.org

Collaborative knowledge management

• Challenge: Biomedical research allows for

genome-scale profiling, but few genes are

previously known to researcher

• Opportunity: Better access to existing

knowledge can make scientific process more

efficient and productive

• Current situation

– Review articles (but sparse coverage)

– Lots of reading of primary literature

5

Collaborative knowledge management

• Our approach: Create

a gene-specific review

article for every human

gene that is

collaboratively written,

continuously updated,

and community

reviewed

• Results: 5M page

views and >1000 edits

per month

6

Collaborative knowledge management

• Our approach: Create

a gene-specific Wikidata

database entry for every

human gene that is

collaboratively

integrated, continuously

updated, and

community reviewed

• Results: all human

genes and diseases

loaded in Wikidata, soon

to have drugs and

relationships

7

Bioinformatics algorithm optimization

• Challenge: Antibody sequence clustering is

computationally expensive (CPU and memory)

• Opportunity: Large-scale clustering of

antibody sequences can aid vaccine

development

• Current situation: Research-grade code can

cluster ~100k sequences in 1.7 hours on high

memory (150 GB) machine.

8

Bioinformatics algorithm optimization

• Our approach: Ran TopCoder contest for 10

days, offering $7500 in prize money

• Results: Best solution can cluster 2.3M

sequences in 30 seconds on a typical desktop

computer (1.1 GB)

9

log(# sequences processed)

log(e

xe

cution t

ime)

Benchmarks

10

Cyrus Afrasiabi

Ramya Gamini

Louis Gioia

Salvatore Loguercio

Adam Mark

Erick Scott

Greg Stupp

Kevin Xin

Other group members

Contact

http://sulab.org

[email protected]

@andrewsu

+Andrew Su

Mark2Cure

Ben Good

Max Nanis

Ginger Tsueng

Chunlei Wu

All Mark2Curators!

Funding and Support

BioGPS: GM83924

Gene Wiki: GM089820

BD2K Center of Excellence: GM114833

Gene Wiki

Ben Good

Sebastian Burgstaller

Andra Waagmeester

Elvira Mitraka, UMB

Lynn Schriml, UMB

Paul Pavlidis, UBC

Gang Fu, NCBI

Contests

Chunlei Wu

Ben Good

Brian Briney, TSRI

Dennis Burton, TSRI

Rinat Sergeev, HBS

Jin Paik, HBS

Karim Laklani, HBS

Jingbo Shang

Rashid Sial, Appirio

Join the team! bit.ly/sulabawesome

Game for breast cancer prognosis

• Challenge: Genomic classifiers of disease are

difficult to train in a way that consistently

validates on secondary datasets

• Opportunity: Better classifiers of disease

diagnosis and/or prognosis have many clinical

applications

• Current situation: Most attempts to train

classifiers rely on machine learning methods

that utilize little or no biological knowledge

11

Game for breast cancer prognosis

• Our approach: Enlist a crowd of expert game

players with diverse perspectives to identify

most biologically relevant genes

• Results: Gene sets derived from game player

data showed comparable performance to

expert-generated gene sets

12

• 1077 registered players

• 15,669 games played

• Demographics– 59% male, 41% female

– 21-29 is most frequent age group

– 35% had graduate degree, 32%

were biologists