mapping brain connectivity through large-scale segmentation and analysis by stephen plaza

44
Mapping Brain Connectivity through Large-scale Segmentation and Analysis Stephen Plaza Stuart Berg @janeliaflyem @janelia-flyem https://www.janelia.org/project-team/fly-em @stephenplaza

Upload: spark-summit

Post on 16-Apr-2017

656 views

Category:

Data & Analytics


0 download

TRANSCRIPT

Page 1: Mapping Brain Connectivity Through Large-Scale Segmentation and Analysis by Stephen Plaza

MappingBrainConnectivitythroughLarge-scaleSegmentationandAnalysis

StephenPlazaStuartBerg

@janeliaflyem@janelia-flyem

https://www.janelia.org/project-team/fly-em@stephenplaza

Page 2: Mapping Brain Connectivity Through Large-Scale Segmentation and Analysis by Stephen Plaza
Page 3: Mapping Brain Connectivity Through Large-Scale Segmentation and Analysis by Stephen Plaza

Outline• Connectomics Background

• ImageSegmentationandChallenges

• Large-scaleSegmentationFramework

• SparkArchitecturalDetails

• ResultsandDiscussion

Page 4: Mapping Brain Connectivity Through Large-Scale Segmentation and Analysis by Stephen Plaza

• Amapofbrainconnectivity• Alistofneurons(graphnodes) and

howtheyareconnectedthroughsynapses(graphedges)

Whatisa(Structural)Connectome?

graphwithconnectionstrengths

neurons

Page 5: Mapping Brain Connectivity Through Large-Scale Segmentation and Analysis by Stephen Plaza

WhyaConnectome?• Betterunderstandhowthebrainworks• However:anatomyoftenprovidesjustclues(likeamap,

often necessarynotsufficienttogetsomewhere)

Δ

X

Photoreceptors

Timedelay

Multiplication

Theory:Hassenstein &Reichardt (1956)

ExampleProblem:Howdoanimalsdetectmotion?

Takemura,FlyEM,etal,Nature‘13

Connectome HelpsUncoverAnswer

Page 6: Mapping Brain Connectivity Through Large-Scale Segmentation and Analysis by Stephen Plaza

HowtoObtainaConnectome?

extractanimalbrain

Imagebrain(electronmicroscopy)

generateimages

Findneurons(cellmembrane)

Findsynapses

Page 7: Mapping Brain Connectivity Through Large-Scale Segmentation and Analysis by Stephen Plaza

Problem:DatasetsareVeryLarge

• fly~105neurons(~100TBofimagedata)• rodents~109neurons• human~1011neurons 100x

100x

100x

Page 8: Mapping Brain Connectivity Through Large-Scale Segmentation and Analysis by Stephen Plaza

OurGroup:FlyEM• FlyEMmission:Performcutting-edgeconnectomeusing

ElectronMicroscopy(EM)intheDrosophila (fruitfly)

?

EM celllibrary

graph

synapses

L1

T4

Mi1Tm3

imaging compute/algs

bioexpertise theorists

Page 9: Mapping Brain Connectivity Through Large-Scale Segmentation and Analysis by Stephen Plaza

EMReconstructionPipeline

Page 10: Mapping Brain Connectivity Through Large-Scale Segmentation and Analysis by Stephen Plaza

Example:FlyOpticLobe

Video courtesy of Ting Zhao

~5yearsoftotalhumaneffort

315,421synapticconnections

~842reconstructed cells

~27,000cubicmicrons(~27GBofdata)<<wholeflybrain

Page 11: Mapping Brain Connectivity Through Large-Scale Segmentation and Analysis by Stephen Plaza

BottlenecksinGeneratingConnectomes

ImagingChallenges

• Yearstoimagesomethinglikeamousebrain(evenwithlatest advances)

• Flybrainisalready100TBofdata

ProofreadingDataset

• Extensivemanualcomponent(depends onsegmentation quality)

• Worsethan imaging(e.g.,1weekofimaging→1yearofproofreading)

Goal:ImproveSegmentation(better segmentationèlessmanualproofreading)

Page 12: Mapping Brain Connectivity Through Large-Scale Segmentation and Analysis by Stephen Plaza

Outline• Connectomics Background

• ImageSegmentationandChallenges

• Large-scaleSegmentationFramework

• SparkArchitecturalDetails

• ResultsandDiscussion

Page 13: Mapping Brain Connectivity Through Large-Scale Segmentation and Analysis by Stephen Plaza

SegmentationPipeline

BoundaryPrediction

imagestack

Watershed Agglomeration

over-segmentation(conservative)

mergeregions

segmentation

0.70.90.6

1.0

0.5 1.00.2

0.10.0

0.00.0

0.00.0

0.1

0.1

1

23

merge1and2 1 3

superpixelsvoxelprobabilities segments(neurons?)

Page 14: Mapping Brain Connectivity Through Large-Scale Segmentation and Analysis by Stephen Plaza

ManualComponents

BoundaryPrediction

imagestack

Watershed Agglomeration

over-segmentation(conservative)

mergeregions

segmentation

Boundarytraining• Smalldataset• Labelbackground

/foreground/etc

Superpixel training• Smalldataset• Yes/noquestions

Manualrevisision• Wholedataset• Timeconsuming

?

Page 15: Mapping Brain Connectivity Through Large-Scale Segmentation and Analysis by Stephen Plaza

SegmentationApproaches• randomforest,CNN

(e.g.,Ilastik [FredHamprecht lab])

• Agglomeration:greedyagglomeration,multi-cut,etc (e.g.,NeuroProof [Plaza,Parag])

BoundaryPrediction

Agglomeration

Page 16: Mapping Brain Connectivity Through Large-Scale Segmentation and Analysis by Stephen Plaza

IdealSegmentation

Page 17: Mapping Brain Connectivity Through Large-Scale Segmentation and Analysis by Stephen Plaza

smallerrorcausesbigconnectome change

Overallqualitysusceptibletosmallerrors(99%correctboundarypredictionmightnotbeenough)

neuroncanspan1000sofimages

SegmentationAlgorithmicChallenges

Page 18: Mapping Brain Connectivity Through Large-Scale Segmentation and Analysis by Stephen Plaza

SegmentationAlgorithmicChallenges

Poorclassifier generalizability(untrainedareascanperformpoorly)

ImagingArtifacts(e.g.,membraneholes)

Page 19: Mapping Brain Connectivity Through Large-Scale Segmentation and Analysis by Stephen Plaza

• Smallneurites stressresolutionofimaging(hardtosegmentmanually, traditionallyignoredinevaluation)

• Howtoevaluatealgorithms– Needlargegroundtruth(buthardtoproduce)– Smallthingscausebigerrors

(humansaren’tperfect butdomuchbetter)

smallneurites:10-40nanometer

SegmentationAlgorithmicChallenges

Page 20: Mapping Brain Connectivity Through Large-Scale Segmentation and Analysis by Stephen Plaza

PracticalConsiderationstoLarge-ScaleSegmentation

• Datasettoolargetofitinmemory

• Complexity ofdistributed, large-scale compute:barrierofentryforalgorithmdevelopers

• Robustness: greaterriskthatlong-running operationdies

• Flexibility: abilitytopartiallyrerunsegmentation withbetteralgorithms

Segmentnewalgorithm

proofread

Page 21: Mapping Brain Connectivity Through Large-Scale Segmentation and Analysis by Stephen Plaza

Outline• Connectomics Background

• ImageSegmentationandChallenges

• Large-scaleSegmentationFramework

• SparkArchitecturalDetails

• ResultsandDiscussion

Page 22: Mapping Brain Connectivity Through Large-Scale Segmentation and Analysis by Stephen Plaza

OurSolutions• Distributed,scalablesegmentationframework

• Robustness:Implementcheckpointsforfailurerecovery

• Infrastructureandtoolstoenablecommunitycontributions– Pluginarchitecturetoallowcustomalgorithmdrop-in– Segmentationevaluation toolstofocusonrelevanterrors

Page 23: Mapping Brain Connectivity Through Large-Scale Segmentation and Analysis by Stephen Plaza

ScalableSegmentationFramework

• Mostlylocalcomputation (easytoshard)• Prettyscalable (notcomputelimited currently)

Dataset(e.g.,>200GB-2TB>100,000cubicmicrons)

Map(overlappingsubvolumes)

Boundaryprediction,watershed,agglomeration

Stitch localvolumes(consistent labeling)

“Reduce”

Commitsegmentation

Write

Page 24: Mapping Brain Connectivity Through Large-Scale Segmentation and Analysis by Stephen Plaza

Check-PointandRollback

• Groupsubvolumes intoseparateiterations• Serializeeachiteration(subvolume segmentation)

todistributeddisk• ErroriniterationN,allowrollbackofN-1stages

Boundaryprediction,watershed,agglomeration Stitch localvolumes(consistent labeling)

1 2 3

disk disk disk

Union

Page 25: Mapping Brain Connectivity Through Large-Scale Segmentation and Analysis by Stephen Plaza

RerunSegmentationGoal:reusepreviousproofreadneurons

BoundaryPrediction Watershed Agglomeration

Subvolume SegmentationTask

0.70.90.6

1.0

0.5 1.00.2

0.10.0

0.00.0

0.00.0

0.1

0.1

1

23

merge1and2 1 3

superpixelsvoxelprobabilities segments(neurons?)

Page 26: Mapping Brain Connectivity Through Large-Scale Segmentation and Analysis by Stephen Plaza

RerunSegmentationGoal:reusepreviousproofreadneurons

BoundaryPrediction Watershed Agglomeration

Subvolume SegmentationTask

0.70.90.6

1.0

0.5 1.00.2

0.10.0

0.00.0

0.00.0

0.1

0.1

1

2

3 merge1and2

1 3

superpixelsvoxelprobabilities segments(neurons?)proofreadneuron

4 5 4 5

Page 27: Mapping Brain Connectivity Through Large-Scale Segmentation and Analysis by Stephen Plaza

StitchingSubvolumeSegmentationGoal:createglobalsegmentation(donotpropagate‘small’segmentationerrors)stitchbyoverlap(idealcase)

stitchbyoverlap(pathologicalcase)

1

2

3

4

4

3

1

23badsegmentation

3falsemerge

stitchbyconservativeoverlap(avoidbranching)

1

23

2

13don’tmerge

Howtoavoidbeingtooconservative?

Page 28: Mapping Brain Connectivity Through Large-Scale Segmentation and Analysis by Stephen Plaza

Large-scaleSegmentationEvaluationGoal:allowlarge-scaleevaluationofdifferentalgorithms

map(subvolumes)

seg 1

seg 2(orgroundtruth)

subvolumecomparison

combinestatistics

finalreport

Examplemetrics• Large-scalesimilarity• Smallprocesssimilarity• Editdistance

Page 29: Mapping Brain Connectivity Through Large-Scale Segmentation and Analysis by Stephen Plaza

• Goals:– Simpleimage-oriented APItoGET/POSTsubvolume data– Abstractstorage layerfromclient– Previoussegmentations versioned andsaved

• UseDistributed,Versioned, Image-oriented Dataservice (DVID)

AccessingLargeImageData

BillKatz

https://github.com/janelia-flyem/dvid

Page 30: Mapping Brain Connectivity Through Large-Scale Segmentation and Analysis by Stephen Plaza

Outline• Connectomics Background

• ImageSegmentationandChallenges

• Large-scaleSegmentationFramework

• SparkArchitecturalDetails

• ResultsandDiscussion

Page 31: Mapping Brain Connectivity Through Large-Scale Segmentation and Analysis by Stephen Plaza

WhyImplementinSpark?• Portabilitybetween clusterenvironments

(e.g.,AWS,Googlecompute,in-house SGE)

• Simplemodel fordistributedcomputing– Encouragegreatercommunityinvolvement– Easiertomaintainandextend

• Storeentiresegmentation forlargevolumes inmemory(enablefutureworkrequiringglobal,distributedmemoryaccesstosegmentation)

Page 32: Mapping Brain Connectivity Through Large-Scale Segmentation and Analysis by Stephen Plaza

DesignFeatures• WritteninPython (pyspark)

• Primarily long-running,disjointtasksonlargesubvolumes=>needcompression androbustness tocrashes

• Allowcustomizableplugins (canbeanexecutablecalledbyPython)

Plugin1:BoundaryPrediction

Plugin2:Watershed

Plugin3:Agglomeration

Input:grayscaleOutput:voxelprobabilities

Input:voxelprobabilitiesOutput:labels

Input:voxelprobabilitiesOutput:labels

Page 33: Mapping Brain Connectivity Through Large-Scale Segmentation and Analysis by Stephen Plaza

DesignFeatures• Fastlz4compression forpyspark serialization

• Fastlz4compressiondirectlyonnumpy arrays(cpickle performs slowlyonlargedatasets)

lz4 cpickle

cpickle

numpyarray

numpyarray

lz4

~1GB~30MB

labelvolume

lz4

isfasterthan

Page 34: Mapping Brain Connectivity Through Large-Scale Segmentation and Analysis by Stephen Plaza

High-levelArchitecture

SparkDVID

DVID(containsdataset)

Sparkapplication“CreateSegmentation”

diskbackupsubvolumesegmentation

• 1subvolume/partition• Time-consuming• Noshuffling

MainTasks

matchboundaries

• Extractsubstackboundaryregion

• Shuffle/reduce

stitch• Several

stitches/subvolume• Veryfast

writesubvolume

• Remaplabels• Foreachwriteon

subvolumes

checkpoints

partitiondataset

Minimizelargeshuffles(mostlyoverlapboundaries)

Page 35: Mapping Brain Connectivity Through Large-Scale Segmentation and Analysis by Stephen Plaza

Outline• Connectomics Background

• ImageSegmentationandChallenges

• Large-scaleSegmentationFramework

• SparkArchitecturalDetails

• ResultsandDiscussion

Page 36: Mapping Brain Connectivity Through Large-Scale Segmentation and Analysis by Stephen Plaza

ExperimentalSetup• Goal:re-segmentpartiallyproofreadregion• Dataset:portionofflyopticlobe

– 232,000cubicmicrons– 453GB– 3,375subvolumes

• Eachworker(16cores,90GBmemory)

• Clustersize:(32workers, 512cores,2880GBmemory)• Onlysingle serverDBbehindDVID(fornow)

Page 37: Mapping Brain Connectivity Through Large-Scale Segmentation and Analysis by Stephen Plaza

Results• Runtimes

– Subvolume segmentation:42hours (depends greatlyonpluginsused)(somesparkrecomputation duetoexecutorfailure)

– ~4-5hoursperiteration(7iterations)– Shufflingandstitching:58minutes– Writingsegmentation:20hours

• Fastrestartfromcheckpoint: ~30sec,only95GBserialized segmentation

• >25hoursduetoserial read/writesthroughsingle-server backend(willbefixedsoon)

Page 38: Mapping Brain Connectivity Through Large-Scale Segmentation and Analysis by Stephen Plaza

Conclusions• Open-source large-scale segmentation inSpark:DVIDSparkServices

(https://github.com/janelia-flyem/DVIDSparkServices)

• Fastcheckpointing androllbackcapabilities• Robuststitching• Flexiblepluginarchitecture• Enables in-memorymanipulationofsegmented data

SparkChallenges• Centralizedsystemforcustomtask-level logging(monitoring/debugging)• Dynamiccluster sizing/settings (e.g., sometasksrequiremorememory)• Serializationoflarge(over2GB)RDDelements

Page 39: Mapping Brain Connectivity Through Large-Scale Segmentation and Analysis by Stephen Plaza

FutureWork• Improvethroughputofbackenddatastore

• Testanddeployonthecloud(Google,AWS)

• Increaserobustness/flexibilitybyallowingpartialsegmentationwrite-out(stitchusingoverlapwithpreviouslywrittenresults)

Page 40: Mapping Brain Connectivity Through Large-Scale Segmentation and Analysis by Stephen Plaza

??s

Page 41: Mapping Brain Connectivity Through Large-Scale Segmentation and Analysis by Stephen Plaza

Walkthrough:DVIDSparkServices

Createacustomworkflow1. DefineJSONschema2. Inherit from“Workflow”– e.g.,CustomWorkflow(Workflow)3. Implement “dumpschema” (returnJSONschemastring)4. Implement “execute” (runsactualsparkapplication)

DVIDSparkServices(pythonmodule)

“workflows”(containsplugins)

JSONschema

JSON(runapplication)

reconutils sparkdvid

• IngestGrayscale• ComputeGraph• EvaluateSeg• CreateSegmentation

Page 42: Mapping Brain Connectivity Through Large-Scale Segmentation and Analysis by Stephen Plaza

Walkthrough:DVIDSparkServices

DVIDSparkServices(pythonmodule)

“workflows”(containsplugins)

JSONschema

JSON(runapplication)

reconutils sparkdvid

Runningaworkflow(locally)1. Installdvidsparkservices (withconda)2. Downloadsparkbinary3. Addsparktopath4. spark-submit--masterlocal[8]workflows/launchworkflow.py CustomWorkflow –cconfig.json

Page 43: Mapping Brain Connectivity Through Large-Scale Segmentation and Analysis by Stephen Plaza

Walkthrough:DVIDSparkServices

DVIDSparkServices“workflows”

LaunchinganapplicationontheclusterwithDVIDServicesServer1. InstallDvidServicesServer2. Modifyconfig.json asnecessary3. ModifySparkLaunch/*config asnecessary4. Launchserver(DVIDServicesServer –portXconfig.json)5. Navigatetowebfront-endandlaunchjob6. Usewebpagetomonitorjob

DVIDServicesServerDVIDServicesServer:Whatitdoes• Scriptstolaunchsparkoncluster• Provideintuitivewebinterface• SimpleAPIforjobtracking

Config.json SparkLaunch/*

Page 44: Mapping Brain Connectivity Through Large-Scale Segmentation and Analysis by Stephen Plaza

Demo