deep learning tools for the measurement of animal behavior ... · behavior is the most important...

Deep learning tools for the measurement ofanimal behavior in neuroscience

Mackenzie Weygandt Mathis1,* and Alexander Mathis1

1Harvard University, Cambridge, MA USA | *[email protected]

Recent advances in computer vision have made accurate, fastand robust measurement of animal behavior a reality. In thepast years powerful tools specifically designed to aid the mea-surement of behavior have come to fruition. Here we discusshow capturing the postures of animals - pose estimation - hasbeen rapidly advancing with new deep learning methods. Whilechallenges still remain, we envision that the fast-paced develop-ment of new deep learning tools will rapidly change the land-scape of realizable real-world neuroscience.

Highlights:

1. Deep neural networks are shattering performancebenchmarks in computer vision for various tasks.

2. Using modern deep learning approaches (DNNs) in thelab is a fruitful approach for robust, fast, and efficientmeasurement of animal behavior.

3. New DNN-based tools allow for customized trackingapproaches, which opens new avenues for more flexi-ble and ethologically relevant real-world neuroscience.

IntroductionBehavior is the most important output of the underlyingneural computations in the brain. Behavior is complex, oftenmulti-faceted, and highly context dependent both in howcon-specifics or other observers understand it, as well ashow it is emitted. The study of animal behavior - ethology -has a rich history rooted in the understanding that behaviorgives an observer a unique look into an animal’s umwelt(1–3); what are the motivations, instincts, and needs of ananimal? What survival value do they provide? In order tounderstand the brain, we need to measure behavior in all itsbeauty and depth, and distill it down into meaningful metrics.Observing and efficiently describing behavior is a core tenantof modern ethology, neuroscience, medicine, and technology.

In 1973 Tinbergen, Lorenz, and von Frisch were thefirst ethologists awarded the Nobel Prize in Physiologyor Medicine for their pioneering work on the patternsof individual and social group behavior (4). The awardheralded a coming-of-age for behavior, and how rigorouslydocumenting behavior can influence how we study thebrain (4). Manual methods are powerful, but also highlylabor intensive and subject to the limits of our senses.Matching (and extending) the capabilities of biologists withtechnology is a highly non-trivial problem (5, 6), yet harborstremendous potential. How does one compress an animal’s

behavior over long time periods into meaningful met-rics? How does one use behavioral quantification to build abetter understanding of the brain and an animal’s umwelt (1)?

In this review we discuss the advances, and challenges, in an-imal pose estimation and its impact on neuroscience. Pose es-timation refers to methods for measuring posture, while pos-ture denotes to the geometrical configuration of body parts.While there are many ways to record behavior (7–9), videog-raphy is a non-invasive way to observe the posture of ani-mals. Estimated poses across time can then, depending onthe application, be transformed into kinematics, dynamics,and actions (3, 5–7, 10). Due to the low-dimensional natureof posture, these applications are computationally tractable.

A very brief history of pose estimation.

The postures and actions of animals have been documentedas far back as cave paintings, illustrating the human desireto distill the essence of an animal for conveying information.As soon as it was possible to store data on a computer, re-searchers have built systems for automated analysis. Overtime, these systems reflected all flavors of artificial intelli-gence from rule-based via expert systems, to machine learn-ing (11, 12). Traditionally posture was measured by placingmarkers on the subject (3), or markerlessly by using bodymodels (i.e. cylinder-based models with edge features (13)).Other computer vision techniques, such as using texture orcolor to segment the person from the background to createsilhouettes (14, 15), or using so-called hand-crafted featureswith decoders (11, 12, 16) were also popular before deeplearning flourished.

The deep learning revolution for posture.

Pose estimation is a challenging problem, but it has beentremendously advanced in the last five years due to advancesin deep learning. Deep neural networks (DNNs) are com-putational algorithms that consist of simple units, which areorganized in layers and then serially stacked to form “deepnetworks". The connections between the units are trainedon data and therefore learn to extract information from rawdata in order to solve tasks. The current deep learning revo-lution started with achieving human-level accuracy for objectrecognition on the ImageNet challenge, a popular benchmarkwith many categories and millions of images (16, 17). Acombination of large annotated data sets, sophisticated net-work architectures, and advances in hardware made this pos-sible and quickly impacted many problems in computer vi-sion (see reviews (12, 16, 18)).

Mathis & Mathis | arXiv.org | October 21, 2019 | 1–11

arX

iv:1

909.

1386

8v2

[cs

.CV

] 1

8 O

ct 2

019

Figure. 1. 2D pose estimation, 3D pose estimation & dense representations of humans and other animals: a: Example 2D multi-human pose estimation fromOpenPose (19). b: Example 3D human pose estimation from (20). c: Dense representations of humans with DensePose, adapted from Guler et al. (21). d:Animals have diverse bodies and experimenter’s are often interested in specific key points, making tailored network attractive solutions. DNNs open a realm ofpossibilities: from mice to cuttlefish. e: 3D pose estimation requires multiple cameras views, or 2D to 3D “lifting". f: The new SMALST model which fits full 3Dmodels to images from Zuffi et al. (22) applied to zebras.

2D and 3D (human) pose estimation.

In 2014 “DeepPose" was the first paper to apply deeplearning to human 2D pose estimation (23), and immediatelynew networks were proposed that improved accuracy byintroducing a translation invariant model (24), and convolu-tional networks plus geometric constraints (25, 26). In thefew years since, numerous human pose estimation papers(approx. 4,000 on Google Scholar), and new benchmarkswith standardized datasets plus evaluation metrics appeared,which allow better comparisons of “state-of-the-art" perfor-mance (27). This culture has driven rapid and remarkableincreases in performance: from 44% of body parts correctlylabeled to nearly 94% - with the top 15 networks beingwithin a few percentage points of each other (an exampletop network is shown in Figure 1a) (19, 28–32). Thehistory and many advances in 2D human pose estimation arecomprehensively reviewed in (11, 33).

3D human pose estimation is a more challenging task and3D labeled data is more difficult to acquire. There has beenmassive improvements in networks; see review (34). Yet cur-rently, the highest accuracy is achieved by using multiple 2Dviews to reconstruct a 3D estimate (Figure 1b; (20, 35)), butother ways of “lifting" 2D into 3D are being actively ex-plored (20, 36, 37).

Dense-representations of bodies.

Other video-based approaches for capturing the posture andsoft tissue of humans (and other animals) also exist. Depth-cameras such as the Microsoft Kinect have been used in hu-

mans (38, 39) and rodents (40, 41). Recently dense-pose rep-resentations, i.e. 3D point clouds or meshes (Figure 1c), havebecome a popular and elegant way to capture the soft-tissueand shape of bodies, which are highly important featuresfor person identification, fashion (i.e. clothing sales), andin medicine (21, 42, 43). However, state-of-the-art perfor-mance currently requires body-scanning of many subjects tomake body models. Typically, large datasets are collected toenable the creation of robust algorithms for inference on di-verse humans (or for animals, scanning toy models has beenfruitful (44)). Recently, outstanding improvements have beenmade to capture shapes of animals from images (22, 45, 46).However, there are no animal-specific toolboxes geared to-wards neuroscience applications, although we believe thatthis will change in the near future, as for many applicationshaving the soft-tissue measured will be highly important, i.e.in obesity or pregnancy research, etc.

Animal pose estimationThe remarkable performance when using deep learning forhuman 2D & 3D pose estimation plus dense-representationsmade this large body of work ripe for exploring its utilityin neuroscience (Figure 1d-f). In the past two years, deeplearning tools for laboratory experiments have arrived(Figure 2a-d).

Many of the properties of DNNs were extremely appealing:remarkable and robust performance, relatively fast inferencedue to GPU hardware, and efficient code due to modernpackages like TensorFlow and PyTorch (reviewed in (47)).Furthermore, unlike for many previous algorithms, neither

2 | arXiv.org Mathis & Mathis | deep learning & animal behavior

body models nor tedious manual tuning of parameters isrequired. Given the algorithms, the crucial ingredient for hu-man pose estimation success was large-scale well annotateddata sets of humans with the locations of the bodyparts.

Here, we identify 5 key areas that were important for mak-ing DNN-based pose estimation tools useful for neurosciencelaboratories, and review the progress in the last two years:

1. Can DNNs be harnessed with small training datasets?Due to the nature of “small-scale" laboratory experi-ments, labeling > 20,000 or more frames is not a fea-sible approach (the typical human benchmark datasetsizes).

2. The end-result must be as accurate as a humanmanually-applied labels (i.e. the gold standard), andcomputationally tractable (fast).

3. The resulting networks should be robust to changes inexperimental setups, and for long-term storage and re-analysis of video data, to video compression.

4. Animals move in 3D, thus having efficient solutionsfor 3D pose estimation would be highly valuable, es-pecially in the context of studying motor learning andcontrol.

5. Tracking multiple subjects and objects is important formany experiments studying social behaviors as well asfor animal-object interactions.

1. Small training sets for lab-sized experiments.

While the challenges discussed above for human poseestimation also apply for other animals, one importantchallenge for applying these methods to neuroscience wasannotated data sets - could DNNs be harnessed for muchsmaller datasets, at sizes reasonable for typical labs? Thus,while it was clear that given enough annotated frames thesame algorithms will be able to learn to track the body partsof any animal, there were feasibility concerns.

Human networks are typically trained on thousands ofimages, and nearly all the current state-of-the-art networksprovide tailored solutions that utilize the skeleton structureduring inference (19, 30). Thus, applying these tools to newdatasets was not immediately straight-forward, and to createanimal-specific networks one would need to potentiallycurate large datasets of the animal(s) they wanted to track.Additionally, researchers would need tailored DNNs to tracktheir subjects (plus the ability to track unique objects, suchas the corners of a box, or an implanted fiber).

Thus, one of the most important challenges is creatingtailored DNNs that are robust and generalize well with littletraining data. One potential solution for making networks foranimal pose estimation that could generalize well, even withlittle data, was to use transfer learning - the ability to take anetwork that has been trained on one task to perform another.

The advantage is that these networks are pretrained on largerdatasets (for different tasks where a lot of data is availablelike ImageNet), therefore they are effectively imbued withgood image representations.

This is indeed what “DeepLabCut," the first tool to leveragethe advances in human pose estimation for application toanimals did (48). DeepLabCut was built on a subset of“DeeperCut" (29), which was an attractive option due toits use of ResNets, which are powerful for transfer learn-ing (49, 50). Moreover transfer learning reduces trainingtimes (49–51), and there is a significant gain over usingrandomly-initialized networks in performance, especially forsmaller datasets (50).

The major result from DeepLabCut was benchmarking onsmaller datasets and finding that only a few hundred anno-tated images are enough to achieve excellent results for di-verse pose estimation tasks like locomotion, reaching andtrail-tracking in mice, egg-laying in flies and hunting in chee-tahs, due to transfer learning (Figure 2f,g,h) (50, 52, 53).“DeepBehavior," which utilized different DNN-packages forvarious pose estimation problems, also illustrated the gain oftransfer learning (51).

2. Accuracy & speed.

To be useful, pose estimation tools need to be as good ashuman annotation of frames (or tracking markers, anotherproxy for a human-applied label). DeepLabCut was shownto reach human labeling accuracy (48), and can replacephysical markers (58). Moreover, they need to be efficient(fast) for both offline analysis and online analysis. Speed isoften related to the depth of the network. Stacked-hourglassnetworks, which use iterative refinement (28, 59) andfewer layers, are fast. Two toolboxes, “LEAP" (55) and“DeepPoseKit" (60) adopted variants of stacked-hourglassnetworks. LEAP allows the user to rapidly compute pos-tures, and then perform unsupervised behavioral analysis(Figure 2d,e) (61). This is an attractive solution for real-timeapplications, but it is not quite as accurate. For variousdatasets, DeepPoseKit reports it is about three times asaccurate as LEAP, yet similar to DeepLabCut (60). Theyalso report about twice faster video processing comparedto DeepLabCut and LEAP for batch-processing (on smallframe sizes).

Deeper networks are slower, but often have more general-ization ability (49). DeepLabCut was designed for gener-alization and therefore utilized deeper networks (ResNets)that are inherently slower than stacked-hourglass networks,yet DeepLabCut can match human accuracy in labeling (Fig-ure 2f) (48). The speed has been shown to be compatible withonline-feedback applications (52, 62, 63). Other networksrecently added to DeepLabCut (with a MobileNetV2 back-bone) give slightly lower accuracy, with twice the speed (50).Overall, on GPU hardware all packages are fast, and reachspeeds of several hundred frames per second in offline modes.

Mathis & Mathis | deep learning & animal behavior arXiv.org | 3

Figure. 2. DNNs applied to animal pose estimation. a: Knee tracking during cycling adopted from (54). b: 3D limb estimates from (52). c: A Lichen Katydidtracked with DeepLabCut, courtesy of the authors. d: Fly with LEAP annotated body parts. The circles indicate the fraction for which predicted positions of theparticular body part are closer to the ground truth than the radii on test images (adopted from (55)). e: Density plot of t-SNE plus frequency-transformed freelymoving fly body-part trajectories. Patches with higher probability indicate more common movements like different types of grooming behaviors (adopted from (55));f. DeepLabCut requires little data to match human performance. Average error (RMSE) for several splits of training and test data vs. number of training imagescompared to RMSE of a human scorer. Each split is denoted by a cross, the average by a dot. For 80% of the data, the algorithm achieves human level accuracyon the test set. As expected, test RMSE increases for fewer training images. Around 100 diverse frames are enough to provide high tracking performance (<5-pixelaccuracy - adopted from (48)). g: Networks that perform better on ImageNet perform better for predicting 22 body parts on horses on within-domain (similar datadistribution as training set, red) and out-of-domain data (novel horses, black). The faint lines are individual splits. (adopted from (50)). h: Due to the convolutionalnetwork architecture, when trained on one mouse the network generalizes to detect body parts of three mice (adopted from (48)). i: 3D reaching kinematics ofrat (adopted from (56)). j: 3D pose estimation on a cheetah for 2 example poses from 6 cameras as well es example 2D views (adopted from (53)). k: Pupil andpupil-size tracking (adopted from (57)).

3. Robustness.

Neuroscience experiments based on video recordings pro-duce large quantities of data and are collected over exten-sive periods of time. Thus, analysis pipelines should be ro-bust to a myriad of perturbations: such as changes in setups(backgrounds, light sources, cameras, etc.), subject appear-ance (due to different animal strains), and compression algo-rithms (which allow storage of perceptually good videos withlittle memory demands (64)).

How can robustness be increased within the DNN? Bothtransfer learning (discussed above) and data augmentationstrategies are popular and rapidly evolving approaches toincrease robustness in DNNs (see review (65)). Moreover,active learning approaches allow an experimenter to contin-uously build more robust and diverse datasets, for large scaleprojects by expanding the training set with images, where thenetwork fails (48, 53, 55). So far, the toolboxes have beentested on data from the same distribution (i.e. by splittingframes from videos into test and training data), which is


important for assessing the performance (48, 55, 60), but didnot directly tested out-of-domain robustness.

Over the course of long-term experiments the backgroundor even animal strain can change, which means havingrobust networks would be highly advantageous. We re-cently tested the generalization ability of DeepLabCutwith different network backbones for pose estimation.We find that pretraining on ImageNet strongly improvesout-of-domain performance, and that better ImageNet per-forming networks are more robust (Figure 2g) (50). Thereis still a gap to close in out-of-domain performance, however.

DeepLabCut is also robust to video compression, as compres-sion by more than 1,000X only mildly affects accuracy (lessthan 1 pixel average error less) (52). The International BrainLab (IBL) independently and synergistically showed that fortracking multiple body parts in a rodent decision making task,DeepLabCut is robust to video compression (66). Thus, inpractice users can substantially compress videos, while re-taining accurate posture information.

4. 3D animal pose estimation.

Currently, there are several animal pose estimation tool-boxes that explicitly support 2D and 3D key-point detec-tion (53, 67–69). DeepLabCut uses 2D pose estimation totrain a single camera-invariant 2D network (or multiple 2Dnetworks) that is then used to perform traditional triangu-lation to extract 3D key points (Figure 2i, j; (53, 56)). Apipeline built on DeepLabCut called “Anipose" allows for3D reconstruction from multiple cameras using a wider va-riety of methods (67). “DeepFly3D" (68) uses the networkarchitecture from Newell et al. (28) and then adds eleganttools to compute an accurate 3D estimate of Drosophilamelanogaster by using the fly itself vs. standard calibra-tion boards. Zhang et al. use epipolar geometry to trainacross views and thereby improve 3D pose estimation formice, dogs, and monkeys (69).

5. Multi-animal & object tracking.

Many experiments in neuroscience require measuring inter-actions of multiple animals or interactions with objects. Hav-ing the ability to both flexibly track user-defined objects ormultiple animals therefore is highly desirable. There aremany pre-deep learning algorithms that allow tracking of ob-jects (one such modern example called “Tracktor" also nicelysummarizes this body of work (70)). Recently researchershave also applied deep learning to this problem. For exam-ple, the impressive “idTracker:ai" (71) allows for users totrack a hundred individual, unmarked animals. Arac et al.used YOLO, a popular and fast object localization network,for tracking two mice during a social behavior (51). These,and others, can then be combined with pose estimation pack-ages for estimating the pose of multiple animals. Currently,two paths are possible: one is to apply pose estimation al-gorithms after tracking individuals (for which any packagecould be used); or, two, extract multiple detections for each

part on each animal (Figure 2h; (48, 72)) and link them usingpart affinity fields (19), pairwise predictions (29), or geomet-rical constraints (72), plus combinatorics.

The impact on experimental neuroscienceIn the short time period these tools have become availablethere has been a rather wide adoption by the neuroscienceand ethology community. Beyond the original publications,DeepLabCut has already been used for pose estimation andbehavioral analysis in many contexts. For instance, kneemovement quantification during cycling (Figure 2a) (54),postural analysis during underwater running (73), socialbehavior in bats (74), for benchmarking thermal constraintswith optogenetics (75), fly leg movement analysis (76, 77),for 3D rat reaching (Figure 2i) (56), hydra in a thermalstimulation assay (78) and pupillometry (Figure 2k) (57).Also inanimate objects can be tracked, and it has indeedalso been used to track metal beads when subjected to ahigh voltage (79), and magic tricks (i.e. coins and themagician) (80). LEAP (55) has been used to track ants (81)and mice (82).

Pose estimation is just the beginning; the next steps involvecareful analysis of kinematics, building detailed, multi-scaleethograms of behaviors, new modeling techniques to un-derstand large-scale brain activity and behaviors across amultitude of timescales, and beyond. We envision threebranches where powerful feature tracking and extensionswill be useful: motor control studies (often involvingcomplex motor actions), naturalistic behaviors in the lab andin the wild, and better quantification of robust and seeminglysimple “non-motor" tasks (Figure 3).

Many paradigms in neuroscience can be loosely arrangedalong three branches as natural (i.e. mouse parenting be-havior), simple trail-based tasks (i.e. classical condition-ing), and/or complex motor actions like skilled reaching (Fig-ure 3). For example, you can have simple and natural taskssuch as licking for water, or complex and natural behaviorssuch as escaping from a looming stimulus that would rarelyproduce repeated trajectories. For simplicity, here we dis-cuss how pose estimation can potentially enhance these stud-ies along those three branches, namely complex movements(Motor control & kinematics), natural behaviors (Natural be-haviors & ethologically relevant features), and during simplemotor-output tasks (Revisiting classic tasks).

Motor control & kinematics.

Often in neuroscience-minded motor control studies end-effector proxies (such as manipulandums or joysticks) areused to measure the motor behavior of subjects or animals.There are relatively few marker-tracking based movementneuroscience studies, in which many degrees of freedomwere measured alongside neural activity, with notable ex-ceptions like (83, 84). The ease with which kinematics oflimbs and digits can now be quantified (48, 51, 56, 77) should


Figure. 3. The Behavioral Space in Neuroscience: new applications for deep learning-assisted analysis. This diagram depicts how pose estimation with non-invasive videography can benefit behavioral paradigms that span from “trial-based behaviors" such as classical conditioning, to “complex motor behaviors/tasks"as in 3D reaching assays, to “naturalistic tasks," often without any trial structure, and that are more akin to real-world ’tasks’ that animals undertake. For example,a looming stimulus is ethologically relevant and complex, and pup-retrieval can be natural yet repeatable in a laboratory setting. With new tools that allow for fastand accurate analysis of movement, these types of experiments become more feasible (with much less human labor than previously required).

greatly simplify such studies in the future. We expect manymore highly detailed kinematic studies will emerge that uti-lize DNN-based analyses, especially for freely moving ani-mals, for small and aquatic animals that cannot be marked,and for motor control studies that can leverage large-scalerecordings and behavioral monitoring.

Natural behaviors & ethologically relevant features.

There is a trend in motor neuroscience towards naturalbehaviors; i.e. less constrained tasks, everyday-skills, andeven “in the wild" studies (85). For instance, we usedDeepLabCut for 3D pose estimation in hunting cheetah’scaptured via multiple Go-Pro cameras (Figure 2j; (53)).Another “in the wild example" is given by a recent study byChambers et al. (86), who revisited the classic question ofhow people synchronize their walking, but with a moderntwist by using videos from YouTube and analysis with

OpenPose (19). Consistent with studies performed in thelaboratory, they found a tendency for pairs of people to eitherwalk in or exactly out of phase (86).

How else can DNNs help? Specialized body parts often playa key role in ethologically relevant behaviors. For instance,ants use their antenna to follow odor trails (87), while molesuse their snouts for sampling bilateral nasal cues to local-ize odorants (88). To accurately measure such behaviors,highly accurate feature-detectors of often tiny, highly dex-terous bodyparts are needed. This is a situation where deeplearning algorithms can excel. Pose estimation algorithmscan not only be used to detect the complete "pose", but dueto their flexibility they are extremely useful to track etholog-ically relevant body parts in challenging situations; inciden-tally DeepLabCut was created, in part, to accurately track thesnouts of mice following odor trails that were printed ontoa treadmill (48). There are of course many other special-


ized body parts that are hard to track: like whiskers, bee-stingers, jellyfish tentacles, or octopus arms, and we believethat studying these beautiful systems in more natural andethologically relevant environments has now gotten easier.

Revisiting classic tasks.

Measuring behavior is already impacting “classic" decision-making paradigms. For example, several groups could showbroad movement encoding across the brain during decision-making tasks by carefully quantifying behavior (89, 90).Moreover, large scale efforts to use these “simple" yet robusttrial-based behaviors across labs and brain areas are lever-aging deep learning, and comparing their utility comparedto classical behavior-monitoring approaches. For example,the IBL has surmised that DeepLabCut could replace tradi-tional methods used for eye, paw and lick detection (66). Webelieve that detailed behavioral analysis will impact manyparadigms, which were historically not considered “motor"studies, as now it is much easier to measure movement.

Remaining challenges in pose estimation.

Advances in deep learning have changed how easily pos-ture can be measured and has impacted many studies. How-ever, pose estimation remains a hard computer vision prob-lem and challenges remain (16, 27, 33, 91–93). In the con-text of multi-animal/human pose estimation, dealing withhighly crowded scenes, in which different individuals covereach other, remains highly challenging (27, 33, 91). In gen-eral, dealing with occlusions remains a challenge. In someexperiments occlusions are hard to avoid. Thus, networksthat can constraint body part detection based on anatomi-cal relationships can be advantageous, but are computation-ally more complex and slower (29). As we highlightedin the robustness section, it is hard to train networks togeneralize to out-of-domain scenarios (50, 92, 94). Eventhough very large data sets have been amassed to build robustDNNs (19, 30–32), they can still fail in sufficiently differentscenarios (50, 92, 93). Making robust networks will highlyuseful for creating shareable behavior- or animal-specific net-works that can generalize across laboratories. There will alsobe much work towards even faster, lighter models.

Outlook & ConclusionsThe field of 2D, 3D, and dense pose estimation will continueto evolve. For example, with respect to handling occlusionsand robustness to out-of-domain data. Perhaps larger andmore balanced datasets will be created to better span thebehavioral space, more temporal information will be utilizedwhen training networks or analyzing data, and new algorith-mic solutions will be found.

Will we always need training data? A hot topic in objectrecognition is training from very few examples (one-shot orzero-shot learning) (95). Can this be achieved in pose esti-mation? Perhaps as new architectures and training regimescome to fruition this could be possible. Alternatively,

specialized networks could now be built that leverage largedatasets of specific animals. It is hard to envision a universal“animal pose detector" network (for object recognitionthis is possible) as animals have highly diverse body plansand experimentalists often have extremely different needs.Currently many individual labs create their own specializednetworks, but we plan to create shareable networks forspecific animals (much like the specific networks, i.e.hand network in OpenPose (96), or the human-networkin DeepLabCut (29, 48)). For example, many open fieldexperiments could benefit from robust and easy-to-use DNNsfor video analysis across similar body points of the mouse.Indeed, efforts are underway to create networks where onecan simply analyze their data without training, and we hopethe community will join these efforts. Nothing improvesDNNs more than more training data. These efforts, togetherwith making code open source, will contribute to the repro-ducibility of science and make these tools broadly accessible.

In summary, we aimed to review the progress in com-puter vision for human pose estimation, how it influencedanimal pose estimation, and how neuroscience laboratoriescan leverage these tools for better quantification of behavior.Taken together, the tremendous advance of computer visionhas provided tools that are practical for the use in the labo-ratory, and they will only get better. They can be as accurateas human-labeling (or marker-based tracking), and are fastenough for closed-loop experiments, which is key for under-standing the link between neural systems and behavior. Weexpect that in-light of shared, easy-to-use tools and additionaldeep learning advances, there will be thrilling and unforeseenadvances in real-world neuroscience.

Acknowledgments:

We thank Eldar Insafutdinov, Alex Gomez-Marin, the M.Mathis Lab, and the anonymous reviewers for comments onthe manuscript, and Julia Kuhl for illustrations. Fundingwas provided by the Rowland Institute at Harvard Univer-sity (M.W.M.) and NIH U19MH114823 (A.M). The authorsdeclare no conflicts of interest.

References1. J. Baron Uexküll. Streifzüge durch die Umwelten von

Tieren und Menschen Ein Bilderbuch unsichtbarer Welten.Springer, Berlin, Heidelberg, 1956.

2. Nikolaas Tinbergen. On aims and methods of Ethology.Zeitschrift für Tierpsychologie, 20(4):410–433, 1963.

3. Nikolai A. Bernstein. The co-ordination and regulation ofmovements, volume 1. Oxford, New York, Pergamon Press,1967.

4. Nikolaas Tinbergen. Nobel lecture. the nobel prize in physi-ology or medicine 1973, 1973.

5. Andreas T Schaefer and Adam Claridge-Chang. Thesurveillance state of behavioral automation. In CurrentOpinion in Neurobiology, 2012.

6. David J Anderson and Pietro Perona. Toward a science ofcomputational ethology. Neuron, 84(1):18–31, 2014.

7. Anthony I Dell, John A Bender, Kristin Branson, Iain DCouzin, Gonzalo G de Polavieja, Lucas PJJ Noldus, AlfonsoPérez-Escudero, Pietro Perona, Andrew D Straw, Martin


Wikelski, et al. Automated image-based tracking and its ap-plication in ecology. Trends in ecology & evolution, 29(7):417–428, 2014.

8. Roian Egnor and Kristin Branson. Computational analysisof behavior. Annual review of neuroscience, 39:217–236,2016.

9. Valentina Camomilla, Elena Bergamini, Silvia Fantozzi, andGiuseppe Vannozzi. Trends supporting the in-field use ofwearable inertial sensors for sport performance evaluation:A systematic review. Sensors, 18(3):873, 2018.

10. Alex Gomez-Marin, Joseph J Paton, Adam R Kampff, Rui MCosta, and Zachary F Mainen. Big behavioral data: psy-chology, ethology and the foundations of neuroscience. Na-ture neuroscience, 17(11):1455, 2014.

11. Ronald Poppe. Vision-based human motion analysis: Anoverview. Computer Vision and Image Understanding, 108(1):4 – 18, 2007. ISSN 1077-3142. doi: https://doi.org/10.1016/j.cviu.2006.10.016. Special Issue on Vision for Human-Computer Interaction.

12. Geert Litjens, Thijs Kooi, Babak Ehteshami Bejnordi, Ar-naud Arindra Adiyoso Setio, Francesco Ciompi, MohsenGhafoorian, Jeroen Awm Van Der Laak, Bram Van Gin-neken, and Clara I Sánchez. A survey on deep learning inmedical image analysis. Medical image analysis, 42:60–88,2017.

13. David Hogg. Model-based vision: a program to see a walk-ing person. Image and Vision Computing, 1(1):5 – 20, 1983.ISSN 0262-8856. doi: https://doi.org/10.1016/0262-8856(83)90003-3.

14. Christopher Richard Wren, Ali Azarbayejani, Trevor Darrell,and Alex Paul Pentland. Pfinder: Real-time tracking of thehuman body. IEEE Transactions on pattern analysis andmachine intelligence, 19(7):780–785, 1997.

15. Daniel Cremers, Mikael Rousson, and Rachid Deriche. Areview of statistical approaches to level set segmentation:integrating color, texture, motion and shape. Internationaljournal of computer vision, 72(2):195–215, 2007.

16. Thomas Serre. Deep learning: the good, the bad, and theugly. Annual Review of Vision Science, 5:399–426, 2019.doi: https://doi.org/10.1146/annurev-vision-091718-014951.

17. Md Zahangir Alom, Tarek M Taha, Christopher Yakopcic,Stefan Westberg, Paheding Sidike, Mst Shamima Nasrin,Brian C Van Esesn, Abdul A S Awwal, and Vijayan KAsari. The history began from alexnet: a comprehen-sive survey on deep learning approaches. arXiv preprintarXiv:1803.01164, 2018.

18. Jürgen Schmidhuber. Deep learning in neural networks: Anoverview. Neural Networks, 61:85 – 117, 2015. ISSN 0893-6080. doi: https://doi.org/10.1016/j.neunet.2014.09.003.

19. Zhe Cao, Tomas Simon, Shih-En Wei, and Yaser Sheikh.Realtime multi-person 2d pose estimation using part affinityfields. In Proceedings of the IEEE conference on computervision and pattern recognition, 2017.

20. Dushyant Mehta, Helge Rhodin, Dan Casas, Oleksandr Sot-nychenko, Weipeng Xu, and Christian Theobalt. Monocular3d human pose estimation using transfer learning and im-proved CNN supervision. CoRR, abs/1611.09813, 2016.

21. Rıza Alp Güler, Natalia Neverova, and Iasonas Kokkinos.Densepose: Dense human pose estimation in the wild. InProceedings of the IEEE Conference on Computer Visionand Pattern Recognition, pages 7297–7306, 2018.

22. Silvia Zuffi, Angjoo Kanazawa, Tanja Berger-Wolf, andMichael Black. Three-d safari: Learning to estimate zebrapose, shape, and texture from images "in the wild". In ICCV.IEEE Computer Society, 08 2019.

23. Alexander Toshev and Christian Szegedy. Deeppose: Hu-man pose estimation via deep neural networks. In Proceed-ings of the IEEE conference on computer vision and patternrecognition, pages 1653–1660, 2014.

24. Arjun Jain, Jonathan Tompson, Yann LeCun, and ChristophBregler. Modeep: A deep learning framework using motionfeatures for human pose estimation. CoRR, abs/1409.7963,2014.

25. Jonathan J Tompson, Arjun Jain, Yann LeCun, andChristoph Bregler. Joint training of a convolutional net-work and a graphical model for human pose estimation. InZ. Ghahramani, M. Welling, C. Cortes, N. D. Lawrence, andK. Q. Weinberger, editors, Advances in Neural InformationProcessing Systems 27, pages 1799–1807. Curran Asso-ciates, Inc., 2014.

26. Jonathan Tompson, Ross Goroshin, Arjun Jain, Yann Le-Cun, and Christoph Bregler. Efficient object localization us-ing convolutional networks. In The IEEE Conference onComputer Vision and Pattern Recognition (CVPR), June2015.

27. Mykhaylo Andriluka, Umar Iqbal, Eldar Insafutdinov, LeonidPishchulin, Anton Milan, Juergen Gall, and Bernt Schiele.Posetrack: A benchmark for human pose estimation andtracking. In Proceedings of the IEEE Conference on Com-puter Vision and Pattern Recognition, pages 5167–5176,2018.

28. Alejandro Newell, Kaiyu Yang, and Jia Deng. Stacked hour-glass networks for human pose estimation. In EuropeanConference on Computer Vision, pages 483–499. Springer,2016.

29. Eldar Insafutdinov, Leonid Pishchulin, Bjoern Andres,Mykhaylo Andriluka, and Bernt Schiele. DeeperCut: Adeeper, stronger, and faster multi-person pose estimationmodel. In European Conference on Computer Vision, pages34–50. Springer, 2016.

30. Eldar Insafutdinov, Mykhaylo Andriluka, Leonid Pishchulin,Siyu Tang, Evgeny Levinkov, Bjoern Andres, and BerntSchiele. Arttrack: Articulated multi-person tracking in thewild. In Proceedings of the IEEE conference on computervision and pattern recognition, 2017.

31. Sven Kreiss, Lorenzo Bertoni, and Alexandre Alahi. Pifpaf:Composite fields for human pose estimation. In Proceed-ings of the IEEE Conference on Computer Vision and Pat-tern Recognition, pages 11977–11986, 2019.

32. Ke Sun, Bin Xiao, Dong Liu, and Jingdong Wang. Deephigh-resolution representation learning for human pose es-timation. arXiv preprint arXiv:1902.09212, 2019.

33. Q. Dang, J. Yin, B. Wang, and W. Zheng. Deep learn-ing based 2d human pose estimation: A survey. TsinghuaScience and Technology, 24(6):663–676, Dec 2019. ISSN1007-0214. doi: 10.26599/TST.2018.9010100.

34. Nikolaos Sarafianos, Bogdan Boteanu, Bogdan Ionescu,and Ioannis A Kakadiaris. 3d human pose estimation: A re-view of the literature and analysis of covariates. ComputerVision and Image Understanding, 152:1–20, 2016.

35. Julieta Martinez, Rayat Hossain, Javier Romero, andJames J Little. A simple yet effective baseline for 3d humanpose estimation. In Proceedings of the IEEE InternationalConference on Computer Vision, pages 2640–2649, 2017.

36. Denis Tomè, Chris Russell, and Lourdes Agapito. Liftingfrom the deep: Convolutional 3d pose estimation from a sin-gle image. CoRR, abs/1701.00295, 2017.

37. Ching-Hang Chen and Deva Ramanan. 3d human poseestimation= 2d pose estimation+ matching. In Proceedingsof the IEEE Conference on Computer Vision and PatternRecognition, pages 7035–7043, 2017.

38. Jamie Shotton, Ross B. Girshick, Andrew W. Fitzgibbon,Toby Sharp, Mat Cook, Mark Finocchio, Richard Moore,Pushmeet Kohli, Antonio Criminisi, Alex Kipman, and An-drew Blake. Efficient human pose estimation from singledepth images. IEEE Transactions on Pattern Analysis andMachine Intelligence, 35:2821–2840, 2012.

39. Š. Obdržálek, G. Kurillo, F. Ofli, R. Bajcsy, E. Seto, H. Jimi-


son, and M. Pavel. Accuracy and robustness of kinect poseestimation in the context of coaching of elderly population.In 2012 Annual International Conference of the IEEE Engi-neering in Medicine and Biology Society, pages 1188–1193,Aug 2012. doi: 10.1109/EMBC.2012.6346149.

40. Alexander B Wiltschko, Matthew J Johnson, Giuliano Iurilli,Ralph E Peterson, Jesse M Katon, Stan L Pashkovski, Vic-toria E Abraira, Ryan P Adams, and Sandeep Robert Datta.Mapping sub-second structure in mouse behavior. Neuron,88(6):1121–1135, 2015.

41. Weizhe Hong, Ann Kennedy, Xavier P Burgos-Artizzu,Moriel Zelikowsky, Santiago G Navonne, Pietro Perona, andDavid J Anderson. Automated measurement of mouse so-cial behaviors using depth sensing, video tracking, and ma-chine learning. Proceedings of the National Academy ofSciences, 112(38):E5351–E5360, 2015.

42. Georgios Pavlakos, Vasileios Choutas, Nima Ghorbani,Timo Bolkart, Ahmed A. A. Osman, Dimitrios Tzionas, andMichael J. Black. Expressive body capture: 3d hands, face,and body from a single image. In Proceedings IEEE Conf.on Computer Vision and Pattern Recognition (CVPR), June2019.

43. Angjoo Kanazawa, Michael J. Black, David W. Jacobs, andJitendra Malik. End-to-end recovery of human shape andpose. In Computer Vision and Pattern Regognition (CVPR),2018.

44. Silvia Zuffi, Angjoo Kanazawa, David Jacobs, and MichaelBlack. 3d menagerie: Modeling the 3d shape and pose ofanimals. 2017 IEEE Conference on Computer Vision andPattern Recognition (CVPR), 11 2016.

45. Benjamin Biggs, Thomas Roddick, Andrew Fitzgibbon, andRoberto Cipolla. Creatures great and smal: Recovering theshape and motion of animals from video. In Asian Confer-ence on Computer Vision, pages 3–19. Springer, 2018.

46. Silvia Zuffi, Angjoo Kanazawa, and Michael J. Black. Lionsand tigers and bears: Capturing non-rigid, 3d, articulatedshape from images. In Proceedings of the IEEE conferenceon Computer Vision and Pattern Recognition), June 2018.

47. Giang Nguyen, Stefan Dlugolinsky, Martin Bobák, Viet Tran,Álvaro López García, Ignacio Heredia, Peter Malík, andLadislav Hluchý. Machine learning and deep learningframeworks and libraries for large-scale data mining: a sur-vey. Artificial Intelligence Review, 52(1):77–124, Jun 2019.ISSN 1573-7462. doi: 10.1007/s10462-018-09679-z.

48. Alexander Mathis, Pranav Mamidanna, Kevin M Cury, TaigaAbe, Venkatesh N Murthy, Mackenzie Weygandt Mathis,and Matthias Bethge. Deeplabcut: markerless pose estima-tion of user-defined body parts with deep learning. Natureneuroscience, 21:1281–1289, 2018.

49. Simon Kornblith, Jonathon Shlens, and Quoc V Le. Do bet-ter imagenet models transfer better? In Proceedings of theIEEE Conference on Computer Vision and Pattern Recog-nition, pages 2661–2671, 2019.

50. Alexander Mathis, Mert Yüksekgönül, Byron Rogers,Matthias Bethge, and Mackenzie W Mathis. Pretrainingboosts out-of-domain robustness for pose estimation. arXivpreprint arXiv:1909.11229, 2019.

51. Ahmet Arac, Pingping Zhao, Bruce H Dobkin, S ThomasCarmichael, and Peyman Golshani. Deepbehavior: A deeplearning toolbox for automated analysis of animal and hu-man behavior imaging data. Frontiers in systems neuro-science, 13:20, 2019.

52. Alexander Mathis and Richard A. Warren. On the infer-ence speed and video-compression robustness of deeplab-cut. bioRxiv, 2018. doi: 10.1101/457242.

53. Tanmay Nath, Alexander Mathis, An Chi Chen, Amir Patel,Matthias Bethge, and Mackenzie W Mathis. Using deeplab-cut for 3d markerless pose estimation across species andbehaviors. Nature protocols, 14:2152–2176, 2019.

54. Oral Kaplan, Goshiro Yamamoto, Takafumi Taketomi,Alexander Plopski, and Hirokazu Kato. Video-based visu-alization of knee movement in cycling for quantitative andqualitative monitoring. In 2019 12th Asia Pacific Workshopon Mixed and Augmented Reality (APMAR), pages 1–5.IEEE, 2019.

55. Talmo D Pereira, Diego E Aldarondo, Lindsay Willmore,Mikhail Kislin, Samuel S-H Wang, Mala Murthy, andJoshua W Shaevitz. Fast animal pose estimation usingdeep neural networks. Nature methods, 16(1):117, 2019.

56. Alexandra Bova, Krista Kernodle, Kaitlyn Mulligan, andDaniel Leventhal. Automated rat single-pellet reaching with3-dimensional reconstruction of paw and digit trajectories.Journal of visualized experiments: JoVE, 2019.

57. Balaji Sriram, Lillian Li, Alberto Cruz-Martín, and AnirvanGhosh. A sparse probabilistic code underlies the limits ofbehavioral discrimination. Cerebral Cortex, 2019.

58. Dalton D. Moore, Jeff D. Walker, Jason N. MacLean, andNicholas G. Hatsopoulos. Comparing pose estimation per-formance using deeplabcut and marker-based tracking. So-ciety for Neuroscience, 2019.

59. Vijay Badrinarayanan, Alex Kendall, and Roberto Cipolla.Segnet: A deep convolutional encoder-decoder architecturefor image segmentation. IEEE transactions on pattern anal-ysis and machine intelligence, 39(12):2481–2495, 2017.

60. Jacob M Graving, Daniel Chae, Hemal Naik, Liang Li, Ben-jamin Koger, Blair R Costelloe, and Iain D Couzin. Deep-posekit, a software toolkit for fast and robust animal poseestimation using deep learning. eLife, 8:e47994, oct 2019.ISSN 2050-084X. doi: 10.7554/eLife.47994.

61. Gordon J. Berman, Daniel M. Choi, William Bialek, andJoshua W. Shaevitz. Mapping the stereotyped behaviourof freely moving fruit flies. Journal of The Royal Society In-terface, 11(99), 2014. ISSN 1742-5689. doi: 10.1098/rsif.2014.0672.

62. Brandon Forys, Dongsheng Xiao, Pankaj Gupta, Jamie DBoyd, and Timothy H Murphy. Real-time markerless videotracking of body parts in mice using deep neural networks.bioRxiv, page 482349, 2018.

63. Vilim Štih, Luigi Petrucco, Andreas M Kist, and Ruben Por-tugues. Stytra: An open-source, integrated system for stim-ulation, tracking and closed-loop behavioral experiments.PLoS computational biology, 15(4):e1006699, 2019.

64. Thomas Wiegand, Gary J Sullivan, Gisle Bjontegaard, andAjay Luthra. Overview of the H. 264/AVC video coding stan-dard. IEEE Transactions on circuits and systems for videotechnology, 13(7):560–576, 2003.

65. Connor Shorten and Taghi M. Khoshgoftaar. A survey onimage data augmentation for deep learning. Journal of BigData, 6(1):60, Jul 2019. ISSN 2196-1115. doi: 10.1186/s40537-019-0197-0.

66. Guido Meijer, Michael Schartner, Gaëlle Chapuis, and KarelSvoboda. Video hardware and software for the internationalbrain laboratory. International Brain Laboratory, 2019.

67. Pierre Karashchuk. lambdaloop/anipose: v0.5.0, August2019.

68. Semih Günel, Helge Rhodin, Daniel Morales, João H Cam-pagnolo, Pavan Ramdya, and Pascal Fua. Deepfly3d, adeep learning-based approach for 3d limb and appendagetracking in tethered, adult Drosophila. eLife, 8:e48571, oct2019. ISSN 2050-084X. doi: 10.7554/eLife.48571.

69. Yilun Zhang and Hyun Soo Park. Multiview supervision byregistration. arXiv preprint arXiv:1811.11251v2, 2019.

70. Vivek Hari Sridhar, Dominique G. Roche, and Simon Gin-gins. Tracktor: Image-based automated tracking of animalmovement and behaviour. Methods in Ecology and Evolu-tion, 10(6):815–820, 2019. doi: 10.1111/2041-210X.13166.

71. Francisco Romero-Ferrero, Mattia G Bergomi, Robert CHinz, Francisco JH Heras, and Gonzalo G de Polavieja.


idtracker. ai: tracking all individuals in small or large col-lectives of unmarked animals. Nature methods, 16(2):179,2019.

72. Zheheng Jiang, Zhihua Liu, Long Chen, Lei Tong, Xian-grong Zhang, Xiangyuan Lan, Danny Crookes, Ming-HsuanYang, and Huiyu Zhou. Detection and tracking of mul-tiple mice using part proposal networks. arXiv preprintarXiv:1906.02831, 2019.

73. Neil J Cronin, Timo Rantalainen, Juha P Ahtiainen, EsaHynynen, and Ben Waller. Markerless 2d kinematic analysisof underwater running: A deep learning approach. Journalof biomechanics, 87:75–82, 2019.

74. Wujie Zhang and Michael M Yartsev. Correlated neural ac-tivity across the brains of socially interacting bats. Cell, 178,2019.

75. Scott F Owen, Max H Liu, and Anatol C Kreitzer. Thermalconstraints on in vivo optogenetic manipulations. Natureneuroscience, 22, 2019.

76. Salil S Bidaye, Meghan Laturney, Amy K Chang, YuejiangLiu, Till Bockemuhl, Ansgar Buschges, and Kristin Scott.Two brain pathways initiate distinct forward walking pro-grams in drosophila. bioRxiv, page 798439, 2019.

77. Anthony W Azevedo, Pralaksha Gurung, Lalanti Venkata-subramanian, Richard Mann, and John C Tuthill. A sizeprinciple for leg motor control in drosophila. bioRxiv, page730218, 2019.

78. Constantine Nicholas Tzouanas, Soonyoung Kim, Kr-ishna N Badhiwala, Benjamin W Avants, and Jacob Robin-son. Thermal stimulation temperature is encoded as a firingrate in a hydra nerve ring. bioRxiv, page 787648, 2019.

79. Benjamin De Bari, James A Dixon, Bruce A Kay, and DilipKondepudi. Oscillatory dynamics of an electrically drivendissipative structure. PloS one, 14(5):e0217305, 2019.

80. Regina Zaghi-Lara, Miguel Ángel Gea, Jordi Camí, Luis MMartínez, and Alex Gomez-Marin. Playing magic tricks todeep neural networks untangles human deception. arXivpreprint arXiv:1908.07446, 2019.

81. Glenna Clifton, David Holway, and Nick Gravish. Roughsubstrates constrain walking speed in ants through modula-tion of stride frequency and not stride length. bioRxiv, page731380, 2019.

82. Yuanlei Yue, Pan Xu, Zhichao Liu, Zekai Chen, Juntao Su,Ai Chen, Ryan T. Ash, Emre Barut, Rahul Simha, SteliosSmirnakis, Chen Zeng, and Hui Lu. Mecp2 deletion im-paired layer 2/3-dominant dynamic reorganization of corti-cal circuit during motor skill learning. bioRxiv, 2019. doi:10.1101/786822.

83. Carlos E Vargas-Irwin, Gregory Shakhnarovich, PaymanYadollahpour, John MK Mislow, Michael J Black, and John PDonoghue. Decoding complete reach and grasp actionsfrom local primary motor cortex populations. Journal of neu-roscience, 30(29):9659–9669, 2010.

84. Stefan Schaffelhofer, Andres Agudelo-Toro, and HansjörgScherberger. Decoding a wide range of hand configurationsfrom macaque motor, premotor, and parietal cortices. Jour-nal of Neuroscience, 35(3):1068–1081, 2015.

85. Alexander Mathis, Andrea R Pack, Rodrigo S Maeda, andSamuel David McDougle. Highlights from the 29th annualmeeting of the society for the neural control of movement.Journal of Neurophysiology, 122:1777–1783, 2019. doi: 10.1152/jn.00484.2019.

86. Claire Chambers, Gaiqing Kong, Kunlin Wei, and KonradKording. Pose estimates from online videos show that side-by-side walkers synchronize movement under naturalisticconditions. PloS one, 14(6):e0217861, 2019.

87. Ryan W Draft, Matthew R McGill, Vikrant Kapoor, andVenkatesh N Murthy. Carpenter ants use diverse antennaesampling strategies to track odor trails. Journal of Experi-mental Biology, 221(22):jeb185124, 2018.

88. Kenneth C Catania. Stereo and serial sniffing guide naviga-tion to an odour source in a mammal. Nature communica-tions, 4:1441, 2013.

89. Carsen Stringer, Marius Pachitariu, Nicholas Steinmetz,Charu Bai Reddy, Matteo Carandini, and Kenneth D. Har-ris. Spontaneous behaviors drive multidimensional, brain-wide activity. Science, 364(6437), 2019. ISSN 0036-8075.doi: 10.1126/science.aav7893.

90. Simon Musall, Matthew T. Kaufman, Ashley L. Juavinett,Steven Gluf, and Anne K. Churchland. Single-trial neu-ral dynamics are dominated by richly varied movements.Nat. Neurosci., 22, 2019. doi: https://doi.org/10.1038/s41593-019-0502-4.

91. Thomas Golda, Tobias Kalb, Arne Schumann, and JürgenBeyerer. Human pose estimation for real-world crowdedscenarios. arXiv preprint arXiv:1907.06922, 2019.

92. Helge Rhodin, Jörg Spörri, Isinsu Katircioglu, Victor Con-stantin, Frédéric Meyer, Erich Müller, Mathieu Salzmann,and Pascal Fua. Learning monocular 3d human pose esti-mation from multi-view images. In Proceedings of the IEEEConference on Computer Vision and Pattern Recognition,pages 8437–8446, 2018.

93. Nidhi Seethapathi, Shaofei Wang, Rachit Saluja, Gun-nar Blohm, and Konrad P Kording. Movement scienceneeds different pose tracking algorithms. arXiv preprintarXiv:1907.10226, 2019.

94. Claudio Michaelis, Benjamin Mitzkus, Robert Geirhos,Evgenia Rusak, Oliver Bringmann, Alexander S Ecker,Matthias Bethge, and Wieland Brendel. Benchmarking ro-bustness in object detection: Autonomous driving when win-ter is coming. arXiv preprint arXiv:1907.07484, 2019.

95. Yongqin Xian, Christoph H Lampert, Bernt Schiele, andZeynep Akata. Zero-shot learning—a comprehensive eval-uation of the good, the bad and the ugly. IEEE Transactionson Pattern Analysis and Machine Intelligence, 41(9):2251–2265, Sep. 2019. ISSN 0162-8828. doi: 10.1109/TPAMI.2018.2857768.

96. Tomas Simon, Hanbyul Joo, Iain Matthews, and YaserSheikh. Hand keypoint detection in single images usingmultiview bootstrapping. In Proceedings of the IEEE con-ference on computer vision and pattern recognition, 2017.

Highlighted References:[1*] A survey on deep learning in medical imageanalysis (12) Comprehensive review of all deep learningalgorithms used in medical image analysis as of 2017, aswell as a discussion of most successful approaches, togetherwith [2**] a fantastic introduction for newcomers to the field.

[2**] Deep learning: The good, the bad, and the ugly (16)Excellent review of deep learning progress including adetailed description of recent successes as well as limitationsof computer vision algorithms.

[3**] Realtime multi-person 2d pose estimation usingpart affinity fields (19) OpenPose was the first real-timemulti-person system to jointly detect human body parts by us-ing part affinity fields, a great way to link body part proposalsacross individuals. The toolbox is well maintained and nowboasts body, hand, facial, and foot keypoints (in total 135keypoints) as well as 3D support.


[4*] Dense-pose: Dense human pose estimation in thewild.(21) Using a large dataset of humans (50K), they builddense correspondences between RGB images and humanbodies. They apply this to human “in the wild," and buildtools for efficiently dealing with occlusions. It is highlyaccurate and runs up to 25 frames per second.

[5**] Three-D Safari: Learning to Estimate Zebra Pose,Shape, and Texture from Images “In the Wild" (22) Zuffiet al. push dense pose estimations by using a new SMALSTmodel for capturing zebras pose, soft-tissue shape, and eventexture “in the wild." This is a difficult challenge as zebrasare designed to blend into the background in the safari.This paper makes significant improvements on accuracy andrealism, and builds on a line of elegant work from theseauthors (44, 46)

[6**] DeeperCut: A deeper, stronger, and faster multi-person pose estimation model (29) DeeperCut is a highlyaccurate algorithm for multi-human pose estimation dueto improved deep learning based body part detectors, andimage-conditioned pairwise terms to predict the location ofbody parts based on the location of other body parts. Theseterms are then used to find accurate poses of individuals viagraph cutting. In ArtTrack (30) the work was extended tofast multi-human pose estimation in videos.

[7*] Recovering the shape and motion of animals fromvideo (45) The authors combine multiple methods in orderto efficiently fit 3D shape to multiple quadrupeds fromcamels to bears. They also provides a novel dataset of jointannotations and silhouette segmentation for eleven animalvideos.

[8**] DeepLabCut: markerless pose estimation ofuser-defined body parts with deep learning (48) DeepLab-Cut was the first deep learning toolbox for animal poseestimation. The key advance was to benchmark a subsetof the feature detectors in DeeperCut (30). This papershowed nearly human-level performance with only 50-200images. It benchmarked flies moving in a 3D chamber,hand articulations and open-field behavior in mice, andprovided open-source tools for creating new datasets anddata loaders to train deep neural networks, and post-hocanalysis. Subsequent work has improved accuracy, speed,and introduced more network variants into the Pythonpackage (50, 52, 53).

[9*] DeepBehavior: A deep learning toolbox for auto-mated analysis of animal and human behavior imagingdata (51) Arac et al. use three different DNN-packages(OpenPose (19), YOLO and Tensorbox) for analyzing 3Danalysis of pellet reaching, three-chamber test, social behav-ior in mice, and 3D human kinematics analysis for clinicalmotor function assessment with OpenPose (19). They alsoprovide MATLAB scripts for additional analysis (after poseestimation).

[10*] Using DeepLabCut for 3D markerless pose estima-tion across species and behaviors (53). A Nature Protocolsuser-guide to DeepLabCut2.0, with 3D pose estimation ofhunting cheetahs and improved network performance. Thetoolbox is provided as a Python package with graphicaluser interfaces for labeling, active-learning-based networkrefinement, together with Jupyter Notebooks that can be runon cloud resources such as Google Colaboratory (for free).

[11*] Fast animal pose estimation using deep neural net-works (55). LEAP (LEAP estimates animal pose), a DNNmethod for predicting the positions of animal body parts.This framework consists of a graphical interface for label-ing of body parts and training the network. Training andinference times are fast due to the lightweight architecture.The authors also analyzed insect gait based on unsupervisedbehavioral methods (61), which they directly applied to theposture, rather than to compressed image features.


deep learning tools for the measurement of animal behavior ... · behavior is the most important...

Documents