machine learning problems in species occupancy modeling rebecca hutchinson march 25, 2010
Post on 20-Dec-2015
216 views
TRANSCRIPT
Multiple Visits
• Visit each site more than once, recording detection histories Yit
• E.g.
• Population closure assumption: the species occupancy status does not change over the course of the visits to a site.
Site Visit 1 Visit 2 Visit 3
1 1 0 0
2 0 0 0
3 1 1 1
4 0 1 1
5 0 0 0
6 0 1 0
6
Assumptions
• Species is never misidentified.• Occupancy status is constant across visits.• Visits are separated enough to be
conditionally independent, given the occupancy status.
• Sites are independent.
7
oi YitZi
i=1,…,M
Xi
Key:square=discretecircle=continuousunshaded=latentgrey=observedpink=parameterblue=deterministic function of inputsdashed=repeated section
Wit
dit
t=1,…,T
Xi = occupancy covariates at site ioi = probability of occupancy at site iZi = true, unobserved occupancy status of site i = parameters of occupancy modelWit = detection covariates at site i, visit tdit = probability of detection at site i, visit tYit = observed presence/absence at site i, visit t = parameters of detection model 8
Some details• Conditional distributions:
• Conditional log-likelihood
• Expected joint log-likelihood
Typical Usage
• Fit a small number of models with differing (small) sets of covariates, using the conditional log-likelihood objective– E.g. model 1 vs. model 2 where
• o1 ~ rainfall + elevation, d1 ~ weather + time-of-day• o2 ~ rainfall + temperature, d2 ~ underbrush-density
• Evaluate models with AIC
• Books on this approach: Mackenzie et al 2006, Royle et al 2007.
Outline
• Citizen Science: 2 motivating datasets• Problem 1: Integrating more flexible models for occupancy
and detection– Regularization– Boosted regression trees– (Joint work with Tom Dietterich)
• Problem 2: Alternative detection models– Experts vs. novices– Relaxing assumptions– (Joint work with Weng-Keen Wong and Jun Yu)
Mission: To interpret and conserve the earth’s biological diversity through research, education, and citizen science focused on birds.
Cornell Lab of Ornithology
13
Birds in Forested Landscapes (BFL)• Goals:
– Determine habitat/landscape requirements of forest-dwelling birds (especially thrushes)– Translate results into management recommendations for conservation– Develop a network of experienced citizen scientists
• BFL is a continent-wide project that has engaged over 1,000 volunteers who surveyed over 3,000 study sites.
• Have data from 1997-2006
• Participants follow a rigorously tested protocol that includes:
• selecting suitable study sites• visiting these sites at least twice during the
breeding season and • measuring a variety of habitat variables.
• http://www.birds.cornell.edu/bfl/
14
BFL data• Select forest patches, then survey points, and one or more species
of interest. • Visit 1: earliest date when all your study species have arrived
– Want beginning of breeding period, but no birds still migrating.• Visit 2: 2-4 weeks later
– Breeding should be underway, different evidence available.• Record presence/absence of 22 possible breeding behaviors
observed in each period on each visit.• Record presence/absence of competitors/predators on each visit.• Record environmental variables at large, medium, and small scales.• Observers work in teams of 1-4 people.
15
BFL data: visit protocol example• Observation Period (mandatory 10 minutes)
Look and listen for predators, cowbirds, and study species
• Playback Period (mandatory 5 minutes per species) Species 1: play songs, calls, or drums for 1 minuteSpecies 1: watch/listen for 1 minuteSpecies 1: repeat songs, calls, or drums for 1 minuteSpecies 1: watch/listen for 2 minutesSpecies 2: play songs, calls, or drums for 1 minuteSpecies 2: watch/listen for 1 minuteSpecies 2: repeat songs, calls, or drums for 1 minuteSpecies 2: watch/listen for 2 minutes
• Behavior Watch Period (mandatory 10 minutes)Play eastern or western mobbing calls for 5 minutes while looking and listening for study speciesWatch/listen for 5 minutes
16
BFL data: habitat characteristics• Survey point (where observer stands)
– Latitude/longitude– Elevation– Distance to nearest edge, road, water, occupied building
• Study site (radius=150m)– Hydrology during breeding season – Forest cover type– Slope– Land use – Land ownership– Canopy characteristics– Low vegetation characteristics
• Landscape level (2500 acres)– Patch edge (what habitats are adjacent)– Forest patch size– Percentage of forest– Linear distance of edge– Distance to nearest 100 & 500 acre patches (if patch is less than 1000 acres)
Increasing model flexibility
• Why?– Many possible habitat variables
• interactions?
– Exploratory modeling with many covariates rather than hypothesis testing with few
• 2 ideas:– Regularization– Boosted regression trees
How to regularize these models?
• One possible penalty:
• How should the two components be weighted?– tug-of-war between occupancy and detection to
explain the all-zero detection histories
Preliminary synthetic data results
• 8 covariates for each model, half of which truly had non-zero coefficients
• Choice of objective function seems more important than regularization parameters
Posterior Regularization• [Ganchev, Gillenwater, Graca, and Taskar, 2009]
• Regularization constraints on posterior expectations instead of parameters, for example:– Expected occupancy is less than 60%– Of the all-zero detection histories, only half can be
‘explained away’ by the detection model
Boosted Regression Trees
• Popular in species distribution modeling– [Elith et al 2006]
• Functional gradient ascent [Friedman 2001]
– regression trees predict F(X) and G(W)– F and G are fed through logistic() to get o and d
• Current challenge: tuning – learning rate (shrinkage)– number of trees to grow at each stage– depth of trees– number of stages
eBird—Current Stats (2009)
• >1,500,000 checklists submitted
• 2,945 species reported
• ~70,000 users
• 21 million observations reported
• ~540,000 site visitors
• 173 countries/territories
Northern Cardinal Distribution (Frequency of Detection)
• Gray – not reported• Tan – insufficient data• White – not covered
Extensions needed for eBird?
• Alternative detection model– add a node for expertise of observer
• Relax the assumption of no-misidentifications– Y|Z=1 ~ Bernoulli(d)– Y|Z=0 ~ Bernoulli(h)
• (instead of 0)
Preliminary results: Synthetic data
0% FP 5% FP 10% FP
LRM EOM EONM LRM EOM EONM LRM EOM EONM
Run 1 0.930384 0.971115 0.977238 0.910816 0.945408 0.96979 0.885863 0.88316 0.935426
Run 2 0.903887 0.980729 0.975234 0.899543 0.940208 0.97605 0.892114 0.828198 0.921102
Run 3 0.94002 0.974756 0.958459 0.886119 0.937266 0.957319 0.890704 0.842421 0.947113
Run 4 0.932801 0.963297 0.958912 0.90578 0.908399 0.953098 0.880455 0.904996 0.939507
Run 5 0.887101 0.96577 0.950276 0.897433 0.92701 0.965392 0.899887 0.918916 0.96166
Run 6 0.895025 0.971983 0.969289 0.855337 0.850694 0.941323 0.883817 0.937461 0.956812
Run 7 0.880148 0.959791 0.965582 0.874276 0.93516 0.91473 0.888834 0.898839 0.91642
Run 8 0.895394 0.955807 0.97103 0.901834 0.94291 0.961646 0.930379 0.924846 0.967828
Run 9 0.922637 0.984437 0.983202 0.896268 0.924409 0.941531 0.917951 0.917625 0.946334
Run 10 0.901313 0.967825 0.971087 0.87269 0.865213 0.945943 0.85779 0.862085 0.900124
Mean 0.908871 0.969551 0.968031 0.890009 0.917668 0.952682 0.892779 0.891855 0.939232
Synthetic data generated from EOM with different levels of false positives
Slide courtesy of Jun Yu
Area under ROC curve
Preliminary results: eBird data• data from New York from May and June in year 2006, 2007 and 2008.
• 27 by 64 Checkerboarding [New York State: Width-285 miles (455 km) and Length-330 miles (530 km):
• Each Cell is roughly 16.8 km by 8.3 km.
• There are roughly 200 sites generated during training.
Red: Confusing birds 2006 2007 2008
Blue: Common birds LRM EOM EONM LRM EOM EONM LRM EOM EONM
Cardinalis_cardinalis 0.720568 0.706608 0.796598 0.607293 0.751957 0.643921 0.702938 0.77029 0.76419
Cyanocitta_cristata 0.692996 0.697844 0.69964 0.577138 0.60214 0.623012 0.705667 0.746821 0.748514
Picoides_pubescens 0.782 0.768099 0.74265 0.643775 0.59784 0.696546 0.712933 0.727307 0.749194
Carpodacus_mexicanus 0.527888 0.628586 0.718772 0.47844 0.29375 0.597955 0.600186 0.655433 0.621303
Sitta_carolinensis 0.681425 0.787735 0.751278 0.643512 0.698183 0.665727 0.662763 0.727671 0.701876
Ardea_herodias 0.586583 0.652467 0.681455 0.509398 0.718355 0.727694 0.751928 0.684067 0.711016
Cathartes_aura 0.597628 0.817579 0.788721 0.659417 0.79799 0.776976 0.571758 0.624674 0.652208
Picoides_villosus 0.72861 0.759946 0.786267 0.539668 0.510911 0.584867 0.593766 0.59877 0.560525
Carpodacus_purpureus 0.465655 0.794822 0.82295 0.640588 0.6002 0.614478 0.676547 0.69174 0.705615
Accipiter_striatus 0.466528 0.531984 0.311784 0.689652 0.930526 0.932894 0.493635 0.563391 0.711689
Mean 0.624988 0.714567 0.710011 0.598888 0.650185 0.686407 0.647212 0.679016 0.692613
Slide courtesy of Jun Yu
More challenges
• Sampling bias• Spatial autocorrelation• For BFL, modeling multiple occupancy states• For eBird, modeling abundance • Multi-species approaches• Dynamic models
– migration– range shift