machine learning problems in species occupancy modeling rebecca hutchinson march 25, 2010

Machine Learning Problems in Species Occupancy Modeling

Rebecca HutchinsonMarch 25, 2010

Toy Example

2

Adding Covariates

3

Challenge #2

4

Birds move. And hide.

5

Multiple Visits

• Visit each site more than once, recording detection histories Yit

• E.g.

• Population closure assumption: the species occupancy status does not change over the course of the visits to a site.

Site Visit 1 Visit 2 Visit 3

1 1 0 0

2 0 0 0

3 1 1 1

4 0 1 1

5 0 0 0

6 0 1 0

6

Assumptions

• Species is never misidentified.• Occupancy status is constant across visits.• Visits are separated enough to be

conditionally independent, given the occupancy status.

• Sites are independent.

7

oi YitZi

i=1,…,M

Xi

Key:square=discretecircle=continuousunshaded=latentgrey=observedpink=parameterblue=deterministic function of inputsdashed=repeated section

Wit

dit

t=1,…,T

Xi = occupancy covariates at site ioi = probability of occupancy at site iZi = true, unobserved occupancy status of site i = parameters of occupancy modelWit = detection covariates at site i, visit tdit = probability of detection at site i, visit tYit = observed presence/absence at site i, visit t = parameters of detection model 8

Some details• Conditional distributions:

• Conditional log-likelihood

• Expected joint log-likelihood

Typical Usage

• Fit a small number of models with differing (small) sets of covariates, using the conditional log-likelihood objective– E.g. model 1 vs. model 2 where

• o1 ~ rainfall + elevation, d1 ~ weather + time-of-day• o2 ~ rainfall + temperature, d2 ~ underbrush-density

• Evaluate models with AIC

• Books on this approach: Mackenzie et al 2006, Royle et al 2007.

Outline

• Citizen Science: 2 motivating datasets• Problem 1: Integrating more flexible models for occupancy

and detection– Regularization– Boosted regression trees– (Joint work with Tom Dietterich)

• Problem 2: Alternative detection models– Experts vs. novices– Relaxing assumptions– (Joint work with Weng-Keen Wong and Jun Yu)

Mission: To interpret and conserve the earth’s biological diversity through research, education, and citizen science focused on birds.

Cornell Lab of Ornithology

13

Birds in Forested Landscapes (BFL)• Goals:

– Determine habitat/landscape requirements of forest-dwelling birds (especially thrushes)– Translate results into management recommendations for conservation– Develop a network of experienced citizen scientists

• BFL is a continent-wide project that has engaged over 1,000 volunteers who surveyed over 3,000 study sites.

• Have data from 1997-2006

• Participants follow a rigorously tested protocol that includes:

• selecting suitable study sites• visiting these sites at least twice during the

breeding season and • measuring a variety of habitat variables.

• http://www.birds.cornell.edu/bfl/

14

BFL data• Select forest patches, then survey points, and one or more species

of interest. • Visit 1: earliest date when all your study species have arrived

– Want beginning of breeding period, but no birds still migrating.• Visit 2: 2-4 weeks later

– Breeding should be underway, different evidence available.• Record presence/absence of 22 possible breeding behaviors

observed in each period on each visit.• Record presence/absence of competitors/predators on each visit.• Record environmental variables at large, medium, and small scales.• Observers work in teams of 1-4 people.

15

BFL data: visit protocol example• Observation Period (mandatory 10 minutes)

Look and listen for predators, cowbirds, and study species

• Playback Period (mandatory 5 minutes per species) Species 1: play songs, calls, or drums for 1 minuteSpecies 1: watch/listen for 1 minuteSpecies 1: repeat songs, calls, or drums for 1 minuteSpecies 1: watch/listen for 2 minutesSpecies 2: play songs, calls, or drums for 1 minuteSpecies 2: watch/listen for 1 minuteSpecies 2: repeat songs, calls, or drums for 1 minuteSpecies 2: watch/listen for 2 minutes

• Behavior Watch Period (mandatory 10 minutes)Play eastern or western mobbing calls for 5 minutes while looking and listening for study speciesWatch/listen for 5 minutes

16

BFL data: habitat characteristics• Survey point (where observer stands)

– Latitude/longitude– Elevation– Distance to nearest edge, road, water, occupied building

• Study site (radius=150m)– Hydrology during breeding season – Forest cover type– Slope– Land use – Land ownership– Canopy characteristics– Low vegetation characteristics

• Landscape level (2500 acres)– Patch edge (what habitats are adjacent)– Forest patch size– Percentage of forest– Linear distance of edge– Distance to nearest 100 & 500 acre patches (if patch is less than 1000 acres)

Increasing model flexibility

• Why?– Many possible habitat variables

• interactions?

– Exploratory modeling with many covariates rather than hypothesis testing with few

• 2 ideas:– Regularization– Boosted regression trees

How to regularize these models?

• One possible penalty:

• How should the two components be weighted?– tug-of-war between occupancy and detection to

explain the all-zero detection histories

Preliminary synthetic data results

• 8 covariates for each model, half of which truly had non-zero coefficients

• Choice of objective function seems more important than regularization parameters

Posterior Regularization• [Ganchev, Gillenwater, Graca, and Taskar, 2009]

• Regularization constraints on posterior expectations instead of parameters, for example:– Expected occupancy is less than 60%– Of the all-zero detection histories, only half can be

‘explained away’ by the detection model

Boosted Regression Trees

• Popular in species distribution modeling– [Elith et al 2006]

• Functional gradient ascent [Friedman 2001]

– regression trees predict F(X) and G(W)– F and G are fed through logistic() to get o and d

• Current challenge: tuning – learning rate (shrinkage)– number of trees to grow at each stage– depth of trees– number of stages

Where Birding Meets Science!

eBird—Current Stats (2009)

• >1,500,000 checklists submitted

• 2,945 species reported

• ~70,000 users

• 21 million observations reported

• ~540,000 site visitors

• 173 countries/territories

Northern Cardinal Distribution (Frequency of Detection)

• Gray – not reported• Tan – insufficient data• White – not covered

Extensions needed for eBird?

• Alternative detection model– add a node for expertise of observer

• Relax the assumption of no-misidentifications– Y|Z=1 ~ Bernoulli(d)– Y|Z=0 ~ Bernoulli(h)

• (instead of 0)

Model with expertise node

Xi

Wics

UjZis Yics

Bic

Ej

c

i

sj

Preliminary results: Synthetic data

0% FP 5% FP 10% FP

LRM EOM EONM LRM EOM EONM LRM EOM EONM

Run 1 0.930384 0.971115 0.977238 0.910816 0.945408 0.96979 0.885863 0.88316 0.935426

Run 2 0.903887 0.980729 0.975234 0.899543 0.940208 0.97605 0.892114 0.828198 0.921102

Run 3 0.94002 0.974756 0.958459 0.886119 0.937266 0.957319 0.890704 0.842421 0.947113

Run 4 0.932801 0.963297 0.958912 0.90578 0.908399 0.953098 0.880455 0.904996 0.939507

Run 5 0.887101 0.96577 0.950276 0.897433 0.92701 0.965392 0.899887 0.918916 0.96166

Run 6 0.895025 0.971983 0.969289 0.855337 0.850694 0.941323 0.883817 0.937461 0.956812

Run 7 0.880148 0.959791 0.965582 0.874276 0.93516 0.91473 0.888834 0.898839 0.91642

Run 8 0.895394 0.955807 0.97103 0.901834 0.94291 0.961646 0.930379 0.924846 0.967828

Run 9 0.922637 0.984437 0.983202 0.896268 0.924409 0.941531 0.917951 0.917625 0.946334

Run 10 0.901313 0.967825 0.971087 0.87269 0.865213 0.945943 0.85779 0.862085 0.900124

Mean 0.908871 0.969551 0.968031 0.890009 0.917668 0.952682 0.892779 0.891855 0.939232

Synthetic data generated from EOM with different levels of false positives

Slide courtesy of Jun Yu

Area under ROC curve

Preliminary results: eBird data• data from New York from May and June in year 2006, 2007 and 2008.

• 27 by 64 Checkerboarding [New York State: Width-285 miles (455 km) and Length-330 miles (530 km):

• Each Cell is roughly 16.8 km by 8.3 km.

• There are roughly 200 sites generated during training.

Red: Confusing birds 2006 2007 2008

Blue: Common birds LRM EOM EONM LRM EOM EONM LRM EOM EONM

Cardinalis_cardinalis 0.720568 0.706608 0.796598 0.607293 0.751957 0.643921 0.702938 0.77029 0.76419

Cyanocitta_cristata 0.692996 0.697844 0.69964 0.577138 0.60214 0.623012 0.705667 0.746821 0.748514

Picoides_pubescens 0.782 0.768099 0.74265 0.643775 0.59784 0.696546 0.712933 0.727307 0.749194

Carpodacus_mexicanus 0.527888 0.628586 0.718772 0.47844 0.29375 0.597955 0.600186 0.655433 0.621303

Sitta_carolinensis 0.681425 0.787735 0.751278 0.643512 0.698183 0.665727 0.662763 0.727671 0.701876

Ardea_herodias 0.586583 0.652467 0.681455 0.509398 0.718355 0.727694 0.751928 0.684067 0.711016

Cathartes_aura 0.597628 0.817579 0.788721 0.659417 0.79799 0.776976 0.571758 0.624674 0.652208

Picoides_villosus 0.72861 0.759946 0.786267 0.539668 0.510911 0.584867 0.593766 0.59877 0.560525

Carpodacus_purpureus 0.465655 0.794822 0.82295 0.640588 0.6002 0.614478 0.676547 0.69174 0.705615

Accipiter_striatus 0.466528 0.531984 0.311784 0.689652 0.930526 0.932894 0.493635 0.563391 0.711689

Mean 0.624988 0.714567 0.710011 0.598888 0.650185 0.686407 0.647212 0.679016 0.692613

Slide courtesy of Jun Yu

More challenges

• Sampling bias• Spatial autocorrelation• For BFL, modeling multiple occupancy states• For eBird, modeling abundance • Multi-species approaches• Dynamic models

– migration– range shift

Questions?Comments?Suggestions?

machine learning problems in species occupancy modeling rebecca hutchinson march 25, 2010

Documents

yu slide

occupancy covariates

species occupancy status

joint loglikelihood

probability of occupancy

detection covariates

parameters of detection

probability of detection