678)$6789$6:;.4!0&>$6+?

Unsupervised Approaches for Post-Processing in Computationally EfficientWaveform-Similarity-Based Earthquake Detection

Karianne Bergen1, Clara Yoon2, Ossian O’Reilly2, Gregory Beroza2

1Institute for Computational and Mathematical Engineering, Stanford University, 2Department of Geophysics, Stanford University email: [email protected]

Introduction

Fingerprint and Similarity Thresholding (FAST) promises to allow large-scaleblind search for similar waveforms in long-duration continuous seismic data [1].n Waveform similarity search applied to datasets of months to years of data will

identify significantly more low-magnitude events than traditional methods forearthquake detection.

n New approaches for processing the output from similarity-based detection arerequired - manual inspection is infeasible for large data volumes.

n We explore data mining techniques for improved detection post-processing.

FAST: Method Overview

FAST is inspired by the Waveprint [2] algorithm for identifying audio clips, adaptedto continuous seismic waveform data.

wavelet transform x index

wav

elet

tran

sfor

m y

inde

x

Sign of top wavelet coefficients, window #1267

0 20 40 600

5

10

15

20

25

30

−1

0

1


wav

elet

tran

sfor

m y

inde

x

log10(|Haar transform|), window #1267

0 20 40 600

5

10

15

20

25

30

−5

0

5

fingerprint x index

finge

rprin

t y in

dex

Binary fingerprints, window #1267

0 20 40 600

10

20

30

40

50

60

0

1

Time (s)

Freq

uenc

y (H

z)

log10(|spectral image|), window #1267

0 2 4 6 8 100

2

4

6

8

10

−5

0

5

Preprocessing:spectrogram(a.erbandpassfiltering)


wav

elet

tran

sfor

m y

inde

x


0 20 40 600

5

10

15

20

25

30

−1

0

1


wav

elet

tran

sfor

m y

inde

x


0 20 40 600

5

10

15

20

25

30

−5

0

5

fingerprint x index

finge

rprin

t y in

dex


0 20 40 600

10

20

30

40

50

60

0

1

Time (s)

Freq

uenc

y (H

z)


0 2 4 6 8 100

2

4

6

8

10

−5

0

5

Data:con6nuous6meseriesdata

140 160 180 200 220 240 260 280 300

-0.6

-0.4

-0.2

0

0.2

0.4

A

140 160 180 200 220 240 260 280 300

-0.6

-0.4

-0.2

0

0.2

0.4

B

Detec1onResults

Post-Processing

§  Iden6fyingevents§  Combiningovernetwork§  Removingfalseposi6ves§  Clusteringwaveforms

( , ) ( , )

( , )

( , )

DatabaseGenera1on&Search

Fastapproximatesimilaritysearchusing§ MinHashand§  LocalitySensi6veHashing

FASTAlgorithmicPipeline


wav

elet

tran

sfor

m y

inde

x


0 20 40 600

5

10

15

20

25

30

−1

0

1


wav

elet

tran

sfor

m y

inde

x


0 20 40 600

5

10

15

20

25

30

−5

0

5

fingerprint x index

finge

rprin

t y in

dex


0 20 40 600

10

20

30

40

50

60

0

1

Time (s)

Freq

uenc

y (H

z)


0 2 4 6 8 100

2

4

6

8

10

−5

0

5


wav

elet

tran

sfor

m y

inde

x


0 20 40 600

5

10

15

20

25

30

−1

0

1


wav

elet

tran

sfor

m y

inde

x


0 20 40 600

5

10

15

20

25

30

−5

0

5

fingerprint x index

finge

rprin

t y in

dex


0 20 40 600

10

20

30

40

50

60

0

1

Time (s)

Freq

uenc

y (H

z)


0 2 4 6 8 100

2

4

6

8

10

−5

0

5


wav

elet

tran

sfor

m y

inde

x


0 20 40 600

5

10

15

20

25

30

−1

0

1


wav

elet

tran

sfor

m y

inde

x


0 20 40 600

5

10

15

20

25

30

−5

0

5

fingerprint x index

finge

rprin

t y in

dex


0 20 40 600

10

20

30

40

50

60

0

1

Time (s)

Freq

uenc

y (H

z)


0 2 4 6 8 100

2

4

6

8

10

−5

0

5


wav

elet

tran

sfor

m y

inde

x


0 20 40 600

5

10

15

20

25

30

−1

0

1


wav

elet

tran

sfor

m y

inde

x


0 20 40 600

5

10

15

20

25

30

−5

0

5

fingerprint x index

finge

rprin

t y in

dex


0 20 40 600

10

20

30

40

50

60

0

1

Time (s)

Freq

uenc

y (H

z)


0 2 4 6 8 100

2

4

6

8

10

−5

0

5

FeatureExtrac1on

SpectralImage

Topcoefficients(mostdiscrimina-ve)

BinaryFingerprint

HaarTransform

n Database search returns list of “candidate pairs” - post-processing is necessaryto eliminate non-earthquakes (false positives, correlated noise)

Event Identification and Network Detection

How do we identify earthquakes from waveform pairs returned by FAST?

0.9880.975

0.970

event1

event2

n Output of FAST(single channel): sparse matrix - (candidate) pairs of similar waveformsn Single event pairs often result in multiple detections: time-adjacent windows overlapn Multiple (sequential) detections of a single event pair appear along a diagonal line (fixed

inter-event time ∆t) in similarity matrixn Link all detections for each event pair for improved thresholding

How do we combine single-station detection results from FAST over a network of seismic stations?

n Network detection can improve detection sensitivityn Limited move-out (multiple channels at single sta-

tion or nearby stations): sum single-channel similar-ity matrices → network similarity matrix

n Challenge: move-out varies between stations and isunknown a priori in blind search

n Inter-event time is uniform across network for agiven event pair

n Pseudo-association: group detections by inter-event time (diagonal) across multiple stations

Data set: Iquique foreshocks, 2014-03-21 Time (s), from 831580 20 40 60

PSGCX

PB11

PB08

PB01

PATCX

Time (s), from 840750 20 40 60

CC"="0.627""

CC"="0.792"

CC"="0.814"

CC"="0.775"

CC"="0.829"

Waveforms of event pair recordedacross multiple stations

83160 83180 83200 83220

84080

84100

84120

84140 0

0.1

0.2

0.3

0.4

0.5

0.6

>0.7

!meindex1

!meinde

x2

SummedNetworkSimilarity

PB01PB08

PATCX

2sta!onsPSGCXPB11

Similarity matrix: event pair detected across multiple stationsappears along same diagonal, but with minimal temporal overlap

Clustering Waveforms

Clustering is a set of techniques for identifying groups of similar waveforms within the full set of detections returned by FAST, which can be used to:n Organize detection results for easier interpretation (i.e. find interesting structure/patterns in the data),n Identify new template waveforms for template matching or subspace detection, andn Remove additional false alarms (e.g. outliers, non-earthquake clusters)

Application: Guy-Greenbrier Fault, central Arkansas

n FAST detects 746 new earthquakes that were not identified by templatematching in one month of data (July 2010) at station WHAR [3]

n Similarity matrix for new detections has a block-like structure - apply spectralclustering to identify 8 broad waveform clusters

1

234

5

6

78

3-channeleventsimilari0es(normalizedCC) 3-channeleventsimilari0es(normalizedCC)

eventindex1 eventindex1(reordered)

even

tind

ex2

even

tind

ex2(reo

rdered

)

Representative waveforms (three-component) from each cluster

WHAR.HHE WHAR.HHN WHAR.HHZ

*me(s)

cluster2

cluster3

cluster4

cluster5

cluster6

cluster7

cluster8

*me(s)*me(s)0.0 4.02.00.0 4.02.00.0 4.02.0

cluster1

n Reclustering within large clusters can identify repre-sentative waveforms or small clusters, e.g. cluster 8

n e.g. Hierarchical clustering (complete-linkage)identifies representative waveforms within clusters

(Right) Clustering can aid in visualization and interpretation of alarge number of new detections: cluster membership of new FASTdetections plotted over time. Injection began at well #1, closest tothe Guy-Greenbrier Fault, on 7 July 2010 (at 518400s in figure).

!me(s)from2010-07-0100:00:00.00

1.0

0.8

0.6

0.4

0.2

0

similarity(m

axim

umnormalize

dCC

)

0 0.5×106 1.5×106 2.0×1061.0×106 2.5×106

Feature Extraction

“Good” feature extraction can reduce false detectionsn Binary fingerprints act as proxies for waveforms in efficient similarity searchn Fingerprints must be discriminative: (dis)similar waveforms should have

(dis)similar fingerprintsn False detections preferred to missed detections, but too many hurt performance

How are “most discriminative” Haar coefficients selected?

n Top magnitude coefficients (often used for efficient compression)n Most atypical coefficients, as measured by:

n Z-score (mean, standard deviation), orn Median Absolute Deviation (MAD) across data set

n MAD-based Haar coefficient selection demonstrates the best performancein low SNR settings and is most efficient.

50 100 150 200 250 300 350-20

-10

0

10

20

50 100 150 200 250 300 350

-20-1001020

10 20 30 40 50 60

10

20

30

40

50

60

10 20 30 40 50 60

10

20

30

40

50

60

10 20 30 40 50 60

10

20

30

40

50

60

10 20 30 40 50 60

10

20

30

40

50

60

10 20 30 40 50 60

10

20

30

40

50

60

10 20 30 40 50 60

10

20

30

40

50

60

10 20 30 40 50 60

10

20

30

40

50

60

10 20 30 40 50 60

10

20

30

40

50

60

10 20 30 40 50 60

10

20

30

40

50

60

10 20 30 40 50 60

10

20

30

40

50

60

10 20 30 40 50 60

10

20

30

40

50

60

10 20 30 40 50 60

10

20

30

40

50

60

TopMagnitude TopZ-score TopMAD

noisesample1

noisesample2

Synthetic Test

Comparison of the performance of Haar coefficient-selection methods on synthetictest. The MAD-based coefficient selection best separates the repeated waveformsfrom the noise.

(Right) Test data (a): 12 pairs of repeatedwaveforms (SNR 1.25-5) planted at knowntimes in 3hrs of noise (bandpass 1-10Hz).Detection results from FAST shown for (b)top magnitude, (c) top Z-score, and (d) topMAD Haar coefficients. Location of truerepeated events indicated by orange verti-cal lines, and the detection statistic (simi-larity value) is plotted in blue. Top 400 co-efficients selected in results pictured, butresults hold for top 100-800 coefficients.

% bits in binary fingerprint (cumulative)0 0.2 0.4 0.6 0.8 1

frequ

ency

of c

oeffi

cien

t act

ivat

ion

(nor

mal

ized

)

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1top 400 Haar coefficients in magnitudetop 400 standarized Haar coefficients (Z-score)top 400 standarized Haar coefficients (MAD)ideal line for perfectly efficient representation

% bits in binary fingerprint (cumulative)0 0.2 0.4 0.6 0.8 1

frequ

ency

of c

oeffi

cien

t act

ivat

ion

(nor

mal

ized

)

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1top 400 Haar coefficients in magnitudetop 400 standarized Haar coefficients (Z-score)top 400 standarized Haar coefficients (MAD)ideal line for perfectly efficient representation

(a)

(b)

(c)

(d)

0 100002000 4000 6000 8000.me(s)

0 100002000 4000 6000 8000.me(s)

0 100002000 4000 6000 8000.me(s)

0

-40

-80

40

80

similarityvalue

0.4

0.2

0

1.0

0.6

0.8

similarityvalue

0.4

0.2

0

1.0

0.6

0.8

0 100002000 4000 6000 8000.me(s)

similarityvalue

0.4

0.2

0

1.0

0.6

0.8

(Left) Efficiency of binary representations (orderedfrom least to most efficient): top magnitude (blue),top Z-score (orange) and top MAD (purple), withGini index of 0.73, 0.28, and 0.11, respectively.

Alternate Feature Extraction Approaches (on-going work)

n Time-domain features: bag-of-waveforms, wavelets, random projections,n Data-driven features: spectral hashing, shift-invariant sparse coding,

nonnegative matrix factorization (NMF)-based features

References

[1] Yoon, C., et al. (2015). “Earthquake detection through computationallyefficient similarity search.” Science Advances, 1(11).

[2] Baluja, S., and Covell, M. (2008). “Waveprint: Efficient wavelet-basedaudio fingerprinting.” Pattern Recognition, 41(11).

[3] Yoon, C. et al., (2015) AGU Fall Meeting Abstract S13B-2850.ReadmoreaboutFAST(doi:10.1126/sciadv.1501057)

678)$6789$6:;.4!0&>$6+?

Documents