locality-sensitive hashing for earthquake detection: a ... · in this work, we report on a novel...

14
Locality-Sensitive Hashing for Earthquake Detection: A Case Study of Scaling Data-Driven Science Kexin Rong * , Clara E. Yoon , Karianne J. Bergen , Hashem Elezabi * , Peter Bailis * , Philip Levis * , Gregory C. Beroza Stanford University ABSTRACT In this work, we report on a novel application of Locality Sensitive Hashing (LSH) to seismic data at scale. Based on the high wave- form similarity between reoccurring earthquakes, our application identifies potential earthquakes by searching for similar time series segments via LSH. However, a straightforward implementation of this LSH-enabled application has difficulty scaling beyond 3 months of continuous time series data measured at a single seismic station. As a case study of a data-driven science workflow, we illustrate how domain knowledge can be incorporated into the workload to improve both the efficiency and result quality. We describe several end-to- end optimizations of the analysis pipeline from pre-processing to post-processing, which allow the application to scale to time se- ries data measured at multiple seismic stations. Our optimizations enable an over 100× speedup in the end-to-end analysis pipeline. This improved scalability enabled seismologists to perform seismic analysis on more than ten years of continuous time series data from over ten seismic stations, and has directly enabled the discovery of 597 new earthquakes near the Diablo Canyon nuclear power plant in California and 6123 new earthquakes in New Zealand. PVLDB Reference Format: Kexin Rong, Clara E. Yoon, Karianne J. Bergen, Hashem Elezabi, Pe- ter Bailis, Philip Levis, Gregory C. Beroza. Locality-Sensitive Hashing for Earthquake Detection: A Case Study of Scaling Data-Driven Science. PVLDB, 11 (11): 1674-1687, 2018. DOI: https://doi.org/10.14778/3236187.3236214 1. INTRODUCTION Locality Sensitive Hashing (LSH) [29] is a well studied com- putational primitive for efficient nearest neighbor search in high- dimensional spaces. LSH hashes items into low-dimensional spaces such that similar items have a higher collision probability in the hash table. Successful applications of LSH include entity resolution [65], genome sequence comparison [18], text and image search [41, 52], near duplicate detection [20, 46], and video identification [37]. * Department of Computer Science Department of Geophysics Institute for Computational and Mathematical Engineering Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Articles from this volume were invited to present their results at The 44th International Conference on Very Large Data Bases, August 2018, Rio de Janeiro, Brazil. Copyright 2018 VLDB Endowment 2150-8097/18/07. DOI: https://doi.org/10.14778/3236187.3236214 P S P S P S P S Figure 1: Example of near identical waveforms between occur- rences of the same earthquake two months apart, observed at three seismic stations in New Zealand. The stations experience increased ground motions upon the arrivals of seismic waves (e.g., P and S waves). This paper scales LSH to over 30 billion data points and discovers 597 and 6123 new earthquakes near the Diablo Canyon nuclear power plant in California and in New Zealand, respectively. In this paper, we present an innovative use of LSH—and associ- ated challenges at scale—in large-scale earthquake detection across seismic networks. Earthquake detection is particularly interesting in both its abundance of raw data and scarcity of labeled examples: First, seismic data is large. Earthquakes are monitored by seis- mic networks, which can contain thousands of seismometers that continuously measure ground motion and vibration. For example, Southern California alone has over 500 seismic stations, each col- lecting continuous ground motion measurements at 100Hz. As a result, this network alone has collected over ten trillion (10 13 ) data points in the form of time series in the past decade [5]. Second, despite large measurement volumes, only a small fraction of earthquake events are cataloged, or confirmed and hand-labeled. As earthquake magnitude (i.e., size) decreases, the frequency of earthquake events increases exponentially. Worldwide, major earth- quakes (magnitude 7+) occur approximately once a month, while magnitude 2.0 and smaller earthquakes occur several thousand times a day. At low magnitudes, it is difficult to detect earthquake signals because earthquake energy approaches the noise floor, and con- ventional seismological analyses can fail to disambiguate between signal and noise. Nevertheless, detecting these small earthquakes is important in uncovering sources of earthquakes [24, 32], improv- ing the understanding of earthquake mechanics [49, 58], and better predicting the occurrences of future events [38]. To take advantage of the large volume of unlabeled raw mea- surement data, seismologists have developed an unsupervised, data- driven earthquake detection method, Fingerprint And Similarity Thresholding (FAST), based on waveform similarity [25]. Seismic sources repeatedly generate earthquakes over the course of days, months or even years, and these earthquakes show near identical waveforms when recorded at the same seismic station, regardless of the earthquake’s magnitude [27, 56]. Figure 1 illustrates this 1674

Upload: others

Post on 30-Apr-2020

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Locality-Sensitive Hashing for Earthquake Detection: A ... · In this work, we report on a novel application of Locality Sensitive Hashing (LSH) to seismic data at scale. Based on

Locality-Sensitive Hashing for Earthquake Detection:A Case Study of Scaling Data-Driven Science

Kexin Rong∗, Clara E. Yoon†, Karianne J. Bergen‡, Hashem Elezabi∗,Peter Bailis∗, Philip Levis∗, Gregory C. Beroza†

Stanford University

ABSTRACTIn this work, we report on a novel application of Locality SensitiveHashing (LSH) to seismic data at scale. Based on the high wave-form similarity between reoccurring earthquakes, our applicationidentifies potential earthquakes by searching for similar time seriessegments via LSH. However, a straightforward implementation ofthis LSH-enabled application has difficulty scaling beyond 3 monthsof continuous time series data measured at a single seismic station.As a case study of a data-driven science workflow, we illustrate howdomain knowledge can be incorporated into the workload to improveboth the efficiency and result quality. We describe several end-to-end optimizations of the analysis pipeline from pre-processing topost-processing, which allow the application to scale to time se-ries data measured at multiple seismic stations. Our optimizationsenable an over 100× speedup in the end-to-end analysis pipeline.This improved scalability enabled seismologists to perform seismicanalysis on more than ten years of continuous time series data fromover ten seismic stations, and has directly enabled the discovery of597 new earthquakes near the Diablo Canyon nuclear power plantin California and 6123 new earthquakes in New Zealand.

PVLDB Reference Format:Kexin Rong, Clara E. Yoon, Karianne J. Bergen, Hashem Elezabi, Pe-ter Bailis, Philip Levis, Gregory C. Beroza. Locality-Sensitive Hashingfor Earthquake Detection: A Case Study of Scaling Data-Driven Science.PVLDB, 11 (11): 1674-1687, 2018.DOI: https://doi.org/10.14778/3236187.3236214

1. INTRODUCTIONLocality Sensitive Hashing (LSH) [29] is a well studied com-

putational primitive for efficient nearest neighbor search in high-dimensional spaces. LSH hashes items into low-dimensional spacessuch that similar items have a higher collision probability in the hashtable. Successful applications of LSH include entity resolution [65],genome sequence comparison [18], text and image search [41, 52],near duplicate detection [20, 46], and video identification [37].

*Department of Computer Science†Department of Geophysics‡Institute for Computational and Mathematical Engineering

Permission to make digital or hard copies of all or part of this work forpersonal or classroom use is granted without fee provided that copies arenot made or distributed for profit or commercial advantage and that copiesbear this notice and the full citation on the first page. To copy otherwise, torepublish, to post on servers or to redistribute to lists, requires prior specificpermission and/or a fee. Articles from this volume were invited to presenttheir results at The 44th International Conference on Very Large Data Bases,August 2018, Rio de Janeiro, Brazil.Copyright 2018 VLDB Endowment 2150-8097/18/07.DOI: https://doi.org/10.14778/3236187.3236214

P S P S

P S P S

Figure 1: Example of near identical waveforms between occur-rences of the same earthquake two months apart, observed at threeseismic stations in New Zealand. The stations experience increasedground motions upon the arrivals of seismic waves (e.g., P and Swaves). This paper scales LSH to over 30 billion data points anddiscovers 597 and 6123 new earthquakes near the Diablo Canyonnuclear power plant in California and in New Zealand, respectively.

In this paper, we present an innovative use of LSH—and associ-ated challenges at scale—in large-scale earthquake detection acrossseismic networks. Earthquake detection is particularly interesting inboth its abundance of raw data and scarcity of labeled examples:

First, seismic data is large. Earthquakes are monitored by seis-mic networks, which can contain thousands of seismometers thatcontinuously measure ground motion and vibration. For example,Southern California alone has over 500 seismic stations, each col-lecting continuous ground motion measurements at 100Hz. As aresult, this network alone has collected over ten trillion (1013) datapoints in the form of time series in the past decade [5].

Second, despite large measurement volumes, only a small fractionof earthquake events are cataloged, or confirmed and hand-labeled.As earthquake magnitude (i.e., size) decreases, the frequency ofearthquake events increases exponentially. Worldwide, major earth-quakes (magnitude 7+) occur approximately once a month, whilemagnitude 2.0 and smaller earthquakes occur several thousand timesa day. At low magnitudes, it is difficult to detect earthquake signalsbecause earthquake energy approaches the noise floor, and con-ventional seismological analyses can fail to disambiguate betweensignal and noise. Nevertheless, detecting these small earthquakesis important in uncovering sources of earthquakes [24, 32], improv-ing the understanding of earthquake mechanics [49, 58], and betterpredicting the occurrences of future events [38].

To take advantage of the large volume of unlabeled raw mea-surement data, seismologists have developed an unsupervised, data-driven earthquake detection method, Fingerprint And SimilarityThresholding (FAST), based on waveform similarity [25]. Seismicsources repeatedly generate earthquakes over the course of days,months or even years, and these earthquakes show near identicalwaveforms when recorded at the same seismic station, regardlessof the earthquake’s magnitude [27, 56]. Figure 1 illustrates this

1674

Page 2: Locality-Sensitive Hashing for Earthquake Detection: A ... · In this work, we report on a novel application of Locality Sensitive Hashing (LSH) to seismic data at scale. Based on

phenomenon by depicting a pair of reoccurring earthquakes thatare two months apart, observed at three seismic stations in NewZealand. By applying LSH to identify similar waveforms from seis-mic data, seismologists were able to discover new, low-magnitudeearthquakes without knowledge of prior earthquake events.

Despite early successes, seismologists had difficulty scaling theirLSH-based analysis beyond 3-month of time series data (7.95×108

data points) at a single seismic station [24]. The FAST implemen-tation faces severe scalability challenges. Contrary to what LSHtheory suggests, the actual LSH runtime in FAST grows near quadrat-ically with the input size due to correlations in the seismic signals:in an initial performance benchmark, the similarity search took 5CPU-days to process 3 months of data, and, with a 5× increasein dataset size, LSH query time increased by 30×. In addition,station-specific repeated background noise leads to an overwhelm-ing number of similar but non-earthquake time series matches, bothcrippling throughput and seismologists’ ability to sift through theoutput, which can number in the hundreds of millions of events.Ultimately, these scalability bottlenecks prevented seismologistsfrom making use of the decades of data at their disposal.

In this paper, we show how systems, algorithms, and domainexpertise can go hand-in-hand to deliver substantial scalability im-provements for this seismological analysis. Via algorithmic design,optimization using domain knowledge, and data engineering, wescale the FAST workload to years of continuous data at multiple sta-tions. In turn, this scalability has enabled new scientific discoveries,including previously unknown earthquakes near a nuclear reactor inSan Luis Obispo, California, and in New Zealand.

Specifically, we build a scalable end-to-end earthquake detectionpipeline comprised of three main steps. First, the fingerprint extrac-tion step encodes time-frequency features of the original time seriesinto compact binary fingerprints that are more robust to small varia-tions. To address the bottleneck caused by repeating non-seismicsignals, we apply domain-specific filters based on the frequencybands and the frequency of occurrences of earthquakes. Second, thesearch step applies LSH on the binary fingerprints to identify allpairs of similar time series segments. We pinpoint high hash colli-sion rates caused by physical correlations in the input data as a coreculprit of LSH performance degradation and alleviate the impactof large buckets by increasing hash selectivity while keeping thedetection threshold constant. Third, the alignment step significantlyreduces the size of detection results and confirms seismic behaviorby performing spatiotemporal correlation with nearby seismic sta-tions in the network [14]. To scale this analysis, we leverage domainknowledge of the invariance of the time difference between a pairof earthquake events across all stations at which they are recorded.

In summary, as an innovative systems and applications paper, thiswork makes several contributions:• We report on a new application of LSH in seismology as

well as a complete end-to-end data science pipeline, includingnon-trivial pre-processing and post-processing, that scales toa decade of continuous time series for earthquake detection.• We present a case study for using domain knowledge to im-

prove the accuracy and efficiency of the pipeline. We illustratehow applying seismological domain knowledge in each com-ponent of the pipeline is critical to scalability.• We demonstrate that our optimizations enable a cumulative

two order-of-magnitude speedup in the end-to-end detectionpipeline. These quantitative improvements enable qualitativediscoveries: we discovered 597 new earthquakes near theDiablo Canyon nuclear power plant in California and 6123new earthquakes in New Zealand, allowing seismologists todetermine the size and shape of nearby fault structures.

Beyond these contributions to a database audience, our solutionis an open source tool, available for use by the broader scientificcommunity. We have already run workshops for seismologists atStanford [2] and believe that the pipeline can not only facilitatetargeted seismic analysis but also contribute to the label generationfor supervised methods in seismic data [50].

The rest of the paper proceeds as follows. We review backgroundinformation about earthquake detection in Section 2 and discussadditional related work in Section 3. We give a brief overview of theend-to-end detection pipeline and key technical challenges in Sec-tion 4. Sections 5, 6 and 7 present details as well as optimizations inthe fingerprint extraction, similarity search and the spatiotemporalalignment steps of the pipeline. We perform a detailed evaluationon both the quantitative performance improvements of our optimiza-tions as well as qualitative results of new seismic findings in Section8. In Section 9, we reflect on lessons learned and conclude.

2. BACKGROUNDWith the deployment of denser and increasingly sensitive sensor

arrays, seismology is experiencing a rapid growth of high-resolutiondata [30]. Seismic networks with up to thousands of sensors havebeen recording years of continuous seismic data streams, typicallyat 100Hz frequencies. The rising data volume has fueled stronginterest in the seismology community to develop and apply scalabledata-driven algorithms that improve the monitoring and predictionof earthquake events [21, 40, 42].

In this work, we focus on the problem of detecting new, low-magnitude earthquakes from historical seismic data. Earthquakes,which are primarily caused by the rupture of geological faults, ra-diate energy that travels through the Earth in the form of seismicwaves. Seismic waves induce ground motion that is recorded by seis-mometers. Modern seismometers typically include 3 componentsthat measure simultaneous ground motion along the north-south,east-west, and vertical axes. Ground motions along each of thesethree axes are recorded as a separate channel of time series data.

Channels capture complementary signals for different seismicwaves, such as the P-wave and the S-wave. The P-waves travel alongthe direction of propagation, like sound, while the S-waves travelperpendicular to the direction of propagation, like ocean waves. Thevertical channel, therefore, better captures the up and down motionscaused by the P-waves while the horizontal channels better capturethe side to side motions caused by the S-waves. P-waves travel thefastest and are the first to arrive at seismic stations, followed by theslower but usually larger amplitude S-waves. Hence, the P-waveand S-wave of an earthquake typically register as two “big wiggles”on the ground motion measurements (Figure 1). These impulsivearrivals of seismic waves are example characteristics of earthquakesthat seismologists look for in the data.

While it is easy for human eyes to identify large earthquakes ona single channel, accurately detecting small earthquakes usuallyrequires looking at data from multiple channels or stations. Theselow-magnitude earthquakes pose challenges for conventional meth-ods for detection, which we outline below. Traditional energy-basedearthquake detectors such as a short-term average (STA)/long-termaverage (LTA) identify earthquake events by their impulsive, highsignal-to-noise P-wave and S-wave arrivals. However, these de-tectors are prone to high false positive and false negative rates atlow magnitudes, especially with noisy backgrounds [28]. Templatematching, or the waveform cross-correlation with template wave-forms of known earthquakes, has proven more effective for detectingknown seismic signals in noisy data [15, 57]. However, the methodrelies on template waveforms of prior events and is not suitable fordiscovering events from unknown sources.

1675

Page 3: Locality-Sensitive Hashing for Earthquake Detection: A ... · In this work, we report on a novel application of Locality Sensitive Hashing (LSH) to seismic data at scale. Based on

As a result, almost all earthquakes greater than magnitude 5are detected [26]. In comparison, an estimated 1.5 million earth-quakes with magnitude between 2 and 5 are not detected by con-ventional means, and 1.3 million of these are between magnitude2 and 2.9. The estimate is based on the magnitude frequency dis-tribution of earthquakes [31]. We are interested in detecting theselow-magnitude earthquakes missing from public earthquake cata-logs to better understand earthquake mechanics and sources, whichinform seismic hazard estimates and prediction [32, 38, 49, 58].

The earthquake detection pipeline we study in the paper is anunsupervised and data-driven approach that does not rely on su-pervised (i.e., labeled) examples of prior earthquake events, and isdesigned to complement existing, supervised detection methods. Asin template matching, the method we optimize takes advantage ofthe high similarity between waveforms generated by reoccurringearthquakes. However, instead of relying on waveform templatesfrom only known events, the pipeline leverages the recurring natureof seismic activities to detect similar waveforms in time and acrossstations. To do so, the pipeline performs an all-pair time seriessimilarity search, treating each segment of the input waveform dataas a “template” for potential earthquakes. This pipeline will notdetect an earthquake that occurs only once and is not similar enoughto any other earthquakes in the input data. Therefore, to improvedetection recall, it is critical to be able to scale the analysis to inputdata with a longer duration (e.g., years instead of weeks or months).

3. RELATED WORKIn this section, we address related work in earthquake detection,

LSH-based applications and time series similarity search.

Earthquake Detection. The original FAST work appeared in theseismology community, and has proven a useful tool in scientificdiscovery [24, 25]. In this paper, we present FAST to a databaseaudience for the first time, and report on both the pipeline composi-tion and optimization from a computational perspective. The resultspresented in this paper are the result of over a year of collaborationbetween our database research group and the Stanford earthquakeseismology research group. The optimizations we present in thispaper and the resulting scalability results of the optimized pipelinehave not previously been published. We believe this represents auseful and innovative application of LSH to a real domain sciencetool that will be of interest to both the database community andresearchers of LSH and time-series analytics.

The problem of earthquake detection is decades old [6], and manyclassic techniques—many of which are in use today—were devel-oped for an era in which humans manually inspected seismographsfor readings [35, 66]. With the rise of machine learning and large-scale data analytics, there has been increasing interest in furtherautomating these techniques. While FAST is optimized to find manysmall-scale earthquakes, alternative approaches in the seismologycommunity utilize template matching [15,57], social media [54], andmachine learning techniques [8, 64]. Most recently, with sufficienttraining data, supervised approaches have shown promising resultsof being able to detect non-repeating earthquake events [50]. Incontrast, our LSH-based detection method does not rely on labeledearthquake events and detects reoccurring earthquake events. In theevaluation, we compare against two supervised methods [50,55] andshow that our unsupervised pipeline is able to detect qualitativelydifferent events from the existing earthquake catalog.

Locality Sensitive Hashing. In this work, we perform a detailedcase study of the practical challenges and the domain-specific so-lutions of applying LSH to the field of seismology. We do not

contribute to the advance of the state-of-the-art LSH algorithms; in-stead, we show that classic LSH techniques, combined with domain-specific optimizations, can lead to scientific discoveries when ap-plied at scale. Existing work shows that LSH performance is sensi-tive to key parameters such as the number of hash functions [23,52];we provide supporting evidence and analysis on the performance im-plication of LSH parameters in our application domain. In additionto the core LSH techniques, we also present nontrivial preprocess-ing and postprocessing steps that enable an end-to-end detectionpipeline, including spatiotemporal alignment of LSH matches.

Our work targets CPU workloads, complementing existing effortsthat speed up similarity search on GPUs [34]. To preserve the in-tegrity of the established science pipeline, we focus on optimizingthe existing MinHash based LSH rather than replacing it with po-tentially more efficient LSH variants such as LSH forest [10] andmulti-probe LSH [45]. While we share observations with priorwork that parallelizes and distributes a different LSH family [61],we present the unique challenges and opportunities of optimizingMinHash LSH in our application domain. We provide performancebenchmarks against alternative similarity search algorithms in theevaluation, such as set similarity joins [47] and an alternative LSHlibrary based on recent theoretical advances in LSH for cosine sim-ilarity [7]. We believe the resulting experience report, as well asour open source implementation, will be valuable to researchersdeveloping LSH techniques in the future.

Time Series Analytics. Time series analytics is a core topic inlarge-scale data analytics and data mining [39, 44, 68]. In our appli-cation, we utilize time series similarity search as a core workhorsefor earthquake detection. There are a number of distance metrics fortime series [22], including Euclidean distance and its variants [69],Dynamic Time Warping [51], and edit distance [62]. However,our input time series from seismic sensors is high frequency (e.g.100Hz) and often noisy. Therefore, small time-shifts, outliers andscaling can result in large changes in time-domain metrics [19].Instead, we encode time-frequency features of the input time seriesinto binary vectors and focus on the Jaccard similarity between thebinary feature vectors. This feature extraction procedure is an adap-tation of the Waveprint algorithm [9] initially designed for audiodata; the key modification made for seismic data was to focus onfrequency features that are the most discriminative from backgroundnoise, such that the average similarity between non-seismic signalsis reduced [13]. An alternative binary representation models timeseries as points on a grid, and uses the non-empty grid cells as a setrepresentation of the time series [48]. However, this representationdoes not take advantage of the physical properties distinguishingbackground from seismic signals.

4. PIPELINE OVERVIEWIn this section, we provide an overview of the three main steps of

our end-to-end detection pipeline. We elaborate on each step—andour associated optimizations—in later sections, referenced inline.

The input of the detection pipeline consists of continuous groundmotion measurements in the form of time series, collected frommultiple stations in the seismic network. The output is a list ofpotential earthquakes, specified in the form of timestamps when theseismic wave arrives at each station. From there, seismologists cancompare with public earthquake catalogs to identify new events, andvisually inspect the measurements to confirm seismic findings.

Figure 2 illustrates the three major components of the end-to-enddetection pipeline: fingerprint extraction, similarity search, and spa-tiotemporal alignment. For each input time series, or continuousground motion measurements from a seismic channel, the algorithm

1676

Page 4: Locality-Sensitive Hashing for Earthquake Detection: A ... · In this work, we report on a novel application of Locality Sensitive Hashing (LSH) to seismic data at scale. Based on

Figure 2: The three steps of the end-to-end earthquake detection pipeline: fingerprinting transforms time series into binary vectors (Section 5);similarity search identifies pairs of similar binary vectors (Section 6); alignment aggregates and reduces false positives in results (Section 7).

slices the input into short windows of overlapping time series seg-ments and encodes time-frequency features of each window into abinary fingerprint; the similarity of the fingerprints resembles thatof the original waveforms (Section 5). The algorithm then performsan all pairs similarity search via LSH on the binary fingerprints andidentifies pairs of highly similar fingerprints (Section 6). Finally,like a traditional associator that maps earthquake detections at eachstation to a consistent seismic source, in the spatiotemporal align-ment stage, the algorithm combines, filters and clusters the outputsfrom all seismic channels to generate a list of candidate earthquakedetections with high confidence (Section 7).

A naıve implementation of the pipeline imposes several scalabilitychallenges. For example, we observed LSH performance degrada-tion in our application caused by the non-uniformity and correlationin the binary fingerprints; the correlations induce undesired LSHhash collisions, which significantly increase the number of lookupsper similarity search query (Section 6.3). In addition, the similaritysearch does not distinguish seismic from non-seismic signals. In thepresence of repeating background signals, similar noise waveformscould outnumber similar earthquake waveforms, leading to morethan an order of magnitude slow down in runtime and increase inoutput size (Section 6.5). As the input time series and the output ofthe similarity search becomes larger, the pipeline must adapt to datasizes that are too large to fit into main memory (Section 6.4, 7.2).

In this paper, we focus on single-machine, main-memory execu-tion on commodity servers with multicore processors. We parallelizethe pipeline within a given server but otherwise do not distributethe computation to multiple servers. In principle, the parallelizationefforts extend to distributed execution. However, given the poorquadratic scalability of the unoptimized pipeline, distribution alonewould not have been a viable option for scaling to desired data vol-ume. As a result of the optimizations described in this paper, we areable to scale to a decade of data on a single node without requiringdistribution. However, we view distributed execution as a valuableextension for future work.

In the remaining sections of this paper, we describe the designdecisions as well as performance optimizations for each pipelinecomponent. Most of our optimizations focus on the all pairs similar-ity search, where the initial implementation exhibited near quadraticgrowth in runtime with the input size. We show in the evaluationthat, these optimizations enable speedups of more than two ordersof magnitude in the end-to-end pipeline.

5. FINGERPRINT EXTRACTIONIn this section, we describe the fingerprint extraction step that en-

codes time-frequency features of the input time series into compactbinary vectors for similarity search. We begin with an overview ofthe fingerprinting algorithm [13] and the benefits of using finger-

Time Series Spectrogram Wavelet

MAD NormalizationTop CoefficientBinary Fingerprint

Figure 3: The fingerprinting algorithm encodes time-frequencyfeatures of the original time series into binary vectors.

prints in place of the time series (Section 5.1). We then describe anew optimization that parallelizes and accelerates the fingerprintinggeneration via sampling (Section 5.2).

5.1 Fingerprint OverviewInspired by the success of feature extraction techniques for in-

dexing audio snippets [13], fingerprint extraction step transformscontinuous time series data into compact binary vectors (finger-prints) for similarity search. Each fingerprint encodes representativetime-frequency features of the time series. The Jaccard similarity oftwo fingerprints, defined as the size of the intersection of the non-zero entries divided by the size of the union, preserves the waveformsimilarity of the corresponding time series segments. Compared todirectly computing similarity on the time series, fingerprinting in-troduces frequency-domain features into the detection and providesadditional robustness against translation and small variations [13].

Figure 3 illustrates the individual steps of fingerprinting:1. Spectrogram Compute the spectrogram, a time-frequency

representation, of the time series. Slice the spectrograminto short overlapping segments using a sliding window andsmooth by downsampling each segment into a spectral image.

2. Wavelet Transform Compute two-dimensional discrete Haarwavelet transform on each spectral image. The wavelet coeffi-cients serve as a lossy compression of the spectral images.

3. Normalization Normalize each wavelet coefficient by its me-dian and the median absolute deviation (MAD) on the full,background dominated dataset.

4. Top coefficient Extract the top K most anomalous waveletcoefficients, or the largest coefficients after MAD normal-ization, from each spectral image. By selecting the mostanomalous coefficients, we focus only on coefficients that aremost distinct from coefficients that characterize noise, whichempirically leads to better detection results.

5. Binarize Binarize the signs and positions of the top waveletcoefficients. We encode the sign of each normalized coeffi-cient using 2 bits: −1→ 01, 0→ 00, 1→ 10.

1677

Page 5: Locality-Sensitive Hashing for Earthquake Detection: A ... · In this work, we report on a novel application of Locality Sensitive Hashing (LSH) to seismic data at scale. Based on

5.2 Optimization: MAD via samplingThe fingerprint extraction is implemented via scientific modules

such as scipy, numpy and PyWavelets in Python. While its runtimegrows linearly with input size, fingerprinting ten years of time seriesdata can take several days on a single core.

In the unoptimized procedure, normalizing the wavelet coeffi-cients requires two full passes over the data. The first pass calculatesthe median and the MAD1 for each wavelet coefficient over thewhole population, and the second pass normalizes the wavelet rep-resentation of each fingerprint accordingly. Given the median andMAD for each wavelet coefficient, the input time series can be parti-tioned and normalized in parallel. Therefore, the computation of themedian and MAD remains the runtime bottleneck.

We accelerate the computation by approximating the true medianand MAD with statistics calculated from a small random sampleof the input data. The confidence interval for MAD with a samplesize of n shrinks with n1/2 [59]. We further investigate the trade-offbetween speed and accuracy under different sampling rates in theevaluation (Section 8.3). We empirically find that, on one monthof input time series data, sampling provides an order of magnitudespeedup with almost no loss in accuracy. For input time series oflonger duration, sampling 1% or less of the input can suffice.

6. LSH-BASED SIMILARITY SEARCHIn this section, we present the time series similar search step

based on LSH. We start with a description of the algorithm andthe baseline implementation (Section 6.1), upon which we buildthe optimizations. Our contributions include: an optimized hashsignature generation procedure (Section 6.2), an empirical analysisof the impact of hash collisions and LSH parameters on queryperformance (Section 6.3), partition and parallelization of LSH thatreduce the runtime and memory usage (Section 6.4), and finally,two domain-specific filters that improve both the performance anddetection quality of the search (Section 6.5).

6.1 Similarity Search OverviewReoccurring earthquakes originated from nearby seismic sources

appear as near-identical waveforms at the same seismic station.Given continuous ground motion measurements from a seismicstation, our pipeline identifies similar time series segments from theinput as candidates for reoccurring earthquake events.

Concretely, we perform an approximate similarity search via Min-Hash LSH on the binary fingerprints to identify all pairs of finger-prints whose Jaccard similarity exceeds a predefined threshold [17].MinHash LSH performs a random projection of high-dimensionaldata into lower dimensional space, hashing similar items to the samehash table “bucket” with high probability (Figure 4). Instead ofperforming a naıve pairwise comparisons between all fingerprints,LSH limits the comparisons to fingerprints sharing the same hashbucket, significantly reducing the computation. The ratio of the av-erage number of comparisons per query to the size of the dataset, orselectivity, is a machine-independent proxy for query efficiency [23].

Hash signature generation. The MinHash of a fingerprint isthe first non-zero element of the fingerprint under a given randompermutation of its elements. The permutation is defined by a hashfunction mapping fingerprint elements to random indices. Let pdenote the collision probability of a hash signature generated with asingle hash function. By increasing the number of hash functions k,the collision probability of the hash signature decreases to pk [43].

1For X = {x1,x2, ...,xn}, the MAD is defined as the median of the absolutedeviations from the median: MAD = median(|xi−median(X)|)

General Purpose Hashing Locality-Sensitive Hashing

Figure 4: Locality-sensitive hashing hashes similar items to thesame hash “bucket” with high probability.

Hash table construction. Each hash table stores an independentmapping of fingerprints to hash buckets. The tables are initializedby mapping hash signatures to a list of fingerprints that share thesame signature. Empirically, we find that using t = 100 hash ta-bles suffices for our application, and there is little gain in furtherincreasing the number of hash tables.

Search. The search queries the hash tables for each fingerprint’snear neighbor candidates, or other fingerprints that share the queryfingerprint’s hash buckets. We keep track of the number of times thequery fingerprint and candidates have matching hash signature in thehash tables, and output candidates with matches above a predefinedthreshold. The number of matches is also used as a proxy for theconfidence of the similarity in the final step of the pipeline.

6.2 Optimization: Hash signature generationIn this subsection, we present both memory access pattern and

algorithmic improvements to speed up the generation of hash signa-tures. We show that, together, the optimizations lead to an over 3×improvement in hash generation time (Section 8.1).

Similar to observations made for SimHash (a different hash familyfor angular distances) [61], a naıve implementation of the MinHashgeneration can suffer from poor memory locality due to the sparsityof input data. SimHash functions are evaluated as a dot productbetween the input and hash mapping vectors, while MinHash func-tions are evaluated as a minimum of hash mappings correspondingto non-zero elements of the input. For sparse input, both functionsaccess scattered, non-contiguous elements in the hash mapping vec-tor, causing an increase in cache misses. We improve the memoryaccess pattern by blocking the access to the hash mappings. Weuse dimensions of the fingerprint, rather than hash functions, as themain loop for each fingerprint. As a result, the lookups for eachnon-zero element in the fingerprint are blocked into rows in the hashmapping array. For our application, this loop order has the addi-tional advantage of exploiting the high overlap (e.g. over 60% inone example) between neighboring fingerprints. The overlap meansthat previously accessed elements in hash mappings are likely to getreused while in cache, further improving the memory locality.

In addition, we speed up the hash signature generation by re-placing MinHash with Min-Max hash. MinHash only keeps theminimum value for each hash mapping, while Min-Max hashkeepsboth the min and the max. Therefore, to generate hash signatureswith similar collision probability, Min-Max hash reduces the num-ber of required hash functions to half. Previous work showed theMin-Max hash is an unbiased estimator of pairwise Jaccard similar-ity, and achieves similar and sometimes smaller mean squared error(MSE) in estimating pairwise Jaccard similarity in practice [33]. Weinclude pseudocode for the optimized hash signature calculation inAppendix D of extended Technical Report [53].

6.3 Optimization: Alleviating hash collisionsPerhaps surprisingly, our initial LSH implementation demon-

strated poor scaling with the input size: with a 5× increase in input,the runtime increases by 30×. In this subsection, we analyze thecause of LSH performance degradation and the performance impli-cations of core LSH parameters in our application.

1678

Page 6: Locality-Sensitive Hashing for Earthquake Detection: A ... · In this work, we report on a novel application of Locality Sensitive Hashing (LSH) to seismic data at scale. Based on

0 20 40 60 80 100 120Fingerprint x index

0

20

40

60Fi

nger

prin

t y in

dex

0.10

0.15

0.20

0.25

0.30

P(nonzero)

Figure 5: Probability that each element in the fingerprint is equalto 1, averaged over 15.7M fingerprints, each of dimension 8192,generated from a year of time series data. The heatmap showsthat some elements of the fingerprint are much more likely to benon-zero compared to others.

Cause of hash collisions. Poor distribution of hash signatures canlead to large LSH hash buckets or high query selectivity, significantlydegrading the performance of the similarity search [10, 36]. Forexample, in the extreme case when all fingerprints are hashed into asingle bucket, the selectivity equals 1 and the LSH performance isequivalent to that of the exhaustive O(n2) search.

Our input fingerprints encode physical properties of the waveformdata. As a result, the probability that each element in the fingerprintis non-zero is highly non-uniform (Figure 5). Moreover, finger-print elements are not necessarily independent, meaning that certainfingerprint elements are likely to co-occur: given an element ai isnon-zero, the element a j has a much higher probability of beingnon-zero (P[ai = 1,a j = 1]> P[ai = 1]×P[a j = 1]).

This correlation has a direct impact on the collision probabilityof MinHash signatures. For example, if a hash signature containsk independent MinHash of a fingerprint and two of the non-zeroelements responsible for the MinHash are dependent, then the sig-nature has effectively similar collision probability as the signaturewith only k− 1 MinHash . In other words, more fingerprints arelikely to be hashed to the same bucket under this signature. Forfingerprints shown in Figure 5, the largest 0.1% of the hash bucketscontain an average of 32.9% of the total fingerprints for hash tablesconstructed with 6 hash functions.

Performance impact of LSH parameters. The precision and re-call of the LSH can be tuned via two key parameters: the number ofhash functions k and the number of hash table matches m. Intuitively,using k hash functions is equivalent to requiring two fingerprintsagree at k randomly selected non-zero positions. Therefore, thelarger the number of hash functions, the lower the probability ofcollision. To improve recall, we increase the number of independentpermutations to make sure that similar fingerprints can land in thesame hash bucket with high probability.

Formally, given two fingerprints with Jaccard similarity s, theprobability that with k hash functions, the fingerprints are hashed tothe same bucket at least m times out of t = 100 hash tables is:

P[s] = 1−m−1

∑i=0

[

(ti

)(1− sk)t−i(sk)i].

The probability of detection success as a function of Jaccard simi-larity has the form of an S-curve (Figure 6). The S-curve shifts tothe right with the increase in the number of hash functions k or thenumber of matches m, increasing the Jaccard similarity thresholdfor LSH. Figure 6 illustrates a near-identical probability of successcurve under different parameter settings.

Due to the presence of correlations in the input data, LSH pa-rameters with the same theoretically success probability can havevastly different runtime in practice. Specifically, as the number of

0.0 0.2 0.4 0.6 0.8 1.0Jaccard Similarity

0.0

0.2

0.4

0.6

0.8

1.0

Prob

abilit

y of

Suc

cess

k=4, m=5k=6, m=5k=7, m=3k=8, m=2k=10, m=2

Figure 6: Theoretical probability of a successful search versusJaccard similarity between fingerprints (k: number of hash functions,m: number of matches). Different LSH parameter settings can havenear identical detection probability with vastly different runtime.

hash functions increases, the expected average size of hash bucketsdecreases, which can lead to an order of magnitude speed up in thesimilarity search for seismic data in practice. However, to keep thesuccess probability curve constant with increased hash functions,the number of matches needs to be lowered, which increases theprobability of spurious matches. These spurious matches can besuppressed by scaling up the number of total hash tables, at the costof larger memory usage. We further investigate the performanceimpact of LSH parameters in the evaluation.

6.4 Optimization: PartitioningIn this subsection, we describe the partition and parallelization of

the LSH that further reduce its runtime and memory footprint.

Partition. Using a 1-second lag for adjacent fingerprints resultsin around 300M total fingerprints for 10 years of time series data.Given a hash signature of 64 bits and 100 total hash tables, the totalsize of hash signatures is approximately 250 GB. To avoid expensivedisk I/O, we also want to keep all hash tables in memory for lookups.Taken together, this requires several hundred gigabytes of memory,which can exceed available main memory.

To scale to larger input data on a single node with the existingLSH implementation, we perform similarity search in partitions.We evenly partition the fingerprints and populate the hash tableswith one partition at a time, while still keeping the lookup tableof fingerprints to hash signatures in memory. During query, weoutput matches between fingerprints in the current partition (or inthe hash tables) with all other fingerprints and subsequently repeatthis process for each partition. The partitioned search yields identicalresults to the original search, with the benefit that only a subset ofthe fingerprints are stored in the hash tables in memory. We canpartition the lookup table of hash signatures similarly to furtherreduce memory. We illustrate the performance and memory trade-offs under different numbers of partitions in Section 8.3.

The idea of populating the hash table with a subset of the inputcould also be favorable for performing a small number of nearestneighbor queries on a large dataset, e.g., a thousand queries ona million items. There are two ways to execute the queries. Wecan hash the full dataset and then perform a thousand queries toretrieve near neighbor candidates in each query item’s hash buckets;alternatively, we can hash only the query items and for every otheritem in the dataset, check whether it is mapped to an existing bucketin the table. While the two methods yield identical query results,the latter could be 8.6× faster since the cost of initializing the hashtable dominates that of the search.

It is possible to further improve LSH performance and memoryusage with the more space efficient variants such as multi-probeLSH [45]. However, given that the alignment step uses the numberof hash buckets shared between fingerprints as a proxy for similarity,and that switching to a multi-probe implementation would alter thissimilarity measure, we preserve the original LSH implementation for

1679

Page 7: Locality-Sensitive Hashing for Earthquake Detection: A ... · In this work, we report on a novel application of Locality Sensitive Hashing (LSH) to seismic data at scale. Based on

Time

Gro

und

Mot

ion

Figure 7: The short, three-spike pattern is an example of similarand repeating background signals not due to seismic activity. Theserepeating noise patterns cause scalability challenges for LSH.

backwards compatibility with FAST. We compare against alternativeLSH implementations and demonstrate the potential benefits ofadopting multi-probe LSH in the evaluation (Section 8.4).

Parallelization. Once the hash mappings are generated, we caneasily partition the input fingerprints and generate the hash signa-tures in parallel. Similarly, the query procedure can be parallelizedby running nearest neighbor queries for different fingerprints andoutputting results to files in parallel. We show in Section 8.3 thatthe total hash signature generation time and similarity search timereduces near linearly with the number of processes.

6.5 Optimization: Domain-specific filtersLike many other sensor measurements, seismometer readings can

be noisy. In this subsection, we address a practical challenge ofthe detection pipeline, where similar non-seismic signals dominateseismic findings in runtime and detection results. We show thatby leveraging domain knowledge, we can greatly increase both theefficiency and the quality of the detection.

Filtering irrelevant frequencies. Some input time series containstation-specific narrow-band noise that repeats over time. Patterns ofthe repeating noise are captured in the fingerprints and are identifiedas near neighbors, or earthquake candidates in the similarity search.

To address this problem, we apply a bandpass filter to excludefrequency bands that show high average amplitudes and repeatingpatterns while containing low seismic activities. The bandpass filteris selected manually by examining short spectrogram samples, typi-cally an hour long, of the input time series, based on seismologicalknowledge. Typical bandpass filter ranges span from 2 to 20Hz.Prior work [13, 14, 24, 25] proposes the idea of filtering irrelevantfrequencies, but only on input time series. We extend the filter to thefingerprinting algorithm and cutoff spectrograms at the corner of thebandpass filter, which empirically improves detection performance.We perform a quantitative evaluation of the impact of bandpassfilters on both the runtime and result quality (Section 8.2).

Removing correlated noise. Repeating non-seismic signals canalso occur in frequency bands containing rich earthquake signals.Figure 7 shows an example of strong repeating background signalsfrom a New Zealand seismic station. A large cluster of repeatingsignals with high pairwise similarity could produce nearest neigh-bor matches that dominate the similarity search, leading to a 10×increase in runtime and an over 100× increase in output size com-pared to results from similar stations. This poses both problems forcomputational scalability and for seismological interpretability.

We develop an occurrence filter for the similarity search by ex-ploiting the rarity of the earthquake signals. Specifically, if a specificfingerprint is generating too many nearest neighbor matches in ashort duration of time, we can be fairly confident that it is not anearthquake signal. This observation holds in general except forspecial scenarios such as volcanic earthquakes [12].

During the similarity search, we dynamically generate a list offingerprints to exclude from future search. If the number of near

neighbor candidates a fingerprint generates is larger than a prede-fined percentage of the total fingerprints, we exclude this fingerprintas well as its neighbors from future similarity search. To capturerepeating noise over a short duration of time, the filter can be appliedon top of the partitioned search. In this case, the filtering thresholdis defined as the percentage of fingerprints in the current partition,rather than in the whole dataset. On the example dataset above, thisapproach filtered out around 30% of the total fingerprints with nofalse positives. We evaluate the effect of the occurrence filter ondifferent datasets under different filtering thresholds in Section 8.2.

7. SPATIOTEMPORAL ALIGNMENTThe LSH-based similar search outputs pairs of similar fingerprints

(or waveforms) from the input, without knowing whether or notthe pairs correspond to actual earthquake events. In this section,we show that by incorporating domain knowledge, we are able tosignificantly reduce the size of the output and prioritize seismicfindings in the similarity search results. We briefly summarize theaggregation and filtering techniques on the level of seismic channels,seismic stations and seismic networks introduced in a recent paper inseismology [14] (Section 7.1). We then describe the implementationchallenges and our out-of-core adaptations enabling the algorithmto scale to large output volumes (Section 7.2).

7.1 Alignment OverviewThe similarity search computes a sparse similarity matrix M,

where the non-zero entry M[i, j] represents the similarity of finger-prints i and j. In order to identify weak events in low signal-to-noiseratio settings, seismologists set lenient detection thresholds for thesimilarity search, resulting in large outputs in practice. For example,one year of input time series data can easily generate 100G of output,or more than 5 billion pairs of similar fingerprints. Since it is infea-sible for seismologists to inspect all results manually, we need toautomatically filter and align the similar fingerprint pairs into a listof potential earthquakes with high confidence. Based on algorithmsproposed in a recent work in seismology [14], we seek to reducesimilarity search results at the level of seismic channels, stationsand also across a seismic network. Figure 8 gives an overview ofthe spatiotemporal alignment procedure.

Channel Level. Seismic channels at the same station experienceground movements at the same time. Therefore, we can directlymerge detection results from each channel of the station by sum-ming the corresponding similarity matrix. Given that earthquake-triggered fingerprint matches tend to register at multiple channelswhereas matches induced by local noise might only appear on onechannel, we can prune detections by imposing a slightly highersimilarity threshold on the combined similarity matrix. This is tomake sure that we include either matches with high similarity, orweaker matches registered at more than one channel.

Station Level. Given a combined similarity matrix for each seis-mic station, domain scientists have found that earthquake eventscan be characterized by thin diagonal shaped clusters in the matrix,which corresponds to a group of similar fingerprint pairs separatedby a constant offset [14]. The constant offset represents the timedifference, or the inter-event time, between a pair of reoccurringearthquake events. One pair of reoccurring earthquake events cangenerate multiple fingerprint matches in the similarity matrix, sinceevent waveforms are longer than a fingerprint time window. Weexclude “self-matches” generated from adjacent/overlapping finger-prints that are not attributable to reoccurring earthquakes. Aftergrouping similar fingerprint pairs into clusters of thin diagonals, we

1680

Page 8: Locality-Sensitive Hashing for Earthquake Detection: A ... · In this work, we report on a novel application of Locality Sensitive Hashing (LSH) to seismic data at scale. Based on

Figure 8: The alignment procedure combines similarity search out-puts from all channels in the same station (Channel Level), groupssimilar fingerprint matches generated from the same pair of reoccur-ring earthquakes (Station Level), and checks across seismic stationsto reduce false positives in the final detection list (Network Level).

Figure 9: Earthquakes from the same seismic sources has a fixedtravel time to each seismic station (e.g. δ tA, δ tB in the figure). Theinter-event time between two occurrences of the same earthquake isinvariant across seismic stations.

reduce each cluster to a few summary statistics, such as the bound-ing box of the diagonal, the total number of similar pairs in thebounding box, and the sum of their similarity. Compared to storingevery similar fingerprint pair, the clusters and summary statisticssignificantly reduce the size of the output.

Network Level. Earthquake signals also show strong temporalcorrelation across the seismic network, which we exploit to furthersuppress non-earthquake matches. Since an earthquake’s travel timeis only a function of its distance from the source but not of themagnitude, reoccurring earthquakes generated from the same sourcetake a fixed travel time from the source to the seismic stations oneach occurrence. Assume that an earthquake originated from sourceX takes δ tA and δ tB to travel to seismic stations A and B and that thesource generates two earthquakes at time t1 and t2 (Figure 9). StationA experiences the arrivals of the two earthquakes at time t1 + δ tAand t2 +δ tA, while station B experiences the arrivals at t1 +δ tB andt2 +δ tB. The inter-event time ∆t of these two earthquake events isindependent of the location of the stations:

∆t = (t2 +δ tA)− (t1 +δ tA) = (t2 +δ tB)− (t1 +δ tB) = t2− t1.

This means that in practice, diagonals with the same offset ∆t andclose starting times at multiple stations can be attributed to the sameearthquake event. We require a pair of earthquake events to beobserved at more than a user-specified number of stations in orderto be considered as a detection.

On a run with 7 to 10 years of time series data from 11 seismicstations (27 channels), the postprocessing procedure effectivelyreduced the output from more than 2 Terabytes of similar fingerprintpairs to around 30K timestamps of potential earthquakes.

7.2 Implementation and OptimizationThe volume of similarity search output poses serious challenges

for the alignment procedure, as we often need to process resultslarger than the main memory of a single node. In this subsection, wedescribe our implementation and the new out-of-core adaptations ofthe algorithm that enable the scaling to large output volumes.

Similarity search output format. The similarity search producesoutputs that are in the form of triplets. A triplet (dt, idx1,sim) isa non-zero entry in the similarity matrix, which represents thatfingerprint idx1 and (idx1+dt) are hashed into the same bucket simtimes (out of t independent trials). We use sim as an approximationof the similarity between the two fingerprints.

Channel. First, given outputs of similar fingerprint pairs (or thenon-zero entries of the similarity matrix) from different channelsat the same station, we want to compute the combined similaritymatrix with only entries above a predefined threshold.

Naıvely, we could update a shared hashmap of the non-zero en-tries of the similarity matrix for each channel in the station. However,since the hashmap might not fit in the main memory on a single ma-chine, we utilize the following sort-merge-reduce procedure instead:

1. In the sorting phase, we perform an external merge sort onthe outputs from each channel, with dt as the primary sortkey and idx1 as the secondary sort key. That is, we sort thesimilar fingerprint pairs first by the diagonal that they belongto in the similarity matrix, and within the diagonals, by thestart time of the pairs.

2. In the merging phase, we perform a similar external mergesort on the already sorted outputs from each channel. Thisis to make sure that all matches generated by the same pairof fingerprint idx1 and idx1+dt at different channels can beconcentrated in consecutive rows of the merged file.

3. In the reduce phase, we traverse through the merged file andcombine the similarity score of consecutive rows of the filethat share the same dt and idx1. We discard results that havecombined similarity smaller than the threshold.

Station. Given a combined similarity matrix for each seismicstation, represented in the form of its non-zero entries sorted bytheir corresponding diagonals and starting time, we want to clusterfingerprint matches generated by potential earthquake events, orcluster non-zero entries along the narrow diagonals in the matrix.

We look for sequences of detections (non-zero entries) along eachdiagonal dt, where the largest gap between consecutive detections issmaller than a predefined gap parameter. Empirically, permitting agap help ensure an earthquake’s P and S wave arrivals are assignedto the same cluster. Identification of the initial clusters along eachdiagonal dt requires a linear pass through the similarity matrix. Wethen interactively merge clusters in adjacent diagonals dt−1 anddt + 1, with the restriction that the final cluster has a relativelynarrow width. We store a few summary statistics for each cluster(e.g. the cluster’s bounding box, the total number of entries) aswell as prune small clusters and isolated fingerprint matches, whichsignificantly reduces the output size.

The station level clustering dominates the runtime in the spa-tiotemporal alignment. In order to speed up the clustering, wepartition the similarity matrix according to the diagonals, or rangesof dts of the matched fingerprints, and perform clustering in parallelon each partition. A naıve equal-sized partition of the similaritymatrix could lead to missed detections if a cluster split into twopartitions gets pruned in both due to the decrease in size. Instead,we look for proper points of partition in the similarity matrix wherethere is a small gap between neighboring occupied diagonals. Again,

1681

Page 9: Locality-Sensitive Hashing for Earthquake Detection: A ... · In this work, we report on a novel application of Locality Sensitive Hashing (LSH) to seismic data at scale. Based on

we take advantage of the ordered nature of similarity matrix entries.We uniformly sample entries in the similarity matrix, and for everypair of neighboring sampled entries, we only check the entries in be-tween for partition points if the two sampled entries lie on diagonalsfar apart enough to be in two partitions. Empirically, a samplingrate of around 1% works well for our datasets in that most sampledentries are skipped because they are too close to be partitioned.

Network. Given groups of potential events at each station, weperform a similar summarization across the network in order to iden-tify subsets of the events that can be attributed to the same seismicsource. In principle, we could also partition and parallelize the net-work detection. In practice, however, we found that the summarizedevent information at each station is already small enough that itsuffices to compute in serial.

8. EVALUATIONIn this section, we perform both quantitative evaluation on perfor-

mances of the detection pipeline, as well as qualitative analysis ofthe detection results. Our goal is to demonstrate that:

1. Each of our optimizations contributes meaningfully to theperformance improvement; together, our optimizations enablean over 100× speed up in the end-to-end detection pipeline.

2. Incorporating domain knowledge in the pipeline improvesboth the performance and the quality of the detection.

3. The improved scalability of the pipeline enables new scientificdiscoveries on two public datasets: we discovered 597 newearthquakes from a decade of seismic data near the DiabloCanyon nuclear power plant in California, as well as 6123 newearthquakes from a year of seismic data from New Zealand.

Dataset. We evaluate on two public datasets used in seismologicalanalyses with our domain collaborators. The first dataset includes1 year of 100Hz time series data (3.15 billion points per station)from 5 seismic stations (LTZ, MQZ, KHZ, THZ, OXZ) in NewZealand. We use the vertical channel (usually the least noisy) fromeach station [3]. The second dataset of interest includes 7 to 10years of 100Hz time series data from 11 seismic stations and 27total channels near the Diablo Canyon power plant in California [4].

Experimental Setup. We report results from evaluating thepipeline on a server with 512GB of RAM and two 28-thread In-tel Xeon E5-2690 v4 2.6GHz CPUs. Our test server has L1, L2,L3 cache sizes of 32K, 256K and 35840K. We report the runtimeaverages from multiple trials.

8.1 End-to-end EvaluationIn this subsection, we report the runtime breakdown of the base-

line implementation of the pipeline, as well as the effects of applyingdifferent optimizations.

To evaluate how our optimizations scale with data size, we eval-uate the end-to-end pipeline on 1 month and 1 year of time seriesdata from station LTZ in the New Zealand dataset. We applied abandpass filter of 3-20Hz on the original time series to exclude noisylow-frequency bands. For fingerprinting, we used a sliding windowwith length of 30 seconds and slide of 2 seconds, which results in1.28M binary fingerprints for 1 month of time series data (15.7Mfor one year), each of dimension 8192; for similarity search, we use6 hash functions, and require a detection threshold of 5 matches outof 100 hash tables. We further investigate the effect of varying theseparameters in the microbenchmarks in Section 8.3.

Figure 10 shows the cumulative runtime after applying each opti-mization. Cumulatively, our optimizations scale well with the size

of the dataset, and enable an over 100× improvement in end-to-endprocessing time. We analyze each of these components in turn:

First, we apply a 1% occurrence filter (+ occur filter, Section 6.5)during similarity search to exclude frequent fingerprint matchesgenerated by repeating background noise. This enables a 2-5×improvement in similarity search runtime while reducing the outputsize by 10-50×, reflected in the decrease in postprocessing time.

Second, we further reduce the search time by increasing thenumber of hash functions to 8 and lowering the detection thresholdto 2 (+ increase #funcs, Section 6.3). While this increases thehash signature generation and output size, it enables around 10×improvement in search time for both datasets.

Third, we reduce the hash signature generation time by improvingthe cache locality and reducing the computation with Min-Max hashinstead of MinHash (+ locality MinMax, Section 6.2), which leadsto a 3× speedup for both datasets.

Fourth, we speed up fingerprinting by 2× by estimating MADstatistics with a 10% sample (+ MAD sample, Section 5.2).

Finally, we enable parallelism and run the pipeline with 12 threads(Section 5.2, 6.4, 7.2). As a result, we see an almost linear decreasein runtime in each part of the pipeline. Notably, due to the overalllack of data dependencies in this scientific pipeline, simple paral-lelization can already enable significant speedups.

The improved scalability enables us to scale analytics from 3months to over 10 years of data. We discuss qualitative detectionresults from both datasets in Section 8.5.

8.2 Effect of domain-specific optimizationsHere, we investigate the effect of applying domain-specific opti-

mizations to the pipeline. We demonstrate that incorporating domainknowledge could improve both performance and result quality.

Occurrence filter. We evaluate the effect of applying the occur-rence filter during similarity search on the five stations from theNew Zealand dataset. For this evaluation, we use a partition size of 1month as the duration for the occurrence threshold; a >1% thresholdindicates that a fingerprint matches over 1% (10K) other fingerprintsin the same month. We report the total percentage of filtered fin-gerprints under varying thresholds in Table 1. We also evaluate theaccuracy of the occurrence filter by comparing the timestamps offiltered fingerprints with the catalog of the arrival times of knownearthquakes at each station. In Table 1, we report the false positiverate, or the number of filtered earthquakes over the total number ofcataloged events, of the filter under varying thresholds.

The results show that as the occurrence filter becomes stronger,the percentage of filtered fingerprints and the false positive rate bothincrease. For seismic stations suffering from correlated noise, theoccurrence filter can effectively eliminate a significant amount offingerprints from the similarity search. For station LTZ, a >1%threshold filters out up to 30% of the total fingerprints without anyfalse positives, which results in a 4× improvement in runtime. Forother stations, the occurrence filter has little influence on the results.This is expected since these stations do not have repeating noisesignals present at station LTZ (Figure 7). In practice, correlatednoise is rather prevalent in seismic data. In the Diablo Canyondataset for example, we applied the occurrence filter on three out ofthe eleven seismic stations in order for the similarity search to finishin a tractable time.

Bandpass filter. We compare similarity search on the same dataset(Nyquist frequency 50Hz) before and after applying bandpass filters.The first bandpass filter (bp: 1-20Hz) is selected as most seismicsignals are under 20Hz; the second (bp: 3-20Hz) is selected aftermanually looking at samples spectrograms of the dataset and exclud-

1682

Page 10: Locality-Sensitive Hashing for Earthquake Detection: A ... · In this work, we report on a novel application of Locality Sensitive Hashing (LSH) to seismic data at scale. Based on

Baseline + occur filter

+ increase #func

+ locality MinMax

+ MAD Sample

+ parallel(n=12)

0

2

4

6

Tim

e (h

our)

6.7

2.2 1.9 1.6 1.30.2

Baseline + occur filter

+ increase #func

+ locality MinMax

+ MAD Sample

+ parallel(n=12)

0

50

100

150

60.837.0 32.9 28.3

3.6

1 month 1 yearFingerprint Hash Gen Similarity Search Alignment

Figure 10: Factor analysis of processing 1 month (left) and 1 year (right) of 100Hz data from LTZ station in the New Zealand dataset. Weshow that each of our optimization contributes to the performance improvements, and enabled an over 100× speed up end-to-end.

Table 1: The table shows that the percentage of fingerprints filtered (Filtered) and the false positive rate (FP) both increase as the occurrencefilter becomes stronger (from filtering matches above 5.0% to above 0.1%). The runtime (in hours) measures similarity search time.

LTZ (1548 events) MQZ (1544 events) KHZ (1542 events) THZ (1352 events) OXZ (1248 events)

Thresh FP Filtered Time FP Filtered Time FP Filtered Time FP Filtered Time FP Filtered Time>5.0% 0 0.09 149.3 0 0 2.8 0 0 2.2 0 0 2.4 0 0 2.6>1.0% 0 30.1 31.0 0 0 2.7 0 0 2.3 0 0 2.3 0 0 2.6>0.5% 0 31.2 32.1 0 0.09 2.8 0 0 2.4 0 0 2.4 0.08 0.08 2.7>0.1% 0 32.1 28.6 0.07 0.3 2.7 0 0.03 2.4 0 0.02 2.3 0.08 0.17 2.6

OXZ KHZ THZ MQZ0

5

10

15

20

25

30

35

40

Sear

ch R

untim

e (h

ours

) 37.2

2.5 2.2

12.7

6.82.2

4.9 2.6 1.86.2

2.6 1.8

Original (0 to 50Hz)bp: 1 to 20Hzbp: 3 to 20Hz

Figure 11: LSH runtime under different band pass filters. Matchesof noise in the non-seismic frequency bands can lead to significantincrease in runtime for unfiltered time series.

ing noisy low frequencies. Figure 11 reports the similarity searchruntime for fingerprints generated with different bandpass filters.Overall, similarity search suffers from additional matches generatedfrom the noisy frequency bands outside the interests of seismology.For example, at station OXZ, removing the bandpass filter leads toa 16× slow down in runtime and a 209× increase in output size.

We compare detection recall on 8811 catalog earthquake eventsfor different bandpass filters. The recall for the unfiltered data (0-50Hz), the 1-20Hz and 3-20Hz bandpass filters are 20.3%, 23.7%,45.2%, respectively. The overall low recall is expected, as we onlyused 4 (out of over 50) stations in the seismic network that con-tributes to the generation of catalog events. Empirically, a narrow,domain-informed bandpass filter focuses the comparison of finger-print similarity only on frequencies that are characteristics of seismicevents, leading to improved similarity between earthquake eventsand therefore increased recall. We provide guidelines for setting thebandpass filter in the extended report ([53], Appendix C).

8.3 Effect of pipeline parametersIn this section, we evaluate the effect of the space/quality and

time trade-offs for core pipeline parameters.

MAD sampling rate. We evaluate the speed and quality trade-offfor calculating the median and MAD of the wavelet coefficients forfingerprints via sampling. We measure the runtime and accuracy onthe 1 month dataset in Section 8.1 (1.3M fingerprints) under varyingsampling rates. Overall, runtime and accuracy both decrease with

LTZ KHZ THZ MQZ OXZ0

5

10

15

20

25

30Se

arch

Run

time

(hou

rs)

31.0

8.3

3.1 2.20.60.2

1.80.40.3

2.30.50.2

2.50.60.3

LTZ KHZ THZ MQZ OXZ0

2K

4K

6K

8K

Aver

age

Look

ups p

er Q

uery

9.0K

2.4K

761

6.4K

1.2K

289

5.1K

847244

5.3K

1.0K352

6.7K

1.6K

649

6 hashes, 5 matches 7 hashes, 3 matches 8 hashes, 2 matches

Figure 12: Effect of LSH parameters on similarity search runtimeand average query lookups. Increasing the number of hash functionssignificantly decreases average number of lookups per query, whichresults in a 10× improvement in runtime.

sampling rate as expected. For example, a 10% and 1% samplingrate produce fingerprints with 99.7% and 98.7% accuracy, whileenabling a near linear speedup of 10.5× and 99.8×, respectively.Below 1%, runtime improvements suffer from a diminishing return,as the IO begins to dominate the MAD calculation in runtime–on thisdataset, a 0.1% sampling rate only speeds up the MAD calculationby 350×. We include additional results of this trade-off in [53].

LSH parameters. We report runtime of the similarity search underdifferent LSH parameters in Figure 12. As indicated in Figure 6,the three sets of parameters that we evaluate yield near identicalprobability of detection given Jaccard similarity of two fingerprints.However, by increasing the number of hash functions and therebyincreasing the selectivity of hash signatures, we decrease the averagenumber of lookups per query by over 10x. This results in around10x improvement in similarity search time.

Number of partitions. We report the runtime and memory usageof the similarity search with varying number of partitions in Fig-ure 13. As the number of partitions increases, the runtime increasesslightly due to the overhead of initialization and deletion of hashtables. In contrast, memory usage decreases as we only need to keepa subset of the hash signatures in the hash table at any time. Overall,by increasing the number of partitions from 1 to 8, we are able todecrease the memory usage by over 60% while incurring less than20% runtime overhead. This allows us to run LSH on larger datasetswith the same amount of memory.

1683

Page 11: Locality-Sensitive Hashing for Earthquake Detection: A ... · In this work, we report on a novel application of Locality Sensitive Hashing (LSH) to seismic data at scale. Based on

1 2 3 4 5 6 7 8# partitions

2.0

2.1

2.2

2.3

2.4

2.5Av

erag

e Ru

ntim

e (h

our)

Runtime

20

30

40

50

60

Mem

ory

Usag

e (G

B)Memory

Figure 13: Runtime and memory usage for similarity search undera varying number of partitions. By increasing the number of searchpartitions, we are able to decrease the memory usage by over 60%while incurring less than 20% runtime overhead.

1 2 4 8 16 32 56# threads

0.0

0.5

1.0

1.5

2.0

2.5

Runt

ime

(hou

rs)

1.61

2.29

0.80

1.33

0.400.73

0.200.41

0.110.230.060.16 0.050.11

Hash Gen Similarity Search

Figure 14: Hash generation scales near linearly up to 32 threads.

Parallelism. Finally, to quantify the speedups from parallelism, wereport the runtime of LSH hash signature generation and similaritysearch using a varying number of threads. For hash signature gener-ation, we report time taken to generate hash mappings as well as thetime taken to compute Min-Max hash for each fingerprint. For simi-larity search, we fix the input hash signatures and vary the number ofthreads assigned during the search. We show the runtime averagedfrom four seismic stations in Figure 14. Overall, hash signaturegeneration scales almost perfectly (linearly) up to 32 threads, whilesimilarity search scales slightly worse; both experience significantperformance degradation running with all available threads.

8.4 Comparison with AlternativesIn this section, we evaluate against alternative similarity search al-

gorithms and supervised methods. We include additional experimentdetails in the extended technical report ([53], Appendix A).

Alternative Similarity Search Algorithms. We compare thesingle-core query performance of our MinHash LSH to 1) an alterna-tive open source LSH library FALCONN [1] 2) four state-of-the-artset similarity join algorithms: PPJoin [67], GroupJoin [16], All-Pairs [11] and AdaptJoin [63]. We use 74,795 fingerprints withdimension 2048 and 10% non-zero entries, and a Jaccard similaritythreshold of 0.5 for all libraries. Compared to exact algorithmslike set similarity joins, the LSHs incur a 6% false negative rate.However, MinHash LSH enables a 24× to 65× speedup againstFALCONN and 63× to 197× speedup against set similarity joins(Table 2). Characteristics of the input fingerprints contribute to theperformance differences: the fixed number of non-zero entries in fin-gerprints makes pruning techniques in set similarity joins based onset length irrelevant; our results corroborate with previous findingsthat MinHash outperforms SimHash on binary, sparse input [60].

Supervised Methods. We report results evaluating two supervisedmodels: WEASEL [55] and ConvNetQuake [50] on the DiabloCanyon dataset. Both models were trained on labeled catalog events(3585 events from 2010 to 2017) and randomly sampled noise win-dows at station PG.LMD. We also augment the earthquake trainingexamples by 1) adding earthquake examples from another stationPG.DCD 2) perturbing existing events with white noise 3) shifting

Table 2: Single core per-datapoint query time for LSH and setsimilarity joins. MinHash LSH incurs a 6.6% false negative ratewhile enabling up to 197× speedup.

Algorithm Average Query time SpeedupMinHash LSH 36 µs –

FALCONN vanilla LSH .87ms 24×FALCONN multi-probe LSH 2.4ms 65×

AdaptJoin [63] 2.3ms 63×AllPairs [11] 7.1ms 197×

GroupJoin [16] 5.7ms 159×PPJoin [67] 5.5ms 151×

Table 3: Supervised methods trained on catalog events exhibit highfalse positive rate and a 20% accuracy gap between predictions oncatalog and FAST detected events.

WEASEL [55] ConvNetQuake [50]Test Catalog Acc. (%) 90.8 90.6

Test FAST Acc. (%) 68.0 70.5True Negative Rate (%) 98.6 92.2False Positive Rate (%) 90.0±5.88 90.0±5.88

the location of the earthquake event in the window. Table 3 reportstest accuracy of the two models on a sample of 306 unseen catalogevents and 449 new events detected by our pipeline (FAST events),as well as the false positive rate estimated from manual inspectionof 100 random earthquake predictions. While supervised methodsachieve high accuracy in classifying unseen catalog and noise events,they exhibit a high false positive rate (90±5.88%) and miss 30-32%of new earthquake events detected by our pipeline. The experimentsuggests that unsupervised methods like our pipeline are able todetect qualitatively different events from the existing catalog, andthat supervised methods are complements, rather than replacements,of unsupervised methods for earthquake detection.

8.5 Qualitative ResultsWe first report our findings in running the pipeline over a decade

(06/2007 to 10/2017) of continuous seismic data from 11 seismicstations (27 total channels) near the Diablo Canyon nuclear powerplant in central California. The chosen area is of special interestas there are many active faults near the power plant. Detectingadditional small earthquakes in this region will allow seismologiststo determine the size and shape of nearby fault structures, which canpotentially inform seismic hazard estimates.

We applied station-specific bandpass filters between 3 and 12Hz to remove repeating background noise from the time series. Inaddition, we applied the occurrence filter on three out of the elevenseismic stations that experienced corrupted sensor measurements.The number of input binary fingerprints for each seismic channelranges from 180 million to 337 million; the similarity search runtimeranges from 3 hours to 12 hours with 48 threads.

Among the 5048 detections above our detection threshold, 397detections (about 8%) were false positives, confirmed via visualinspection: 30 were duplicate earthquakes with a lower similarity,18 were catalog quarry blasts, 5 were deep teleseismic earthquakes(large earthquakes from >1000 km away). There were also 62 non-seismic signals detected across the seismic network; we suspect thatsome of these waveforms are sonic booms.

Overall, we were able to detect and locate 3957 catalog earth-quakes, as well as 597 new local earthquakes. Figure 15 shows anoverview of the origin time of detected earthquakes, which is spreadover the entire ten-year span. The detected events include both low-magnitude events near the seismic stations, as well as larger eventsthat are farther away. Figure 16 visualizes the locations of both cata-log events and newly detected earthquakes, and Figure 17 zooms in

1684

Page 12: Locality-Sensitive Hashing for Earthquake Detection: A ... · In this work, we report on a novel application of Locality Sensitive Hashing (LSH) to seismic data at scale. Based on

2008 2009 2010 2011 2012 2013 2014 2015 2016 201701234567

Mag

nitu

de

PG.SH.EHZPG.DCDPG.DPDPG.VPDPG.EFDPG.SHDPG.LMDPG.LSDPG.MLDNC.PABB.EHZNC.PPB.EHZ

Catalog eventsNew events

Figure 15: The left axis shows origin times and magnitude of detected earthquakes, with the catalog events marked in blue and new eventsmarked in red. The colored bands in the right axis represent the duration of data used for detection collected from 11 seismic stations and 27total channels. Overall, we detected 3957 catalog earthquakes (diamond) as well as 597 new local earthquakes (circle) from this dataset.

LEGENDDetected catalog events: Magnitudes

1 2 3 4 5

New detected local events

Stations(detection)

Stations(location)

Cities/Towns

Faults (USGS Quaternary database)

Power Plant

−0.5 0.0 0.5 1.0 1.5 2.0 2.5Local Magnitude

0

10

20

30

40

50

60

Num

ber o

f eve

nts

Magnitude of new detected events

-0.2 ≤ ML ≤ 2.4

122˚W 121.5˚W 121˚W 120.5˚W 120˚W 119.5˚W34˚N

34.5˚N

35˚N

35.5˚N

36˚N

36.5˚N

San SimeonCambria

Morro BaySan Luis Obispo

Pismo Beach

Paso Robles

Atascadero

Coalinga

Kettleman City

Parkfield

Cholame

Santa Maria

Big Sur

King City

Greenfield

Soledad

LompocSolvang

San Simeon earthquake

(M6.5, 2003) aftershocks?

San Andreas Fault (creeping)

Kettleman Hills blind thrustRinconada Fault

Hosgri Fault

0 100

km

Figure 16: Overview of the location of detected catalog events(gray open circles) and new events (red diamonds). The pipelinewas able to detect earthquakes close to the seismic network (boxed)as well as all over California.

on earthquakes in the vicinity of the power plant. Despite the lowrate of local earthquake activity (535 total catalog events from 2007to 2017 within the area shown in Figure 17), we were able to detect355 new events that are between −0.2 and 2.4 in magnitude andlocated within the seismic network, where many active faults exist.We missed 261 catalog events, almost all of which originated fromoutside the network of our interest. Running the detection pipelineat scale enables scientists to discover earthquakes from unknownsources. These new detected local events will be used to determinethe details of active fault structures near the power plant.

We are also actively working with our domain collaborators onadditional analysis of the New Zealand dataset. The pipeline de-tected 11419 events, including 4916 catalog events, 355 teleseismicevents, 6123 new local earthquakes and 25 false positives (noisewaveforms) verified by the seismologists. We are preparing theseresults for publication in seismological venues, and expect to furtherimprove the detection results by scaling up the analysis to moreseismic stations over a longer duration of time.

9. CONCLUSIONIn this work, we reported on a novel application of LSH to large-

scale seismological data, as well as the challenges and optimizationsrequired to scale the system to over a decade of continuous sensordata. This experience in scaling LSH for large-scale earthquakedetection illustrates both the potential and the challenge of applyingcore data analytics primitives to data-driven domain science on large

121.2˚W 121˚W 120.8˚W 120.6˚W 120.4˚W34.9˚N

35˚N

35.1˚N

35.2˚N

35.3˚N

35.4˚N

35.5˚N

Morro Bay

San Luis Obispo

Pismo Beach

Atascadero

Santa Mar0 10 20

km

LEGENDDetected catalog events: Magnitudes

1 2 3

New detected local events: Magnitudes

Stations(detection)

Stations(location)

0 1 2

Missed catalog events: Magnitudes

1 2 3

Cities/Towns

Faults (USGS Quaternary database)

Power Plant

Local seismicity 2007-2017

Figure 17: Zoom in view of locations of new detected earthquakes(red diamonds) and cataloged events (blue circles) near the seismicnetwork (box in Figure 16). The new local earthquakes contributedetailed information about the structure of faults.

datasets. On the one hand, LSH and, more generally, time seriessimilarity search, is well-studied, with scores of algorithms for effi-cient implementation: by applying canonical MinHash-based LSH,our seismologist collaborators were able to meaningfully analyzemore data than would have been feasible via manual inspection.On the other hand, the straightforward implementation of LSH inthe original FAST detection pipeline failed to scale beyond a fewmonths of data. The particulars of seismological data—such asfrequency imbalance in the time series and repeated backgroundnoise—placed severe strain on an unmodified LSH implementationand on researchers attempting to understand the output. As a result,the seismological discoveries we have described in this paper wouldnot have been possible without domain-specific optimizations tothe detection pipeline. We believe that these results have importantimplications for researchers studying LSH (e.g., regarding the im-portance of skew resistance) and will continue to bear fruit as wescale the system to even more data and larger networks.

AcknowledgementsWe thank the many members of the Stanford InfoLab for their valu-able feedback on this work. This research was supported in partby affiliate members and other supporters of the Stanford DAWNproject—Facebook, Google, Intel, Microsoft, NEC, SAP, Teradata,and VMware—as well as Toyota Research Institute, Keysight Tech-nologies, Hitachi, Northrop Grumman, Amazon Web Services, Ju-niper Networks, NetApp, PG&E, the Stanford Data Science Initia-tive, the Secure Internet of Things Project, and the NSF under grantEAR-1818579 and CAREER grant CNS-1651570.

1685

Page 13: Locality-Sensitive Hashing for Earthquake Detection: A ... · In this work, we report on a novel application of Locality Sensitive Hashing (LSH) to seismic data at scale. Based on

10. REFERENCES[1] FALCONN - FAst Lookups of Cosine and Other Nearest Neighbors.

https://github.com/falconn-lib/falconn.[2] FAST Detection Pipeline.

https://github.com/stanford-futuredata/FAST.[3] GeoNet. https://www.geonet.org.nz/data/tools/FDSN.[4] NCEDC. http://service.ncedc.org/.[5] SCEDC (2013): Southern California Earthquake Center. Caltech.

Dataset. doi:10.7909/C3WD3xH1.[6] D. L. Anderson. Theory of the Earth. Blackwell scientific publications,

1989.[7] A. Andoni, P. Indyk, T. Laarhoven, I. Razenshteyn, and L. Schmidt.

Practical and Optimal LSH for Angular Distance. NIPS, 1:1225–1233,2015.

[8] W. Astuti, R. Akmeliawati, W. Sediono, and M. Salami. Hybridtechnique using singular value decomposition (SVD) and supportvector machine (SVM) approach for earthquake prediction. IEEEJournal of Selected Topics in Applied Earth Observations and RemoteSensing, 7(5):1719–1728, 2014.

[9] S. Baluja and M. Covell. Audio fingerprinting: Combining computervision & data stream processing. In IEEE ICASSP, volume 2, pagesII–213–II–216, 2007.

[10] M. Bawa, T. Condie, and P. Ganesan. LSH forest: self-tuning indexesfor similarity search. In WWW, pages 651–660. ACM, 2005.

[11] R. J. Bayardo, Y. Ma, and R. Srikant. Scaling Up All Pairs SimilaritySearch. In WWW, pages 131–140, 2007.

[12] A. F. Bell, S. Hernandez, H. E. Gaunt, P. Mothes, M. Ruiz, D. Sierra,and S. Aguaiza. The rise and fall of periodic ’drumbeat’ seismicity atTungurahua volcano, Ecuador. Earth and Planetary Science Letters,475:58 – 70, 2017.

[13] K. Bergen, C. Yoon, and G. C. Beroza. Scalable Similarity Search inSeismology: A New Approach to Large-Scale Earthquake Detection.Similarity Search and Applications, pages 301–308, 2016.

[14] K. J. Bergen and G. C. Beroza. Detecting earthquakes over a seismicnetwork using single-station similarity measures. Geophysical JournalInternational, 213(3):1984–1998, 2018.

[15] D. Bobrov, I. Kitov, and L. Zerbo. Perspectives of cross-correlation inseismic monitoring at the international data centre. Pure and AppliedGeophysics, 171(3):439–468, Mar 2014.

[16] P. Bouros, S. Ge, and N. Mamoulis. Spatio-textual Similarity Joins.PVLDB, 6(1):1–12, 2012.

[17] A. Broder. On the resemblance and containment of documents. InProceedings of the Compression and Complexity of Sequences, pages21–, 1997.

[18] J. Buhler. Efficient large-scale sequence comparison bylocality-sensitive hashing. Bioinformatics, 17(5):419–428, 2001.

[19] F.-P. Chan, A.-C. Fu, and C. Yu. Haar wavelets for efficient similaritysearch of time-series: with and without time warping. IEEETransactions on knowledge and data engineering, 15(3):686–705,2003.

[20] O. Chum, J. Philbin, and A. Zisserman. Near Duplicate ImageDetection: min-Hash and tf-idf Weighting. In BMVC 2008 -Proceedings of the British Machine Vision Conference 2008, 01 2008.

[21] G. Cortes et al. Using Principal Component Analysis to ImproveEarthquake Magnitude Prediction in Japan. Logic Journal of the IGPL,jzx049:1–14, 10 2017.

[22] H. Ding, G. Trajcevski, P. Scheuermann, X. Wang, and E. Keogh.Querying and Mining of Time Series Data: Experimental Comparisonof Representations and Distance Measures. PVLDB, 1(2):1542–1552,2008.

[23] W. Dong, Z. Wang, W. Josephson, M. Charikar, and K. Li. ModelingLSH for Performance Tuning. In Proceedings of the 17th ACMConference on Information and Knowledge Management, CIKM ’08,pages 669–678, 2008.

[24] C. E. Yoon, Y. Huang, W. L. Ellsworth, and G. C. Beroza. SeismicityDuring the Initial Stages of the Guy-Greenbrier, Arkansas, EarthquakeSequence. Journal of Geophysical Research: Solid Earth,122(11):9253–9274, 2017.

[25] C. E Yoon, O. O’Reilly, K. Bergen, and G. C Beroza. Earthquakedetection through computationally efficient similarity search. ScienceAdvances, 1(11):e1501057, 2015.

[26] E. R. Engdahl, R. van der Hilst, and R. Buland. Global teleseismicearthquake relocation with improved travel times and procedures fordepth determination. Bulletin of the Seismological Society of America,88(3):722–743, 1998.

[27] R. J. Geller and C. S. Mueller. Four similar earthquakes in centralcalifornia. Geophysical Research Letters, 7(10):821–824, 1980.

[28] S. J. Gibbons and F. Ringdal. The detection of low magnitude seismicevents using array-based waveform correlation. Geophysical JournalInternational, 165(1):149–166, 2006.

[29] A. Gionis, P. Indyk, and R. Motwani. Similarity Search in HighDimensions via Hashing. VLDB, pages 518–529, 1999.

[30] Y. J. Gu, A. Okeler, S. Contenti, K. Kocon, L. Shen, and K. Brzak.Broadband seismic array deployment and data analysis in Alberta.CSEG Recorder, September, pages 37–44, 2009.

[31] B. GUTENBERG and C. F. RICHTER. Magnitude and energy ofearthquakes. Annals of Geophysics, 9(1):1–15, 1956.

[32] Y. Huang and G. C. Beroza. Temporal variation in themagnitude-frequency distribution during the Guy-Greenbrierearthquake sequence. Geophysical Research Letters,42(16):6639–6646, 2015.

[33] J. Ji, J. Li, S. Yan, Q. Tian, and B. Zhang. Min-Max Hash for JaccardSimilarity. In 2013 IEEE 13th International Conference on DataMining, pages 301–309, 2013.

[34] J. Johnson, M. Douze, and H. Jegou. Billion-scale similarity searchwith GPUs. CoRR, abs/1702.08734, 2017.

[35] M. Joswig. Pattern recognition for earthquake detection. Bulletin ofthe Seismological Society of America, 80(1):170, 1990.

[36] B. Kang and K. Jung. Robust and efficient locality sensitive hashingfor nearest neighbor search in large data sets. In NIPS Workshop onBig Learning (BigLearn), pages 1–8, 2012.

[37] Z. Kang, W. T. Ooi, and Q. Sun. Hierarchical, non-uniform localitysensitive hashing and its application to video identification. In IEEEInternational Conference on Multimedia and Expo (ICME), volume 1,pages 743–746, 2004.

[38] A. Kato and S. Nakagawa. Multiple slow-slip events during aforeshock sequence of the 2014 Iquique, Chile Mw 8.1 earthquake.Geophysical Research Letters, 41(15):5420–5427, 2014.

[39] E. Keogh and S. Kasetty. On the need for time series data miningbenchmarks: a survey and empirical demonstration. Data Mining andknowledge discovery, 7(4):349–371, 2003.

[40] Q. Kong, R. M. Allen, L. Schreier, and Y.-W. Kwon. MyShake: Asmartphone seismic network for earthquake early warning and beyond.Science Advances, 2(2):e1501055, 2016.

[41] B. Kulis and K. Grauman. Kernelized locality-sensitive hashing forscalable image search. In ICCV, pages 2130–2137, 2009.

[42] R. Kulkarni. A Review Of Application Of Data Mining In EarthquakePrediction. In International Journal of Computer Science andInformation Technology (IJCSIT), 2012.

[43] J. Leskovec, A. Rajaraman, and J. D. Ullman. Mining of massivedatasets. Cambridge university press, 2014.

[44] T. W. Liao. Clustering of time series data–a survey. Patternrecognition, 38(11):1857–1874, 2005.

[45] Q. Lv, W. Josephson, Z. Wang, M. Charikar, and K. Li. Multi-probeLSH: Efficient Indexing for High-dimensional Similarity Search.VLDB, pages 950–961, 2007.

[46] G. S. Manku, A. Jain, and A. Das Sarma. Detecting Near-duplicatesfor Web Crawling. In WWW, pages 141–150, 2007.

[47] W. Mann, N. Augsten, and P. Bouros. An Empirical Evaluation of SetSimilarity Join Techniques. PVLDB, 9(9):636–647, 2016.

[48] J. Peng, H. Wang, J. Li, and H. Gao. Set-based similarity search fortime series. In SIGMOD, pages 2039–2052, 2016.

[49] Z. Peng and P. Zhao. Migration of early aftershocks following the2004 Parkfield earthquake. Nature Geoscience, 2:877 EP –, 2009.

[50] T. Perol, M. Gharbi, and M. Denolle. Convolutional neural networkfor earthquake detection and location. Science Advances, 4(2), 2018.

[51] T. Rakthanmanon, B. Campana, A. Mueen, G. Batista, B. Westover,Q. Zhu, J. Zakaria, and E. Keogh. Searching and mining trillions oftime series subsequences under dynamic time warping. In SIGKDD,pages 262–270, 2012.

[52] B. Rao and E. Zhu. Searching Web Data Using MinHash LSH. InSIGMOD, pages 2257–2258, 2016.

1686

Page 14: Locality-Sensitive Hashing for Earthquake Detection: A ... · In this work, we report on a novel application of Locality Sensitive Hashing (LSH) to seismic data at scale. Based on

[53] K. Rong, C. E. Yoon, K. J. Bergen, H. Elezabi, P. Bailis, P. Levis, andG. C. Beroza. Locality-sensitive hashing for earthquake detection: Acase study of scaling data-driven science (extended version).arXiv:1803.09835, 2018.

[54] T. Sakaki, M. Okazaki, and Y. Matsuo. Earthquake shakes Twitterusers: real-time event detection by social sensors. In WWW, pages851–860, 2010.

[55] P. Schafer and U. Leser. Fast and Accurate Time Series Classificationwith WEASEL. In Proceedings of the 2017 ACM on Conference onInformation and Knowledge Management, CIKM ’17, pages 637–646,2017.

[56] D. P. Schaff and G. C. Beroza. Coseismic and postseismic velocitychanges measured by repeating earthquakes. Journal of GeophysicalResearch: Solid Earth, 109(B10), 2004.

[57] D. P. Schaff and F. Waldhauser. One Magnitude Unit Reduction inDetection Threshold by Cross Correlation Applied to Parkfield(California) and China SeismicityOne Magnitude Unit Reduction inDetection Threshold by Cross Correlation. Bulletin of theSeismological Society of America, 100(6):3224, 2010.

[58] D. R. Shelly, D. P. Hill, F. Massin, J. Farrell, R. B. Smith, and T. Taira.A fluid-driven earthquake swarm on the margin of the Yellowstonecaldera. Journal of Geophysical Research: Solid Earth,118(9):4872–4886, 2013.

[59] W. Shi and B. M. G. Kibria. On some confidence intervals forestimating the mean of a skewed population. International Journal ofMathematical Education in Science and Technology, 38(3):412–421,2007.

[60] A. Shrivastava and P. Li. In Defense of MinHash Over SimHash. InAISTATS, volume 33, Reykjavik, Iceland, 2014.

[61] N. Sundaram, A. Turmukhametova, N. Satish, T. Mostak, P. Indyk,S. Madden, and P. Dubey. Streaming Similarity Search over OneBillion Tweets Using Parallel Locality-sensitive Hashing. PVLDB,6(14):1930–1941, 2013.

[62] M. Vlachos, G. Kollios, and D. Gunopulos. Discovering similarmultidimensional trajectories. In ICDE, pages 673–684, 2002.

[63] J. Wang, G. Li, and J. Feng. Can We Beat the Prefix Filtering? AnAdaptive Framework for Similarity Join and Search. In SIGMOD,pages 85–96, 2012.

[64] J. Wang and T.-l. Teng. Identification and picking of S phase using anartificial neural network. Bulletin of the Seismological Society ofAmerica, 87(5):1140, 1997.

[65] Q. Wang, M. Cui, and H. Liang. Semantic-Aware Blocking for EntityResolution. IEEE Transactions on Knowledge and Data Engineering,28(1):166–180, 2016.

[66] M. Withers, R. Aster, C. Young, J. Beiriger, M. Harris, S. Moore, andJ. Trujillo. A comparison of select trigger algorithms for automatedglobal seismic phase and event detection. Bulletin of the SeismologicalSociety of America, 88(1):95, 1998.

[67] C. Xiao, W. Wang, X. Lin, J. X. Yu, and G. Wang. Efficient SimilarityJoins for Near-duplicate Detection. TODS, 36(3):15:1–15:41, 2011.

[68] Z. Xing, J. Pei, and E. Keogh. A Brief Survey on SequenceClassification. SIGKDD Explor. Newsl., 12(1):40–48, Nov. 2010.

[69] B.-K. Yi and C. Faloutsos. Fast Time Sequence Indexing for ArbitraryLp Norms. VLDB, pages 385–394, 2000.

1687