timeseriesanalysisfor - kepler & k2 · introductory time series with r p. cowpertwait& a....
TRANSCRIPT
Time Series Analysis forevenly spaced !!
Kepler Photometry
Eric FeigelsonPenn State University
Kepler Science Conference V March 2019
Who am I?
Prof. of Astronomy & Astrophysics and of Statistics, Penn StateLead author of award-winning 2012 graduate text:
Modern Statistical Methods for Astronomy with R ApplicationsCo-organizer Penn State Summer Schools in astrostatistics (2005--)
Past-President, IAU Commission on Astroinformatics & AstrostatisticsEditor, Astrostatistics & Astroinformatics Portal
Statistical Scientific Editor, AAS Journals
A space astronomer who has been talking with statisticians for 30 years
Outline
• Broad introduction to time series analysis for evenly spaced cadences (12 slides)
• A new approach to Kepler planet detection: Autoregressive Planet Search (16 slides, poster #11)
• An R tutorial illustrating some time series methods:– Displaying Kepler lightcurves– Time series diagnostics (normality, autocorrelation, stationarity) – Smoothing & gap interpolation– ARIMA modeling– Periodicity search– Wavelet analysis & denoising
The tutorial is available as a Jupyter notebook and a R script
The Essential Message of Keplerfrom a Methodology Viewpoint
Since Tycho Brahe & Johannes Kepler, astronomers have had to interpret sparse, irregularly spaced time series with a limited suite of specialized statistical tools.
But Kepler provides a rich, regularly spaced time series. We can thus inherit a vast world of powerful analytic tools from the statistical field of time series analysis designed for signal processing & econometrics.
These tools are described in textbooks and have software implementations in Matlab and R.
Useful books for Kepler researchers
The Analysis of Time Series: An Introduction C. Chatfield, 6th ed, 2004
Ubiquitous undergraduate text, brief & clearTime Series Analysis: Forecasting and Control G. Box, G. Jenkins et al. 5th ed 2015
Famous text since 1970, foundation of our ARPS methodologyForecasting: Principles and Practice R. Hyndman & G. Athanasopoulos, 2nd ed 2018
Excellent, author of important `forecast’ R/CRAN package. Free ebook. Introductory Time Series with R P. Cowpertwait & A. Metcalfe 2008
Brief presentation & cookbook, Springer Use R! seriesWavelet Methods in Statistics with R G. Nason
Brief presentation & cookbook, Springer Use R! seriesIntroduction to Nonparametric Regression K. Takezawa 2006
Broad coverage of methods for irregular cadences with R codeTime Series Analysis by Space State Methods J. Durbin & S. Koopman, 2nd ed, 2012
Advanced likelihood modeling (e.g. stochastic + periodic + astrophysics)
Fundamental concepts of time series
• Cadence & noise Nearly all standard methodology assumes: [a] regularly spaced observations, although missing data (NAs) are often allowed; [b] signal of interest ; and [c] homoscedastic Gaussian white noise, e=N(0,s2)
• Stationarity Statistical properties unchanged by shifts in time. Variations can be deterministic or stochastic. Nonstationarityincludes: trends, heteroscedasticity, and change points.
• Periodicity Measured levels repeat with fixed periods. The signal is concentrated in a narrow range of frequencies (spectral, Fourier, harmonic analysis)
Standard TSA
Standard TSA
• Autocorrelation Measured levels depend on previous values. The nonparametric autocorrelation function ACF(k) gives fraction of variance attributed to correlation at lag k:
ACF(k) = S(xt - <x>)(xt-k - <x>) / S (xt - <x>)2
• Stochastic process Dynamic evolution with random ingredients. Short-memory behaviors are concentrated in a few lags of the ACF. Long-memory behaviors include 1/fa
`red’ noise.
• (Non)(semi)parametric methods Nonparametric procedures make no assumption regarding the global behavior (e.g. kernel smoothing). Parametric procedures assume a mathematical form to the global behavior; regression is used to find the best-fit parameter values (e.g. ARIMA). Semi-parametric methods have local but not global models (e.g. Gaussian Processes regression).
Nonparametric time domain methods
• (Partial) autocorrelation function For zero-mean Gaussian white
noise, the ACF is asymptotically normal (i.e. probabilities can be
inferred): Var(ACF) = (n-k)/n(n+2). PACF(k) gives effect at lag k
with smaller lags removed.
• Density estimation = smoothing Kernel density estimation, local
polynomial regressions (splines, LOESS, …), Gaussian Processes
regression. Preferred if deterministic global functional form is not
known.
• Differencing operator Replacing xt with xt – xt-1 removes many
types of trend. Reversed with integrating operator.
Standard TSA
Parametric time domain methods
• Deterministic trend Variation in time has known form, xt = f(t). Example: an astrophysical Keplerian orbit for radial velocities.
• Stochastic trend Many stochastic behaviors can be successfully fit with linear autoregressive models: ARMA (short-memory stationary), ARIMA (with nonstationary trend), ARFIMA (with long-memory 1/fa), GARCH (with volatility), etc.
Standard TSA
ARMA-type models are fit by maximum likelihood with order (p,q) chosen with the Akaike Information Criterion. Lots of methodology: parameter errors, goodness-of-fit tests, hierarchical models, Bayesian inference, Kalman filter updates, time series diagnostics, etc
Nonparametric frequency domain method
• Fourier analysis Fourier transform, power spectrum (Schuster periodogram) concentrates strictly periodic signal into sharp peak. Theorems are highly restrictive: evenly spaced data of infinite duration, Gaussian white noise with single sinusoidal-shaped signal at fixed period. For realistic data, difficult to establish significance of periodogram peaks.
• Modern Fourier analysis improves the power spectrum with smoothing (reduced variance), tapering (reduced bias), detrending(reduced zero-frequency power), and pre-whitening (removing strong periodicity). Multitapering is recommended.
Standard TSA
Advanced signal processing methods are being introduced into astronomy by leading astrostatistics groups:
§ Independent Component Analysis (Waldmann et al. 2012)§ Correntropy (Huijse et al. 2012)§ Hilbert-Huang transform (Kolotkov et al. 2015)§ Empirical Mode Decomposition (Roberts et al. 2013) § Singular Spectrum Analysis (Boufleur et al. 2018)§ Dynamic time warping (B.Zhang et al. 2016)§ … …§ Ensemble methods to identify astronomical instrumental &
atmospheric effects: SysRem, TFA, EPD, PDC-MAP, ARC (Tamuz et al. 2005, Kovac et al. 2005, Bakos et al. 2007, Stumpe et al. 2012, Roberts et al. 2013)
§ Time series classification methods (J.Richards et al. 2011, Huppenkothen et al. 2017, etc.)
Standard TSA
Standard TSA
The ACF, ARMA maximum likelihood fit, wavelet transform, and the
Fourier transform all contain the same information as the original
time series if an infinite number of coefficients are maintained.
Each specializes in highlighting different characteristics of the dataset.
Choice of method depends on properties of the variabilityand the scientific question being addressed. There is no problem using different methods for different purposes.
General strategy for time series analysis
Change Point Analysis
Wide variety of nonparametric and parametric methods to find sudden change in level or variability behaviors.
• CUSUM (cumulative sum) tests used extensively in manufacturing quality control
• Many extensions to Wald’s Sequential Probability Ratio Test for sudden change in mean, variance, etc.
• Many tests of sudden change of ARMA coefficients.
No applications in astronomy yet
Irregular TS
R for KeplerR is the most comprehensive and widely used software environment for statistical analysis of data.
– R is an easily learned scripting language similar to IDL and Matlab. No methods coding, just data wrangling, graphics & interpretation.
– With >12,000 community-written CRAN packages, R has ~100,000 functions: infrastructure, graphics, statistical tests, regressions, machine learning, etc.
– ~300 packages treat evenly-spaced time series (https://cran.r-project.org/web/views/TimeSeries.html)
Comments on standard time series analysis
• Time series analysis is huge field of methodology and there is a lot
for us to learn. I suggest that everyone be familiar at the level of
texts like Feigelson/Babu (2012) and Chatfield (6th ed 2003).
• Characterization of time series with weird ground-based cadences is
not in any textbook, and non-trivial challenges arise. Caution needed
in interpretation (no theorems). But regularly cadenced space-based
time series can use established standard methods (lots of theorems).
• A vast array of statistics codes are already written, most in R/CRAN
and Matlab (Signal Processing & Econometrics toolboxes). Can wrap
R/CRAN from Python and vice versa.
Kepler: Transit detection of exoplanets
Photometric time series with periodic transits when planet passes in front of star
But a significant challenge isremoval of stellar variabilityunrelated to the planet.
“[E]xoplanets may still be detected by exploiting differences intimescale, shape and wavelength dependence between theplanetary and stellar signals. ... [Stellar] variability, combinedwith residual instrumental systematics, is still limiting thedetection of habitable planets by Kepler.”(Suzanne Aigrain IAU 2015)
Planets and stellar activity: Hide and seek in the CoRoT-7 systemHaywood et al. 2014
GP local regression fit
Stellar variability reduction methodsNonparametric modeling
§ Wavelet analysis (Jenkins 2002 & Kepler pipeline, Carter & Winn 2009)
§ Gaussian Processes regression (Gibson 2014, Aigrain et al 2016, Luger et al 2016)
§ Advanced signal processing methods: Independent Component Analysis, correntropy, trend entropy, Empirical Mode Decomposition, Singular Spectrum Analysis, … (Waldmann et al 2013, Huijse et al 2012, Roberts et al 2013,
Greco et al 2016, Boufleur et al 2018, …)
Parametric modeling
Rarely used because stellar & atmospheric variations do not follow any
obvious function: Flux ~ f(time). However, textbooks in time series analysis
are dominated by stochastic autoregressive regression models,
Flux ~ f(past fluxes, past changes). These are broad model families widely
used in engineering signal processing, econometrics, and other fields
(millions of Google hits).
We have found these models are also very effective for time domain astronomy!!Eric Feigelson 2018 18
ARPS: Four steps to exoplanet transit detection
1. Pre-process the data2. Reduce/remove uninteresting stellar variability but …
“Don’t eat the planet!“ David Jones, SAMSI3. Conduct periodicity search for recurring transits4. Establish decision criteria to report Planet Candidates
Ingest' AR'fit'Preprocess' TCF'periodogram'
Features'Classifica7on''&'ve;ng'
Tables''&'plots'
Autoregressive modeling forevenly spaced time series
ARIMA(p,d,q) includes d differencing operationsx’t = xt – xt-1
ARFIMA(p,d,q) has fractional differencing
CARMA (continuous), VARIMA (vector), SARIMA (periodic), GARCH and dozens of other extensions to the ARMA family
Current valueof stellar flux
Recent pastflux value
Coefficient of linear regression
Current Gassian noisevalue
Recent pastnoise value
• AR & MA components model short-memory processes
• I differencing operator removes many forms of trendand non-stationarity
• F component models long-memory processes and isequivalent to the astronomers’ 1/fa-type red noise. The coefficient d is arithmetically related to a and theeconomists’ Hurst parameter.
Roughly speaking …
The ARIMA and ARFIMA models can remove an enormous variety of temporal variations
seen in astronomical data
0 1000 2000 3000 4000
3780
037
860
Time
Orig
inal
Flu
x
Sample Lightcurve with Fit & Residuals
0 1000 2000 3000 4000
3780
037
860
Time
Fitte
d Fl
ux
0 1000 2000 3000 4000
−40
020
Time
Resid
ual F
lux
Typical 4 yr Keplerlightcurve
Maximum likelihoodARFIMA model
Residuals
KARPS: Kepler AutoRegressive Planet Search
Caceres, Feigelson, et al. 2019b
The ARPS procedure is applied to ~200,000 Kepler stars
Problem:
The differencing operator transforms box-shaped transit signal into a periodic double-spike signal
0 500 1000 1500 2000
950
980
Time
Transit
Schematic of ARMA−Model Effect on Transit Signal
0 500 1000 1500 2000
−40
040
TimeResidual
Transit Comb Filter: a matched filter for a periodicdouble-spike pattern produced by a box-shaped dip subjectto the differencing operator. The TCF power plotted for arange of periods gives a periodogram analogous to the BLSperiodogram
Caceres, Feigelson et al. 2019a
Comparison of TCF and BLS periodogramsKIC 010024701
Here neither method detects periodicity in the original lightcurve
BLS detect periodicity in ARIMAresiduals but with lower SNR than TCF
Eric Feigelson 2018 24
BLS
TCF
Feature importance in Random Forest classifier
ROC curves for 1 vs. 37 features
Result for discovery of new Kepler planets
Red = previously known KOI planetsMagenta = new KARPS planetsGray = stars without planets
Caceres, Feigelson et al. 2019b
The ARIMA & ARFIMA TCF periodograms show an interesting spectral peak: Period~0.3 days
The folded lightcurve shows a faint double-spike (left) and box (right).
The Random Forest classifier gives high confidence for a transiting planet: Prob(RF) = 0.53
ARPS Conclusion
Low-dimensional stochastic ARIMA-type models can be very effective in removing autocorrelated noise in stellar lightcurves, leaving periodic transits in the residuals.
TCF gives an effective periodogram for ARIMA residuals.
Random Forest is an effective classifier if strong training sets are available. False Positive rejection is tricky but feasible.
Coding is easy: extensive econometric software in R. Computational burden is reasonable.
Science result: 97 candidate new planets, most hot (P<5 days) sub-Earths (Depth < 100 ppm)
AutoRegressive Planet Search Project
Methodology: Caceres, Feigelson et al. 2019a, submitted arXiv:1901.0511
Application to 4-year Kepler lightcurves: Caceres, Feigelson et al. 2019b, submitted
Simulations for irregular-space ground-based transit surveys: Stuhr, Feigelson et al. 2019, submitted
Application to HATSouth lightcurves: Stuhr, Feigelson, Hartman et al., in progress
Application to TESS lightcurves: Wolfgang & Feigelson, in progress
Supported by NSF AST-1614690 and NASA 80NSSC17K0122 at Penn State