DOI QR코드

DOI QR Code

Segmentation of binary sequence via minimizing least square error with total variation regularization

  • Jeungju Kim (Department of Statistics, Seoul National University) ;
  • Johan Lim (Department of Statistics, Seoul National University)
  • Received : 2023.12.24
  • Accepted : 2024.05.20
  • Published : 2024.09.30

Abstract

In this paper, we propose a data-driven procedure to segment a binary sequence as an alternative to the popular hidden Markov model (HMM) based procedure. Unlike the HMM, our procedure does not make any distributional or model assumption to the data. To segment the sequence, we suggest to minimize the least square distance from the observations under total variation regularization to the solution, and develop a polynomial time algorithm for it. Finally, we illustrate the algorithm using a toy example and apply it to the Gemini boat race data between Oxford and Cambridge University. Further, we numerically compare the performance of our procedure to the HMM based segmentation through these examples.

Keywords

Acknowledgement

The authors are grateful to the associate editor and two reviewers for several variable comments. The R code of this paper is available from https://github.com/z0o0/bseg. This paper is supported by the National Research Foundation of Korea (No. NRF-2021R1A2C1010786).

References

  1. Baum LE and Petrie T (1966). Statistical inference for probabilistic functions of finite state Markov chains, The Annals of Mathematical Statistics, 37, 1554-1563.
  2. Casella G, Moreno E, and Giron FJ (2014). Cluster analysis, model selection, and prior distributions on models, Bayesian Analysis, 9, 613-658.
  3. Donoho DL, Vetterli M, DeVore RA, and Daubechies I (1998). Data compression and harmonic analysis, IEEE Transactions on Information Theory, 44, 2435-2476.
  4. Forney GD (1973). The Viterbi algorithm, Proceedings of the IEEE, 61, 268-278.
  5. Gray RM (1984). Vector quantization, IEEE ASSP Magazine, 1, 4-29.
  6. Golomb S (1966). Run-length encodings (corresp.), IEEE Transactions on Information Theory, 12, 399-401.
  7. Kehagias A (2004). A hidden Markov model segmentation procedure for hydrological and environmental time series, Stochastic Environmental Research and Risk Assessment, 18, 117-130.
  8. Lelewer DA and Hirschberg DS (1987). Data compression, ACM Computing Surveys, 19, 261-296.
  9. Mohajer M, Englmeier K-H, and Schmid VJ (2010). A comparison of gap statistic definitions with and with-out logarithm function (Technical Report Number 096, 2010), Department of Statistics, University of Munich, Munchen.
  10. Rojas CR and Wahlberg B (2014). On change point detection using the fused lasso method, Available from: arXiv preprint, arXiv:1401.5408
  11. Selesnick IW and Chen PY (2013). Total variation denoising with overlapping group sparsity. In Proceedings of 2013 IEEE International Conference on Acoustics, Speech and Signal Processing, Vancouver, BC, 5696-5700.
  12. Son W, Lim J, and Yu D (2023). Path algorithms for fused lasso signal approximator with application to COVID-19 spread in Korea, International Statistical Review, 91, 218-242.
  13. Tibshirani R, Walther G, and Hastie T (2001). Estimating the number of clusters in a data set via the gap statistic, Journal of the Royal Statistical Society, Series-B, 63, 411-423.
  14. Viterbi AJ (1967). Error bounds for convolutional codes and an asymptotically optimum decoding algorithm, IEEE Transactions on Information Theory, 13, 260-269.
  15. Yan M and Ye K (2007). Determining the number of clusters using the weighted gap statistic, Biometrics, 63, 1031-1037.
  16. Yang T(2004). Bayesian binary segmentation procedure for detecting streakiness in sports, Journal of the Royal Statistical Society, Series-A, 167, 627-637.
  17. Zucchini W, MacDonald IL, and Langrock R (2017). Hidden Markov Models for Time Series: An Introduction Using R., CRC Press, Boca Raton, Florida.