Segmentation of binary sequence via minimizing least square error with total variation regularization

Jeungju Kim;Johan Lim;

doi:10.29220/CSAM.2024.31.5.487

Communications for Statistical Applications and Methods

Volume 31 Issue 5
/
Pages.487-496
/
2024
/
2287-7843(pISSN)
/
2383-4757(eISSN)

The Korean Statistical Society (한국통계학회)

DOI QR Code

Segmentation of binary sequence via minimizing least square error with total variation regularization

Jeungju Kim (Department of Statistics, Seoul National University) ;
Johan Lim (Department of Statistics, Seoul National University)

Received : 2023.12.24
Accepted : 2024.05.20
Published : 2024.09.30

https://doi.org/10.29220/CSAM.2024.31.5.487 Citation PDF

Download PDF

⟨ Previous Next ⟩

Abstract

In this paper, we propose a data-driven procedure to segment a binary sequence as an alternative to the popular hidden Markov model (HMM) based procedure. Unlike the HMM, our procedure does not make any distributional or model assumption to the data. To segment the sequence, we suggest to minimize the least square distance from the observations under total variation regularization to the solution, and develop a polynomial time algorithm for it. Finally, we illustrate the algorithm using a toy example and apply it to the Gemini boat race data between Oxford and Cambridge University. Further, we numerically compare the performance of our procedure to the HMM based segmentation through these examples.

Keywords

Acknowledgement

The authors are grateful to the associate editor and two reviewers for several variable comments. The R code of this paper is available from https://github.com/z0o0/bseg. This paper is supported by the National Research Foundation of Korea (No. NRF-2021R1A2C1010786).

References

Baum LE and Petrie T (1966). Statistical inference for probabilistic functions of finite state Markov chains, The Annals of Mathematical Statistics, 37, 1554-1563.
Casella G, Moreno E, and Giron FJ (2014). Cluster analysis, model selection, and prior distributions on models, Bayesian Analysis, 9, 613-658.
Donoho DL, Vetterli M, DeVore RA, and Daubechies I (1998). Data compression and harmonic analysis, IEEE Transactions on Information Theory, 44, 2435-2476.
Forney GD (1973). The Viterbi algorithm, Proceedings of the IEEE, 61, 268-278.
Gray RM (1984). Vector quantization, IEEE ASSP Magazine, 1, 4-29.
Golomb S (1966). Run-length encodings (corresp.), IEEE Transactions on Information Theory, 12, 399-401.
Kehagias A (2004). A hidden Markov model segmentation procedure for hydrological and environmental time series, Stochastic Environmental Research and Risk Assessment, 18, 117-130.
Lelewer DA and Hirschberg DS (1987). Data compression, ACM Computing Surveys, 19, 261-296.
Mohajer M, Englmeier K-H, and Schmid VJ (2010). A comparison of gap statistic definitions with and with-out logarithm function (Technical Report Number 096, 2010), Department of Statistics, University of Munich, Munchen.
Rojas CR and Wahlberg B (2014). On change point detection using the fused lasso method, Available from: arXiv preprint, arXiv:1401.5408
Selesnick IW and Chen PY (2013). Total variation denoising with overlapping group sparsity. In Proceedings of 2013 IEEE International Conference on Acoustics, Speech and Signal Processing, Vancouver, BC, 5696-5700.
Son W, Lim J, and Yu D (2023). Path algorithms for fused lasso signal approximator with application to COVID-19 spread in Korea, International Statistical Review, 91, 218-242.
Tibshirani R, Walther G, and Hastie T (2001). Estimating the number of clusters in a data set via the gap statistic, Journal of the Royal Statistical Society, Series-B, 63, 411-423.
Viterbi AJ (1967). Error bounds for convolutional codes and an asymptotically optimum decoding algorithm, IEEE Transactions on Information Theory, 13, 260-269.
Yan M and Ye K (2007). Determining the number of clusters using the weighted gap statistic, Biometrics, 63, 1031-1037.
Yang T(2004). Bayesian binary segmentation procedure for detecting streakiness in sports, Journal of the Royal Statistical Society, Series-A, 167, 627-637.
Zucchini W, MacDonald IL, and Langrock R (2017). Hidden Markov Models for Time Series: An Introduction Using R., CRC Press, Boca Raton, Florida.

Communications for Statistical Applications and Methods

Segmentation of binary sequence via minimizing least square error with total variation regularization

Abstract

Keywords

Acknowledgement

References

이메일무단수집거부

이용약관

제 1 장 총칙

제 2 장 이용계약의 체결

제 3 장 계약 당사자의 의무

제 4 장 서비스의 이용

제 5 장 계약 해지 및 이용 제한

제 6 장 손해배상 및 기타사항

Detail Search

Image Search (β)