Background music monitoring framework and dataset for TV broadcast audio

Hyemi Kim;Junghyun Kim;Jihyun Park;Seongwoo Kim;Chanjin Park;Wonyoung Yoo;

doi:10.4218/etrij.2023-0249

ETRI Journal

Volume 46 Issue 4
/
Pages.697-707
/
2024
/
1225-6463(pISSN)
/
2233-7326(eISSN)

Electronics and Telecommunications Research Institute (한국전자통신연구원)

DOI QR Code

Background music monitoring framework and dataset for TV broadcast audio

Hyemi Kim (Content Research Division, Electronics and Telecommunications Research Institute) ;
Junghyun Kim (Content Research Division, Electronics and Telecommunications Research Institute) ;
Jihyun Park (Content Research Division, Electronics and Telecommunications Research Institute) ;
Seongwoo Kim (Department of Electronic Engineering, Inha University) ;
Chanjin Park (Department of Computer Engineering, Yonsei University) ;
Wonyoung Yoo (Content Research Division, Electronics and Telecommunications Research Institute)

Received : 2023.06.29
Accepted : 2023.12.15
Published : 2024.08.20

https://doi.org/10.4218/etrij.2023-0249 Citation PDF

Download PDF

⟨ Previous Next ⟩

Abstract

Music identification is widely regarded as a solved problem for music searching in quiet environments, but its performance tends to degrade in TV broadcast audio owing to the presence of dialogue or sound effects. In addition, constructing an accurate dataset for measuring the performance of background music monitoring in TV broadcast audio is challenging. We propose a framework for monitoring background music by automatic identification and introduce a background music cue sheet. The framework comprises three main components: music identification, music-speech separation, and music detection. In addition, we introduce the Cue-K-Drama dataset, which includes reference songs, audio tracks from 60 episodes of five Korean TV drama series, and corresponding cue sheets that provide the start and end timestamps of background music. Experimental results on the constructed and existing datasets demonstrate that the proposed framework, which incorporates music identification with music-speech separation and music detection, effectively enhances TV broadcast audio monitoring.

Keywords

Acknowledgement

This research was supported by the Culture, Sports, and Tourism R&D Program through a Korea Creative Content Agency grant funded by the Ministry of Culture, Sports and Tourism in 2023 (Project: Development of high-speed music search technology using deeplearning, No. CR202104004, Contribution Rate: 50%, Project: Development of artificial intelligence-based copyright infringement suspicious element detection and alternative material content recommendation technology for educational content, No. CR202104003, Contribution Rate: 50%).

References

G. C. Sebasti, Ciurana, E. Molina, M. Miron, O. Meyers, J. Six, and X. Serra, BAF: an audio fingerprinting dataset for broadcast monitoring, (Proc. 23rd Int. Soc. Music Inf. Retr. Conf., Bengaluru, India), 2022, pp. 908-916.
A. Wang, An industrial-strength audio search algorithm, (Proc. Int. Conf. Music Inf. Retr., Baltimore, USA), 2003, pp. 7-13.
J. Haitsma and T. Kalker, A highly robust audio fingerprinting system, (Proc. Int. Soc. Music Inf. Retr. Conf., Paris, France), 2002, pp. 107-115.
Y.-N. Hung, C.-W. Wu, I. Orife, A. Hipple, W. Wolcott, and A. Lerch, A large TV dataset for speech and music activity detection, EURASIP J. Audio Speech Music Process. 2022 (2022), no. 21, 1-12.
B. Melndez-Cataln, E. Molina, and E. Gomez, Open broadcast media audio from TV: a dataset of TV broadcast audio with relative music loudness annotations, Trans. Int. Soc. Music Inform. Retrieval 2 (2019), no. 1, 43-51.
N. Schmidt, J. Pons, and M. Miron, PodcastMix: a dataset for separating music and speech in podcasts, (Proc. Interspeech, Incheon, Republic of Korea), 2022, pp. 231-235.
D. Petermann, G. Wichern, Z.-Q. Wang, and J. L. Roux, The cocktail fork problem: three-stem audio separation for real-world soundtracks, (IEEE Int. Conf. Acoust. Speech Signal Process. IEEE, Singapore), 2022, pp. 526-530.
M. Defferrard, K. Benzi, P. Vandergheynst, and X. Bresson, FMA: a dataset for music analysis, (18th Int. Soc. Music Inf. Retr. Conf. (ISMIR), Suzhou, China), 2017, pp. 316-323.
V. Panayotov, G. Chen, D. Povey, and S. Khudanpur, Librispeech: an ASR corpus based on public domain audio books, (IEEE Int. Conf. Acoust. Speech Signal Process. (ICASSP), IEEE, South Brisbane, Australia), 2015, pp. 5206-5210.
E. Fonseca, X. Favory, J. Pons, F. Font, and X. Serra, FSD50K: an open dataset of human-labeled sound events, IEEE/ACM Trans. Audio, Speech, Lang. Process. 30 (2022), 829-852.
D. Ellis, The 2014 labrosa audio fingerprint system, (Int. Soc. Music Inf. Retr. Conf., Taipei, Taiwan), 2014.
J. Six, Panako: a scalable audio search system, J. Open Source Softw. 7 (2022), no. 78, 4554.
B. A. Arcas, B. Gfeller, R. Guo, K. Kilgour, S. Kumar, J. Lyon, J. Odell, M. Ritter, D. Roblek, M. Sharifi, and M. Velimirovic, Now playing: continuous low-power music recognition, (Proc. NeurIPS 2017 Workshop Mach. Learn. Phone Other Consum. Devices, Long Beach, CA, USA), 2017, pp. 1-6.
A. Bez-Surez, N. Shah, J. A. Nolazco-Flores, S.-H. S. Huang, O. Gnawali, and W. Shi, SAMAF: sequence-to-sequence autoencoder model for audio fingerprinting, IEEE Int. Conf. Acoust. Speech Sig. Process. 16 (2021), no. 2, 1-23.
S. Chang, D. Lee, J. Park, H. Lim, K. Lee, K. Ko, and Y. Han, Neural audio fingerprint for high-specific audio retrieval based on contrastive learning, (Proc. IEEE Int. Conf. Acoust. Speech Signal Process. IEEE, Toronto, Canada), 2021, pp. 3025-3029.
J. S. Seo, J. Kim, and H. Kim, Audio fingerprint matching based on a power weight, J. Acoust. Soc. Korea 38 (2019), no. 6, 716-723.
M. Strauss, J. Paulus, M. Torcoli, and B. Edler, A hands-on comparison of DNNs for dialog separation using transfer learning from music source separation, (Proc. Interspeech 2021, Brno, Czech Republic), 2021, pp. 3900-3904.
F.-R. Stoter, S. Uhlich, A. Liutkus, and Y. Mitsufuji, Open-Unmix-a reference implementation for music source separation, J. Open Source Softw. 4 (2019), no. 41, 1667.
R. Hennequin, A. Khlif, F. Voituret, and M. Moussallam, Spleeter: a fast and efficient music source separation tool with pretrained models, J. Open Source Softw. 5 (2020), no. 50, 2154.
A. Defossez, N. Usunier, L. Bottou, and F. Bach, Music source separation in the waveform domain, arXiv preprint, 2019, DOI 10.48550/arXiv.1911.13254
W. Choi, M. Kim, J. Chung, and S. Jung, LaSAFT: latent source attentive frequency transformation for conditioned source separation, (Proc. IEEE Int. Conf. Acoust. Speech Signal Process. IEEE, Toronto, Canada), 2021, pp. 171-175.
Z. Rafii, A. Liutkus, F.-R. Stoter, S. I. Mimilakis, and R. Bittner, MUSDB18-a corpus for music separation, 2017.
H. Kim, J. Kim, and J. Park, Performance analysis for background music identification in TV contents according to state-ofthe-art music source separation methods, (Proc. Korea Multimedia Society, Seoul, Korea, 2021, pp. 30-32.
H. Kim, W.-H. Heo, J. Kim, and J. Park, Monaural music-speech source separation based on convolutional neural network for background music identification in TV shows, J. Korean Inst. Commun. Inform. Sci. 45 (2020), no. 5, 855-866.
Q. Kong, Y. Cao, T. Iqbal, Y. Wang, W. Wang, and M. D. Plumbley, PANNs: large-scale pretrained audio neural networks for audio pattern recognition, IEEE/ACM Trans. Audio, Speech, Lang. Process. 28 (2020), 2880-2894.
J. F. Gemmeke, D. P. W. Ellis, D. Freedman, A. Jansen, W. Lawrence, R. C. Moore, M. Plakal, and M. Ritter, Audio set: an ontology and human-labeled dataset for audio events, (IEEE Int. Conf. Acoust. Speech Signal Process. (ICASSP), IEEE, New Orleans, USA), 2017, pp. 776-780.
B.-Y. Jang, W.-H. Heo, J. Kim, and O.-W. Kwon, Music detection from broadcast contents using convolutional neural networks with a Mel-scale kernel, EURASIP J. Audio Speech Music Process. 2019 (2019), no. 11, 1-12.
S. Lee, H. Kim, and G.-J. Jang, Weakly supervised u-net with limited upsampling for sound event detection, Appl. Sci. 13 (2023), no. 11.
B. Weck and X. Serra, Data leakage in cross-modal retrieval training: a case study, (ICASSP 2023-2023 IEEE Int. Conf. Acoust. Speech Signal Process. (ICASSP), Rhodes Island, Greece), 2023, pp. 1-5.
A. S. Koepke, A.-M. Oncescu, J. Henriques, Z. Akata, and S. Albanie, Audio retrieval with natural language queries: a benchmark study, IEEE Trans. Multimed. 25 (2022), 2675-2685.
E. Vincent, R. Gribonval, and C. Fevotte, Performance measurement in blind audio source separation, IEEE Trans. Audio Speech Lang. Process. 14 (2006), no. 4, 1462-1469.
C. Raffel, B. McFee, E. J. Humphrey, J. Salamon, O. Nieto, D. Liang, and D. P. W. Ellis, mir_eval: a transparent implementation of common MIR metrics, (Proc. Int. Soc. Music Inf. Retr. Conf., Taipei, Taiwan), 2014, pp. 367-372.
A. Mesaros, T. Heittola, and T. Virtanen, Metrics for polyphonic sound event detection, Appl. Sci. 6 (2016), no. 6, 1-17.

ETRI Journal

Background music monitoring framework and dataset for TV broadcast audio

Abstract

Keywords

Acknowledgement

References

이메일무단수집거부

이용약관

제 1 장 총칙

제 2 장 이용계약의 체결

제 3 장 계약 당사자의 의무

제 4 장 서비스의 이용

제 5 장 계약 해지 및 이용 제한

제 6 장 손해배상 및 기타사항

Detail Search

Image Search (β)