Acknowledgement
This research was supported by the Culture, Sports, and Tourism R&D Program through a Korea Creative Content Agency grant funded by the Ministry of Culture, Sports and Tourism in 2023 (Project: Development of high-speed music search technology using deeplearning, No. CR202104004, Contribution Rate: 50%, Project: Development of artificial intelligence-based copyright infringement suspicious element detection and alternative material content recommendation technology for educational content, No. CR202104003, Contribution Rate: 50%).
References
- G. C. Sebasti, Ciurana, E. Molina, M. Miron, O. Meyers, J. Six, and X. Serra, BAF: an audio fingerprinting dataset for broadcast monitoring, (Proc. 23rd Int. Soc. Music Inf. Retr. Conf., Bengaluru, India), 2022, pp. 908-916.
- A. Wang, An industrial-strength audio search algorithm, (Proc. Int. Conf. Music Inf. Retr., Baltimore, USA), 2003, pp. 7-13.
- J. Haitsma and T. Kalker, A highly robust audio fingerprinting system, (Proc. Int. Soc. Music Inf. Retr. Conf., Paris, France), 2002, pp. 107-115.
- Y.-N. Hung, C.-W. Wu, I. Orife, A. Hipple, W. Wolcott, and A. Lerch, A large TV dataset for speech and music activity detection, EURASIP J. Audio Speech Music Process. 2022 (2022), no. 21, 1-12.
- B. Melndez-Cataln, E. Molina, and E. Gomez, Open broadcast media audio from TV: a dataset of TV broadcast audio with relative music loudness annotations, Trans. Int. Soc. Music Inform. Retrieval 2 (2019), no. 1, 43-51.
- N. Schmidt, J. Pons, and M. Miron, PodcastMix: a dataset for separating music and speech in podcasts, (Proc. Interspeech, Incheon, Republic of Korea), 2022, pp. 231-235.
- D. Petermann, G. Wichern, Z.-Q. Wang, and J. L. Roux, The cocktail fork problem: three-stem audio separation for real-world soundtracks, (IEEE Int. Conf. Acoust. Speech Signal Process. IEEE, Singapore), 2022, pp. 526-530.
- M. Defferrard, K. Benzi, P. Vandergheynst, and X. Bresson, FMA: a dataset for music analysis, (18th Int. Soc. Music Inf. Retr. Conf. (ISMIR), Suzhou, China), 2017, pp. 316-323.
- V. Panayotov, G. Chen, D. Povey, and S. Khudanpur, Librispeech: an ASR corpus based on public domain audio books, (IEEE Int. Conf. Acoust. Speech Signal Process. (ICASSP), IEEE, South Brisbane, Australia), 2015, pp. 5206-5210.
- E. Fonseca, X. Favory, J. Pons, F. Font, and X. Serra, FSD50K: an open dataset of human-labeled sound events, IEEE/ACM Trans. Audio, Speech, Lang. Process. 30 (2022), 829-852.
- D. Ellis, The 2014 labrosa audio fingerprint system, (Int. Soc. Music Inf. Retr. Conf., Taipei, Taiwan), 2014.
- J. Six, Panako: a scalable audio search system, J. Open Source Softw. 7 (2022), no. 78, 4554.
- B. A. Arcas, B. Gfeller, R. Guo, K. Kilgour, S. Kumar, J. Lyon, J. Odell, M. Ritter, D. Roblek, M. Sharifi, and M. Velimirovic, Now playing: continuous low-power music recognition, (Proc. NeurIPS 2017 Workshop Mach. Learn. Phone Other Consum. Devices, Long Beach, CA, USA), 2017, pp. 1-6.
- A. Bez-Surez, N. Shah, J. A. Nolazco-Flores, S.-H. S. Huang, O. Gnawali, and W. Shi, SAMAF: sequence-to-sequence autoencoder model for audio fingerprinting, IEEE Int. Conf. Acoust. Speech Sig. Process. 16 (2021), no. 2, 1-23.
- S. Chang, D. Lee, J. Park, H. Lim, K. Lee, K. Ko, and Y. Han, Neural audio fingerprint for high-specific audio retrieval based on contrastive learning, (Proc. IEEE Int. Conf. Acoust. Speech Signal Process. IEEE, Toronto, Canada), 2021, pp. 3025-3029.
- J. S. Seo, J. Kim, and H. Kim, Audio fingerprint matching based on a power weight, J. Acoust. Soc. Korea 38 (2019), no. 6, 716-723.
- M. Strauss, J. Paulus, M. Torcoli, and B. Edler, A hands-on comparison of DNNs for dialog separation using transfer learning from music source separation, (Proc. Interspeech 2021, Brno, Czech Republic), 2021, pp. 3900-3904.
- F.-R. Stoter, S. Uhlich, A. Liutkus, and Y. Mitsufuji, Open-Unmix-a reference implementation for music source separation, J. Open Source Softw. 4 (2019), no. 41, 1667.
- R. Hennequin, A. Khlif, F. Voituret, and M. Moussallam, Spleeter: a fast and efficient music source separation tool with pretrained models, J. Open Source Softw. 5 (2020), no. 50, 2154.
- A. Defossez, N. Usunier, L. Bottou, and F. Bach, Music source separation in the waveform domain, arXiv preprint, 2019, DOI 10.48550/arXiv.1911.13254
- W. Choi, M. Kim, J. Chung, and S. Jung, LaSAFT: latent source attentive frequency transformation for conditioned source separation, (Proc. IEEE Int. Conf. Acoust. Speech Signal Process. IEEE, Toronto, Canada), 2021, pp. 171-175.
- Z. Rafii, A. Liutkus, F.-R. Stoter, S. I. Mimilakis, and R. Bittner, MUSDB18-a corpus for music separation, 2017.
- H. Kim, J. Kim, and J. Park, Performance analysis for background music identification in TV contents according to state-ofthe-art music source separation methods, (Proc. Korea Multimedia Society, Seoul, Korea, 2021, pp. 30-32.
- H. Kim, W.-H. Heo, J. Kim, and J. Park, Monaural music-speech source separation based on convolutional neural network for background music identification in TV shows, J. Korean Inst. Commun. Inform. Sci. 45 (2020), no. 5, 855-866.
- Q. Kong, Y. Cao, T. Iqbal, Y. Wang, W. Wang, and M. D. Plumbley, PANNs: large-scale pretrained audio neural networks for audio pattern recognition, IEEE/ACM Trans. Audio, Speech, Lang. Process. 28 (2020), 2880-2894.
- J. F. Gemmeke, D. P. W. Ellis, D. Freedman, A. Jansen, W. Lawrence, R. C. Moore, M. Plakal, and M. Ritter, Audio set: an ontology and human-labeled dataset for audio events, (IEEE Int. Conf. Acoust. Speech Signal Process. (ICASSP), IEEE, New Orleans, USA), 2017, pp. 776-780.
- B.-Y. Jang, W.-H. Heo, J. Kim, and O.-W. Kwon, Music detection from broadcast contents using convolutional neural networks with a Mel-scale kernel, EURASIP J. Audio Speech Music Process. 2019 (2019), no. 11, 1-12.
- S. Lee, H. Kim, and G.-J. Jang, Weakly supervised u-net with limited upsampling for sound event detection, Appl. Sci. 13 (2023), no. 11.
- B. Weck and X. Serra, Data leakage in cross-modal retrieval training: a case study, (ICASSP 2023-2023 IEEE Int. Conf. Acoust. Speech Signal Process. (ICASSP), Rhodes Island, Greece), 2023, pp. 1-5.
- A. S. Koepke, A.-M. Oncescu, J. Henriques, Z. Akata, and S. Albanie, Audio retrieval with natural language queries: a benchmark study, IEEE Trans. Multimed. 25 (2022), 2675-2685.
- E. Vincent, R. Gribonval, and C. Fevotte, Performance measurement in blind audio source separation, IEEE Trans. Audio Speech Lang. Process. 14 (2006), no. 4, 1462-1469.
- C. Raffel, B. McFee, E. J. Humphrey, J. Salamon, O. Nieto, D. Liang, and D. P. W. Ellis, mir_eval: a transparent implementation of common MIR metrics, (Proc. Int. Soc. Music Inf. Retr. Conf., Taipei, Taiwan), 2014, pp. 367-372.
- A. Mesaros, T. Heittola, and T. Virtanen, Metrics for polyphonic sound event detection, Appl. Sci. 6 (2016), no. 6, 1-17.