DOI QR코드

DOI QR Code

Optimizing Information Retrieval in Dark Web Academic Literature: A Study Using KeyBERT for Keyword Extraction and Clustering

  • Received : 2024.09.26
  • Accepted : 2024.10.07
  • Published : 2024.11.30

Abstract

The exponential increase in publications and the interconnected nature of sub-domains make traditional methods of information extraction and organization inadequate. This inefficiency can impede scientific progress and innovation. To address these challenges, this research leverages the ability of Bidirectional Encoder Representations from Transformers for keyword extraction (KeyBERT) and integrates with K-Means clustering to organize topics from large datasets effectively. Analyzing a dataset of 47,627 articles from SCOPUS in the domains of Reinforcement Learning and Computer Vision. An ablation study demonstrates the generalizability of the approach across these fields, with the optimal number of clusters determined to be three using the Elbow Method. The results demonstrate that KeyBERT is effective in extracting and organizing topics within these domains, with a particular focus on applications such as medical imaging, autonomous driving, and real-time detection systems. This methodology offers a scalable solution for organizing vast academic datasets, enabling researchers to extract meaningful insights efficiently and apply this approach to other domains.

Keywords

Acknowledgement

This research was supported by 'The Construction Project for Regional Base Information Security Cluster', a grant funded by the Ministry of Science, ICT and Busan Metropolitan City in 2024.

References

  1. S. Sun, Z. Liu, C. Xiong, Z. Liu, and J. Bao, "Capturing Global Informativeness in Open Domain Keyphrase Extraction," Apr. 2020, [Online]. Available: http://arxiv.org/abs/2004.13639
  2. T. Joachims, "A Probabilistic Analysis of the Rocchio Algorithm with TFIDF for Text Categorization."
  3. D. M. Blei, A. Y. Ng, and J. B. Edu, "Latent Dirichlet Allocation Michael I. Jordan," 2003.
  4. M. Basaldella, E. Antolli, G. Serra, and C. Tasso, "Bidirectional LSTM recurrent neural network for keyphrase extraction," in Communications in Computer and Information Science, Springer Verlag, 2018, pp. 180-187. doi: 10.1007/978-3-319-73165-0_18.
  5. Y. Zhang, M. Tuo, Q. Yin, L. Qi, X. Wang, and T. Liu, "Keywords extraction with deep neural network model," Neurocomputing, vol. 383, pp. 113-121, Mar. 2020, doi: 10.1016/j.neucom.2019.11.083.
  6. T. Nomoto, "Keyword Extraction: A Modern Perspective," SN Comput Sci, vol. 4, no. 1, Jan. 2023, doi: 10.1007/s42979-022-01481-7.
  7. P. Sharma and Y. Li, "Self-supervised Contextual Keyword and Keyphrase Retrieval with Self-Labelling," 2019, doi: 10.20944/preprints201908.0073.v1.
  8. C. Yoo and H. Lee, "Improving Abstractive Dialogue Summarization Using Keyword Extraction," Applied Sciences (Switzerland), vol. 13, no. 17, Sep. 2023, doi: 10.3390/app13179771.
  9. A. Priyanshu and S. Vijay, "AdaptKeyBERT: An Attention-Based approach towards Few-Shot & Zero-Shot Domain Adaptation of KeyBERT," Nov. 2022, [Online]. Available: http://arxiv.org/abs/2211.07499
  10. R. Y. Maragheh et al., "LLM-TAKE: Theme Aware Keyword Extraction Using Large Language Models," Dec. 2023, [Online]. Available: http://arxiv.org/abs/2312.00909
  11. A. K. Jain, M. N. Murty, and P. J. Flynn, "Data Clustering: A Review," 2000.
  12. L. George and P. Sumathy, "An integrated clustering and BERT framework for improved topic modeling," International Journal of Information Technology (Singapore), vol. 15, no. 4, pp. 2187-2195, Apr. 2023, doi: 10.1007/s41870-023-01268-w.
  13. S. Syed and M. Spruit, "Full-Text or abstract? Examining topic coherence scores using latent dirichlet allocation," in Proceedings - 2017 International Conference on Data Science and Advanced Analytics, DSAA 2017, Institute of Electrical and Electronics Engineers Inc., Jul. 2017, pp. 165-174. doi: 10.1109/DSAA.2017.61.
  14. Q. Xie and L. Waltman, "A comparison of citation-based clustering and topic modeling for science mapping," 2023. doi: https://doi.org/10.48550/arXiv.2309.06160.
  15. D. Sharma, B. Kumar, and S. Chand, "A Trend Analysis of Machine Learning Research with Topic Models and Mann-Kendall Test," International Journal of Intelligent Systems and Applications, vol. 11, no. 2, pp. 70-82, Feb. 2019, doi: 10.5815/ijisa.2019.02.08.