DOI QR코드

DOI QR Code

Research on Keyword-Overlap Similarity Algorithm Optimization in Short English Text Based on Lexical Chunk Theory

  • Na Li (Public Foreign Language Teaching and Research Department, Qiqihar University) ;
  • Cheng Li (College of Computer and Control Engineering, Qiqihar University) ;
  • Honglie Zhang (College of Computer and Control Engineering, Qiqihar University)
  • Received : 2022.12.14
  • Accepted : 2023.02.26
  • Published : 2023.10.31

Abstract

Short-text similarity calculation is one of the hot issues in natural language processing research. The conventional keyword-overlap similarity algorithms merely consider the lexical item information and neglect the effect of the word order. And some of its optimized algorithms combine the word order, but the weights are hard to be determined. In the paper, viewing the keyword-overlap similarity algorithm, the short English text similarity algorithm based on lexical chunk theory (LC-SETSA) is proposed, which introduces the lexical chunk theory existing in cognitive psychology category into the short English text similarity calculation for the first time. The lexical chunks are applied to segment short English texts, and the segmentation results demonstrate the semantic connotation and the fixed word order of the lexical chunks, and then the overlap similarity of the lexical chunks is calculated accordingly. Finally, the comparative experiments are carried out, and the experimental results prove that the proposed algorithm of the paper is feasible, stable, and effective to a large extent.

Keywords

Acknowledgement

This research was funded by the Education Department of Heilongjiang Province of China (Grant No. 135309463 and 135509118).

References

  1. C. Banea, S. Hassan, M. Mohler, and R. Mihalcea, "UNT: a supervised synergistic approach to semantic text similarity," in Proceedings of the 6th International Workshop on Semantic Evaluation (SemEval), Montreal, Canada, 2012, pp. 635-642.
  2. H. Liang, K. Lin, and S. Zhu, "Short text similarity hybrid algorithm for a Chinese medical intelligent question answering system," in Technology-Inspired Smart Learning for Future Education. Singapore: Springer, 2020, pp. 129-142. https://doi.org/10.1007/978-981-15-5390-5_11
  3. S. Banerjee, S. Kaur, and P. Kumar, "Quote examiner: verifying quoted images using web-based text similarity," Multimedia Tools and Applications, vol. 80, pp. 12135-12154, 2021. https://doi.org/10.1007/s11042-020-10270-4
  4. Y. Liu and M. Chen, "Applying text similarity algorithm to analyze the triangular citation behavior of scientists," Applied Soft Computing, vol. 107, article no. 107362, 2021. https://doi.org/10.1016/j.asoc.2021.107362
  5. X. Lin, M. Zhang, X. Bao, J. Li, and X. Wu, "Short-text Classification Method Based on Concept Network," Computer Engineering, vol. 36, no. 21, pp. 4-6, 2010. https://doi.org/10.3969/j.issn.1000-3428.2010.21.002
  6. C. Jin and H. Zhou, "Chinese short text clustering based on dynamic vector," Computer Engineering and Applications, vol. 47, no. 33, pp. 156-158, 2011.
  7. X. Q. Zhao, Y. Zheng, and H. Q. Chu. "Application of concept tree in semantic similarity of short texts," Computer Technology and Development, vol. 22, no. 6, pp. 159-162, 2012.
  8. J. Yin, D. Chao, Z. Liu, W. Zhang, X. Yu, and J. Wang, "Model-based clustering of short text streams," in Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, London, UK, 2018, pp. 2634-2642. https://doi.org/10.1145/3219819.3220094
  9. T. Schick, H. Schmid, and H. Schutze, "Automatically identifying words that can serve as labels for few-shot text classification," in Proceedings of the 28th International Conference on Computational Linguistics, Barcelona, Spain, 2020, pp. 5569-5578. https://doi.org/10.18653/v1/2020.coling-main.488
  10. J. W. Sun, X. Q. Lu, and L. H. Zhang, "Short text classification based on semantics and maximum matching degree," Computer Engineering and Designing, vol. 34, no. 10, pp. 3613-3618, 2013.
  11. H. T. Nguyen, P. H. Duong, and E. Cambria, "Learning short-text semantic similarity with word embeddings and external knowledge sources," Knowledge-Based Systems, vol. 182, article no. 104842, 2019. https://doi.org/10.1016/j.knosys.2019.07.013
  12. G. Majumder, P. Pakray, R. Das, and D. Pinto, "Interpretable semantic textual similarity of sentences using alignment of chunks with classification and regression," Applied Intelligence, vol. 51, pp. 7322-7349, 2021. https://doi.org/10.1007/s10489-020-02144-x
  13. Z. Liu, C. Lu, H. Huang, S. Lyu, and Z. Tao, "Text classification based on multi-granularity attention hybrid neural network," 2020 [Online]. Available: https://arxiv.org/abs/2008.05282.
  14. P. Huang, G. Yu, H. Lu, D. Liu, L. Xing, Y. Yin, N. Kovalchuk, L. Xing, and D. Li, "Attention-aware fully convolutional neural network with convolutional long short-term memory network for ultrasound-based motion tracking," Medical Physics, vol. 46, no. 5, pp. 2275-2285, 2019. https://doi.org/10.1002/mp.13510
  15. N. Peinelt, D. Nguyen, and M. Liakata, "tBERT: topic models and BERT joining forces for semantic similarity detection," in Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Virtual Event, 2020, pp. 7047-7055. http://dx.doi.org/10.18653/v1/2020.acl-main.630
  16. R. Zhang, G. Yang, and H. Wu, "A new measure of semantic similarity between unknown Chinese words based on HowNet," Journal of Chinese Information Processing, vol. 26, no. 1, pp. 16-21, 2012.
  17. J. D. Becker, "The phrasal lexicon," in Proceedings of the 1975 Workshop on Theoretical Issues in Natural Language Processing, Cambridge, MA, 1975, pp. 60-63. https://doi.org/10.3115/980190.980212
  18. J. R. Nattinger and J. S. DeCarrico, Lexical Phrases and Language Teaching. Oxford, UK: Oxford University Press, 1992.