Browse > Article
http://dx.doi.org/10.3745/JIPS.02.0124

SSF: Sentence Similar Function Based on word2vector Similar Elements  

Yuan, Xinpan (School of Computer Science, Hunan University of Technology)
Wang, Songlin (School of Computer Science, Hunan University of Technology)
Wan, Lanjun (School of Computer Science, Hunan University of Technology)
Zhang, Chengyuan (School of Information Science and Engineering, Central South University)
Publication Information
Journal of Information Processing Systems / v.15, no.6, 2019 , pp. 1503-1516 More about this Journal
Abstract
In this paper, to improve the accuracy of long sentence similarity calculation, we proposed a sentence similarity calculation method based on a system similarity function. The algorithm uses word2vector as the system elements to calculate the sentence similarity. The higher accuracy of our algorithm is derived from two characteristics: one is the negative effect of penalty item, and the other is that sentence similar function (SSF) based on word2vector similar elements doesn't satisfy the exchange rule. In later studies, we found the time complexity of our algorithm depends on the process of calculating similar elements, so we build an index of potentially similar elements when training the word vector process. Finally, the experimental results show that our algorithm has higher accuracy than the word mover's distance (WMD), and has the least query time of three calculation methods of SSF.
Keywords
Long Sentence Similarity; Similar Element; System Similarity; WMD; Word2vector;
Citations & Related Records
연도 인용수 순위
  • Reference
1 M. U. Devi and G. M. Gandhi, "Query expansion on the role of word and sentence similarity for domain ontology driven fuzzy retrieval systems," Journal of Computational and Theoretical Nanoscience, vol. 14, no. 6, pp. 2612-2619, 2017.   DOI
2 W. Yin, K. Kann, M. Yu, and H. Schutze, "Comparative study of CNN and RNN for natural language processing," 2017; https://arxiv.org/abs/1702.01923.
3 D. Zhang, T. He, Y. Liu, S. Lin, and J. A. Stankovic, "A carpooling recommendation system for taxicab services," IEEE Transactions on Emerging Topics in Computing, vol. 2, no. 3, pp. 254-266, 2014.   DOI
4 J. R. Lin, Z. Z. Hu, J. P. Zhang, and F. Q. Yu, "A natural‐language‐based approach to intelligent data retrieval and representation for cloud BIM," Computer‐Aided Civil and Infrastructure Engineering, vol. 31, no. 1, pp. 18-33, 2016.   DOI
5 A. Prakash, S. A. Hasan, K. Lee, V. Datla, A. Qadir, J. Liu, and O. Farri, "Neural paraphrase generation with stacked residual LSTM networks," 2016; https://arxiv.org/abs/1610.03098.
6 M. A. Boudia, A. Rahmani, M. E. Rahmani, A. Djebbar, H. A. Bouarara, F. Kabli, and M. Guandouz, M. "Hybridization between scoring technique and similarity technique for automatic summarization by extraction," International Journal of Organizational and Collective Intelligence, vol. 6, no. 1, pp. 1-14, 2016.   DOI
7 P. W. McBurney and C. McMillan, "Automatic source code summarization of context for Java methods," IEEE Transactions on Software Engineering, vol. 42, no. 2, pp. 103-119, 2015.   DOI
8 S. K. Bharti and K. S. Babu, "Automatic keyword extraction for text summarization: a survey," 2017; https://arxiv.org/abs/1704.03242.
9 Y. C. Lee, C. M. Eastman, and W. Solihin, "An ontology-based approach for developing data exchange requirements and model views of building information modeling," Advanced Engineering Informatics, vol. 30, no. 3, pp. 354-367, 2016.   DOI
10 J. Muralikumar, S. A. Seelan, N. Vijayakumar, and V. Balasubramanian, "A statistical approach for modeling inter-document semantic relationships in digital libraries," Journal of Intelligent Information Systems, vol. 48, no. 3, pp. 477-498, 2017.   DOI
11 S. Xu, "Research and implementation of paraphrasing recognition technology for question-and-answer system," Harbin Institute of Technology, Harbin, China, 2009.
12 G. Zhou, J. Zhao, T. He, and W. Wu, "An empirical study of topic-sensitive probabilistic model for expert finding in question answer communities," Knowledge-Based Systems, vol. 66, pp. 136-145, 2014.   DOI
13 S. Guo and D. Xing, "Sentence similarity calculation based on word vector and its application research," Modern Electronics Technique, vol. 39, no. 13, pp. 99-102, 2016.
14 F. Li, J. Hou, R. Zeng, and C. Ling, "Research on multi-feature sentence similarity computing method with word embedding," Journal of Frontiers of Computer Science and Technology, vol. 11, no. 4, pp. 608-618, 2017   DOI
15 S. Arora, Y. Liang, and T. Ma, "A simple but tough-to-beat baseline for sentence embeddings," in Proceedings of the 5th International Conference on Learning Representations (ICLR), Toulon, France, 2017.
16 Y. Mrabet, H. Kilicoglu, and D. Demner-Fushman, "TextFlow: a text similarity measure based on continuous sequences," in Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics, Vancouver, Canada, 2017, pp. 763-772.
17 M. Kusner, Y. Sun, N. Kolkin, and K. Weinberger, "From word embeddings to document distances," in Proceedings of the 32nd International Conference on Machine Learning, Lille, France, 2015, pp. 957-966.
18 Y. Guan, X. Wang, and Q. Wang, "A new measurement of systematic similarity," IEEE Transactions on Systems, Man, and Cybernetics-Part A: Systems and Humans, vol. 38, no. 4, pp. 743-758, 2008.   DOI
19 Wikimedia Chinese corpus [Online]. Available: https://dumps.wikimedia.org/zhwiki/latest/zhwiki-latestpages-articles.xml.bz2.
20 Y. Guan, X. Wang, and Q. Wang, "Measurement of system similarity," in Proceedings of China Computational Linguistics Conference (CCL), Nanjing, China, 2005, pp. 341-347.
21 Word2VEC_java [Online]. Available: https://github.com/NLPchina/Word2VEC_java.
22 wmd4j is a Java library for calculating Word Mover's Distance (WMD) [Online]. Available: https://github.com/crtomirmajer/wmd4j.
23 Word2VEC [Online]. Available: https://github.com/jsksxs360/Word2Vec.
24 Word Sentence Similarity Code [Online]. Available: https://download.csdn.net/download/u011001835/9849524.