DOI QR코드

DOI QR Code

A Text Similarity Measurement Method Based on Singular Value Decomposition and Semantic Relevance

  • Li, Xu (School of Information Science and Engineering, Dalian Polytechnic University) ;
  • Yao, Chunlong (School of Information Science and Engineering, Dalian Polytechnic University) ;
  • Fan, Fenglong (School of Information Science and Engineering, Dalian Polytechnic University) ;
  • Yu, Xiaoqiang (School of Information Science and Engineering, Dalian Polytechnic University)
  • Received : 2016.02.15
  • Accepted : 2017.03.14
  • Published : 2017.08.31

Abstract

The traditional text similarity measurement methods based on word frequency vector ignore the semantic relationships between words, which has become the obstacle to text similarity calculation, together with the high-dimensionality and sparsity of document vector. To address the problems, the improved singular value decomposition is used to reduce dimensionality and remove noises of the text representation model. The optimal number of singular values is analyzed and the semantic relevance between words can be calculated in constructed semantic space. An inverted index construction algorithm and the similarity definitions between vectors are proposed to calculate the similarity between two documents on the semantic level. The experimental results on benchmark corpus demonstrate that the proposed method promotes the evaluation metrics of F-measure.

Keywords

References

  1. N. K. Nagwani, "A comment on "a similarity measure for text classification and clustering"," IEEE Transactions on Knowledge and Data Engineering, vol. 27, no. 1, pp. 2589-2590, 2015. https://doi.org/10.1109/TKDE.2015.2451616
  2. A. Awajan, "Semantic similarity based approach for reducing Arabic texts dimensionality," International Journal of Speech Technology, vol. 19, no. 2, pp. 191-201, 2016. https://doi.org/10.1007/s10772-015-9284-6
  3. L. Xu, S. Sun and Q. Wang, "Text similarity algorithm based on semantic vector space model," in Proceedings of the 15th International Conference on Computer and Information Science, Okayama, Japan, 2016, pp. 1-4.
  4. R. lonescu and M. Popescu, Knowledge Transfer between Computer Vision and Text Mining: Similarity-Based Learning Approaches. Cham: Springer, 2016.
  5. E. Blanco and D. Moldovan, "A semantic logic-based approach to determine textual similarity," IEEE/ACM Transactions on Audio, Speech and Language Processing, vol. 23, no. 4, pp. 683-693, 2015. https://doi.org/10.1109/TASLP.2015.2403613
  6. M. Shirakawa, K. Nakayama, T. Hara, and S. Nishio, "Wikipedia-based semantic similarity measurements for noisy short texts using extended naive Bayes," IEEE Transactions on Emerging Topics in Computing, vol. 3, no. 2, pp. 205-219, 2015. https://doi.org/10.1109/TETC.2015.2418716
  7. H. Z. Liu and P. F. Wang, "Accessing text semantic similarity using ontology," Journal of Software, vol. 9, no. 2, pp. 490-497, 2014.
  8. W. Song, C. H. Li, and S. C. Park, "Genetic algorithm for text clustering using ontology and evaluating the validity of various semantic similarity measures," Expert Systems with Applications, vol. 36, no. 5, pp. 9095-9104, 2009. https://doi.org/10.1016/j.eswa.2008.12.046
  9. Y. Wang and J. Hodges, "Document clustering with semantic analysis," in Proceedings of the 39th Annual Hawaii International Conference on System Sciences, Kauia, HI, 2006, pp. 54-63.
  10. R. M. Aliguliyev, "A new sentence similarity measure and sentence based extractive technique for automatic text summarization," Expert Systems with Applications, vol. 36, no. 4, pp. 7764-7772, 2009. https://doi.org/10.1016/j.eswa.2008.11.022
  11. L. Gang, C. Zheng and L. Zhang, "Text information retrieval based on concept semantic similarity," in Proceedings of the 5th International Conference on Semantics, Knowledge and Grid, Zhuhai, China, 2009, pp. 356-360.
  12. A. Hotho, S. Staab, and G. Stumme, "Ontologies improves text document clustering," in Proceedings of the 3rd IEEE International Conference on Data Mining, Melbourne, FL, 2003, pp. 541-544.
  13. R. J. Bellegarda, "Exploiting latent semantic information in statistical language modeling," Proceedings of the IEEE, vol. 88, no. 8, pp. 1279-1296, 2000. https://doi.org/10.1109/5.880084
  14. C. Buck and P. Koehn, "Quick and reliable document alignment via TF/IDF-weighted cosine distance," in Proceedings of the 1st Conference on Machine Translation, Berlin, Germany, 2016, pp. 672-678.
  15. A. Mirzal, "Clustering and latent semantic indexing aspects of the singular value decomposition," International Journal of Information and Decision Sciences, vol. 8, no. 1, pp. 53-72, 2016. https://doi.org/10.1504/IJIDS.2016.075790
  16. G. Karypis, "CLUTO: a clustering toolkit," 2006 [Online]. Available: http://glaros.dtc.umn.edu/gkhome/cluto/cluto/overview.