Browse > Article
http://dx.doi.org/10.1633/JISTaP.2020.8.2.1

Empirical Comparison of Word Similarity Measures Based on Co-Occurrence, Context, and a Vector Space Model  

Kadowaki, Natsuki (Graduate School of Library and Information Science, Keio University)
Kishida, Kazuaki (Faculty of Letters, Keio University)
Publication Information
Journal of Information Science Theory and Practice / v.8, no.2, 2020 , pp. 6-17 More about this Journal
Abstract
Word similarity is often measured to enhance system performance in the information retrieval field and other related areas. This paper reports on an experimental comparison of values for word similarity measures that were computed based on 50 intentionally selected words from a Reuters corpus. There were three targets, including (1) co-occurrence-based similarity measures (for which a co-occurrence frequency is counted as the number of documents or sentences), (2) context-based distributional similarity measures obtained from a latent Dirichlet allocation (LDA), nonnegative matrix factorization (NMF), and Word2Vec algorithm, and (3) similarity measures computed from the tf-idf weights of each word according to a vector space model (VSM). Here, a Pearson correlation coefficient for a pair of VSM-based similarity measures and co-occurrence-based similarity measures according to the number of documents was highest. Group-average agglomerative hierarchical clustering was also applied to similarity matrices computed by individual measures. An evaluation of the cluster sets according to an answer set revealed that VSM- and LDA-based similarity measures performed best.
Keywords
word similarity; word clustering; topic model; word embedding;
Citations & Related Records
연도 인용수 순위
  • Reference
1 Qiu, Y., & Frei, H. -P. (1993, June 27-July 1). Concept based query expansion. In R. Korfhage, E. M. Rasmussen, & P. Willett (Eds.), SIGIR '93: Proceedings of the 16th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (pp. 160-169). Association for Computing Machinery.
2 Ravikumar, S., Agrahari, A., & Singh, S. N. (2015). Mapping the intellectual structure of scientometrics: A co-word analysis of the journal Scientometrics (2005-2010). Scientometrics, 102(1), 929-955.   DOI
3 Schutze, H., & Pedersen, J. O. (1997). A cooccurrence-based thesaurus and two applications to information retrieval. Information Processing & Management, 33(3), 307-318.   DOI
4 Shunmugam, D. A., & Archana, P. (2016). An empirical investigation of word clustering techniques for natural language understanding. International Journal of Engineering Science and Computing, 6(10), 2637-2646.
5 Terra, E. L., & Clarke, C. L. A. (2003, May 27-June 1). Frequency estimates for statistical word similarity measures. In M. Hearst & M. Ostendorf (Eds.), NAACL '03: Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology - Vol. 1 (pp. 165-172). Association for Computational Linguistics.
6 Toutanova, K., Klein, D., Manning, C. D., & Singer, Y. (2003, May 27-June 1). Feature-rich part-of-speech tagging with a cyclic dependency network. In M. Hearst & M. Ostendorf (Eds.), NAACL '03: Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology - Vol. 1 (pp. 173-180). Association for Computational Linguistics.
7 Waltz, D. L., & Pollack, J. B. (1985). Massively parallel parsing: A strongly interactive model of natural language interpretation. Cognitive Science, 9(1), 51-74.   DOI
8 Xu, H., & Yu, B. (2010). Automatic thesaurus construction for spam filtering using revised back propagation neural network. Expert Systems with Applications, 37(1),18-23.   DOI
9 Xu, J., & Croft, W. B. (1996, August 18-22). Query expansion using local and global document analysis. In H.-P. Frei, D. Harman, P. Schauble, & R. Wilkinson (Eds.), SIGIR '96: Proceedings of the 19th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (pp. 4-11). Association for Computing Machinery.
10 Lagutina, K., Larionov, V., Petryakov, V., Lagutina, N., & Paramonov, I. (2018, November 13-16). Sentiment classification of Russian texts using automatically generated thesaurus. In S. Balandin, T. S. Cinotti, F. Viola, & T. Tyutina (Eds.), Proceedings of the 23rd Conference of Open Innovations Association FRUCT (pp. 217-222). FRUCT Oy.
11 Lee, D. D., & Seung, H. S. (1999). Learning the parts of objects by non-negative matrix factorization. Nature, 401, 788-791.   DOI
12 Li, C. H., Yang, J. C., & Park, S. C. (2012). Text categorization algorithms using semantic approaches, corpus-based thesaurus and WordNet. Expert Systems with Applications, 39(1), 765-772.   DOI
13 Liebeskind, C., Dagan, I., & Schler, J. (2018, May 7-12). Automatic thesaurus construction for modern Hebrew. In N. Calzolari, K. Choukri, C. Cieri, T. Declerck, S. Goggi, K. Hasida, H. Isahara, B. Maegaard, J. Mariani, H. Mazo, A. Moreno, J. Odijk, S. Piperidis, & T. Tokunaga (Eds.), Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018) (pp. 1446-1451). European Language Resources Association.
14 Lin, H., Sun, B., Wu, J. & Xiong, H. (2016, June 24-26). Topic detection from short text: A term-based consensus clustering method. In B. Yang (Ed.), 2016 13th International Conference on Service Systems and Service Management (ICSSSM 2016) (pp. 1-6). IEEE.
15 Mandala, R., Tokunaga, T., & Tanaka, H. (1999, August 15-19). Combining multiple evidence from different types of thesaurus for query expansion. In F. Gey, M. A. Hearst, & R. Tong (Eds.), SIGIR '99: Proceedings of the 22nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (pp. 191-197). Association for Computing Machinery.
16 Mikolov, T., Yih, W. T., & Zweig, G. (2013, June 9-14). Linguistic regularities in continuous space word representations. In L. Vanderwende, H. Daume III, & K. Kirchhoff (Eds.), Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (pp. 746-751). Association for Computational Linguistics.
17 Mohsen, G., Al-Ayyoub, M., Hmeidi, I., & Al-Aiad, A. (2018, April 3-5). On the automatic construction of an Arabic thesaurus. 2018 9th International Conference on Information and Communication Systems (ICICS) (pp. 243-247). IEEE.
18 Peat, H. J., & Willett, P. (1991). The limitations of term cooccurrence data for query expansion in document retrieval systems. Journal of the American Society for Information Science, 42(5), 378-383.   DOI
19 Pekar, V., & Staab, S. (2003, April 12-17). Word classification based on combined measures of distributional and semantic similarity. In A. Copestake & J. Hajic (Eds.), EACL '03: Proceedings of the Tenth Conference on European Chapter of the Association for Computational Linguistics) - Vol. 2 (pp. 147-150). Association for Computational Linguistics.   DOI
20 Dagan, I., Lee, L., & Pereira, F. C. N. (1999). Similarity-based models of word cooccurrence probabilities. Machine Learning, 34(1-3), 43-69.   DOI
21 Deerwester, S., Dumais, S. T., Furnas, G. W., Landauer, T. K., & Harshman, R. (1990). Indexing by latent semantic analysis. Journal of the American Society for Information Science, 41(6), 391-407.   DOI
22 Gallant, S., Hecht-Nielson, R., Caid, W., Qing, K., Carleton, J., & Sudbeck, D. (1992, November 4-6). HNC's MatchPlus System. In D. K. Harman (Ed.), The First Text REtrieval Conference (TREC-1) (pp. 107-111). National Institute of Standards and Technology.
23 Griffiths, T. L., & Steyvers, M. (2004). Finding scientific topics. Proceedings of the National Academy of Sciences of the United States of America, 101(suppl 1), 5228-5235.   DOI
24 Blei, D. M., Ng, A. Y., & Jordan, M. I. (2003). Latent Dirichlet allocation. Journal of Machine Learning Research, 3, 993-1022.
25 Chen, L., Fankhauser, P., Thiel, U., & Kamps, T. (2005, October 31-November 5). Statistical relationship determination in automatic thesaurus construction. In O. Herzog, H.-J. Schek, N. Fuhr, A. Chowdhury, & W. Teiken (Eds.), Proceedings of the 14th ACM International Conference on Information and Knowledge Management (CIKM '05) (pp. 267-268). Association for Computing Machinery.
26 Hofmann, T. (1999, August 15-19). Probabilistic latent semantic indexing. In F. Gey, M. A. Hearst, & R. Tong (Eds.), Proceedings of the 22nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR '99) (pp. 50-57). Association for Computing Machinery.
27 Jing, Y., & Croft, W. B. (1994, October 11-13). An association thesaurus for information retrieval. In J-L. F. Brentano & F. Seitz (Eds.), RIAO '94: Intelligent Multimedia Information Retrieval Systems and Management - Vol. 1 (pp. 146-160). Le Centre de Hautes Etudes Internationales d'Informatique Documentaire.
28 Jo, T. (2016, July 25-28). String vector based AHC as approach to word clustering. In R. Stahlbock & G. M. Weiss (Eds.), Proceedings of the International Conference on Data Mining DMIN'16 (pp. 133-138). Lancaster Centre for Forecasting.
29 Khasseh, A. A., Soheili F., Moghaddam, H. S., & Chelak, A. M. (2017). Intellectual structure of knowledge in iMetrics: A co-word analysis. Information Processing & Management, 53(3), 705-720.   DOI
30 Kishida, K. (2014). Empirical comparison of external evaluation measures for document clustering by using synthetic data. IPSJ SIG Technical Report, 2014-IFAT-113, 1-7.
31 Kotlerman, L., Dagan, I., Szpektor, I., & Zhitomirsky-Geffet, M. (2010). Directional distributional similarity for lexical inference. Natural Language Engineering,16(4), 359-389.   DOI
32 Kishida, K. (2011). Double-pass clustering technique for multilingual document collections. Journal of Information Science, 37(3), 304-321.   DOI
33 Zazo, A. F., Figuerola, C. G., Berrocal, J. L. A., & Rodriguez, E. (2005). Reformulation of queries using similarity thesauri. Information Processing & Management, 41(5), 1163-1173.   DOI
34 Zhao, Z., Liu, T., Li, B., & Du, X. (2016, August 29-September 2). Cluster-driven model for improved word and text embedding. In G. A. Kaminka, M. Fox, P. Bouquet, E. Hullermeier, V. Dignum, & (Eds.), ECAI'16: Proceedings of the Twenty-second European Conference on Artificial Intelligence (pp. 99-106). IOS Press.
35 Pennington, J., Socher, R., & Manning, C. D. (2014, October 25-29). GloVe: Global vectors for word representation. In A. Moschitti, B. Pang, & W. Daelemans (Eds.), Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (pp. 1532-1543). Association for Computational Linguistics.
36 Poostchi, H., & Piccardi, M. (2018, December 10-12). Cluster labeling by word embeddings and WordNet's hypernymy. In S. M. Kim & X. Zhang (Eds.), Proceedings of the Australasian Language Technology Association Workshop 2018 (pp. 66-70). Association for Computational Linguistics.