[KSCI] Korea Science Citation Index Service

http://dx.doi.org/10.1633/JISTaP.2020.8.2.1

Empirical Comparison of Word Similarity Measures Based on Co-Occurrence, Context, and a Vector Space Model

Kadowaki, Natsuki (Graduate School of Library and Information Science, Keio University)
Kishida, Kazuaki (Faculty of Letters, Keio University)

Publication Information

Journal of Information Science Theory and Practice / v.8, no.2, 2020 , pp. 6-17 More about this Journal

Abstract

Word similarity is often measured to enhance system performance in the information retrieval field and other related areas. This paper reports on an experimental comparison of values for word similarity measures that were computed based on 50 intentionally selected words from a Reuters corpus. There were three targets, including (1) co-occurrence-based similarity measures (for which a co-occurrence frequency is counted as the number of documents or sentences), (2) context-based distributional similarity measures obtained from a latent Dirichlet allocation (LDA), nonnegative matrix factorization (NMF), and Word2Vec algorithm, and (3) similarity measures computed from the tf-idf weights of each word according to a vector space model (VSM). Here, a Pearson correlation coefficient for a pair of VSM-based similarity measures and co-occurrence-based similarity measures according to the number of documents was highest. Group-average agglomerative hierarchical clustering was also applied to similarity matrices computed by individual measures. An evaluation of the cluster sets according to an answer set revealed that VSM- and LDA-based similarity measures performed best.

Keywords

word similarity; word clustering; topic model; word embedding;

Citations & Related Records

Reference

1	Qiu, Y., & Frei, H. -P. (1993, June 27-July 1). Concept based query expansion. In R. Korfhage, E. M. Rasmussen, & P. Willett (Eds.), SIGIR '93: Proceedings of the 16th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (pp. 160-169). Association for Computing Machinery.
2	Ravikumar, S., Agrahari, A., & Singh, S. N. (2015). Mapping the intellectual structure of scientometrics: A co-word analysis of the journal Scientometrics (2005-2010). Scientometrics, 102(1), 929-955. DOI
3	Schutze, H., & Pedersen, J. O. (1997). A cooccurrence-based thesaurus and two applications to information retrieval. Information Processing & Management, 33(3), 307-318. DOI
4	Shunmugam, D. A., & Archana, P. (2016). An empirical investigation of word clustering techniques for natural language understanding. International Journal of Engineering Science and Computing, 6(10), 2637-2646.
5	Terra, E. L., & Clarke, C. L. A. (2003, May 27-June 1). Frequency estimates for statistical word similarity measures. In M. Hearst & M. Ostendorf (Eds.), NAACL '03: Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology - Vol. 1 (pp. 165-172). Association for Computational Linguistics.
6	Toutanova, K., Klein, D., Manning, C. D., & Singer, Y. (2003, May 27-June 1). Feature-rich part-of-speech tagging with a cyclic dependency network. In M. Hearst & M. Ostendorf (Eds.), NAACL '03: Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology - Vol. 1 (pp. 173-180). Association for Computational Linguistics.
7	Waltz, D. L., & Pollack, J. B. (1985). Massively parallel parsing: A strongly interactive model of natural language interpretation. Cognitive Science, 9(1), 51-74. DOI
8	Xu, H., & Yu, B. (2010). Automatic thesaurus construction for spam filtering using revised back propagation neural network. Expert Systems with Applications, 37(1),18-23. DOI
9	Xu, J., & Croft, W. B. (1996, August 18-22). Query expansion using local and global document analysis. In H.-P. Frei, D. Harman, P. Schauble, & R. Wilkinson (Eds.), SIGIR '96: Proceedings of the 19th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (pp. 4-11). Association for Computing Machinery.
10	Lagutina, K., Larionov, V., Petryakov, V., Lagutina, N., & Paramonov, I. (2018, November 13-16). Sentiment classification of Russian texts using automatically generated thesaurus. In S. Balandin, T. S. Cinotti, F. Viola, & T. Tyutina (Eds.), Proceedings of the 23rd Conference of Open Innovations Association FRUCT (pp. 217-222). FRUCT Oy.
11	Lee, D. D., & Seung, H. S. (1999). Learning the parts of objects by non-negative matrix factorization. Nature, 401, 788-791. DOI
12	Li, C. H., Yang, J. C., & Park, S. C. (2012). Text categorization algorithms using semantic approaches, corpus-based thesaurus and WordNet. Expert Systems with Applications, 39(1), 765-772. DOI
13	Liebeskind, C., Dagan, I., & Schler, J. (2018, May 7-12). Automatic thesaurus construction for modern Hebrew. In N. Calzolari, K. Choukri, C. Cieri, T. Declerck, S. Goggi, K. Hasida, H. Isahara, B. Maegaard, J. Mariani, H. Mazo, A. Moreno, J. Odijk, S. Piperidis, & T. Tokunaga (Eds.), Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018) (pp. 1446-1451). European Language Resources Association.
14	Lin, H., Sun, B., Wu, J. & Xiong, H. (2016, June 24-26). Topic detection from short text: A term-based consensus clustering method. In B. Yang (Ed.), 2016 13th International Conference on Service Systems and Service Management (ICSSSM 2016) (pp. 1-6). IEEE.
15	Peat, H. J., & Willett, P. (1991). The limitations of term cooccurrence data for query expansion in document retrieval systems. Journal of the American Society for Information Science, 42(5), 378-383. DOI
16	Mandala, R., Tokunaga, T., & Tanaka, H. (1999, August 15-19). Combining multiple evidence from different types of thesaurus for query expansion. In F. Gey, M. A. Hearst, & R. Tong (Eds.), SIGIR '99: Proceedings of the 22nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (pp. 191-197). Association for Computing Machinery.
17	Mikolov, T., Yih, W. T., & Zweig, G. (2013, June 9-14). Linguistic regularities in continuous space word representations. In L. Vanderwende, H. Daume III, & K. Kirchhoff (Eds.), Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (pp. 746-751). Association for Computational Linguistics.
18	Mohsen, G., Al-Ayyoub, M., Hmeidi, I., & Al-Aiad, A. (2018, April 3-5). On the automatic construction of an Arabic thesaurus. 2018 9th International Conference on Information and Communication Systems (ICICS) (pp. 243-247). IEEE.
19	Pekar, V., & Staab, S. (2003, April 12-17). Word classification based on combined measures of distributional and semantic similarity. In A. Copestake & J. Hajic (Eds.), EACL '03: Proceedings of the Tenth Conference on European Chapter of the Association for Computational Linguistics) - Vol. 2 (pp. 147-150). Association for Computational Linguistics. DOI
20	Dagan, I., Lee, L., & Pereira, F. C. N. (1999). Similarity-based models of word cooccurrence probabilities. Machine Learning, 34(1-3), 43-69. DOI
21	Deerwester, S., Dumais, S. T., Furnas, G. W., Landauer, T. K., & Harshman, R. (1990). Indexing by latent semantic analysis. Journal of the American Society for Information Science, 41(6), 391-407. DOI
22	Chen, L., Fankhauser, P., Thiel, U., & Kamps, T. (2005, October 31-November 5). Statistical relationship determination in automatic thesaurus construction. In O. Herzog, H.-J. Schek, N. Fuhr, A. Chowdhury, & W. Teiken (Eds.), Proceedings of the 14th ACM International Conference on Information and Knowledge Management (CIKM '05) (pp. 267-268). Association for Computing Machinery.
23	Gallant, S., Hecht-Nielson, R., Caid, W., Qing, K., Carleton, J., & Sudbeck, D. (1992, November 4-6). HNC's MatchPlus System. In D. K. Harman (Ed.), The First Text REtrieval Conference (TREC-1) (pp. 107-111). National Institute of Standards and Technology.
24	Griffiths, T. L., & Steyvers, M. (2004). Finding scientific topics. Proceedings of the National Academy of Sciences of the United States of America, 101(suppl 1), 5228-5235. DOI
25	Blei, D. M., Ng, A. Y., & Jordan, M. I. (2003). Latent Dirichlet allocation. Journal of Machine Learning Research, 3, 993-1022.
26	Hofmann, T. (1999, August 15-19). Probabilistic latent semantic indexing. In F. Gey, M. A. Hearst, & R. Tong (Eds.), Proceedings of the 22nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR '99) (pp. 50-57). Association for Computing Machinery.
27	Jing, Y., & Croft, W. B. (1994, October 11-13). An association thesaurus for information retrieval. In J-L. F. Brentano & F. Seitz (Eds.), RIAO '94: Intelligent Multimedia Information Retrieval Systems and Management - Vol. 1 (pp. 146-160). Le Centre de Hautes Etudes Internationales d'Informatique Documentaire.
28	Jo, T. (2016, July 25-28). String vector based AHC as approach to word clustering. In R. Stahlbock & G. M. Weiss (Eds.), Proceedings of the International Conference on Data Mining DMIN'16 (pp. 133-138). Lancaster Centre for Forecasting.
29	Kishida, K. (2014). Empirical comparison of external evaluation measures for document clustering by using synthetic data. IPSJ SIG Technical Report, 2014-IFAT-113, 1-7.
30	Khasseh, A. A., Soheili F., Moghaddam, H. S., & Chelak, A. M. (2017). Intellectual structure of knowledge in iMetrics: A co-word analysis. Information Processing & Management, 53(3), 705-720. DOI
31	Poostchi, H., & Piccardi, M. (2018, December 10-12). Cluster labeling by word embeddings and WordNet's hypernymy. In S. M. Kim & X. Zhang (Eds.), Proceedings of the Australasian Language Technology Association Workshop 2018 (pp. 66-70). Association for Computational Linguistics.
32	Kotlerman, L., Dagan, I., Szpektor, I., & Zhitomirsky-Geffet, M. (2010). Directional distributional similarity for lexical inference. Natural Language Engineering,16(4), 359-389. DOI
33	Kishida, K. (2011). Double-pass clustering technique for multilingual document collections. Journal of Information Science, 37(3), 304-321. DOI
34	Zazo, A. F., Figuerola, C. G., Berrocal, J. L. A., & Rodriguez, E. (2005). Reformulation of queries using similarity thesauri. Information Processing & Management, 41(5), 1163-1173. DOI
35	Zhao, Z., Liu, T., Li, B., & Du, X. (2016, August 29-September 2). Cluster-driven model for improved word and text embedding. In G. A. Kaminka, M. Fox, P. Bouquet, E. Hullermeier, V. Dignum, & (Eds.), ECAI'16: Proceedings of the Twenty-second European Conference on Artificial Intelligence (pp. 99-106). IOS Press.
36	Pennington, J., Socher, R., & Manning, C. D. (2014, October 25-29). GloVe: Global vectors for word representation. In A. Moschitti, B. Pang, & W. Daelemans (Eds.), Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (pp. 1532-1543). Association for Computational Linguistics.