Browse > Article
http://dx.doi.org/10.5351/KJAS.2021.34.3.389

Modified multi-sense skip-gram using weighted context and x-means  

Jeong, Hyunwoo (Department of Statistics, Sungkyunkwan University)
Lee, Eun Ryung (Department of Statistics, Sungkyunkwan University)
Publication Information
The Korean Journal of Applied Statistics / v.34, no.3, 2021 , pp. 389-399 More about this Journal
Abstract
In recent years, word embedding has been a popular field of natural language processing research and a skip-gram has become one successful word embedding method. It assigns a word embedding vector to each word using contexts, which provides an effective way to analyze text data. However, due to the limitation of vector space model, primary word embedding methods assume that every word only have a single meaning. As one faces multi-sense words, that is, words with more than one meaning, in reality, Neelakantan (2014) proposed a multi-sense skip-gram (MSSG) to find embedding vectors corresponding to the each senses of a multi-sense word using a clustering method. In this paper, we propose a modified method of the MSSG to improve statistical accuracy. Moreover, we propose a data-adaptive choice of the number of clusters, that is, the number of meanings for a multi-sense word. Some numerical evidence is given by conducting real data-based simulations.
Keywords
word embedding; skip-gram; multi-sense word; multi-sense skip-gram; X-means clustering; weighted context vector;
Citations & Related Records
연도 인용수 순위
  • Reference
1 Neelakantan A, Shankar J, Passos A, and McCallum A (2014). Efficient non-parametric estimation of multiple embeddings per word in vector space. Conference on Empirical Methods in Natural Language Processing, 1059-1069.
2 Huang E, Socher R, Manning C, and Ng A (2012). Improving word representations via global context and multiple word prototypes. In Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics, 1, 873-882.
3 Rong X (2014). Word2vec Parameter Learning Explained, arXiv.
4 Tsunenori I (2005). An expansion of X-means for automatically determining the optimal number of clusters. In Proceedings of International Conference on Computational Intelligence, 2, 91-95.
5 Dan P, Moore AW (2000). X-means: extending k-means with efficient estimation of the number of clusters. In Proceedings of the 17th International Conference on Machine Learning, 727.
6 Grun B and Hornik H (2011). Topicmodels: an r package for fitting topic models, Journal of Statistical Software, 40, 1-30.
7 Huang E, Socher R, Manning C, and Ng A (2012). Improving word representations via global context and multiple word prototypes. In Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics, 1, 873-882.
8 Mikolov T, Sutskever I, Chen K, Corrado G, and Dean J (2013a). Distributed representations of words and phrases and their compositionality, Advances in Neural Information Processing Systems, 26, 3111-3119.
9 Zheng Y, Shi Y, Guo K, Li WL, and Zhu L (2017). Enhanced word embedding with multiple prototypes. 4th International Conference on Industrial Economics System and Industrial Security Engineering, 1-5.
10 Mikolov T, Chen K, Corrado G, and Dean J (2013b). Efficient estimation of word representations in vector space. International Conference on Learning Representations.