[KSCI] Korea Science Citation Index Service

http://dx.doi.org/10.3837/tiis.2020.10.003

Microblog User Geolocation by Extracting Local Words Based on Word Clustering and Wrapper Feature Selection

Tian, Hechan (State Key Laboratory of Mathematical Engineering and Advanced Computing)
Liu, Fenlin (State Key Laboratory of Mathematical Engineering and Advanced Computing)
Luo, Xiangyang (State Key Laboratory of Mathematical Engineering and Advanced Computing)
Zhang, Fan (State Key Laboratory of Mathematical Engineering and Advanced Computing)
Qiao, Yaqiong (State Key Laboratory of Mathematical Engineering and Advanced Computing)

Publication Information

KSII Transactions on Internet and Information Systems (TIIS) / v.14, no.10, 2020 , pp. 3972-3988 More about this Journal

Abstract

Existing methods always rely on statistical features to extract local words for microblog user geolocation. There are many non-local words in extracted words, which makes geolocation accuracy lower. Considering the statistical and semantic features of local words, this paper proposes a microblog user geolocation method by extracting local words based on word clustering and wrapper feature selection. First, ordinary words without positional indications are initially filtered based on statistical features. Second, a word clustering algorithm based on word vectors is proposed. The remaining semantically similar words are clustered together based on the distance of word vectors with semantic meanings. Next, a wrapper feature selection algorithm based on sequential backward subset search is proposed. The cluster subset with the best geolocation effect is selected. Words in selected cluster subset are extracted as local words. Finally, the Naive Bayes classifier is trained based on local words to geolocate the microblog user. The proposed method is validated based on two different types of microblog data - Twitter and Weibo. The results show that the proposed method outperforms existing two typical methods based on statistical features in terms of accuracy, precision, recall, and F1-score.

Keywords

Location Prediction; Word Clustering; Feature Selection;

Citations & Related Records

Reference

1	O. Ajao, J. Hong and W. Liu, "A survey of location inference techniques on Twitter," Journal of Information Science, vol. 41, no. 6, pp. 855-864, December, 2015. DOI
2	G. Jang and S.H. Myaeng, "Predicting event mentions based on a semantic analysis of microblogs for inter-region relationships," Journal of Information Science, vol. 44, no. 6, pp. 818-829, March, 2018. DOI
3	K. Akyol and B. Sen, "Modeling and Predicting of News Popularity in Social Media Sources," Computers, Materials & Continua, vol. 61, no. 1, pp.69-80, 2019. DOI
4	C. You, D. Zhu, Y. Sun, A. Ye, G. Wu, N. Cao, J. Qiu and H. M. Zhou, "SNES: Social-Network-Oriented Public Opinion Monitoring Platform Based on ElasticSearch," Computers, Materials & Continua, vol. 61, no. 3, pp.1271-1283, 2019. DOI
5	P. Wang, Z. Wang, T. Chen and Q. Ma, "Personalized Privacy Protecting Model in Mobile Social Network," Computers, Materials & Continua, vol. 59, no. 2, pp.533-546, 2019. DOI
6	Z.Y. Cheng, J. Caverlee and K. Lee, "A content-driven framework for geolocating microblog users," ACM Transactions on Intelligent Systems and Technology, vol. 4, no. 1, pp. 1-27, February, 2013.
7	K.M. Ryoo and S. Moon, "Inferring Twitter user locations with 10 km accuracy," in Proc. of the 23rd International Conference on World Wide Web (WWW'14), pp. 643-648, April 7-11, 2014.
8	C.A. Davis, G.L. Pappa, Diogo Rennó Rocha De Oliveira, and F.D.L. Arcanjo, "Inferring the location of twitter messages based on user relationships," Transactions in Gis, vol. 15, no. 6, pp. 735-751, December, 2011. DOI
9	S. Abrol and L. Khan, "Tweethood: Agglomerative clustering on fuzzy k-closest friends with variable depth for location mining," in Proc. of the IEEE 2nd International Conference on Social Computing (SocialCom'10), pp. 153-160, August 20-22, 2010.
10	L. Backstrom, E. Sun and C. Marlow, "Find me if you can: Improving geographical prediction with social and spatial proximity," in Proc. of the 19th International Conference on World Wide Web (WWW'10), pp. 61-70, April 26-30, 2010.
11	J. McGee, J. Caverlee and Z.Y. Cheng, "Location prediction in social media based on tie strength," in Proc. of the 22nd ACM International Conference on Conference on Information and Knowledge Management (CIKM'13), pp. 459-468, October 27 - November 1, 2013.
12	D. Rout, B. Kalina, PreotiucPietro Daniel, and C. Trevor, "Where's @wally: A classifcation approach to geolocating users based on their social ties," in Proc. of the 24th ACM Conference on Hypertext and Social Media (HT'13), pp. 11-20, May 1-3, 2013.
13	L. Kong, Z. Liu and Y. Huang, "Spot: Locating social media users based on social network context," Journal Proceedings of the VLDB Endowment, vol. 7, no. 13, pp. 1681-1684, January, 2014. DOI
14	D. Jurgens, "That's what friends are for: Inferring location in online social media platforms based on social relationships," in Proc. of the 7th International AAAI Conference on Weblogs and Social Media (ICWSM'13), pp. 273-282, July 8-10, 2013.
15	R. Compton, D. Jurgens and D. Allen, "Geotagging one hundred million twitter accounts with total variation minimization," in Proc. of the IEEE International Conference on Big Data (Big Data'14), pp. 393-401, October 27-30, 2014.
16	B. Hecht, L. Hong, B. Suh, and E.H. Chi, "Tweets from Justin Bieber's heart: the dynamics of the location field in user profiles," in Proc. of the 29th SIGCHI Conference on Human Factors in Computing Systems(CHI'11), pp. 237-246, May 7-12, 2011.
17	A. Rahimi, T. Cohn and T. Baldwin, "Twitter user geolocation using a unified text and network prediction model," in Proc. of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (ACL-IJCNLP'15), pp. 630-636, July 26-31, 2015.
18	M. Ebrahimi, E. Shafieibavani, R. Wong, and F. Chen, "Twitter user geolocation by filtering of highly mentioned users," Journal of the Association for Information Science and Technology, vol. 69, no. 7, pp. 879-889, February, 2018. DOI
19	J. Eisenstein, B. O'Connor, N.A. Smith, and E.P. Xing, "A latent variable model for geographic lexical variation," in Proc. of the 7th Conference on Empirical Methods in Natural Language Processing (EMNLP'10), pp. 1277-1287, October 9-10, 2010.
20	Z.Y. Cheng, J. Caverlee and K. Lee, "You are where you tweet: a content-based approach to geo-locating twitter users," in Proc. of the 19th ACM international Conference on Information and Knowledge Management (CIKM'10), pp. 759-768, October 26-30, 2010.
21	B. Han, P. Cook and T. Baldwin, "Geolocation prediction in social media data by finding location indicative words," in Proc. of the 24th International Conference on Computational Linguistics (COLING'12), pp. 1045-1062, December 8-15, 2012.
22	B. Han, P. Cook and T. Baldwin, "Text-based twitter user geolocation prediction," Journal of Artificial Intelligence Research, vol. 49, no. 1, pp. 451-500, January, 2018.
23	L. Chi, K.H. Lim, N. Alam, and C. Butler, "Geolocation Prediction in Twitter Using Location Indicative Words and Textual Features," in Proc. of the 2nd Workshop on Noisy User-generated Text(WNUT'16), pp. 227-234, December 11, 2016.
24	J. R. Finkel, T. Grenager, and C. Manning, "Incorporating Non-local Information into Information Extraction Systems by Gibbs Sampling," in Proc. of the 43nd Annual Meeting of the Association for Computational Linguistics (ACL'05), pp. 363-370, June 25-30, 2005.
25	T. Mikolov, K. Chen, G. Corrado, and J. Dean, "Efficient Estimation of Word Representations in Vector Space," in Proc. of 1st International Conference on Learning Representations (ICLR'13), May 2-4, 2013.
26	W.X. Che, Z.H. Li and T. Liu, "LTP: A Chinese language technology platform," in Proc. of the 23rd International Conference on Computational Linguistics (COLING'10), pp. 13-16, August 23-27, 2010.