[KSCI] Korea Science Citation Index Service

http://dx.doi.org/10.1633/JISTaP.2017.5.3.3

Enhancing the Narrow-down Approach to Large-scale Hierarchical Text Classification with Category Path Information

Oh, Heung-Seon (Korea Institute of Science and Technology Information)
Jung, Yuchul (Kumoh National Institute of Technology (KIT))

Publication Information

Journal of Information Science Theory and Practice / v.5, no.3, 2017 , pp. 31-47 More about this Journal

Abstract

The narrow-down approach, separately composed of search and classification stages, is an effective way of dealing with large-scale hierarchical text classification. Recent approaches introduce methods of incorporating global, local, and path information extracted from web taxonomies in the classification stage. Meanwhile, in the case of utilizing path information, there have been few efforts to address existing limitations and develop more sophisticated methods. In this paper, we propose an expansion method to effectively exploit category path information based on the observation that the existing method is exposed to a term mismatch problem and low discrimination power due to insufficient path information. The key idea of our method is to utilize relevant information not presented on category paths by adding more useful words. We evaluate the effectiveness of our method on state-of-the art narrow-down methods and report the results with in-depth analysis.

Keywords

Hierarchical text classification; Query expansion; Narrow-down approach;

Citations & Related Records

Reference

1	Bennett, P. N., & Nguyen, N. (2009). Refined experts: Improving classification in large taxonomies. In Proceedings of the 32nd international ACM SIGIR conference on Research and development in information retrieval (pp. 11-18). ACM. Retrieved from http://portal.acm.org/citation.cfm?id=1571946
2	Broder, A., Ciccolo, P., Gabrilovich, E., Josifovski, V., Metzler, D., Riedel, L., & Yuan, J. (2009). Online expansion of rare queries for sponsored search. In Proceedings of the 18th international conference on World wide web - WWW '09 (pp. 511-520). New York: ACM Press. http://doi.org/10.1145/1526709.1526778 DOI
3	Broder, A., Fontoura, M., Josifovski, V., & Riedel, L. (2007). A semantic approach to contextual advertising. In Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval - SIGIR '07 (pp. 559-566). New York: ACM Press. http://doi.org/10.1145/1277741.1277837 DOI
4	Cai, L., & Hofmann, T. (2004). Hierarchical document categorization with support vector machines categories and subject descriptors. In Proceedings of the thirteenth ACM international conference on Information and knowledge management (pp. 78-87). New York: ACM Press. http://doi.org/10.1145/1031171.1031186 DOI
5	Cai, L., Zhou, G., Liu, K., & Zhao, J. (2011). Large-scale question classification in cQA by leveraging Wikipedia semantic knowledge. In CIKM'11 (pp. 1321-1330). New York: ACM Press. http://doi.org/10.1145/2063576.2063768 DOI
6	Wang, X. L., Zhao, H., & Lu, B. L. (2014). A meta-top-down method for large-scale hierarchical classification. IEEE Transactions on Knowledge and Data Engineering, 26(3), 500-513. http://doi.org/10.1109/TKDE.2013.30 DOI
7	Xue, G. R., Xing, D., Yang, Q., & Yu, Y. (2008). Deep classification in large-scale text hierarchies. In Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval (pp. 619-626). New York: ACM Press. http://doi.org/10.1145/1390334.1390440 DOI
8	Zhai, C., & Lafferty, J. (2004). A study of smoothing methods for language models applied to information retrieval. ACM Transactions on Information Systems, 22(2), 179-214. http://doi.org/10.1145/984321.984322 DOI
9	Zhang, B., Li, H., Liu, Y., Ji, L., Xi, W., Fan, W., & Ma, W.-Y. (2005). Improving web search results using affinity graph. In Proceedings of the 28th annual international ACM SIGIR conference on Research and development in information retrieval (pp. 504-511). New York: ACM Press. http://doi.org/10.1145/1076034.1076120 DOI
10	Zhao, L., & Callan, J. (2012). Automatic term mismatch diagnosis for selective query expansion. In Proceedings of the 35th international ACM SIGIR conference on Research and development in information retrieval - SIGIR '12 (pp. 515-524). New York: ACM Press. http://doi.org/10.1145/2348283.2348354 DOI
11	Na, S. H., Kang, I. S., & Lee, J. H. (2007). Parsimonious translation models for information retrieval. Information Processing and Management, 43(1), 121-145. http://doi.org/10.1016/j.ipm.2006.04.005 DOI
12	Lafferty, J., & Zhai, C. (2001). Document language models, query models, and risk minimization for information retrieval. In Proceedings of the 24th annual international ACM SIGIR conference on Research and development in information retrieval - SIGIR '01 (pp. 111-119). New York: ACM Press. http://doi.org/10.1145/383952.383970 DOI
13	Liu, T.-Y., Yang, Y., Wan, H., Zeng, H.-J., Chen, Z., & Ma, W.-Y. (2005, June 1). Support vector machines classification with a very large-scale taxonomy. ACM SIGKDD Explorations Newsletter. ACM. http://doi.org/10.1145/1089815.1089821 DOI
14	McCallum, A., Rosenfeld, R., Mitchell, T. M., & Ng, A. Y. A. Y. (1998). Improving text classification by shrinkage in a hierarchy of classes. In Proceedings of the Fifteenth International Conference on Machine Learning (pp. 359-367). Citeseer. Retrieved from http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.73.5412&rep=rep1&type=pdf
15	Oh, H.-S., Choi, Y., & Myaeng, S.-H. (2010). Combining global and local information for enhanced deep classification. In Proceedings of the 2010 ACM Symposium on Applied Computing - SAC '10 (pp. 1760-1767). New York: ACM Press. http://doi.org/10.1145/1774088.1774463 DOI
16	Oh, H.-S., Choi, Y., & Myaeng, S.-H. (2011). Text classification for a large-scale taxonomy using dynamically mixed local and global models for a node. In Proceedings of the 33rd European conference on Advances in information retrieval (pp. 7-18). Springer. http://doi.org/10.1007/978-3-642-20161-5_4 DOI
17	Gopal, S., Yang, Y., & Niculescu-mizil, A. (2012). Regularization framework for large scale hierarchical classification. In Large Scale Hierarchical Classification, ECML/PKDD Discovery Challenge Workshop.
18	Bai, J., Song, D., Bruza, P., Nie, J.-Y., & Cao, G. (2005). Query expansion using term relationships in language models for information retrieval. In Proceedings of the 14th ACM international conference on Information and knowledge management (pp. 688-695). New York: ACM. http://doi.org/10.1145/1099554.1099725 DOI
19	Custis, T., & Al-Kofahi, K. (2007). A new approach for evaluating query expansion. In Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval - SIGIR '07 (pp. 575-582). New York: ACM Press. http://doi.org/10.1145/1277741.1277840 DOI
20	Gopal, S., & Yang, Y. (2013). Recursive regularization for large-scale classification with hierarchical and graphical dependencies. In Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data mining - KDD '13 (pp. 257-265). New York: ACM Press. http://doi.org/10.1145/2487575.2487644 DOI
21	Sebastiani, F. (2001). Machine learning in automated text categorization. ACM Computing Surveys, 34(1), 1-47. http://doi.org/10.1145/505282.505283 DOI
22	Carpineto, C., & Romano, G. (2012). A survey of automatic query expansion in information retrieval. ACM Computing Surveys, 44(1), 1-50. http://doi.org/10.1145/2071389.2071390 DOI
23	Chan, W., Yang, W., Tang, J., Du, J., Zhou, X., & Wang, W. (2013). Community question topic categorization via hierarchical kernelized classification. In Proceedings of the 22nd ACM international conference on Conference on information & knowledge management - CIKM '13 (pp. 959-968). New York: ACM Press. http://doi.org/10.1145/2505515.2505676 DOI
24	Schutze, H., & Pedersen, J. O. (1997). A cooccurrence-based thesaurus and two applications to information retrieval. Information Processing & Management, 33(3), 307-318. http://doi.org/10.1016/S0306-4573(96)00068-4 DOI
25	Sokolov, A., & Ben-Hur, A. (2010). Hierarchical classification of gene ontology terms using the GOstruct method. Journal of Bioinformatics and Computational Biology, 8(2), 357-76. Retrieved from http://www.ncbi.nlm.nih.gov/pubmed/20401950 DOI
26	Sun, A. S. A., & Lim, E.-P. L. E.-P. (2001). Hierarchical text classification and evaluation. In Proceedings 2001 IEEE International Conference on Data Mining (pp. 521-528). IEEE Computer Society. http://doi.org/10.1109/ICDM.2001.989560 DOI
27	Wang, X.-L., & Lu, B.-L. (2010). Flatten hierarchies for large-scale hierarchical text categorization. In 2010 Fifth International Conference on Digital Information Management (ICDIM) (pp. 139-144). IEEE. http://doi.org/10.1109/ICDIM.2010.5664247 DOI
28	Chen, Y., Xue, G.-R., & Yu, Y. (2008). Advertising keyword suggestion based on concept hierarchy. In Proceedings of the international conference on Web search and web data mining - WSDM '08 (pp. 251-260). New York: ACM Press. http://doi.org/10.1145/1341531.1341564 DOI
29	Hiemstra, D., Robertson, S., & Zaragoza, H. (2004). Parsimonious language models for information retrieval. In Proceedings of the 27th annual international ACM SIGIR conference on Research and development in information retrieval (pp. 178-185). New York: ACM Press. http://doi.org/10.1145/1008992.1009025 DOI
30	Karimzadehgan, M., & Zhai, C. (2010). Estimation of statistical translation models based on mutual information for ad hoc information retrieval. In Proceeding of the 33rd international ACM SIGIR conference on Research and development in information retrieval - SIGIR '10 (pp. 323-330). New York: ACM Press. http://doi.org/10.1145/1835449.1835505 DOI
31	Koller, D., & Sahami, M. (1997). Hierarchically classifying documents using very few words. In Proceedings of the 4th International Conference on Machine Learning (pp. 170-178). Morgan Kaufmann Publishers Inc. Retrieved from http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.31.2455&rep=rep1&type=pdf
32	Kurland, O., & Lee, L. (2006). PageRank without hyperlinks: Structural re-ranking using links induced by language models. In Proceedings of the 28th annual international ACM SIGIR conference on Research and development in information retrieval - SIGIR '05 (pp. 306-313). New York: ACM Press. http://doi.org/10.1145/1076034.1076087 DOI
33	Labrou, Y., & Finin, T. (1999). Yahoo! as an ontology. In Proceedings of the eighth international conference on Information and knowledge management - CIKM '99 (pp. 180-187). New York: ACM Press. http://doi.org/10.1145/319950.319976 DOI
34	Robertson, S., & Walker, S. (1994). Some simple effective approximations to the 2-Poisson model for probabilistic weighted retrieval. In Proceedings of the 17th annual international ACM SIGIR conference on Research and development in information retrieval (pp. 232-241). New York: Springer-Verlag. Retrieved from http://dl.acm.org/citation.cfm?id=188490.188561
35	Oh, H.-S., & Jung, Y. (2014). External methods to address limitations of using global information on the narrow-down approach for hierarchical text classification. Journal of Information Science, 40(5), 688-708. http://doi.org/10.1177/0165551514544626 DOI
36	Oh, H.-S., & Myaeng, S.-H. (2014). Utilizing global and path information with language modelling for hierarchical text classification. Journal of Information Science, 40(2), 127-145. http://doi.org/10.1177/0165551513507415 DOI
37	Ponte, J. M., & Croft, W. B. (1998). A language modeling approach to information retrieval. In Proceedings of the 21st annual international ACM SIGIR conference on Research and development in information retrieval - SIGIR '98 (pp. 275-281). New York: ACM Press. http://doi.org/10.1145/290941.291008 DOI
38	Sasaki, M., & Kita, K. (1998). Rule-based text categorization using hierarchical categories. In IEEE International Conference on Systems, Man, and Cybernetics (Vol. 3, pp. 2827-2830). IEEE. http://doi.org/10.1109/ICSMC.1998.725090 DOI