Browse > Article
http://dx.doi.org/10.7236/JIIBC.2013.13.4.141

Performance Improvement of Web Document Classification through Incorporation of Feature Selection and Weighting  

Lee, Ah-Ram (School of Electrical and Computer Engineering, University of Seoul)
Kim, Han-Joon (School of Electrical and Computer Engineering, University of Seoul)
Man, Xuan (School of Electrical and Computer Engineering, University of Seoul)
Publication Information
The Journal of the Institute of Internet, Broadcasting and Communication / v.13, no.4, 2013 , pp. 141-148 More about this Journal
Abstract
Automated classification systems which utilize machine learning develops classification models through learning process, and then classify unknown data into predefined set of categories according to the model. The performance of machine learning-based classification systems relies greatly upon the quality of features composing classification models. For textual data, we can use their word terms and structure information in order to generate the set of features. Particularly, in order to extract feature from Web documents, we need to analyze tag and hyperlink information. Recent studies on Web document classification focus on feature engineering technology other than machine learning algorithms themselves. Thus this paper proposes a novel method of incorporating feature selection and weighting which can improves classification models effectively. Through extensive experiments using Web-KB document collections, the proposed method outperforms conventional ones.
Keywords
Document Classification; Web; Feature Selection; Feature Weighting; Machine Learning;
Citations & Related Records
Times Cited By KSCI : 1  (Citation Analysis)
연도 인용수 순위
1 J. Kim, and M. Kim, "A Study on the Implementation of SNS Message Classification by Emotion Factors", The Journal of the Institute of Internet, Broadcasting and Communiction, Vol. 11, No. 4, pp. 217-222, 2011   과학기술학회마을
2 J. Joo, and Y. Yoon, "Pattern Analysis and Prediction System for Meme Data", Journal of Korean Institute of Information Technology, Vol. 9, No. 9, pp. 163-177, 2011
3 T.M. Mitchell, "Machine Learning", McGraw-Hill, 1997
4 H. Altincay, "Feature Extraction Using Single Variable Classifiers for Binary Text Classification", Lecture Notes in Computer Science, Vol. 7906, pp 332-340, 2013
5 X. Qi, and B. D. Davison, "Web page classification: Features and algorithms", ACM Computing Surveys, Vol. 41, No. 2, Article No. 12, 2009
6 S. Chakrabarti, B. Dom, and P. Indyk, "Enhanced hypertext categorization using hyperlinks", Proceedings of ACM SIGMOD International Conference on Management of Data, pp. 307-318, 1998
7 H. Utard, and J. Furnkranz, "Link-Local Features for Hypertext Classification", Lecture Notes in Computer Science, Vol. 4289, pp. 58-69, 2005
8 S. Brin, and L. Page, "The Anatomy of a Large-Scale Hypertextual Web Search Engine", Seventh International World-Wide Web Conference, pp. 14-18, 1998
9 H. Benbrahim, and M. Bramer, "Impact on Performance of Hypertext Classification of Selective Rich HTML Capture", Artificial Intelligence Applications and Innovations (AIAI-2004), pp. 22-27, 2004
10 J. Gantz, and D. Reinsel, "Extracting Value from Chaos", http://www.emc.com/collateral/analyst-reports/, 2011
11 Hye-young Yang, "Technology Planning Method using Big Data", Korea Institute of S&T Evaluation and Planning (KISTEP), 2012
12 The Value and Benefits of Text Mining, JISC Digital Infrastructure, 2012