Search | Korea Science

Ko, Byeong-Kyu;Oh, Kun-Seok;Kim, Pan-Koo
- Journal of Information Technology and Architecture
- /
- v.9 no.4
- /
- pp.413-422
- /
- 2012
Many researchers have been studied to reveal human natural language to let machine understand its meaning by text based, page rank based or more. Particularly, it has been considered that URL and HTML Tag information in web documents are attracting people' attention again to analyze huge amount of web document automatically. In this paper, we propose a STW (Semantic Term Weight) approach based on syntactic and linguistic structure of web documents in order to classify what genres are. For the evaluation, we analyzed more than 1,000 documents from 20-Genre-collection corpus for training the documents based on SVM algorithm. Afterwards, we tested KI-04 corpus to evaluate performance of our proposed method. This paper measured their accuracy by classifying them into an experiment using STW and one without u sing STW. As the results, the proposed STW based approach showed approximately 10.2% which Is higher than one without use of STW.
KSCI

Lee, Kong-Joo;Lim, Chul-Su;Kim, Jae-Hoon
- The KIPS Transactions:PartB
- /
- v.11B no.5
- /
- pp.555-562
- /
- 2004
A genre or a style is another view of documents different from a subject or a topic. The style is also a criterion to classify the documents. There have been several studies on detecting a style of textual documents. However, only a few of them dealt with web documents. In this paper we suggest sets of features to detect styles of web documents. Web documents are different from textual documents in that Dey contain URL and HTML tags within the pages. We introduce the features specific to web documents, which are extracted from URL and HTML tags. Experimental results enable us to evaluate their characteristics and performances.
https://doi.org/10.3745/KIPSTB.2004.11B.5.555 인용 PDF KSCI