통합 검색 | Korea Science

Representation of Texts into String Vectors for Text Categorization

Jo, Tae-Ho
- Journal of Computing Science and Engineering
- /
- 제4권2호
- /
- pp.110-127
- /
- 2010
In this study, we propose a method for encoding documents into string vectors, instead of numerical vectors. A traditional approach to text categorization usually requires encoding documents into numerical vectors. The usual method of encoding documents therefore causes two main problems: huge dimensionality and sparse distribution. In this study, we modify or create machine learning-based approaches to text categorization, where string vectors are received as input vectors, instead of numerical vectors. As a result, we can improve text categorization performance by avoiding these two problems.
https://doi.org/10.5626/JCSE.2010.4.2.110 인용 PDF

Impact of Instance Selection on kNN-Based Text Categorization

Barigou, Fatiha
- Journal of Information Processing Systems
- /
- 제14권2호
- /
- pp.418-434
- /
- 2018
With the increasing use of the Internet and electronic documents, automatic text categorization becomes imperative. Several machine learning algorithms have been proposed for text categorization. The k-nearest neighbor algorithm (kNN) is known to be one of the best state of the art classifiers when used for text categorization. However, kNN suffers from limitations such as high computation when classifying new instances. Instance selection techniques have emerged as highly competitive methods to improve kNN through data reduction. However previous works have evaluated those approaches only on structured datasets. In addition, their performance has not been examined over the text categorization domain where the dimensionality and size of the dataset is very high. Motivated by these observations, this paper investigates and analyzes the impact of instance selection on kNN-based text categorization in terms of various aspects such as classification accuracy, classification efficiency, and data reduction.
https://doi.org/10.3745/JIPS.02.0080 인용 PDF KSCI

Table based Matching Algorithm for Soft Categorization of News Articles in Reuter 21578

Jo, Tae-Ho
- 한국멀티미디어학회논문지
- /
- 제11권6호
- /
- pp.875-882
- /
- 2008
This research proposes an alternative approach to machine learning based ones for text categorization. For using machine learning based approaches for any task of text mining, documents should be encoded into numerical vectors; it causes two problems: huge dimensionality and sparse distribution. Although there are various tasks of text mining such as text categorization, text clustering, and text summarization, the scope of this research is restricted to text categorization. The idea of this research is to avoid the two problems by encoding a document or documents into a table, instead of numerical vectors. Therefore, the goal of this research is to improve the performance of text categorization by proposing approaches, which are free from the two problems.
PDF

Text Categorization for Authorship based on the Features of Lingual Conceptual Expression

Zhang, Quan;Zhang, Yun-liang;Yuan, Yi
- 한국언어정보학회:학술대회논문집
- /
- 한국언어정보학회 2007년도 정기학술대회
- /
- pp.515-521
- /
- 2007
The text categorization is an important field for the automatic text information processing. Moreover, the authorship identification of a text can be treated as a special text categorization. This paper adopts the conceptual primitives' expression based on the Hierarchical Network of Concepts (HNC) theory, which can describe the words meaning in hierarchical symbols, in order to avoid the sparse data shortcoming that is aroused by the natural language surface features in text categorization. The KNN algorithm is used as computing classification element. Then, the experiment has been done on the Chinese text authorship identification. The experiment result gives out that the processing mode that is put forward in this paper achieves high correct rate, so it is feasible for the text authorship identification.
PDF

Neural Text Categorizer for Exclusive Text Categorization

Jo, Tae-Ho
- Journal of Information Processing Systems
- /
- 제4권2호
- /
- pp.77-86
- /
- 2008
This research proposes a new neural network for text categorization which uses alternative representations of documents to numerical vectors. Since the proposed neural network is intended originally only for text categorization, it is called NTC (Neural Text Categorizer) in this research. Numerical vectors representing documents for tasks of text mining have inherently two main problems: huge dimensionality and sparse distribution. Although many various feature selection methods are developed to address the first problem, the reduced dimension remains still large. If the dimension is reduced excessively by a feature selection method, robustness of text categorization is degraded. Even if SVM (Support Vector Machine) is tolerable to huge dimensionality, it is not so to the second problem. The goal of this research is to address the two problems at same time by proposing a new representation of documents and a new neural network using the representation for its input vector.
https://doi.org/10.3745/JIPS.2008.4.2.077 인용 PDF KSCI

A Novel Statistical Feature Selection Approach for Text Categorization

Fattah, Mohamed Abdel
- Journal of Information Processing Systems
- /
- 제13권5호
- /
- pp.1397-1409
- /
- 2017
For text categorization task, distinctive text features selection is important due to feature space high dimensionality. It is important to decrease the feature space dimension to decrease processing time and increase accuracy. In the current study, for text categorization task, we introduce a novel statistical feature selection approach. This approach measures the term distribution in all collection documents, the term distribution in a certain category and the term distribution in a certain class relative to other classes. The proposed method results show its superiority over the traditional feature selection methods.
https://doi.org/10.3745/JIPS.02.0076 인용 PDF KSCI

문서관리를 위한 자동문서범주화에 대한 이론 및 기법 (An Automatic Text Categorization Theories and Techniques for Text Management)

고영중;서정연
- 정보관리연구
- /
- 제33권2호
- /
- pp.19-32
- /
- 2002
최근 디지털 도서관이 등장하고 인터넷이 폭 넓게 보급되어 온라인 상에서 얻을 수 있는 텍스트 정보의 양이 급증함에 따라 효율적인 정보 관리 및 검색이 요구되고 있다. 자동 문서 범주화란 문서의 내용에 기반하여 미리 정의되어 있는 범주에 문서를 자동으로 할당하는 작업으로써 효율적인 정보 관리 및 검색을 가능하게 하는 동시에 방대한 양의 수작업을 감소시키는데 그 목적이 있다. 문서 분류를 위해서는 문서들을 가장 잘 표현할 수 있는 자질들을 정하고, 이러한 자질들을 통해 분류할 문서를 색인 과정을 통해 표현한다. 또한, 문서 분류기를 통해 문서를 목적에 맞게 분류한다. 본 논문에서는 자동 문서 범주화를 수행하기 위한 각 단계를 소개하고 각 수행 단계에서 사용되는 여러 가지 기법들을 소개하고자 한다.
https://doi.org/10.1633/JIM.2002.33.2.019 인용 PDF

구문 패턴과 키워드 집합을 이용한 통계적 자동 문서 분류의 성능 향상 (Improving the Performance of Statistical Automatic Text Categorization by using Phrasal Patterns and Keyword Sets)

한정기;박민규;조광제;김준태
- 한국정보처리학회논문지
- /
- 제7권4호
- /
- pp.1150-1159
- /
- 2000
This paper presents an automatic text categorization model that improves the accuracy by combining statistical and knowledge-based categorization methods. In our model we apply knowledge-based method first, and then apply statistical method on the text which are not categorized by knowledge-based method. By using this combined method, we can improve the accuracy of categorization while categorize all the texts without failure. For statistical categorization, the vector model with Inverted Category Frequency (ICF) weighting is used. For knowledge-based categorization, Phrasal Patterns and Keyword Sets are introduced to represent sentence patterns, and then pattern matching is performed. Experimental results on new articles show that the accuracy of categorization can be improved by combining the tow different categorization methods.
PDF

Modified Version of SVM for Text Categorization

Jo, Tae-Ho
- International Journal of Fuzzy Logic and Intelligent Systems
- /
- 제8권1호
- /
- pp.52-60
- /
- 2008
This research proposes a new strategy where documents are encoded into string vectors for text categorization and modified versions of SVM to be adaptable to string vectors. Traditionally, when the traditional version of SVM is used for pattern classification, raw data should be encoded into numerical vectors. This encoding may be difficult, depending on a given application area of pattern classification. For example, in text categorization, encoding full texts given as raw data into numerical vectors leads to two main problems: huge dimensionality and sparse distribution. In this research, we encode full texts into string vectors, and apply the modified version of SVM adaptable to string vectors for text categorization.
https://doi.org/10.5391/IJFIS.2008.8.1.052 인용 PDF KSCI

Inverted Index based Modified Version of KNN for Text Categorization

Jo, Tae-Ho
- Journal of Information Processing Systems
- /
- 제4권1호
- /
- pp.17-26
- /
- 2008
This research proposes a new strategy where documents are encoded into string vectors and modified version of KNN to be adaptable to string vectors for text categorization. Traditionally, when KNN are used for pattern classification, raw data should be encoded into numerical vectors. This encoding may be difficult, depending on a given application area of pattern classification. For example, in text categorization, encoding full texts given as raw data into numerical vectors leads to two main problems: huge dimensionality and sparse distribution. In this research, we encode full texts into string vectors, and modify the supervised learning algorithms adaptable to string vectors for text categorization.
https://doi.org/10.3745/JIPS.2008.4.1.017 인용 PDF KSCI

검색결과 147건 처리시간 0.025초

이메일무단수집거부

이용약관

제 1 장 총칙

제 2 장 이용계약의 체결

제 3 장 계약 당사자의 의무

제 4 장 서비스의 이용

제 5 장 계약 해지 및 이용 제한

제 6 장 손해배상 및 기타사항

자세히 찾기

이미지 검색 (β)