• Title/Summary/Keyword: Text Classification

Search Result 715, Processing Time 0.033 seconds

Machine Learning Based Automatic Categorization Model for Text Lines in Invoice Documents

  • Shin, Hyun-Kyung
    • Journal of Korea Multimedia Society
    • /
    • v.13 no.12
    • /
    • pp.1786-1797
    • /
    • 2010
  • Automatic understanding of contents in document image is a very hard problem due to involvement with mathematically challenging problems originated mainly from the over-determined system induced by document segmentation process. In both academic and industrial areas, there have been incessant and various efforts to improve core parts of content retrieval technologies by the means of separating out segmentation related issues using semi-structured document, e.g., invoice,. In this paper we proposed classification models for text lines on invoice document in which text lines were clustered into the five categories in accordance with their contents: purchase order header, invoice header, summary header, surcharge header, purchase items. Our investigation was concentrated on the performance of machine learning based models in aspect of linear-discriminant-analysis (LDA) and non-LDA (logic based). In the group of LDA, na$\"{\i}$ve baysian, k-nearest neighbor, and SVM were used, in the group of non LDA, decision tree, random forest, and boost were used. We described the details of feature vector construction and the selection processes of the model and the parameter including training and validation. We also presented the experimental results of comparison on training/classification error levels for the models employed.

A Study on Improvement of Image Classification Accuracy Using Image-Text Pairs (이미지-텍스트 쌍을 활용한 이미지 분류 정확도 향상에 관한 연구)

  • Mi-Hui Kim;Ju-Hyeok Lee
    • Journal of IKEEE
    • /
    • v.27 no.4
    • /
    • pp.561-566
    • /
    • 2023
  • With the development of deep learning, it is possible to solve various computer non-specialized problems such as image processing. However, most image processing methods use only the visual information of the image to process the image. Text data such as descriptions and annotations related to images may provide additional tactile and visual information that is difficult to obtain from the image itself. In this paper, we intend to improve image classification accuracy through a deep learning model that analyzes images and texts using image-text pairs. The proposed model showed an approximately 11% classification accuracy improvement over the deep learning model using only image information.

Reinforcement Method for Automated Text Classification using Post-processing and Training with Definition Criteria (학습방법개선과 후처리 분석을 이용한 자동문서분류의 성능향상 방법)

  • Choi, Yun-Jeong;Park, Seung-Soo
    • The KIPS Transactions:PartB
    • /
    • v.12B no.7 s.103
    • /
    • pp.811-822
    • /
    • 2005
  • Automated text categorization is to classify free text documents into predefined categories automatically and whose main goals is to reduce considerable manual process required to the task. The researches to improving the text categorization performance(efficiency) in recent years, focused on enhancing existing classification models and algorithms itself, but, whose range had been limited by feature based statistical methodology. In this paper, we propose RTPost system of different style from i.ny traditional method, which takes fault tolerant system approach and data mining strategy. The 2 important parts of RTPost system are reinforcement training and post-processing part. First, the main point of training method deals with the problem of defining category to be classified before selecting training sample documents. And post-processing method deals with the problem of assigning category, not performance of classification algorithms. In experiments, we applied our system to documents getting low classification accuracy which were laid on a decision boundary nearby. Through the experiments, we shows that our system has high accuracy and stability in actual conditions. It wholly did not depend on some variables which are important influence to classification power such as number of training documents, selection problem and performance of classification algorithms. In addition, we can expect self learning effect which decrease the training cost and increase the training power with employing active learning advantage.

Short Text Classification for Job Placement Chatbot by T-EBOW (T-EBOW를 이용한 취업알선 챗봇용 단문 분류 연구)

  • Kim, Jeongrae;Kim, Han-joon;Jeong, Kyoung Hee
    • Journal of Internet Computing and Services
    • /
    • v.20 no.2
    • /
    • pp.93-100
    • /
    • 2019
  • Recently, in various business fields, companies are concentrating on providing chatbot services to various environments by adding artificial intelligence to existing messenger platforms. Organizations in the field of job placement also require chatbot services to improve the quality of employment counseling services and to solve the problem of agent management. A text-based general chatbot classifies input user sentences into learned sentences and provides appropriate answers to users. Recently, user sentences inputted to chatbots are inputted as short texts due to the activation of social network services. Therefore, performance improvement of short text classification can contribute to improvement of chatbot service performance. In this paper, we propose T-EBOW (Translation-Extended Bag Of Words), which is a method to add translation information as well as concept information of existing researches in order to strengthen the short text classification for employment chatbot. The performance evaluation results of the T-EBOW applied to the machine learning classification model are superior to those of the conventional method.

A Hierarchical Text Rating System for Objectionable Documents

  • Jeong, Chi-Yoon;Han, Seung-Wan;Nam, Taek-Yong
    • Journal of Information Processing Systems
    • /
    • v.1 no.1 s.1
    • /
    • pp.22-26
    • /
    • 2005
  • In this paper, we classified the objectionable texts into four rates according to their harmfulness and proposed the hierarchical text rating system for objectionable documents. Since the documents in the same category have similarities in used words, expressions and structure of the document, the text rating system, which uses a single classification model, has low accuracy. To solve this problem, we separate objectionable documents into several subsets by using their properties, and then classify the subsets hierarchically. The proposed system consists of three layers. In each layer, we select features using the chi-square statistics, and then the weight of the features, which is calculated by using the TF-IDF weighting scheme, is used as an input of the non-linear SVM classifier. By means of a hierarchical scheme using the different features and the different number of features in each layer, we can characterize the objectionability of documents more effectively and expect to improve the performance of the rating system. We compared the performance of the proposed system and performance of several text rating systems and experimental results show that the proposed system can archive an excellent classification performance.

Using Ontologies for Semantic Text Mining (시맨틱 텍스트 마이닝을 위한 온톨로지 활용 방안)

  • Yu, Eun-Ji;Kim, Jung-Chul;Lee, Choon-Youl;Kim, Nam-Gyu
    • The Journal of Information Systems
    • /
    • v.21 no.3
    • /
    • pp.137-161
    • /
    • 2012
  • The increasing interest in big data analysis using various data mining techniques indicates that many commercial data mining tools now need to be equipped with fundamental text analysis modules. The most essential prerequisite for accurate analysis of text documents is an understanding of the exact semantics of each term in a document. The main difficulties in understanding the exact semantics of terms are mainly attributable to homonym and synonym problems, which is a traditional problem in the natural language processing field. Some major text mining tools provide a thesaurus to solve these problems, but a thesaurus cannot be used to resolve complex synonym problems. Furthermore, the use of a thesaurus is irrelevant to the issue of homonym problems and hence cannot solve them. In this paper, we propose a semantic text mining methodology that uses ontologies to improve the quality of text mining results by resolving the semantic ambiguity caused by homonym and synonym problems. We evaluate the practical applicability of the proposed methodology by performing a classification analysis to predict customer churn using real transactional data and Q&A articles from the "S" online shopping mall in Korea. The experiments revealed that the prediction model produced by our proposed semantic text mining method outperformed the model produced by traditional text mining in terms of prediction accuracy such as the response, captured response, and lift.

Comparison of term weighting schemes for document classification (문서 분류를 위한 용어 가중치 기법 비교)

  • Jeong, Ho Young;Shin, Sang Min;Choi, Yong-Seok
    • The Korean Journal of Applied Statistics
    • /
    • v.32 no.2
    • /
    • pp.265-276
    • /
    • 2019
  • The document-term frequency matrix is a general data of objects in text mining. In this study, we introduce a traditional term weighting scheme TF-IDF (term frequency-inverse document frequency) which is applied in the document-term frequency matrix and used for text classifications. In addition, we introduce and compare TF-IDF-ICSDF and TF-IGM schemes which are well known recently. This study also provides a method to extract keyword enhancing the quality of text classifications. Based on the keywords extracted, we applied support vector machine for the text classification. In this study, to compare the performance term weighting schemes, we used some performance metrics such as precision, recall, and F1-score. Therefore, we know that TF-IGM scheme provided high performance metrics and was optimal for text classification.

Vocabulary Expansion Technique for Advertisement Classification

  • Jung, Jin-Yong;Lee, Jung-Hyun;Ha, Jong-Woo;Lee, Sang-Keun
    • KSII Transactions on Internet and Information Systems (TIIS)
    • /
    • v.6 no.5
    • /
    • pp.1373-1387
    • /
    • 2012
  • Contextual advertising is an important revenue source for major service providers on the Web. Ads classification is one of main tasks in contextual advertising, and it is used to retrieve semantically relevant ads with respect to the content of web pages. However, it is difficult for traditional text classification methods to achieve satisfactory performance in ads classification due to scarce term features in ads. In this paper, we propose a novel ads classification method that handles the lack of term features for classifying ads with short text. The proposed method utilizes a vocabulary expansion technique using semantic associations among terms learned from large-scale search query logs. The evaluation results show that our methodology achieves 4.0% ~ 9.7% improvements in terms of the hierarchical f-measure over the baseline classifiers without vocabulary expansion.

Development of e-Mail Classifiers for e-Mail Response Management Systems (전자메일 자동관리 시스템을 위한 전자메일 분류기의 개발)

  • Kim, Kuk-Pyo;Kwon, Young-S.
    • Journal of Information Technology Services
    • /
    • v.2 no.2
    • /
    • pp.87-95
    • /
    • 2003
  • With the increasing proliferation of World Wide Web, electronic mail systems have become very widely used communication tools. Researches on e-mail classification have been very important in that e-mail classification system is a major engine for e-mail response management systems which mine unstructured e-mail messages and automatically categorize them. in this research we develop e-mail classifiers for e-mail Response Management Systems (ERMS) using naive bayesian learning and centroid-based classification. We analyze which method performs better under which conditions, comparing classification accuracies which may depend on the structure, the size of training data set and number of classes, using the different data set of an on-line shopping mall and a credit card company. The developed e-mail classifiers have been successfully implemented in practice. The experimental results show that naive bayesian learning performs better, while centroid-based classification is more robust in terms of classification accuracy.

Modified Version of SVM for Text Categorization

  • Jo, Tae-Ho
    • International Journal of Fuzzy Logic and Intelligent Systems
    • /
    • v.8 no.1
    • /
    • pp.52-60
    • /
    • 2008
  • This research proposes a new strategy where documents are encoded into string vectors for text categorization and modified versions of SVM to be adaptable to string vectors. Traditionally, when the traditional version of SVM is used for pattern classification, raw data should be encoded into numerical vectors. This encoding may be difficult, depending on a given application area of pattern classification. For example, in text categorization, encoding full texts given as raw data into numerical vectors leads to two main problems: huge dimensionality and sparse distribution. In this research, we encode full texts into string vectors, and apply the modified version of SVM adaptable to string vectors for text categorization.