• 제목/요약/키워드: Text-based classification

검색결과 452건 처리시간 0.023초

Intention Classification for Retrieval of Health Questions

  • Liu, Rey-Long
    • International Journal of Knowledge Content Development & Technology
    • /
    • 제7권1호
    • /
    • pp.101-120
    • /
    • 2017
  • Healthcare professionals have edited many health questions (HQs) and their answers for healthcare consumers on the Internet. The HQs provide both readable and reliable health information, and hence retrieval of those HQs that are relevant to a given question is essential for health education and promotion through the Internet. However, retrieval of relevant HQs needs to be based on the recognition of the intention of each HQ, which is difficult to be done by predefining syntactic and semantic rules. We thus model the intention recognition problem as a text classification problem, and develop two techniques to improve a learning-based text classifier for the problem. The two techniques improve the classifier by location-based and area-based feature weightings, respectively. Experimental results show that, the two techniques can work together to significantly improve a Support Vector Machine classifier in both the recognition of HQ intentions and the retrieval of relevant HQs.

An Improved Text Classification Method for Sentiment Classification

  • Wang, Guangxing;Shin, Seong Yoon
    • Journal of information and communication convergence engineering
    • /
    • 제17권1호
    • /
    • pp.41-48
    • /
    • 2019
  • In recent years, sentiment analysis research has become popular. The research results of sentiment analysis have achieved remarkable results in practical applications, such as in Amazon's book recommendation system and the North American movie box office evaluation system. Analyzing big data based on user preferences and evaluations and recommending hot-selling books and hot-rated movies to users in a targeted manner greatly improve book sales and attendance rate in movies [1, 2]. However, traditional machine learning-based sentiment analysis methods such as the Classification and Regression Tree (CART), Support Vector Machine (SVM), and k-nearest neighbor classification (kNN) had performed poorly in accuracy. In this paper, an improved kNN classification method is proposed. Through the improved method and normalizing of data, the purpose of improving accuracy is achieved. Subsequently, the three classification algorithms and the improved algorithm were compared based on experimental data. Experiments show that the improved method performs best in the kNN classification method, with an accuracy rate of 11.5% and a precision rate of 20.3%.

빅데이터 환경에서 텍스트마이닝 기법을 활용한 공공문서 분류체계의 적용사례 연구 (Case Study on Public Document Classification System That Utilizes Text-Mining Technique in BigData Environment)

  • 심장섭;이강욱
    • 한국정보통신학회:학술대회논문집
    • /
    • 한국정보통신학회 2015년도 추계학술대회
    • /
    • pp.1085-1089
    • /
    • 2015
  • 과거의 텍스트마이닝기법은 텍스트 자체의 복잡성과 텍스트 내에 산재한 변수의 자유도 때문에 분석 알고리즘을 구현하는데 어려움이 있었다. 의미 있는 정보를 얻기 위하여 어렵게 알고리즘을 구현했다고 하더라도, 기계적으로 텍스트 분석에 소요되는 시간이 텍스트를 사람이 직접 읽어 분석 하는 것보다 많은 시간이 요구 되었다. 그러나 최근 하드웨어와 분석 알고리즘의 발전과 함께 빅데이터라는 기술이 등장하였으며, 앞에서 설명한 제약사항을 극복할 수 있게 되었고, 텍스트마이닝을 통한 분석이 현실세계에서 그 가치를 충분히 인정받고 있다. 만약, 텍스트의 탐색 수준에서 벗어나 마이닝을 통하여 분석이 가능하다면 텍스트 분석에 소비되는 인적, 물적 자원의 비용을 절감할 수 있기 때문에 공공분야에서 절실히 요구되는 창조적인 일에 더 많은 자원을 효과적으로 활용할 수 있을 것이다. 이에 본 논문에서는 인적 자원이 수작업으로 하는 공공분야 문서 분류의 결과값과 빅데이터 환경에서 텍스트마이닝기반의 문서내 단어 빈도수(TF-IDF)와 문서간 코사인 유사도(Cosine Similarity)를 활용한 공공분야 문서분류의 결과값을 비교하여 평가한다.

  • PDF

Text Classification with Heterogeneous Data Using Multiple Self-Training Classifiers

  • William Xiu Shun Wong;Donghoon Lee;Namgyu Kim
    • Asia pacific journal of information systems
    • /
    • 제29권4호
    • /
    • pp.789-816
    • /
    • 2019
  • Text classification is a challenging task, especially when dealing with a huge amount of text data. The performance of a classification model can be varied depending on what type of words contained in the document corpus and what type of features generated for classification. Aside from proposing a new modified version of the existing algorithm or creating a new algorithm, we attempt to modify the use of data. The classifier performance is usually affected by the quality of learning data as the classifier is built based on these training data. We assume that the data from different domains might have different characteristics of noise, which can be utilized in the process of learning the classifier. Therefore, we attempt to enhance the robustness of the classifier by injecting the heterogeneous data artificially into the learning process in order to improve the classification accuracy. Semi-supervised approach was applied for utilizing the heterogeneous data in the process of learning the document classifier. However, the performance of document classifier might be degraded by the unlabeled data. Therefore, we further proposed an algorithm to extract only the documents that contribute to the accuracy improvement of the classifier.

Biomedical Ontologies and Text Mining for Biomedicine and Healthcare: A Survey

  • Yoo, Ill-Hoi;Song, Min
    • Journal of Computing Science and Engineering
    • /
    • 제2권2호
    • /
    • pp.109-136
    • /
    • 2008
  • In this survey paper, we discuss biomedical ontologies and major text mining techniques applied to biomedicine and healthcare. Biomedical ontologies such as UMLS are currently being adopted in text mining approaches because they provide domain knowledge for text mining approaches. In addition, biomedical ontologies enable us to resolve many linguistic problems when text mining approaches handle biomedical literature. As the first example of text mining, document clustering is surveyed. Because a document set is normally multiple topic, text mining approaches use document clustering as a preprocessing step to group similar documents. Additionally, document clustering is able to inform the biomedical literature searches required for the practice of evidence-based medicine. We introduce Swanson's UnDiscovered Public Knowledge (UDPK) model to generate biomedical hypotheses from biomedical literature such as MEDLINE by discovering novel connections among logically-related biomedical concepts. Another important area of text mining is document classification. Document classification is a valuable tool for biomedical tasks that involve large amounts of text. We survey well-known classification techniques in biomedicine. As the last example of text mining in biomedicine and healthcare, we survey information extraction. Information extraction is the process of scanning text for information relevant to some interest, including extracting entities, relations, and events. We also address techniques and issues of evaluating text mining applications in biomedicine and healthcare.

Machine Learning Based Automatic Categorization Model for Text Lines in Invoice Documents

  • Shin, Hyun-Kyung
    • 한국멀티미디어학회논문지
    • /
    • 제13권12호
    • /
    • pp.1786-1797
    • /
    • 2010
  • Automatic understanding of contents in document image is a very hard problem due to involvement with mathematically challenging problems originated mainly from the over-determined system induced by document segmentation process. In both academic and industrial areas, there have been incessant and various efforts to improve core parts of content retrieval technologies by the means of separating out segmentation related issues using semi-structured document, e.g., invoice,. In this paper we proposed classification models for text lines on invoice document in which text lines were clustered into the five categories in accordance with their contents: purchase order header, invoice header, summary header, surcharge header, purchase items. Our investigation was concentrated on the performance of machine learning based models in aspect of linear-discriminant-analysis (LDA) and non-LDA (logic based). In the group of LDA, na$\"{\i}$ve baysian, k-nearest neighbor, and SVM were used, in the group of non LDA, decision tree, random forest, and boost were used. We described the details of feature vector construction and the selection processes of the model and the parameter including training and validation. We also presented the experimental results of comparison on training/classification error levels for the models employed.

SNS 특징정보를 활용한 마르코프 논리 네트워크 기반의 단문 텍스트 분류 방법 (A Method for Short Text Classification using SNS Feature Information based on Markov Logic Networks)

  • 이은지;김판구
    • 한국멀티미디어학회논문지
    • /
    • 제20권7호
    • /
    • pp.1065-1072
    • /
    • 2017
  • As smart devices and social network services (SNSs) become increasingly pervasive, individuals produce large amounts of data in real time. Accordingly, studies on unstructured data analysis are actively being conducted to solve the resultant problem of information overload and to facilitate effective data processing. Many such studies are conducted for filtering inappropriate information. In this paper, a feature-weighting method considering SNS-message features is proposed for the classification of short text messages generated on SNSs, using Markov logic networks for category inference. The performance of the proposed method is verified through a comparison with an existing frequency-based classification methods.

Improved Spam Filter via Handling of Text Embedded Image E-mail

  • Youn, Seongwook;Cho, Hyun-Chong
    • Journal of Electrical Engineering and Technology
    • /
    • 제10권1호
    • /
    • pp.401-407
    • /
    • 2015
  • The increase of image spam, a kind of spam in which the text message is embedded into attached image to defeat spam filtering technique, is a major problem of the current e-mail system. For nearly a decade, content based filtering using text classification or machine learning has been a major trend of anti-spam filtering system. Recently, spammers try to defeat anti-spam filter by many techniques. Text embedding into attached image is one of them. We proposed an ontology spam filters. However, the proposed system handles only text e-mail and the percentage of attached images is increasing sharply. The contribution of the paper is that we add image e-mail handling capability into the anti-spam filtering system keeping the advantages of the previous text based spam e-mail filtering system. Also, the proposed system gives a low false negative value, which means that user's valuable e-mail is rarely regarded as a spam e-mail.

중국어 텍스트 분류 작업의 개선을 위한 WWMBERT 기반 방식 (A WWMBERT-based Method for Improving Chinese Text Classification Task)

  • 왕흠원;조인휘
    • 한국정보처리학회:학술대회논문집
    • /
    • 한국정보처리학회 2021년도 춘계학술발표대회
    • /
    • pp.408-410
    • /
    • 2021
  • In the NLP field, the pre-training model BERT launched by the Google team in 2018 has shown amazing results in various tasks in the NLP field. Subsequently, many variant models have been derived based on the original BERT, such as RoBERTa, ERNIEBERT and so on. In this paper, the WWMBERT (Whole Word Masking BERT) model suitable for Chinese text tasks was used as the baseline model of our experiment. The experiment is mainly for "Text-level Chinese text classification tasks" are improved, which mainly combines Tapt (Task-Adaptive Pretraining) and "Multi-Sample Dropout method" to improve the model, and compare the experimental results, experimental data sets and model scoring standards Both are consistent with the official WWMBERT model using Accuracy as the scoring standard. The official WWMBERT model uses the maximum and average values of multiple experimental results as the experimental scores. The development set was 97.70% (97.50%) on the "text-level Chinese text classification task". and 97.70% (97.50%) of the test set. After comparing the results of the experiments in this paper, the development set increased by 0.35% (0.5%) and the test set increased by 0.31% (0.48%). The original baseline model has been significantly improved.

Research on Chinese Microblog Sentiment Classification Based on TextCNN-BiLSTM Model

  • Haiqin Tang;Ruirui Zhang
    • Journal of Information Processing Systems
    • /
    • 제19권6호
    • /
    • pp.842-857
    • /
    • 2023
  • Currently, most sentiment classification models on microblogging platforms analyze sentence parts of speech and emoticons without comprehending users' emotional inclinations and grasping moral nuances. This study proposes a hybrid sentiment analysis model. Given the distinct nature of microblog comments, the model employs a combined stop-word list and word2vec for word vectorization. To mitigate local information loss, the TextCNN model, devoid of pooling layers, is employed for local feature extraction, while BiLSTM is utilized for contextual feature extraction in deep learning. Subsequently, microblog comment sentiments are categorized using a classification layer. Given the binary classification task at the output layer and the numerous hidden layers within BiLSTM, the Tanh activation function is adopted in this model. Experimental findings demonstrate that the enhanced TextCNN-BiLSTM model attains a precision of 94.75%. This represents a 1.21%, 1.25%, and 1.25% enhancement in precision, recall, and F1 values, respectively, in comparison to the individual deep learning models TextCNN. Furthermore, it outperforms BiLSTM by 0.78%, 0.9%, and 0.9% in precision, recall, and F1 values.