• Title/Summary/Keyword: size of training documents

Search Result 11, Processing Time 0.021 seconds

Document Classification Model Using Web Documents for Balancing Training Corpus Size per Category

  • Park, So-Young;Chang, Juno;Kihl, Taesuk
    • Journal of information and communication convergence engineering
    • /
    • v.11 no.4
    • /
    • pp.268-273
    • /
    • 2013
  • In this paper, we propose a document classification model using Web documents as a part of the training corpus in order to resolve the imbalance of the training corpus size per category. For the purpose of retrieving the Web documents closely related to each category, the proposed document classification model calculates the matching score between word features and each category, and generates a Web search query by combining the higher-ranked word features and the category title. Then, the proposed document classification model sends each combined query to the open application programming interface of the Web search engine, and receives the snippet results retrieved from the Web search engine. Finally, the proposed document classification model adds these snippet results as Web documents to the training corpus. Experimental results show that the method that considers the balance of the training corpus size per category exhibits better performance in some categories with small training sets.

Optimization of Number of Training Documents in Text Categorization (문헌범주화에서 학습문헌수 최적화에 관한 연구)

  • Shim, Kyung
    • Journal of the Korean Society for information Management
    • /
    • v.23 no.4 s.62
    • /
    • pp.277-294
    • /
    • 2006
  • This paper examines a level of categorization performance in a real-life collection of abstract articles in the fields of science and technology, and tests the optimal size of documents per category in a training set using a kNN classifier. The corpus is built by choosing categories that hold more than 2,556 documents first, and then 2,556 documents per category are randomly selected. It is further divided into eight subsets of different size of training documents : each set is randomly selected to build training documents ranging from 20 documents (Tr-20) to 2,000 documents (Tr-2000) per category. The categorization performances of the 8 subsets are compared. The average performance of the eight subsets is 30% in $F_1$ measure which is relatively poor compared to the findings of previous studies. The experimental results suggest that among the eight subsets the Tr-100 appears to be the most optimal size for training a km classifier In addition, the correctness of subject categories assigned to the training sets is probed by manually reclassifying the training sets in order to support the above conclusion by establishing a relation between and the correctness and categorization performance.

Design and Implementation of Text Classification System based on ETOM+RPost (ETOM+RPost기반의 문서분류시스템의 설계 및 구현)

  • Choi, Yun-Jeong
    • Journal of the Korea Academia-Industrial cooperation Society
    • /
    • v.11 no.2
    • /
    • pp.517-524
    • /
    • 2010
  • Recently, the size of online texts and textual information is increasing explosively, and the automated classification has a great potential for handling data such as news materials and images. Text classification system is based on supervised learning which needs laborous work by human expert. The main goal of this paper is to reduce the manual intervention, required for the task. The other goal is to increase accuracy to be high. Most of the documents have high complexity in contents and the high similarities in their described style. So, the classification results are not satisfactory. This paper shows the implementation of classification system based on ETOM+RPost algorithm and classification progress using SPAM data. In experiments, we verified our system with right-training documents and wrong-training documents. The experimental results show that our system has high accuracy and stability in all situation as 16% improvement in accuracy.

An implementation of the mixed type character recognition system using combNET (CombNET 신경망을 이용한 혼용 문서 인식 시스템의 구현)

  • 최재혁;손영우;남궁재찬
    • The Journal of Korean Institute of Communications and Information Sciences
    • /
    • v.21 no.12
    • /
    • pp.3265-3276
    • /
    • 1996
  • The studies of document recongnition have been focused mainly on Korean documents. But most of documents composed of Korean and other characters. So, in this paper, we propose the document recognition system that can recognize the multi-size, multi font and mixed type characters. We have utilized a large scale network model, "CombNET" which consists of a 4 layered network with combstructure. And we propose recognition method that can recognize characters without discrimination of character type. The first layer constitutes a Kohonen's SOFM network which quantizes an input feature vector space into several sub-spaces and the following 2-4 layers constitutes BP network modules which classify input data in each sub-space into specified catagories. An experimental result demonstrated the usefulness of this approach with the recognition rates of 95.6% for the training data. For the mixed type character documents we obtained the recognition rates of 92.6% and recognition speed of 10.3 characters per second.

  • PDF

A Study of Research on Methods of Automated Biomedical Document Classification using Topic Modeling and Deep Learning (토픽모델링과 딥 러닝을 활용한 생의학 문헌 자동 분류 기법 연구)

  • Yuk, JeeHee;Song, Min
    • Journal of the Korean Society for information Management
    • /
    • v.35 no.2
    • /
    • pp.63-88
    • /
    • 2018
  • This research evaluated differences of classification performance for feature selection methods using LDA topic model and Doc2Vec which is based on word embedding using deep learning, feature corpus sizes and classification algorithms. In addition to find the feature corpus with high performance of classification, an experiment was conducted using feature corpus was composed differently according to the location of the document and by adjusting the size of the feature corpus. Conclusionally, in the experiments using deep learning evaluate training frequency and specifically considered information for context inference. This study constructed biomedical document dataset, Disease-35083 which consisted biomedical scholarly documents provided by PMC and categorized by the disease category. Throughout the study this research verifies which type and size of feature corpus produces the highest performance and, also suggests some feature corpus which carry an extensibility to specific feature by displaying efficiency during the training time. Additionally, this research compares the differences between deep learning and existing method and suggests an appropriate method by classification environment.

A Study on Recognition of Citation Metadata using Bidirectional GRU-CRF Model based on Pre-trained Language Model (사전학습 된 언어 모델 기반의 양방향 게이트 순환 유닛 모델과 조건부 랜덤 필드 모델을 이용한 참고문헌 메타데이터 인식 연구)

  • Ji, Seon-yeong;Choi, Sung-pil
    • Journal of the Korean Society for information Management
    • /
    • v.38 no.1
    • /
    • pp.221-242
    • /
    • 2021
  • This study applied reference metadata recognition using bidirectional GRU-CRF model based on pre-trained language model. The experimental group consists of 161,315 references extracted by 53,562 academic documents in PDF format collected from 40 journals published in 2018 based on rules. In order to construct an experiment set. This study was conducted to automatically extract the references from academic literature in PDF format. Through this study, the language model with the highest performance was identified, and additional experiments were conducted on the model to compare the recognition performance according to the size of the training set. Finally, the performance of each metadata was confirmed.

Understanding the Current State of Deep Learning Application to Water-related Disaster Management in Developing Countries

  • Yusuff, Kareem Kola;Shiksa, Bastola;Park, Kidoo;Jung, Younghun
    • Proceedings of the Korea Water Resources Association Conference
    • /
    • 2022.05a
    • /
    • pp.145-145
    • /
    • 2022
  • Availability of abundant water resources data in developing countries is a great concern that has hindered the adoption of deep learning techniques (DL) for disaster prevention and mitigation. On the contrary, over the last two decades, a sizeable amount of DL publication in disaster management emanated from developed countries with efficient data management systems. To understand the current state of DL adoption for solving water-related disaster management in developing countries, an extensive bibliometric review coupled with a theory-based analysis of related research documents is conducted from 2003 - 2022 using Web of Science, Scopus, VOSviewer software and PRISMA model. Results show that four major disasters - pluvial / fluvial flooding, land subsidence, drought and snow avalanche are the most prevalent. Also, recurrent flash floods and landslides caused by irregular rainfall pattern, abundant freshwater and mountainous terrains made India the only developing country with an impressive DL adoption rate of 50% publication count, thereby setting the pace for other developing countries. Further analysis indicates that economically-disadvantaged countries will experience a delay in DL implementation based on their Human Development Index (HDI) because DL implementation is capital-intensive. COVID-19 among other factors is identified as a driver of DL. Although, the Long Short Term Model (LSTM) model is the most frequently used, but optimal model performance is not limited to a certain model. Each DL model performs based on defined modelling objectives. Furthermore, effect of input data size shows no clear relationship with model performance while final model deployment in solving disaster problems in real-life scenarios is lacking. Therefore, data augmentation and transfer learning are recommended to solve data management problems. Intensive research, training, innovation, deployment using cheap web-based servers, APIs and nature-based solutions are encouraged to enhance disaster preparedness.

  • PDF

Research on ITB Contract Terms Classification Model for Risk Management in EPC Projects: Deep Learning-Based PLM Ensemble Techniques (EPC 프로젝트의 위험 관리를 위한 ITB 문서 조항 분류 모델 연구: 딥러닝 기반 PLM 앙상블 기법 활용)

  • Hyunsang Lee;Wonseok Lee;Bogeun Jo;Heejun Lee;Sangjin Oh;Sangwoo You;Maru Nam;Hyunsik Lee
    • KIPS Transactions on Software and Data Engineering
    • /
    • v.12 no.11
    • /
    • pp.471-480
    • /
    • 2023
  • The Korean construction order volume in South Korea grew significantly from 91.3 trillion won in public orders in 2013 to a total of 212 trillion won in 2021, particularly in the private sector. As the size of the domestic and overseas markets grew, the scale and complexity of EPC (Engineering, Procurement, Construction) projects increased, and risk management of project management and ITB (Invitation to Bid) documents became a critical issue. The time granted to actual construction companies in the bidding process following the EPC project award is not only limited, but also extremely challenging to review all the risk terms in the ITB document due to manpower and cost issues. Previous research attempted to categorize the risk terms in EPC contract documents and detect them based on AI, but there were limitations to practical use due to problems related to data, such as the limit of labeled data utilization and class imbalance. Therefore, this study aims to develop an AI model that can categorize the contract terms based on the FIDIC Yellow 2017(Federation Internationale Des Ingenieurs-Conseils Contract terms) standard in detail, rather than defining and classifying risk terms like previous research. A multi-text classification function is necessary because the contract terms that need to be reviewed in detail may vary depending on the scale and type of the project. To enhance the performance of the multi-text classification model, we developed the ELECTRA PLM (Pre-trained Language Model) capable of efficiently learning the context of text data from the pre-training stage, and conducted a four-step experiment to validate the performance of the model. As a result, the ensemble version of the self-developed ITB-ELECTRA model and Legal-BERT achieved the best performance with a weighted average F1-Score of 76% in the classification of 57 contract terms.

A Study of Pre-trained Language Models for Korean Language Generation (한국어 자연어생성에 적합한 사전훈련 언어모델 특성 연구)

  • Song, Minchae;Shin, Kyung-shik
    • Journal of Intelligence and Information Systems
    • /
    • v.28 no.4
    • /
    • pp.309-328
    • /
    • 2022
  • This study empirically analyzed a Korean pre-trained language models (PLMs) designed for natural language generation. The performance of two PLMs - BART and GPT - at the task of abstractive text summarization was compared. To investigate how performance depends on the characteristics of the inference data, ten different document types, containing six types of informational content and creation content, were considered. It was found that BART (which can both generate and understand natural language) performed better than GPT (which can only generate). Upon more detailed examination of the effect of inference data characteristics, the performance of GPT was found to be proportional to the length of the input text. However, even for the longest documents (with optimal GPT performance), BART still out-performed GPT, suggesting that the greatest influence on downstream performance is not the size of the training data or PLMs parameters but the structural suitability of the PLMs for the applied downstream task. The performance of different PLMs was also compared through analyzing parts of speech (POS) shares. BART's performance was inversely related to the proportion of prefixes, adjectives, adverbs and verbs but positively related to that of nouns. This result emphasizes the importance of taking the inference data's characteristics into account when fine-tuning a PLMs for its intended downstream task.

Effect of Tongue Exercise on Stroke Patients With Dysphagia : A Systematic Review (혀 운동(tongue exercise)이 연하장애를 가진 뇌졸중 환자에게 미치는 효과 : 체계적 고찰)

  • Son, Yeong Soo;Choi, Yoo Im
    • Therapeutic Science for Rehabilitation
    • /
    • v.11 no.3
    • /
    • pp.7-22
    • /
    • 2022
  • Objectives : This study was a systematic review of tongue movements in stroke patients with dysphagia. This study aimed to provide a basis for verifying the effects of tongue movement and identifying the tendency of tongue movement. Methods : A systematic review was conducted using the Preferred Reporting Items for Systematic Reviews and Meta-Analyses checklist and flow chart. PubMed, MEDLINE, CINAHL, RISS, and e-articles databases were searched. A total of six documents were investigated, and the PEDro scale was used to evaluate the quality of the papers. Results : Three intervention methods were included in the six papers analyzed. Regarding the type of tongue exercise, three TPRT (Tongue to Palate Resistance Trainings) and two TSAT (Tongue Strength and Accuracy Training) were mediated through the IOPI (Iowa Oral Performance Instrument), and only one study applied TSE (Tongue Stretching Exercise). The treatment effects for each intervention implemented in the literature were confirmed to be effective. However, generalizability of findings is difficult because of the small sample size. Further, no significant difference was found between the experimental and control groups. Conclusions : This study can help occupational therapists provide efficient swallowing rehabilitation treatment by applying tongue exercises to stroke patients with dysphagia. More research should be conducted to determine the effects of tongue exercise.