• Title/Summary/Keyword: Semi-Supervised learning

Search Result 150, Processing Time 0.023 seconds

Analysis of Extraction Performance according to the Expanding of Applied Character in Hangul Stroke Element Extraction (한글 획요소 추출 학습에서 적용 글자의 확장에 따른 추출 성능 분석)

  • Jeon, Ja-Yeon;Lim, Soon-Bum
    • Journal of Korea Multimedia Society
    • /
    • v.23 no.11
    • /
    • pp.1361-1371
    • /
    • 2020
  • Fonts have developed as a visual element, and their influence has rapidly increased around the world. Research on font automation is actively being conducted mainly in English because Hangul is a combination character and the structure is complicated. In the previous study to solve this problem, the stroke element of the character was automatically extracted by applying the object detection by component. However, the previous research was only for similarity, so it was tested on various print style fonts, but it has not been tested on other characters. In order to extract the stroke elements of all characters and fonts, we performed a performance analysis experiment according to the expansion character in the Hangul stroke element extraction training. The results were all high overall. In particular, in the font expansion type, the extraction success rate was high regardless of having done the training or not. In the character expansion type, the extraction success rate of trained characters was slightly higher than that of untrained characters. In conclusion, for the perfect Hangul stroke element extraction model, we will introduce Semi-Supervised Learning to increase the number of data and strengthen it.

Tri-training algorithm based on cross entropy and K-nearest neighbors for network intrusion detection

  • Zhao, Jia;Li, Song;Wu, Runxiu;Zhang, Yiying;Zhang, Bo;Han, Longzhe
    • KSII Transactions on Internet and Information Systems (TIIS)
    • /
    • v.16 no.12
    • /
    • pp.3889-3903
    • /
    • 2022
  • To address the problem of low detection accuracy due to training noise caused by mislabeling when Tri-training for network intrusion detection (NID), we propose a Tri-training algorithm based on cross entropy and K-nearest neighbors (TCK) for network intrusion detection. The proposed algorithm uses cross-entropy to replace the classification error rate to better identify the difference between the practical and predicted distributions of the model and reduce the prediction bias of mislabeled data to unlabeled data; K-nearest neighbors are used to remove the mislabeled data and reduce the number of mislabeled data. In order to verify the effectiveness of the algorithm proposed in this paper, experiments were conducted on 12 UCI datasets and NSL-KDD network intrusion datasets, and four indexes including accuracy, recall, F-measure and precision were used for comparison. The experimental results revealed that the TCK has superior performance than the conventional Tri-training algorithms and the Tri-training algorithms using only cross-entropy or K-nearest neighbor strategy.

Semi-supervised GPT2 for News Article Recommendation with Curriculum Learning (준 지도 학습과 커리큘럼 학습을 이용한 유사 기사 추천 모델)

  • Seo, Jaehyung;Oh, Dongsuk;Eo, Sugyeong;Park, Sungjin;Lim, Heuiseok
    • Annual Conference on Human and Language Technology
    • /
    • 2020.10a
    • /
    • pp.495-500
    • /
    • 2020
  • 뉴스 기사는 반드시 객관적이고 넓은 시각으로 정보를 전달하지 않는다. 따라서 뉴스 기사를 기존의 추천 시스템과 같이 개인의 관심사나 사적 정보를 바탕으로 선별적으로 추천하는 것은 바람직하지 않다. 본 논문에서는 최대한 객관적으로 다양한 시각에서 비슷한 사건과 인물에 대해서 판단할 수 있도록 유사도 기반의 기사 추천 모델을 제시한다. 길이가 긴 문서 사이의 유사도를 측정하기 위해 GPT2 [1]언어 모델을 활용했다. 이 과정에서 단방향 디코더 모델인 GPT2 [1]의 단점을 추가 학습으로 개선했으며, 저장 공간의 효율과 핵심 문단 추출을 위해 BM25 [2]함수를 사용했다. 그리고 준 지도 학습 [3]을 통해 유사도 레이블링이 되어있지 않은 최신 뉴스 기사에 대해서도 자가 학습을 진행했으며, 이와 함께 길이가 긴 문단에 대해서도 효과적으로 학습할 수 있도록 문장 길이를 기준으로 3개의 단계로 나누어진 커리큘럼 학습 [4]방식을 적용했다.

  • PDF

A Co-training Method based on Classification Using Unlabeled Data (비분류표시 데이타를 이용하는 분류 기반 Co-training 방법)

  • 윤혜성;이상호;박승수;용환승;김주한
    • Journal of KIISE:Software and Applications
    • /
    • v.31 no.8
    • /
    • pp.991-998
    • /
    • 2004
  • In many practical teaming problems including bioinformatics area, there is a small amount of labeled data along with a large pool of unlabeled data. Labeled examples are fairly expensive to obtain because they require human efforts. In contrast, unlabeled examples can be inexpensively gathered without an expert. A common method with unlabeled data for data classification and analysis is co-training. This method uses a small set of labeled examples to learn a classifier in two views. Then each classifier is applied to all unlabeled examples, and co-training detects the examples on which each classifier makes the most confident predictions. After some iterations, new classifiers are learned in training data and the number of labeled examples is increased. In this paper, we propose a new co-training strategy using unlabeled data. And we evaluate our method with two classifiers and two experimental data: WebKB and BIND XML data. Our experimentation shows that the proposed co-training technique effectively improves the classification accuracy when the number of labeled examples are very small.

Semi-supervised learning for sentiment analysis in mass social media (대용량 소셜 미디어 감성분석을 위한 반감독 학습 기법)

  • Hong, Sola;Chung, Yeounoh;Lee, Jee-Hyong
    • Journal of the Korean Institute of Intelligent Systems
    • /
    • v.24 no.5
    • /
    • pp.482-488
    • /
    • 2014
  • This paper aims to analyze user's emotion automatically by analyzing Twitter, a representative social network service (SNS). In order to create sentiment analysis models by using machine learning techniques, sentiment labels that represent positive/negative emotions are required. However it is very expensive to obtain sentiment labels of tweets. So, in this paper, we propose a sentiment analysis model by using self-training technique in order to utilize "data without sentiment labels" as well as "data with sentiment labels". Self-training technique is that labels of "data without sentiment labels" is determined by utilizing "data with sentiment labels", and then updates models using together with "data with sentiment labels" and newly labeled data. This technique improves the sentiment analysis performance gradually. However, it has a problem that misclassifications of unlabeled data in an early stage affect the model updating through the whole learning process because labels of unlabeled data never changes once those are determined. Thus, labels of "data without sentiment labels" needs to be carefully determined. In this paper, in order to get high performance using self-training technique, we propose 3 policies for updating "data with sentiment labels" and conduct a comparative analysis. The first policy is to select data of which confidence is higher than a given threshold among newly labeled data. The second policy is to choose the same number of the positive and negative data in the newly labeled data in order to avoid the imbalanced class learning problem. The third policy is to choose newly labeled data less than a given maximum number in order to avoid the updates of large amount of data at a time for gradual model updates. Experiments are conducted using Stanford data set and the data set is classified into positive and negative. As a result, the learned model has a high performance than the learned models by using "data with sentiment labels" only and the self-training with a regular model update policy.

Improving Human Activity Recognition Model with Limited Labeled Data using Multitask Semi-Supervised Learning (제한된 라벨 데이터 상에서 다중-태스크 반 지도학습을 사용한 동작 인지 모델의 성능 향상)

  • Prabono, Aria Ghora;Yahya, Bernardo Nugroho;Lee, Seok-Lyong
    • Database Research
    • /
    • v.34 no.3
    • /
    • pp.137-147
    • /
    • 2018
  • A key to a well-performing human activity recognition (HAR) system through machine learning technique is the availability of a substantial amount of labeled data. Collecting sufficient labeled data is an expensive and time-consuming task. To build a HAR system in a new environment (i.e., the target domain) with very limited labeled data, it is unfavorable to naively exploit the data or trained classifier model from the existing environment (i.e., the source domain) as it is due to the domain difference. While traditional machine learning approaches are unable to address such distribution mismatch, transfer learning approach leverages the utilization of knowledge from existing well-established source domains that help to build an accurate classifier in the target domain. In this work, we propose a transfer learning approach to create an accurate HAR classifier with very limited data through the multitask neural network. The classifier loss function minimization for source and target domain are treated as two different tasks. The knowledge transfer is performed by simultaneously minimizing the loss function of both tasks using a single neural network model. Furthermore, we utilize the unlabeled data in an unsupervised manner to help the model training. The experiment result shows that the proposed work consistently outperforms existing approaches.

Vision-Based Vehicle Detection and Tracking Using Online Learning (온라인 학습을 이용한 비전 기반의 차량 검출 및 추적)

  • Gil, Sung-Ho;Kim, Gyeong-Hwan
    • The Journal of Korean Institute of Communications and Information Sciences
    • /
    • v.39A no.1
    • /
    • pp.1-11
    • /
    • 2014
  • In this paper we propose a system for vehicle detection and tracking which has the ability to learn on-line appearance changes of vehicles being tracked. The proposed system uses feature-based tracking method to estimate rapidly and robustly the motion of the newly detected vehicles between consecutive frames. Simultaneously, the system trains an online vehicle detector for the tracked vehicles. If the tracker fails, it is re-initialized by the detection of the online vehicle detector. An improved vehicle appearance model update rule is presented to increase a tracking performance and a speed of the proposed system. Performance of the proposed system is evaluated on the dataset acquired on various driving environment. In particular, the experimental results proved that the performance of the vehicle tracking is significantly improved under bad conditions such as entering a tunnel and passing rain.

An Emerging Technology Trend Identifier Based on the Citation and the Change of Academic and Industrial Popularity (학계와 산업계의 정보 대중성 변동과 인용 정보에 기반한 최신 기술 동향 식별 시스템)

  • Kim, Seonho;Lee, Junkyu;Rasheed, Waqas;Yeo, Woondong
    • Journal of Korea Technology Innovation Society
    • /
    • v.14 no.spc
    • /
    • pp.1171-1186
    • /
    • 2011
  • Identifying Emerging Technology Trends is crucial for decision makers of nations and organizations in order to use limited resources, such as time, money, etc., efficiently. Many researchers have proposed emerging trend detection systems based on a popularity analysis of the document, but this still needs to be improved. In this paper, an emerging trend detection classifier is proposed which uses both academic and industrial data, SCOPUS and PATSTAT. Unlike most pre-vious research, our emerging technology trend classifi-er utilizes supervised, semi-automatic, machine learning techniques to improve the precision of the results. In addition, the citation information from among the SCOPUS data is analyzed to identify the early signals of emerging technology trends.

  • PDF

An Experimental Study on AutoEncoder to Detect Botnet Traffic Using NetFlow-Timewindow Scheme: Revisited (넷플로우-타임윈도우 기반 봇넷 검출을 위한 오토엔코더 실험적 재고찰)

  • Koohong Kang
    • Journal of the Korea Institute of Information Security & Cryptology
    • /
    • v.33 no.4
    • /
    • pp.687-697
    • /
    • 2023
  • Botnets, whose attack patterns are becoming more sophisticated and diverse, are recognized as one of the most serious cybersecurity threats today. This paper revisits the experimental results of botnet detection using autoencoder, a semi-supervised deep learning model, for UGR and CTU-13 data sets. To prepare the input vectors of autoencoder, we create data points by grouping the NetFlow records into sliding windows based on source IP address and aggregating them to form features. In particular, we discover a simple power-law; that is the number of data points that have some flow-degree is proportional to the number of NetFlow records aggregated in them. Moreover, we show that our power-law fits the real data very well resulting in correlation coefficients of 97% or higher. We also show that this power-law has an impact on the learning of autoencoder and, as a result, influences the performance of botnet detection. Furthermore, we evaluate the performance of autoencoder using the area under the Receiver Operating Characteristic (ROC) curve.

Tokamak plasma disruption precursor onset time study based on semi-supervised anomaly detection

  • X.K. Ai;W. Zheng;M. Zhang;D.L. Chen;C.S. Shen;B.H. Guo;B.J. Xiao;Y. Zhong;N.C. Wang;Z.J. Yang;Z.P. Chen;Z.Y. Chen;Y.H. Ding;Y. Pan
    • Nuclear Engineering and Technology
    • /
    • v.56 no.4
    • /
    • pp.1501-1512
    • /
    • 2024
  • Plasma disruption in tokamak experiments is a challenging issue that causes damage to the device. Reliable prediction methods are needed, but the lack of full understanding of plasma disruption limits the effectiveness of physics-driven methods. Data-driven methods based on supervised learning are commonly used, and they rely on labelled training data. However, manual labelling of disruption precursors is a time-consuming and challenging task, as some precursors are difficult to accurately identify. The mainstream labelling methods assume that the precursor onset occurs at a fixed time before disruption, which leads to mislabeled samples and suboptimal prediction performance. In this paper, we present disruption prediction methods based on anomaly detection to address these issues, demonstrating good prediction performance on J-TEXT and EAST. By evaluating precursor onset times using different anomaly detection algorithms, it is found that labelling methods can be improved since the onset times of different shots are not necessarily the same. The study optimizes precursor labelling using the onset times inferred by the anomaly detection predictor and test the optimized labels on supervised learning disruption predictors. The results on J-TEXT and EAST show that the models trained on the optimized labels outperform those trained on fixed onset time labels.