• Title/Summary/Keyword: Unlabeled

Search Result 154, Processing Time 0.038 seconds

Probabilistic based Web Contents Mining (확률 기반 웹 콘텐츠 마이닝)

  • Yun, Bo-Hyun;Cho, Kwang-Moon
    • Proceedings of the Korea Contents Association Conference
    • /
    • 2006.11a
    • /
    • pp.16-20
    • /
    • 2006
  • In Web contents mining, it is important to recognize the unlabeled entities and to integrate the sub-linked information and the extracted results. This paper presents the probabilistic based method which can recognize the unlabeled entity by using the Baysien model. Moreover, we propose the method that can use the information of the sub-linked web pages and integrate the extracted results. In the experimental results, we can see that the probabilistic based entity and information integration show the most significant precision.

  • PDF

A Hybrid Selection Method of Helpful Unlabeled Data Applicable for Semi-Supervised Learning Algorithm

  • Le, Thanh-Binh;Kim, Sang-Woon
    • IEIE Transactions on Smart Processing and Computing
    • /
    • v.3 no.4
    • /
    • pp.234-239
    • /
    • 2014
  • This paper presents an empirical study on selecting a small amount of useful unlabeled data to improve the classification accuracy of semi-supervised learning algorithms. In particular, a hybrid method of unifying the simply recycled selection method and the incrementally-reinforced selection method was considered and evaluated empirically. The experimental results, which were obtained from well-known benchmark data sets using semi-supervised support vector machines, demonstrated that the hybrid method works better than the traditional ones in terms of the classification accuracy.

A Study on Incremental Learning Model for Naive Bayes Text Classifier (Naive Bayes 문서 분류기를 위한 점진적 학습 모델 연구)

  • 김제욱;김한준;이상구
    • The Journal of Information Technology and Database
    • /
    • v.8 no.1
    • /
    • pp.95-104
    • /
    • 2001
  • In the text classification domain, labeling the training documents is an expensive process because it requires human expertise and is a tedious, time-consuming task. Therefore, it is important to reduce the manual labeling of training documents while improving the text classifier. Selective sampling, a form of active learning, reduces the number of training documents that needs to be labeled by examining the unlabeled documents and selecting the most informative ones for manual labeling. We apply this methodology to Naive Bayes, a text classifier renowned as a successful method in text classification. One of the most important issues in selective sampling is to determine the criterion when selecting the training documents from the large pool of unlabeled documents. In this paper, we propose two measures that would determine this criterion : the Mean Absolute Deviation (MAD) and the entropy measure. The experimental results, using Renters 21578 corpus, show that this proposed learning method improves Naive Bayes text classifier more than the existing ones.

  • PDF

Implementing a Branch-and-bound Algorithm for Transductive Support Vector Machines

  • Park, Chan-Kyoo
    • Management Science and Financial Engineering
    • /
    • v.16 no.1
    • /
    • pp.81-117
    • /
    • 2010
  • Semi-supervised learning incorporates unlabeled examples, whose labels are unknown, as well as labeled examples into learning process. Although transductive support vector machine (TSVM), one of semi-supervised learning models, was proposed about a decade ago, its application to large-scaled data has still been limited due to its high computational complexity. Our previous research addressed this limitation by introducing a branch-and-bound algorithm for finding an optimal solution to TSVM. In this paper, we propose three new techniques to enhance the performance of the branch-and-bound algorithm. The first one tightens min-cut bound, one of two bounding strategies. Another technique exploits a graph-based approximation to a support vector machine problem to avoid the most time-consuming step. The last one tries to fix the labels of unlabeled examples whose labels can be obviously predicted based on labeled examples. Experimental results are presented which demonstrate that the proposed techniques can reduce drastically the number of subproblems and eventually computational time.

Automatic Text Classification Method Using Keywords and Unlabeled Text (주제어와 미분류 문서들을 이용한 문서의 자동 분류 방법)

  • Lee Kang-Il;Lee Chang-Hwan
    • Proceedings of the Korean Information Science Society Conference
    • /
    • 2005.07b
    • /
    • pp.592-594
    • /
    • 2005
  • 문서를 분류하기 위해서는 분류주제에 맞춰 미리 분류가 된 자료(labeled data)가 필요하다. 하지만 미리 분류가 된 자료를 만들기 위해서는 사람이 직접 그 문서의 의미를 해석하고 일일이 분류를 해야 하기 때문에 시간이 많이 소모가 된다. 본 논문에서는 비록 사랑이 직접 분류한 자료를 이용하는 것에 비해서 분류 정확도는 조금 떨어지지만, 대신 주제어와 미분류 문서(unlabeled data)를 이용해서 문서를 분류하는 방법을 제시하려고 한다. 이와 같은 주제어와 미분류 문서의 경우에는 구하기가 쉽고, 사랑이 일일이 분류하는 작업이 필요로 하지 않기 때문에 비용과 시간이 크게 절약이 된다는 장정이 있다.

  • PDF

Attentive Transfer Learning via Self-supervised Learning for Cervical Dysplasia Diagnosis

  • Chae, Jinyeong;Zimmermann, Roger;Kim, Dongho;Kim, Jihie
    • Journal of Information Processing Systems
    • /
    • v.17 no.3
    • /
    • pp.453-461
    • /
    • 2021
  • Many deep learning approaches have been studied for image classification in computer vision. However, there are not enough data to generate accurate models in medical fields, and many datasets are not annotated. This study presents a new method that can use both unlabeled and labeled data. The proposed method is applied to classify cervix images into normal versus cancerous, and we demonstrate the results. First, we use a patch self-supervised learning for training the global context of the image using an unlabeled image dataset. Second, we generate a classifier model by using the transferred knowledge from self-supervised learning. We also apply attention learning to capture the local features of the image. The combined method provides better performance than state-of-the-art approaches in accuracy and sensitivity.

Unlabeled Wi-Fi RSSI Indoor Positioning by Using IMU

  • Chanyeong, Ju;Jaehyun, Yoo
    • Journal of Positioning, Navigation, and Timing
    • /
    • v.12 no.1
    • /
    • pp.37-42
    • /
    • 2023
  • Wi-Fi Received Signal Strength Indicator (RSSI) is considered one of the most important sensor data types for indoor localization. However, collecting a RSSI fingerprint, which consists of pairs of a RSSI measurement set and a corresponding location, is costly and time-consuming. In this paper, we propose a Wi-Fi RSSI learning technique without true location data to overcome the limitations of static database construction. Instead of the true reference positions, inertial measurement unit (IMU) data are used to generate pseudo locations, which enable a trainer to move during data collection. This improves the efficiency of data collection dramatically. From an experiment it is seen that the proposed algorithm successfully learns the unsupervised Wi-Fi RSSI positioning model, resulting in 2 m accuracy when the cumulative distribution function (CDF) is 0.8.

A Study on Feature Extraction Performance of Naive Convolutional Auto Encoder to Natural Images (자연 영상에 대한 Naive Convolutional Auto Encoder의 특징 추출 성능에 관한 연구)

  • Lee, Sung Ju;Cho, Nam Ik
    • Proceedings of the Korean Society of Broadcast Engineers Conference
    • /
    • 2022.06a
    • /
    • pp.1286-1289
    • /
    • 2022
  • 최근 영상 군집화 분야는 딥러닝 모델에게 Self-supervision을 주거나 unlabeled 영상에 유사-레이블을 주는 방식으로 연구되고 있다. 또한, 고차원 컬러 자연 영상에 대해 잘 압축된 특징 벡터를 추출하는 것은 군집화에 있어 중요한 기준이 된다. 본 연구에서는 자연 영상에 대한 Convolutional Auto Encoder의 특징 추출 성능을 평가하기 위해 설계한 실험 방법을 소개한다. 특히 모델의 특징 추출 능력을 순수하게 확인하기 위하여 Self-supervision 및 유사-레이블을 제공하지 않은 채 Naive한 모델의 결과를 분석할 것이다. 먼저 실험을 위해 설계된 4가지 비지도학습 모델의 복원 결과를 통해 모델별 학습 정도를 확인한다. 그리고 비지도 모델이 다량의 unlabeled 영상으로 학습되어도 더 적은 labeled 데이터로 학습된 지도학습 모델의 특징 추출 성능에 못 미침을 특징 벡터의 군집화 및 분류 실험 결과를 통해 확인한다. 또한, 지도학습 모델에 데이터셋 간 교차 학습을 수행하여 출력된 특징 벡터의 군집화 및 분류 성능도 확인한다.

  • PDF

The use of support vector machines in semi-supervised classification

  • Bae, Hyunjoo;Kim, Hyungwoo;Shin, Seung Jun
    • Communications for Statistical Applications and Methods
    • /
    • v.29 no.2
    • /
    • pp.193-202
    • /
    • 2022
  • Semi-supervised learning has gained significant attention in recent applications. In this article, we provide a selective overview of popular semi-supervised methods and then propose a simple but effective algorithm for semi-supervised classification using support vector machines (SVM), one of the most popular binary classifiers in a machine learning community. The idea is simple as follows. First, we apply the dimension reduction to the unlabeled observations and cluster them to assign labels on the reduced space. SVM is then employed to the combined set of labeled and unlabeled observations to construct a classification rule. The use of SVM enables us to extend it to the nonlinear counterpart via kernel trick. Our numerical experiments under various scenarios demonstrate that the proposed method is promising in semi-supervised classification.

Research on supplementing unlabeled data through pseudo-labeling. (의사 레이블링을 통한 레이블이 없는 데이터 보완 연구)

  • Min-Hee Yoo;Heon-Chang Yu
    • Proceedings of the Korea Information Processing Society Conference
    • /
    • 2023.11a
    • /
    • pp.410-413
    • /
    • 2023
  • 레이블링 작업은 데이터 분석 시 필요한 사전 작업중 하나이다. 모든 데이터들에 대해 레이블링 작업은 시간/인적 자원을 필요로 하기에, 해당 작업을 보완할 방법이 존재한다면 요구되는 리소스를 줄여 효율성을 크게 향상시킬 수 있다. 본 논문에서는 통신회사에서 적재된 데이터 셋에 대하여 레이블이 없는 데이터(Unlabeled-data)에 대해 의사 레이블링(Pseudo-labeling), SMOTE 를 통한 데이터 증강을 활용하여 기존에 활용되지 못한 데이터를 추가하여 모델에 학습시킨다. 실험을 통해 의사 레이블을 통한 모델 학습 방법이 기존 도메인 지식의 레이블 방법보다 효율적이고 성능이 우수함을 확인하였다.