• Title/Summary/Keyword: 오분류

Search Result 798, Processing Time 0.028 seconds

Evaluations of predicted models fitted for data mining - comparisons of classification accuracy and training time for 4 algorithms (데이터마이닝기법상에서 적합된 예측모형의 평가 -4개분류예측모형의 오분류율 및 훈련시간 비교평가 중심으로)

  • Lee, Sang-Bock
    • Journal of the Korean Data and Information Science Society
    • /
    • v.12 no.2
    • /
    • pp.113-124
    • /
    • 2001
  • CHAID, logistic regression, bagging trees, and bagging trees are compared on SAS artificial data set as HMEQ in terms of classification accuracy and training time. In error rates, bagging trees is at the top, although its run time is slower than those of others. The run time of logistic regression is best among given models, but there is no uniformly efficient model satisfied in both criteria.

  • PDF

Automatic Retrieval of SNS Opinion Document Using Machine Learning Technique (기계학습을 이용한 SNS 오피니언 문서의 자동추출기법)

  • Chang, Jae-Young
    • The Journal of the Institute of Internet, Broadcasting and Communication
    • /
    • v.13 no.5
    • /
    • pp.27-35
    • /
    • 2013
  • Recently, as Social Network Services(SNS) are becoming more popular, much research has been doing on analyzing public opinions from SNS. One of the most important tasks for solving such a problem is to separate opinion(subjective) documents from others(e.g. objective documents) in SNS. In this paper, we propose a new method of retrieving the opinion documents from Twitter. The reason why it is not easy to search or classify the opinion documents in Twitter is due to a lack of publicly available Twitter documents for training. To tackle the problem, at first, we build a machine-learned model for sentiment classification using the external documents similar to Twitter, and then modify the model to separate the opinion documents from Twitter. Experimental results show that proposed method can be applied successfully in opinion classification.

Cancer driver gene using multi-omics data and biological network information (멀티 오믹스 데이터 및 생물학적 네트워크 정보를 이용한 드라이버 유전자 분류)

  • Jeong-Ho Park;Kyuri Jo
    • Proceedings of the Korea Information Processing Society Conference
    • /
    • 2023.05a
    • /
    • pp.490-492
    • /
    • 2023
  • 시퀀싱(sequencing) 기술의 발달로 다양한 오믹스(omics) 데이터의 축적과 인공 지능 기술의 발달로 인하여 다양한 드라이버 유전자 분류기법이 제안되어왔다. 최근에는 암 데이터가 대용량으로 축적되며 기계 학습 기반의 다양한 기법들이 활발히 제안되었다. 특히 다양한 오믹스 데이터를 결합한 고차원 데이터에서 높은 정확도를 확보하기 위한 시도가 활발히 이루어지고 있다. 본 논문에서는 멀티 오믹스와 네트워크 관련 특징을 기반으로 암의 증식 및 발생에 중요한 역할을 하는 드라이버 유전자를 분류하는 딥러닝 모델을 제시한다. 또한 The Cancer Genome Atlas(TCGA) 데이터를 통해서 모델 학습 후 기존 통계 및 머신러닝 기반 기법과 비교하여 성능이 개선되었음을 확인하였다.

Efficient Retrieval of Short Opinion Documents Using Learning to Rank (기계학습을 이용한 단문 오피니언 문서의 효율적 검색 기법)

  • Chang, Jae-Young
    • The Journal of the Institute of Internet, Broadcasting and Communication
    • /
    • v.13 no.4
    • /
    • pp.117-126
    • /
    • 2013
  • Recently, as Social Network Services(SNS), such as Twitter, Facebook, are becoming more popular, much research has been doing on opinion mining. However, current related researches are mostly focused on sentiment classification or feature selection, but there were few studies about opinion document retrieval. In this paper, we propose a new retrieval method of short opinion documents. Proposed method utilizes previous sentiment classification methodology, and applies several features of documents for evaluating the quality of the opinion documents. For generating the retrieval model, we adopt Learning-to-rank technique and integrate sentiment classification model to Learning-to-rank. Experimental results show that proposed method can be applied successfully in opinion search.

분류 알고리즘에 대한 경험적 비교연구

  • 전홍석;이주영
    • Proceedings of the Safety Management and Science Conference
    • /
    • 2000.05a
    • /
    • pp.411-422
    • /
    • 2000
  • 본 연구에서는 결정트리 분야에서 각 분류알고리즘을 살펴보고 통계학의 판별분석과 기계학습(Machine Learning)분야에서 분류알고리즘을 비교하고, 자료에 따라 오분류율을 분석 하였다.

  • PDF

Theoretical Considerations for the Agresti-Coull Type Confidence Interval in Misclassified Binary Data (오분류된 이진자료에서 Agresti-Coull유형의 신뢰구간에 대한 이론적 고찰)

  • Lee, Seung-Chun
    • Communications for Statistical Applications and Methods
    • /
    • v.18 no.4
    • /
    • pp.445-455
    • /
    • 2011
  • Although misclassified binary data occur frequently in practice, the statistical methodology available for the data is rather limited. In particular, the interval estimation of population proportion has relied on the classical Wald method. Recently, Lee and Choi (2009) developed a new confidence interval by applying the Agresti-Coull's approach and showed the efficiency of their proposed confidence interval numerically, but a theoretical justification has not been explored yet. Therefore, a Bayesian model for the misclassified binary data is developed to consider the Agresti-Coull confidence interval from a theoretical point of view. It is shown that the Agresti-Coull confidence interval is essentially a Bayesian confidence interval.

Aggregating Prediction Outputs of Multiple Classification Techniques Using Mixed Integer Programming (다수의 분류 기법의 예측 결과를 결합하기 위한 혼합 정수 계획법의 사용)

  • Jo, Hongkyu;Han, Ingoo
    • Journal of Intelligence and Information Systems
    • /
    • v.9 no.1
    • /
    • pp.71-89
    • /
    • 2003
  • Although many studies demonstrate that one technique outperforms the others for a given data set, there is often no way to tell a priori which of these techniques will be most effective in the classification problems. Alternatively, it has been suggested that a better approach to classification problem might be to integrate several different forecasting techniques. This study proposes the linearly combining methodology of different classification techniques. The methodology is developed to find the optimal combining weight and compute the weighted-average of different techniques' outputs. The proposed methodology is represented as the form of mixed integer programming. The objective function of proposed combining methodology is to minimize total misclassification cost which is the weighted-sum of two types of misclassification. To simplify the problem solving process, cutoff value is fixed and threshold function is removed. The form of mixed integer programming is solved with the branch and bound methods. The result showed that proposed methodology classified more accurately than any of techniques individually did. It is confirmed that Proposed methodology Predicts significantly better than individual techniques and the other combining methods.

  • PDF

Analysis and Alternative of Classification Errors in Public Libraries (공공도서관 분류오류의 실증적 분석과 대안)

  • 윤희윤
    • Journal of Korean Library and Information Science Society
    • /
    • v.34 no.1
    • /
    • pp.43-65
    • /
    • 2003
  • Libraries have long experience of applying classification schemes to resources - chiefly books. The ultimate goals of classification are systematic shelving of books and ease of user's access. In order to achieve these goals, books about a particular field of knowledge should be shelved together and near each other. If not so, it is classification error. The focus of this study is, therefore, on analysing the classification error in Korea public libraries and suggesting some alternatives.

  • PDF

Undecided inference using logistic regression for credit evaluation (신용평가에서 로지스틱 회귀를 이용한 미결정자 추론)

  • Hong, Chong-Sun;Jung, Min-Sub
    • Journal of the Korean Data and Information Science Society
    • /
    • v.22 no.2
    • /
    • pp.149-157
    • /
    • 2011
  • Undecided inference could be regarded as a missing data problem such as MARand MNAR. Under the assumption of MAR, undecided inference make use of logistic regression model. The probability of default for the undecided group is obtained with regression coefficient vectors for the decided group and compare with the probability of default for the decided group. And under the assumption of MNAR, undecide dinference make use of logistic regression model with additional feature random vector. Simulation results based on two kinds of real data are obtained and compared. It is found that the misclassification rates are not much different from the rate of rawdata under the assumption of MAR. However the misclassification rates under the assumption of MNAR are less than those under the assumption of MAR, and as the ratio of the undecided group is increasing, the misclassification rates is decreasing.

Misclassified Area Detection Algorithm for Aerial LiDAR Digital Terrain Data (항공 라이다 수치지면자료의 오분류 영역 탐지 알고리즘)

  • Kim, Min-Chul;Noh, Myoung-Jong;Cho, Woo-Sug;Bang, Ki-In;Park, Jun-Ku
    • Journal of Korean Society for Geospatial Information Science
    • /
    • v.19 no.1
    • /
    • pp.79-86
    • /
    • 2011
  • Recently, aerial laser scanning technology has received full attention in constructing DEM(Digital Elevation Model). It is well known that the quality of DEM is mostly influenced by the accuracy of DTD(Digital Terrain Data) extracted from LiDAR(Light Detection And Ranging) raw data. However, there are always misclassified data in the DTD generated by automatic filtering process due to the limitation of automatic filtering algorithm and intrinsic property of LiDAR raw data. In order to eliminate the misclassified data, a manual filtering process is performed right after automatic filtering process. In this study, an algorithm that detects automatically possible misclassified data included in the DTD from automatic filtering process is proposed, which will reduce the load of manual filtering process. The algorithm runs on 2D grid data structure and makes use of several parameters such as 'Slope Angle', 'Slope DeltaH' and 'NNMaxDH(Nearest Neighbor Max Delta Height)'. The experimental results show that the proposed algorithm quite well detected the misclassified data regardless of the terrain type and LiDAR point density.