• Title/Summary/Keyword: 통계학분류

Search Result 75, Processing Time 0.019 seconds

Random projection ensemble adaptive nearest neighbor classification (랜덤 투영 앙상블 기법을 활용한 적응 최근접 이웃 판별분류기법)

  • Kang, Jongkyeong;Jhun, Myoungshic
    • The Korean Journal of Applied Statistics
    • /
    • v.34 no.3
    • /
    • pp.401-410
    • /
    • 2021
  • Popular in discriminant classification analysis, k-nearest neighbor classification methods have limitations that do not reflect the local characteristic of the data, considering only the number of fixed neighbors. Considering the local structure of the data, the adaptive nearest neighbor method has been developed to select the number of neighbors. In the analysis of high-dimensional data, it is common to perform dimension reduction such as random projection techniques before using k-nearest neighbor classification. Recently, an ensemble technique has been developed that carefully combines the results of such random classifiers and makes final assignments by voting. In this paper, we propose a novel discriminant classification technique that combines adaptive nearest neighbor methods with random projection ensemble techniques for analysis on high-dimensional data. Through simulation and real-world data analyses, we confirm that the proposed method outperforms in terms of classification accuracy compared to the previously developed methods.

Wafer bin map failure pattern recognition using hierarchical clustering (계층적 군집분석을 이용한 반도체 웨이퍼의 불량 및 불량 패턴 탐지)

  • Jeong, Joowon;Jung, Yoonsuh
    • The Korean Journal of Applied Statistics
    • /
    • v.35 no.3
    • /
    • pp.407-419
    • /
    • 2022
  • The semiconductor fabrication process is complex and time-consuming. There are sometimes errors in the process, which results in defective die on the wafer bin map (WBM). We can detect the faulty WBM by finding some patterns caused by dies. When one manually seeks the failure on WBM, it takes a long time due to the enormous number of WBMs. We suggest a two-step approach to discover the probable pattern on the WBMs in this paper. The first step is to separate the normal WBMs from the defective WBMs. We adapt a hierarchical clustering for de-noising, which nicely performs this work by wisely tuning the number of minimum points and the cutting height. Once declared as a faulty WBM, then it moves to the next step. In the second step, we classify the patterns among the defective WBMs. For this purpose, we extract features from the WBM. Then machine learning algorithm classifies the pattern. We use a real WBM data set (WM-811K) released by Taiwan semiconductor manufacturing company.

Advancing Societal Statistics Processing Methodology through Artificial Intelligence: A Case Study on Household Trend Survey and Time Use Survey (인공지능 기반 사회 통계 생산 방법론 고도화 방안: 가계동향조사와 생활시간조사 사례)

  • Kyo-Joong Oh;Ho-Jin Choi;Ilgu Kim;Seungwoo Han;Kunsoo Kim
    • Annual Conference on Human and Language Technology
    • /
    • 2023.10a
    • /
    • pp.563-567
    • /
    • 2023
  • 본 연구는 한국 통계청이 수행하는 가계동향조사와 생활시간조사에서 자료처리 과정 및 방법을 혁신하려는 시도로, 기존의 통계 생산 방법론의 한계를 극복하고, 대규모 데이터의 효과적인 관리와 분석을 가능하게 하는 인공지능 기반의 통계 생산을 목표로 한다. 본 연구는 데이터 과학과 통계학의 교차점에서 진행되며, 인공지능 기술, 특히 자연어 처리와 딥러닝을 활용하여 비정형 텍스트 분류 방법의 성능을 검증하며, 인공지능 기반 통계분류 방법론의 확장성과 추가적인 조사 확대 적용의 가능성을 탐구한다. 이 연구의 결과는 통계 데이터의 품질 향상과 신뢰성 증가에 기여하며, 국민의 생활 패턴과 행동에 대한 더 깊고 정확한 이해를 제공한다.

  • PDF

A study on the estimation of rock mass classes using the information off a tunnel center line (터널 중심선으로부터 이격된 자료를 활용한 미시추구간의 암반등급 산정에 관한 연구)

  • You, Kwang-Ho;Lee, Sang-Ho;Choo, Suk-Yeon;Jue, Kwang-Sue
    • Journal of Korean Tunnelling and Underground Space Association
    • /
    • v.6 no.2
    • /
    • pp.101-111
    • /
    • 2004
  • In order to guarantee the stability of a tunnel and its optimum design, it is very important to obtain enough ground investigation data. In realty, however, it is not the case due to the limitation of measuring spatially distributed data and economical reasons. Especially, there are regions where drilling is impossible due to civil appeal and mountainous topology, and it is also difficult to estimate rock mass classes quantitatively with only geophysical exploration data. In this study, therefore, 3 dimensional multiple indicator kriging (3D-MI kriging), which can incorporate geophysical exploration data and drill core data off a tunnel center line, is proposed to cope with such problems. To this end, two dimensional mutiple indicator kriging, which is one of the geostatistical techniques, is extended for three dimensional analysis. Also, the proposed 3D-MI kriging was applied to determine the rock mass classes by RMR system for the design of a Kyungbu express rail way tunnel.

  • PDF

Construction of Onion Sentiment Dictionary using Cluster Analysis (군집분석을 이용한 양파 감성사전 구축)

  • Oh, Seungwon;Kim, Min Soo
    • Journal of the Korean Data Analysis Society
    • /
    • v.20 no.6
    • /
    • pp.2917-2932
    • /
    • 2018
  • Many researches are accomplished as a result of the efforts of developing the production predicting model to solve the supply imbalance of onions which are vegetables very closely related to Korean food. But considering the possibility of storing onions, it is very difficult to solve the supply imbalance of onions only with predicting the production. So, this paper's purpose is trying to build a sentiment dictionary to predict the price of onions by using the internet articles which include the informations about the production of onions and various factors of the price, and these articles are very easy to access on our daily lives. Articles about onions are from 2012 to 2016, using TF-IDF for comparing with four kinds of TF-IDFs through the documents classification of wholesale prices of onions. As a result of classifying the positive/negative words for price by k-means clustering, DBSCAN (density based spatial cluster application with noise) clustering, GMM (Gaussian mixture model) clustering which are partitional clustering, GMM clustering is composed with three meaningful dictionaries. To compare the reasonability of these built dictionary, applying classified articles about the rise and drop of the price on logistic regression, and it shows 85.7% accuracy.

Hierarchical grouping recommendation system based on the attributes of contents: a case study of 'The Movie Dataset' (콘텐츠 속성에 따른 계층적 그룹화 추천시스템: 'The Movie Dataset' 분석사례연구)

  • Kim, Yoon Kyoung;Yeo, In-Kwon
    • The Korean Journal of Applied Statistics
    • /
    • v.33 no.6
    • /
    • pp.833-842
    • /
    • 2020
  • Global platforms such as Netflix, Amazon, and YouTube have developed a precise recommendation system based on various information from large set of customers and many of the items recommended here are leading to actual purchases. In this paper, a cluster analysis was conducted according to the attribute of the content, expecting that there would be a difference in user preferences according to the attribute of the recommended content. Gower distance was used for use regardless of the type of variables. In this paper, using the data of movie rating site 'The Movie Dataset', the users were grouped hierarchically and recommended movies based on genre, director and actor variables. To evaluate the recommended systems proposed, user group was divided into train set and test set to examine the precision. The results showed that proposed algorithms have far higher precision than UBCF.

Comparison of Clustering Techniques in Flight Approach Phase using ADS-B Track Data (공항 근처 ADS-B 항적 자료에서의 클러스터링 기법 비교)

  • Jong-Chan Park;Heon Jin Park
    • The Journal of Bigdata
    • /
    • v.6 no.2
    • /
    • pp.29-38
    • /
    • 2021
  • Deviation of route in aviation safety management is a dangerous factor that can lead to serious accidents. In this study, the anomaly score is calculated by classifying the tracks through clustering and calculating the distance from the cluster center. The study was conducted by extracting tracks within 100 km of the airport from the ADS-B track data received for one year. The wake was vectorized using linear interpolation. Latitude, longitude, and altitude 3D coordinates were used. Through PCA, the dimension was reduced to an axis representing more than 90% of the overall data distribution, and k-means clustering, hierarchical clustering, and PAM techniques were applied. The number of clusters was selected using the silhouette measure, and an abnormality score was calculated by calculating the distance from the cluster center. In this study, we compare the number of clusters for each cluster technique, and evaluate the clustering result through the silhouette measure.

A Feature Selection Method Based on Fuzzy Cluster Analysis (퍼지 클러스터 분석 기반 특징 선택 방법)

  • Rhee, Hyun-Sook
    • The KIPS Transactions:PartB
    • /
    • v.14B no.2
    • /
    • pp.135-140
    • /
    • 2007
  • Feature selection is a preprocessing technique commonly used on high dimensional data. Feature selection studies how to select a subset or list of attributes that are used to construct models describing data. Feature selection methods attempt to explore data's intrinsic properties by employing statistics or information theory. The recent developments have involved approaches like correlation method, dimensionality reduction and mutual information technique. This feature selection have become the focus of much research in areas of applications with massive and complex data sets. In this paper, we provide a feature selection method considering data characteristics and generalization capability. It provides a computational approach for feature selection based on fuzzy cluster analysis of its attribute values and its performance measures. And we apply it to the system for classifying computer virus and compared with heuristic method using the contrast concept. Experimental result shows the proposed approach can give a feature ranking, select the features, and improve the system performance.

Comparison of evaluation measures for classification models on binary data (이진자료 분류모형에 대한 평가측도의 특성 비교)

  • Kim, Byungsoo;Kwon, Soyoung
    • The Korean Journal of Applied Statistics
    • /
    • v.32 no.2
    • /
    • pp.291-300
    • /
    • 2019
  • This study investigates the characteristics of evaluation measures for classification models on a binary response variable in order to evaluate their suitability for use. Six measures are considered: Accuracy, Sensitivity, Specificity, Precision, F-measure, and the Heidke's skill score (HSS). Evaluation measures are reformulated using x(ratio of actually 1), y(ratio predicted by 1), z(ratio of both actual and predicted by 1) from the confusion matrix. We suggest two necessary conditions to assess the suitability of the evaluation measures. The first condition is that the measure function is constant for x and y in the case of a random model. The second condition is that the measure function is increasing for z and decreasing for x and y. Since only HSS satisfies the two conditions, that is always appropriate as an evaluation measure for the classification model on the binary response variable, and the other measures should be used within a limited range.

Quantitative Prediction of Landslide Probability in Gyeonggi Province, Korea (경기지역 산사태 발생가능성의 정량적 예측)

  • 김원영;김경수;채병곤;조용찬
    • Proceedings of the KSEG Conference
    • /
    • 2001.03a
    • /
    • pp.33-44
    • /
    • 2001
  • 경기지역에 발생한 약 1,600개의 산사태를 1:50,000 지형도와 1:5,000 지형도를 이용해 정밀 조사하였다. 대부분의 산사태는 토석류(debris flow)로 분류되지만, 산사태 시작부는 전이형(translational) 사태의 성격을 띤다. 강우 이외에 산사태를 발생시키는 지질학적 요인을 찾고자 하루동안 250-300mm의 강우량을 가진 6개 지역을 세부연구지역으로 선정하였다. 이 지역 내 198개의 산사태를 대상으로 현장정밀 조사와 실내실험을 실시한 후, 이를 바탕으로 지구통계학 기법을 이용해 사태발생 원인을 선정하였고, 이에 대한 정량적 가중치를 각각 결정하였다. 분석결과에 의하면 7개 요소가 산사태 발생원인으로 선정되었고, 원인별 정량적 가중치를 부여한 산사태 발생확률을 계산식을 도출했다. 이를 통해 일부지역의 산사태 발생 확률을 구한 후, 실제 발생기록과 비교한 결과 90.74%의 정확성을 나타냈다.

  • PDF