• 제목/요약/키워드: Evaluation Data Set

검색결과 1,079건 처리시간 0.023초

The Effect of Bias in Data Set for Conceptual Clustering Algorithms

  • Lee, Gye Sung
    • International journal of advanced smart convergence
    • /
    • 제8권3호
    • /
    • pp.46-53
    • /
    • 2019
  • When a partitioned structure is derived from a data set using a clustering algorithm, it is not unusual to have a different set of outcomes when it runs with a different order of data. This problem is known as the order bias problem. Many algorithms in machine learning fields try to achieve optimized result from available training and test data. Optimization is determined by an evaluation function which has also a tendency toward a certain goal. It is inevitable to have a tendency in the evaluation function both for efficiency and for consistency in the result. But its preference for a specific goal in the evaluation function may sometimes lead to unfavorable consequences in the final result of the clustering. To overcome this bias problems, the first clustering process proceeds to construct an initial partition. The initial partition is expected to imply the possible range in the number of final clusters. We apply the data centric sorting to the data objects in the clusters of the partition to rearrange them in a new order. The same clustering procedure is reapplied to the newly arranged data set to build a new partition. We have developed an algorithm that reduces bias effect resulting from how data is fed into the algorithm. Experiment results have been presented to show that the algorithm helps minimize the order bias effects. We have also shown that the current evaluation measure used for the clustering algorithm is biased toward favoring a smaller number of clusters and a larger size of clusters as a result.

An Application of the Rough Set Approach to credit Rating

  • Kim, Jae-Kyeong;Cho, Sung-Sik
    • 한국지능정보시스템학회:학술대회논문집
    • /
    • 한국지능정보시스템학회 1999년도 추계학술대회-지능형 정보기술과 미래조직 Information Technology and Future Organization
    • /
    • pp.347-354
    • /
    • 1999
  • The credit rating represents an assessment of the relative level of risk associated with the timely payments required by the debt obligation. In this paper, we present a new approach to credit rating of customers based on the rough set theory. The concept of a rough set appeared to be an effective tool for the analysis of customer information systems representing knowledge gained by experience. The customer information system describes a set of customers by a set of multi-valued attributes, called condition attributes. The customers are classified into groups of risk subject to an expert's opinion, called decision attribute. A natural problem of knowledge analysis consists then in discovering relationships, in terms of decision rules, between description of customers by condition attributes and particular decisions. The rough set approach enables one to discover minimal subsets of condition attributes ensuring an acceptable quality of classification of the customers analyzed and to derive decision rules from the customer information system which can be used to support decisions about rating new customers. Using the rough set approach one analyses only facts hidden in data, it does not need any additional information about data and does not correct inconsistencies manifested in data; instead, rules produced are categorized into certain and possible. A real problem of the evaluation of the evaluation of credit rating by a department store is studied using the rough set approach.

  • PDF

설악산국립공원내 산양(Nemorhaedus Caudatus Raddeanus)의 잠재 서식지 적합성 모형; 다기준평가기법(MCE)과 퍼지집합(Fuzzy Set)의 도입을 통하여 (Korean Groal Potential Habitat Suitability Model at Soraksan National Park Using Fuzzy Set and Multi-Criteria Evaluation)

  • 최태영;박종화
    • 한국조경학회지
    • /
    • 제32권4호
    • /
    • pp.28-38
    • /
    • 2004
  • Korean goral (Nemorhaedus caudatus raddeanus) is one of the endangered species in Korea, and the rugged terrain of the Soraksan National Park (373㎢) is a critical habitat for the species. But the goral population is threatened by habitat fragmentation caused by roads and hiking trails. The objective of this study was to develop a potential habitat suitability model for Korean goral in the park, and the model was based on the concepts of fuzzy set theory and multi-criteria evaluation. The process of the suitability modeling could be divided into three steps. First, data for the modeling was collected by using field work and a literature survey. Collected data included 204 points of GPS data obtained through a goral trace survey and through the number of daily visitors to each hiking trail during the peak season of the park. Second, fuzzy set theory was employed for building a GIS data base related to environmental factors affecting the suitability of the goral habitat. Finally, a multiple-criteria evaluation was performed as the final step towards a goral habitat suitability model. The results of the study were as follows. First, characteristics of suitable habitats were the proximity to rock cliffs, scattered pine (Pinus densiflora) patches, ridges, the elevation of 700∼800m, and the aspect of south and southeast. Second, the habitat suitability model had a high classification accuracy of 93.9% for the analysis site, and 95.7% for the validation site at a cut off value of 0.5. Finally, 11.7% of habitatwith more than 0.5 of habitat suitability index was affected by roads and hiking trails in the park.

Development of Personal-Credit Evaluation System Using Real-Time Neural Learning Mechanism

  • Park, Jong U.;Park, Hong Y.;Yoon Chung
    • 정보기술과데이타베이스저널
    • /
    • 제2권2호
    • /
    • pp.71-85
    • /
    • 1995
  • Many research results conducted by neural network researchers have claimed that the classification accuracy of neural networks is superior to, or at least equal to that of conventional methods. However, in series of neural network classifications, it was found that the classification accuracy strongly depends on the characteristics of training data set. Even though there are many research reports that the classification accuracy of neural networks can be different, depending on the composition and architecture of the networks, training algorithm, and test data set, very few research addressed the problem of classification accuracy when the basic assumption of data monotonicity is violated, In this research, development project of automated credit evaluation system is described. The finding was that arrangement of training data is critical to successful implementation of neural training to maintain monotonicity of the data set, for enhancing classification accuracy of neural networks.

  • PDF

Evaluation of Machine Learning Algorithm Utilization for Lung Cancer Classification Based on Gene Expression Levels

  • Podolsky, Maxim D;Barchuk, Anton A;Kuznetcov, Vladimir I;Gusarova, Natalia F;Gaidukov, Vadim S;Tarakanov, Segrey A
    • Asian Pacific Journal of Cancer Prevention
    • /
    • 제17권2호
    • /
    • pp.835-838
    • /
    • 2016
  • Background: Lung cancer remains one of the most common cancers in the world, both in terms of new cases (about 13% of total per year) and deaths (nearly one cancer death in five), because of the high case fatality. Errors in lung cancer type or malignant growth determination lead to degraded treatment efficacy, because anticancer strategy depends on tumor morphology. Materials and Methods: We have made an attempt to evaluate effectiveness of machine learning algorithms in the task of lung cancer classification based on gene expression levels. We processed four publicly available data sets. The Dana-Farber Cancer Institute data set contains 203 samples and the task was to classify four cancer types and sound tissue samples. With the University of Michigan data set of 96 samples, the task was to execute a binary classification of adenocarcinoma and non-neoplastic tissues. The University of Toronto data set contains 39 samples and the task was to detect recurrence, while with the Brigham and Women's Hospital data set of 181 samples it was to make a binary classification of malignant pleural mesothelioma and adenocarcinoma. We used the k-nearest neighbor algorithm (k=1, k=5, k=10), naive Bayes classifier with assumption of both a normal distribution of attributes and a distribution through histograms, support vector machine and C4.5 decision tree. Effectiveness of machine learning algorithms was evaluated with the Matthews correlation coefficient. Results: The support vector machine method showed best results among data sets from the Dana-Farber Cancer Institute and Brigham and Women's Hospital. All algorithms with the exception of the C4.5 decision tree showed maximum potential effectiveness in the University of Michigan data set. However, the C4.5 decision tree showed best results for the University of Toronto data set. Conclusions: Machine learning algorithms can be used for lung cancer morphology classification and similar tasks based on gene expression level evaluation.

데이터 분할 평가 진화알고리즘을 이용한 효율적인 퍼지 분류규칙의 생성 (Generation of Efficient Fuzzy Classification Rules Using Evolutionary Algorithm with Data Partition Evaluation)

  • 류정우;김성은;김명원
    • 한국지능시스템학회논문지
    • /
    • 제18권1호
    • /
    • pp.32-40
    • /
    • 2008
  • 데이터 속성 값이 연속적이고 애매할 때 퍼지 규칙으로 분류규칙을 표현하는 것은 매우 유용하면서도 효과적이다. 그러나 효과적인 퍼지 분류규칙을 생성하기 위한 소속함수를 결정하기는 어렵다. 본 논문에서는 진화알고리즘을 이용하여 효과적인 퍼지 분류규칙을 자동으로 생성하는 방법을 제안한다. 제안한 방법은 지도 군집화로 클래스 분포에 따라 초기 소속함수를 생성하고, 정확하고 간결한 규칙을 생성할 수 있도록 초기 소속함수를 진화시키는 방법이다. 또한 진화알고리즘의 시간에 대한 효율성을 높이기 위한 방법으로 데이터 분할 평가 진화 방법을 제안한다. 데이터 분할 평가 진화 방법은 전체 학습 데이터를 여러 개의 부분 학습 데이터들로 나누고 개체는 전체 학습 데이터 대신 부분 학습 데이터를 임의로 선택하여 평가하는 방법이다. UCI 벤치마크 데이터로 기존 방법과 비교 실험을 통해 평균적으로 제안한 방법이 효과적임을 보였다. 또한 KDD'99 Cup의 침입탐지 데이터에서 KDD'99 Cup 우승자에 비해 1.54% 향상된 인식률과 20.8% 절감된 탐지비용을 보였고 데이터 분할 평가 진화 방법으로 개체평가 시간을 약 70% 감소시켰다.

The application of a fuzzy inference system and analytical hierarchy process based online evaluation framework to the Donghai Bridge Health Monitoring System

  • Dan, Danhui;Sun, Limin;Yang, Zhifang;Xie, Daqi
    • Smart Structures and Systems
    • /
    • 제14권2호
    • /
    • pp.129-144
    • /
    • 2014
  • In this paper, a fuzzy inference system and an analytical hierarchy process-based online evaluation technique is developed to monitor the condition of the 32-km Donghai Bridge in Shanghai. The system has 478 sensors distributed along eight segments selected from the whole bridge. An online evaluation subsystem is realized, which uses raw data and extracted features or indices to give a set of hierarchically organized condition evaluations. The thresholds of each index were set to an initial value obtained from a structure damage and performance evolution analysis of the bridge. After one year of baseline monitoring, the initial threshold system was updated from the collected data. The results show that the techniques described are valid and reliable. The online method fulfills long-term infrastructure health monitoring requirements for the Donghai Bridge.

영상 데이터 특징 커버리지 기반 딥러닝 모델 검증 기법 (Deep Learning Model Validation Method Based on Image Data Feature Coverage)

  • 임창남;박예슬;이정원
    • 정보처리학회논문지:소프트웨어 및 데이터공학
    • /
    • 제10권9호
    • /
    • pp.375-384
    • /
    • 2021
  • 딥러닝 기법은 영상 처리 분야에서 높은 성능을 입증 받아 다양한 분야에서 적용되고 있다. 이러한 딥러닝 모델의 검증에 가장 널리 사용되는 방법으로는 홀드아웃 검증 방법, k-겹 교차 검증 방법, 부트스트랩 방법 등이 있다. 이러한 기존의 기법들은 데이터 셋을 분할하는 과정에서 클래스 간의 비율에 대한 균형을 고려하지만, 같은 클래스 내에서도 존재하는 다양한 특징들의 비율은 고려하지 않고 있다. 이러한 특징들을 고려하지 않을 경우, 일부 특징에 편향된 검증 결과를 얻게 될 수 있다. 따라서 본 논문에서는 기존 검증 방법들을 개선하여 영상 분류를 위한 데이터 특징 커버리지 기반의 딥러닝 모델 검증 기법을 제안한다. 제안하는 기법은 딥러닝 모델의 학습과 검증을 위한 훈련 데이터 셋과 평가 데이터 셋이 전체 데이터 셋의 특징을 얼마나 반영하고 있는지 수치로 측정할 수 있는 데이터 특징 커버리지를 제안한다. 이러한 방식은 전체 데이터 셋의 특징을 모두 포함하도록 커버리지를 보장하여 데이터 셋을 분할할 수 있고, 모델의 평가 결과를 생성한 특징 군집 단위로 분석할 수 있다. 검증결과, 훈련 데이터 셋의 데이터 특징 커버리지가 낮아질 경우, 모델이 특정 특징에 편향되게 학습하여 모델의 성능이 낮아지며, Fashion-MNIST의 경우 정확도가 8.9%까지 차이나는 것을 확인하였다.

TMY2 방식에 의한 국내 기상자료 작성 연구 (TMY2 Weather data for Korea)

  • 신기식;윤창렬;박상동
    • 한국신재생에너지학회:학술대회논문집
    • /
    • 한국신재생에너지학회 2009년도 춘계학술대회 논문집
    • /
    • pp.243-246
    • /
    • 2009
  • To evaluate the building energy performance, many building simulation programs are used and its capabilities are developed. Despite of its increased capabilities the weather data used In the Building Energy performance evaluation, are still using the same limited set of data. This often forces users to find or calculate weather data such as illuminance, solar radiation, and ground temperature from other sources to calculate it. Also, proper selection of a right weather data set has been considered as one of important factors for a successful building energy simulation. In this paper, we describe TMY2 data, a generalized weather data format developed for use, and applied to Seoul region and examine the differences comparing to existing weather data. A set of 23 years raw weather data base has been developed to provide the weather data file for building energy analysis in Seoul.

  • PDF

Stability evaluation model for loess deposits based on PCA-PNN

  • Li, Guangkun;Su, Maoxin;Xue, Yiguo;Song, Qian;Qiu, Daohong;Fu, Kang;Wang, Peng
    • Geomechanics and Engineering
    • /
    • 제27권6호
    • /
    • pp.551-560
    • /
    • 2021
  • Due to the low strength and high compressibility characteristics, the loess deposits tunnels are prone to large deformations and collapse. An accurate stability evaluation for loess deposits is of considerable significance in deformation control and safety work during tunnel construction. 37 groups of representative data based on real loess deposits cases were adopted to establish the stability evaluation model for the tunnel project in Yan'an, China. Physical and mechanical indices, including water content, cohesion, internal friction angle, elastic modulus, and poisson ratio are selected as index system on the stability level of loess. The data set is randomly divided into 80% as the training set and 20% as the test set. Firstly, principal component analysis (PCA) is used to convert the five index system to three linearly independent principal components X1, X2 and X3. Then, the principal components were used as input vectors for probabilistic neural network (PNN) to map the nonlinear relationship between the index system and stability level of loess. Furthermore, Leave-One-Out cross validation was applied for the training set to find the suitable smoothing factor. At last, the established model with the target smoothing factor 0.04 was applied for the test set, and a 100% prediction accuracy rate was obtained. This intelligent classification method for loess deposits can be easily conducted, which has wide potential applications in evaluating loess deposits.