• Title/Summary/Keyword: ratio of random variables

Search Result 105, Processing Time 0.02 seconds

The Effect of Meta-Features of Multiclass Datasets on the Performance of Classification Algorithms (다중 클래스 데이터셋의 메타특징이 판별 알고리즘의 성능에 미치는 영향 연구)

  • Kim, Jeonghun;Kim, Min Yong;Kwon, Ohbyung
    • Journal of Intelligence and Information Systems
    • /
    • v.26 no.1
    • /
    • pp.23-45
    • /
    • 2020
  • Big data is creating in a wide variety of fields such as medical care, manufacturing, logistics, sales site, SNS, and the dataset characteristics are also diverse. In order to secure the competitiveness of companies, it is necessary to improve decision-making capacity using a classification algorithm. However, most of them do not have sufficient knowledge on what kind of classification algorithm is appropriate for a specific problem area. In other words, determining which classification algorithm is appropriate depending on the characteristics of the dataset was has been a task that required expertise and effort. This is because the relationship between the characteristics of datasets (called meta-features) and the performance of classification algorithms has not been fully understood. Moreover, there has been little research on meta-features reflecting the characteristics of multi-class. Therefore, the purpose of this study is to empirically analyze whether meta-features of multi-class datasets have a significant effect on the performance of classification algorithms. In this study, meta-features of multi-class datasets were identified into two factors, (the data structure and the data complexity,) and seven representative meta-features were selected. Among those, we included the Herfindahl-Hirschman Index (HHI), originally a market concentration measurement index, in the meta-features to replace IR(Imbalanced Ratio). Also, we developed a new index called Reverse ReLU Silhouette Score into the meta-feature set. Among the UCI Machine Learning Repository data, six representative datasets (Balance Scale, PageBlocks, Car Evaluation, User Knowledge-Modeling, Wine Quality(red), Contraceptive Method Choice) were selected. The class of each dataset was classified by using the classification algorithms (KNN, Logistic Regression, Nave Bayes, Random Forest, and SVM) selected in the study. For each dataset, we applied 10-fold cross validation method. 10% to 100% oversampling method is applied for each fold and meta-features of the dataset is measured. The meta-features selected are HHI, Number of Classes, Number of Features, Entropy, Reverse ReLU Silhouette Score, Nonlinearity of Linear Classifier, Hub Score. F1-score was selected as the dependent variable. As a result, the results of this study showed that the six meta-features including Reverse ReLU Silhouette Score and HHI proposed in this study have a significant effect on the classification performance. (1) The meta-features HHI proposed in this study was significant in the classification performance. (2) The number of variables has a significant effect on the classification performance, unlike the number of classes, but it has a positive effect. (3) The number of classes has a negative effect on the performance of classification. (4) Entropy has a significant effect on the performance of classification. (5) The Reverse ReLU Silhouette Score also significantly affects the classification performance at a significant level of 0.01. (6) The nonlinearity of linear classifiers has a significant negative effect on classification performance. In addition, the results of the analysis by the classification algorithms were also consistent. In the regression analysis by classification algorithm, Naïve Bayes algorithm does not have a significant effect on the number of variables unlike other classification algorithms. This study has two theoretical contributions: (1) two new meta-features (HHI, Reverse ReLU Silhouette score) was proved to be significant. (2) The effects of data characteristics on the performance of classification were investigated using meta-features. The practical contribution points (1) can be utilized in the development of classification algorithm recommendation system according to the characteristics of datasets. (2) Many data scientists are often testing by adjusting the parameters of the algorithm to find the optimal algorithm for the situation because the characteristics of the data are different. In this process, excessive waste of resources occurs due to hardware, cost, time, and manpower. This study is expected to be useful for machine learning, data mining researchers, practitioners, and machine learning-based system developers. The composition of this study consists of introduction, related research, research model, experiment, conclusion and discussion.

A Three-Dimensiomal Slope Stability Analysis in Probabilistic Solution (3차원(次元) 사면(斜面) 안정해석(安定解析)에 관한 확률론적(確率論的) 연구(研究))

  • Kim, Young Su
    • KSCE Journal of Civil and Environmental Engineering Research
    • /
    • v.4 no.3
    • /
    • pp.75-83
    • /
    • 1984
  • The probability of failure is used to analyze the reliability of three dimensional slope failure, instead of conventional factor of safety. The strength parameters are assumed to be normal variated and beta variated. These are interval estimated under the specified confidence level and maximum likelihood estimation. The pseudonormal and beta random variables are generated using the uniform probability transformation method according to central limit theorem and rejection method. By means of a Monte-Carlo Simulation, the probability of failure is defined as; $P_f=M/N$ N: Total number of trials M: Total number of failures Some of the conclusions derived. from the case study include; 1. Three dimensional factors of safety are generally much higher than 2-D factors of safety. However situations appear to exist where the 3-D factor of safety can be lower than the 2-D factor of safety. 2. The $F_3/F_2$ ratio appears to be quite sensitive to c and ${\phi}$ and to the shape of the 3-D shear surface and the slope but not to be to the unit weight of soil. 3. From the two models (normal, beta) considered for the distribution of the factor of safety, the beta distribution generally provides lager than normal distribution. 4. Results obtained using the beta and normal models are presented in a nomgraph relating slope height and slop angle to probability of failure.

  • PDF

Reliability Assessment of Fatigue Crack Propagation using Response Surface Method (응답면기법을 활용한 피로균열진전 신뢰성 평가)

  • Cho, Tae Jun;Kim, Lee Hyeon;Kyung, Kab Soo;Choi, Eun Soo
    • Journal of Korean Society of Steel Construction
    • /
    • v.20 no.6
    • /
    • pp.723-730
    • /
    • 2008
  • Due to the higher ratio of live load to total loads of railway bridges, the accumulated damage by cyclic fatigue is significant. Moreover, it is highly possible that the initiated crack grows faster than that of highway bridges. Therefore, it is strongly needed to assess the safety for the accumulated damage analytically. The initiation and growth of fatigue-crack are related with the stress range, number of cycles, and the stiffness of the structural system. The stiffness of the structural system includes uncertainties of the planning, design, construction and maintenance, which varies as time goes. In this study, the authors developed the design and risk assessment techniques based on the reliability theories considering the uncertainties in load and resistance. For the probabilistic risk assessment of crack growth and the remaining life of the structures by the cyclic load of railway and subway bridges, response surface method (RSM) combined with first order second moment method were used. For composing limit state function, the stress range, stress intensity factor and the remaining life were selected as input important random variables to the RSM program. The probabilities of failure and the reliability indices of fatigue life for the considered specimen under cyclic loads were evaluated and discussed.

Effects of Process Variables on the Microstructure and Gas Sensing Characteristics of Magnetron Sputtered $\textrm{SnO}_2$Thin Films (마그네트론 스퍼터링 증착 조건에 따른 $\textrm{SnO}_2$ 박막의 미세구조와 가스검지특성 변화)

  • Kim, Jong-Min;Moon, Jong-Ha;Lee, Byung-Teak
    • Korean Journal of Materials Research
    • /
    • v.9 no.11
    • /
    • pp.1083-1087
    • /
    • 1999
  • Microstructures and the gas-sensing characteristics of the $\textrm{SnO}_2$ thin films were studied, which were deposited at various conditions (rf power, sample temperature, $\textrm{O}_2$/Ar ratio) by the rf magnetron sputtering. As a result, six typical microstructures were derived, such as amorphous(A), amorphous mixed with polycrystalline grains (A+P), polycrystalline with random crystalographic orientation (P), fine columnar (FC), coarse columnar (CC) and Zone T (T) with dense fiberous structure. Typically, A, A+ P, and P structures were formed when no $\textrm{O}_2$ was added to the sputter gas, whereas FC, CC, and T structures were obtained when $\textrm{O}_2$ was added. The A structure formed at low rf power and low temperature, the A+P at high rf power and low temperature, and the P at high rf power and high temperature. The FC structure was obtained at low rf power and low temperature. the CC at low rf power and high temperature, and the T at high rf power and low temperature. Results of the gas-sensing test of the sensor chips fabricated from the typical films indicated that the fine columnar microstructure shows the highest sensitivity both at $300^{\circ}C$ and $400^{\circ}C$. It was proposed that this is due to the high specific surface area of the micro-columns.

  • PDF

Suggestion of Urban Regeneration Type Recommendation System Based on Local Characteristics Using Text Mining (텍스트 마이닝을 활용한 지역 특성 기반 도시재생 유형 추천 시스템 제안)

  • Kim, Ikjun;Lee, Junho;Kim, Hyomin;Kang, Juyoung
    • Journal of Intelligence and Information Systems
    • /
    • v.26 no.3
    • /
    • pp.149-169
    • /
    • 2020
  • "The Urban Renewal New Deal project", one of the government's major national projects, is about developing underdeveloped areas by investing 50 trillion won in 100 locations on the first year and 500 over the next four years. This project is drawing keen attention from the media and local governments. However, the project model which fails to reflect the original characteristics of the area as it divides project area into five categories: "Our Neighborhood Restoration, Housing Maintenance Support Type, General Neighborhood Type, Central Urban Type, and Economic Base Type," According to keywords for successful urban regeneration in Korea, "resident participation," "regional specialization," "ministerial cooperation" and "public-private cooperation", when local governments propose urban regeneration projects to the government, they can see that it is most important to accurately understand the characteristics of the city and push ahead with the projects in a way that suits the characteristics of the city with the help of local residents and private companies. In addition, considering the gentrification problem, which is one of the side effects of urban regeneration projects, it is important to select and implement urban regeneration types suitable for the characteristics of the area. In order to supplement the limitations of the 'Urban Regeneration New Deal Project' methodology, this study aims to propose a system that recommends urban regeneration types suitable for urban regeneration sites by utilizing various machine learning algorithms, referring to the urban regeneration types of the '2025 Seoul Metropolitan Government Urban Regeneration Strategy Plan' promoted based on regional characteristics. There are four types of urban regeneration in Seoul: "Low-use Low-Level Development, Abandonment, Deteriorated Housing, and Specialization of Historical and Cultural Resources" (Shon and Park, 2017). In order to identify regional characteristics, approximately 100,000 text data were collected for 22 regions where the project was carried out for a total of four types of urban regeneration. Using the collected data, we drew key keywords for each region according to the type of urban regeneration and conducted topic modeling to explore whether there were differences between types. As a result, it was confirmed that a number of topics related to real estate and economy appeared in old residential areas, and in the case of declining and underdeveloped areas, topics reflecting the characteristics of areas where industrial activities were active in the past appeared. In the case of the historical and cultural resource area, since it is an area that contains traces of the past, many keywords related to the government appeared. Therefore, it was possible to confirm political topics and cultural topics resulting from various events. Finally, in the case of low-use and under-developed areas, many topics on real estate and accessibility are emerging, so accessibility is good. It mainly had the characteristics of a region where development is planned or is likely to be developed. Furthermore, a model was implemented that proposes urban regeneration types tailored to regional characteristics for regions other than Seoul. Machine learning technology was used to implement the model, and training data and test data were randomly extracted at an 8:2 ratio and used. In order to compare the performance between various models, the input variables are set in two ways: Count Vector and TF-IDF Vector, and as Classifier, there are 5 types of SVM (Support Vector Machine), Decision Tree, Random Forest, Logistic Regression, and Gradient Boosting. By applying it, performance comparison for a total of 10 models was conducted. The model with the highest performance was the Gradient Boosting method using TF-IDF Vector input data, and the accuracy was 97%. Therefore, the recommendation system proposed in this study is expected to recommend urban regeneration types based on the regional characteristics of new business sites in the process of carrying out urban regeneration projects."