• Title/Summary/Keyword: categorical data analysis

Search Result 195, Processing Time 0.02 seconds

Analysis of the relationships between topographic factors and landslide occurrence and their application to landslide susceptibility mapping: a case study of Mingchukur, Uzbekistan

  • Kadirhodjaev, Azam;Kadavi, Prima Riza;Lee, Chang-Wook;Lee, Saro
    • Geosciences Journal
    • /
    • v.22 no.6
    • /
    • pp.1053-1067
    • /
    • 2018
  • This paper uses a probability-based approach to study the spatial relationships between landslides and their causative factors in the Mingchukur area, Bostanlik districts of Tashkent, Uzbekistan. The approach is based on digital databases and incorporates methods including probability analysis, spatial pattern analysis, and interactive mapping. First, an object-oriented conceptual model for describing landslide events is proposed, and a combined database of landslides and environmental factors is constructed by integrating various databases within a unifying conceptual framework. The frequency ratio probability model and landslide occurrence data are linked for interactive, spatial evaluation of the relationships between landslides and their causative factors. In total, 15 factors were analyzed, divided into topography, hydrology, and geology categories. All analyzed factors were also divided into numerical and categorical types. Numerical factors are continuous and were evaluated according to their $R^2$ values. A landslide susceptibility map was constructed based on conditioning factors and landslide occurrence data using the frequency ratio model. Finally, the map was validated and the accuracy showed the satisfactory value of 83.3%.

An introductory study on the urban functions using CHAID technique (CHAID 技法에 의한 都市機能의 試論的 硏究)

  • ;Yang, Soon-Jeong
    • Journal of the Korean Geographical Society
    • /
    • v.29 no.3
    • /
    • pp.360-368
    • /
    • 1994
  • To this day, a number of quantitative analytical methods have been employed in clarifying regional characteristics in the discipline of geography. This paper attempted, as a part of application of those quantitative analyses, to make clear the urban functions and consequently the urban characteristics statistically by adopting newly-introduced CHAID, a sort of discriminant analyis technique. The processing of data was sonducted in two phases. To begin with, the urban functions were classified after designating twenty cities - the population of each city counting 250, 000 or more - as predictor variable, and at the same time four major urban functions like administration, marketing, finance and production as response variable. And then, preeminent functions of individual region were discriminated and concurrently classified by treating the remaining traffic, education, medicare, culture and transportation functions as predictor variable, and the following five regions as response variable: Metropolitan Seoul Area. Pusan region, Taegu region, Kwangju region and Chungcheong region. According to the result of this analysis, marketing and administration are emereed as meaningful functions in Seoul and Taegu respectively. As for the finance function only Pusan and Pucheon can be discriminated. Seoul, Pusan and Seongnam reveal their dominancy in production function. To take a look at the result of the latter analysis, the Metropolitan Seoul area shows, among other functions, strong traffic and finance functions. When it comes fo Pusan region, adminstration, education and finance functions are recorded as a leading ones, and Taegu region is preferable in education, medicare and transportation functions. In case of Kwangju region adminstration, production and education functions are discriminated from any other functions. Chungcheong region shows similar aspect with only traffic function replacing the production function of Kwangju region. Based on aforementioned anlysis, it can be said that the CHAID technique, which is capable of processing large amount of categorical data and, by presenting its outcome in the form of dendrogram, facilitates the interpretation work, is an effective, meaningful means to classify and discriminate certain geographical regions and their characteristics.

  • PDF

A study for improving data mining methods for continuous response variables (연속형 반응변수를 위한 데이터마이닝 방법 성능 향상 연구)

  • Choi, Jin-Soo;Lee, Seok-Hyung;Cho, Hyung-Jun
    • Journal of the Korean Data and Information Science Society
    • /
    • v.21 no.5
    • /
    • pp.917-926
    • /
    • 2010
  • It is known that bagging and boosting techniques improve the performance in classification problem. A number of researchers have proved the high performance of bagging and boosting through experiments for categorical response but not for continuous response. We study whether bagging and boosting improve data mining methods for continuous responses such as linear regression, decision tree, neural network through bagging and boosting. The analysis of eight real data sets prove the high performance of bagging and boosting empirically.

The Study on the Different Moderation Effect of Contingency Variable (Focused on SPSS statistics and AOMS program) (상황변수의 조절효과 차이에 관한 연구 (SPSS와 AMOS프로그램을 중심으로))

  • Choi, Chang-Ho;You, Yen-Yoo
    • Journal of Digital Convergence
    • /
    • v.15 no.2
    • /
    • pp.89-98
    • /
    • 2017
  • This study analyzed empirically the same data through SPSS statistics(regression analysis) and AMOS program(structural equation model) used for cause and effect analysis. The result of empirical analysis of moderation effect was as follows. Meanwhile, SPSS statistics(regression analysis) did not pictured moderation effect in the categorical data(sex) and continous data(satisfaction of consunting), AMOS program(structural equation model) pictured partial moderation effect about the effecting of consultant's capability and attitude on the consulting repurchase within 10% level of significant. Eventually, This study showed that AMOS program and SPSS statistics used different methology in moderation effect, thus the different outcomes appeared although using the same data.

Data Mining-Based Performance Prediction Technology of Geothermal Heat Pump System (지열 히트펌프 시스템의 데이터 마이닝 기반 성능 예측 기술)

  • Hwang, Min Hye;Park, Myung Kyu;Jun, In Ki;Sohn, Byonghu
    • Transactions of the KSME C: Technology and Education
    • /
    • v.4 no.1
    • /
    • pp.27-34
    • /
    • 2016
  • This preliminary study investigated data mining-based methods to assess and predict the performance of geothermal heat pump(GHP) system. Data mining is a key process of the knowledge discovery in database (KDD), which includes five steps: 1) Selection; 2) Pre-processing; 3) Transformation; 4) Analysis(data mining); and 5) Interpretation/Evaluation. We used two analysis models, categorical and numerical decision tree models to ascertain the patterns of performance(COP) and electrical consumption of the GHP system. Prior to applying the decision tree models, we statistically analyzed measurement database to determine the effect of sampling intervals on the system performance. Analysis results showed that 10-min sampling data for the performance analysis had highest accuracy of 97.7% over the actual dataset of the GHP system.

Performance Comparison of Clustering using Discritization Algorithm (이산화 알고리즘을 이용한 계층적 클러스터링의 실험적 성능 평가)

  • Won, Jae Kang;Lee, Jeong Chan;Jung, Yong Gyu;Lee, Young Ho
    • Journal of Service Research and Studies
    • /
    • v.3 no.2
    • /
    • pp.53-60
    • /
    • 2013
  • Datamining from the large data in the form of various techniques for obtaining information have been developed. In recent years one of the most sought areas of pattern recognition and machine learning method is created with most of existing learning algorithms based on categorical attributes to a rule or decision model. However, the real-world data, it may consist of numeric attributes in many cases. In addition it contains attributes with numerical values to the normal categorical attribute. In this case, therefore, it is required processes in order to use the data to learn an appropriate value for the type attribute. In this paper, the domain of the numeric attributes are divided into several segments using learning algorithm techniques of discritization. It is described Clustering with other data mining techniques. Large amount of first cluster with characteristics is similar records from the database into smaller groups that split multiple given finite patterns in the pattern space. It is close to each other of a set of patterns that together make up a bunch. Among the set without specifying a particular category in a given data by extracting a pattern. It will be described similar grouping of data clustering technique to classify the data.

  • PDF

Collapsibility Using Raindrop Plot (RAINDROP PLOT을 이용한 차원축소)

  • Hong C. S.;Kim B. J.;Park J. Y.
    • The Korean Journal of Applied Statistics
    • /
    • v.18 no.2
    • /
    • pp.471-485
    • /
    • 2005
  • For categorical data analysis, the collapsibility were explained with the odds ratio (cross-product ratio). When these theories with these odds ratios are applied to real $2{\times}2{\times}K$ contingency tables, it is impossible to decide whether data are collapsible. Among graphical methods to represent odds ratios, Contour plot which is developed by Doi, Nakamura and Yamamoto (2001) could explain the structure of these data, but cannot decide on the collapsibility. In this paper, by using the Raindrop plot proposed by Barrowman and Myers (2003), we suggest an alternative method which can not only explain the structure of data, but also decide on the collapsibility.

A pooled Bayes test of independence using restricted pooling model for contingency tables from small areas

  • Jo, Aejeong;Kim, Dal Ho
    • Communications for Statistical Applications and Methods
    • /
    • v.29 no.5
    • /
    • pp.547-559
    • /
    • 2022
  • For a chi-squared test, which is a statistical method used to test the independence of a contingency table of two factors, the expected frequency of each cell must be greater than 5. The percentage of cells with an expected frequency below 5 must be less than 20% of all cells. However, there are many cases in which the regional expected frequency is below 5 in general small area studies. Even in large-scale surveys, it is difficult to forecast the expected frequency to be greater than 5 when there is small area estimation with subgroup analysis. Another statistical method to test independence is to use the Bayes factor, but since there is a high ratio of data dependency due to the nature of the Bayesian approach, the low expected frequency tends to decrease the precision of the test results. To overcome these limitations, we will borrow information from areas with similar characteristics and pool the data statistically to propose a pooled Bayes test of independence in target areas. Jo et al. (2021) suggested hierarchical Bayesian pooling models for small area estimation of categorical data, and we will introduce the pooled Bayes factors calculated by expanding their restricted pooling model. We applied the pooled Bayes factors using bone mineral density and body mass index data from the Third National Health and Nutrition Examination Survey conducted in the United States and compared them with chi-squared tests often used in tests of independence.

Decision Analysis System for Job Guidance using Rough Set (러프집합을 통한 취업의사결정 분석시스템)

  • Lee, Heui-Tae;Park, In-Kyoo
    • Journal of Digital Convergence
    • /
    • v.11 no.10
    • /
    • pp.387-394
    • /
    • 2013
  • Data mining is the process of discovering hidden, non-trivial patterns in large amounts of data records in order to be used very effectively for analysis and forecasting. Because hundreds of variables give rise to a high level of redundancy and dimensionality with time complexity, they are more likely to have spurious relationships, and even the weakest relationships will be highly significant by any statistical test. Hence cluster analysis is a main task of data mining and is the task of grouping a set of objects in such a way that objects in the same group are more similar to each other than to those in other groups. In this paper system implementation is of great significance, which defines a new definition based on information-theoretic entropy and analyse the analogue behaviors of objects at hand so as to address the measurement of uncertainties in the classification of categorical data. The sources were taken from a survey aimed to identify of job guidance from students in high school pyeongtaek. we show how variable precision information-entropy based rough set can be used to group student in each section. It is proved that the proposed method has the more exact classification than the conventional in attributes more than 10 and that is more effective in job guidance for students.

Influence of Global Competitive Capability on Global Performance of Distribution Industry in South Korea

  • KIM, Boine;KIM, Byoung-Goo
    • Journal of Distribution Science
    • /
    • v.19 no.12
    • /
    • pp.83-89
    • /
    • 2021
  • Purpose: Purpose of this study is to empirically analyze influence of global competitive capability on global performance of distribution industry in South Korea. Also based on the empirical results, give managerial implication to distribution industry and contribute to academies of management. Research design, data and methodology: This study focuses on relationship analysis between global competitive capability and global performance. This study measured global competitive capability with three concepts; human capability, network capability and product/service capability. And measured global performance with export performance. To empirically analyze relationship between variables, this study used 2,316 data of GCL Test by KOTRA and Kdata. This study used SPSS26 and analyzed frequency, reliability, correlation and stepwise regression analysis. Results: Result shows that, in control variable, business period and business field give significant positive influence on export performance. Among antecedents, human capability and network capability give significant positive influence on export performance. However, product/goods/service was not significant. Due to significant influence of business field which is categorical variable. This study additionally analyze relationship by business field group to confirm whether relationship differ by group or similar. Conclusions: Based on the results, this study try to give implication to distribution industry management and contribute to academic.