• Title/Summary/Keyword: Diverse data sets

Search Result 79, Processing Time 0.021 seconds

Macroscopic Biclustering of Gene Expression Data (유전자 발현 데이터에 적용한 거시적인 바이클러스터링 기법)

  • Ahn, Jae-Gyoon;Yoon, Young-Mi;Park, Sang-Hyun
    • The KIPS Transactions:PartD
    • /
    • v.16D no.3
    • /
    • pp.327-338
    • /
    • 2009
  • A microarray dataset is 2-dimensional dataset with a set of genes and a set of conditions. A bicluster is a subset of genes that show similar behavior within a subset of conditions. Genes that show similar behavior can be considered to have same cellular functions. Thus, biclustering algorithm is a useful tool to uncover groups of genes involved in the same cellular process and groups of conditions which take place in this process. We are proposing a polynomial time algorithm to identify functionally highly correlated biclusters. Our algorithm identifies 1) the gene set that has hidden patterns even if the level of noise is high, 2) the multiple, possibly overlapped, and diverse gene sets, 3) gene sets whose functional association is strongly high, and 4) deterministic biclustering results. We validated the level of functional association of our method, and compared with current methods using GO.

Network Anomaly Detection Technologies Using Unsupervised Learning AutoEncoders (비지도학습 오토 엔코더를 활용한 네트워크 이상 검출 기술)

  • Kang, Koohong
    • Journal of the Korea Institute of Information Security & Cryptology
    • /
    • v.30 no.4
    • /
    • pp.617-629
    • /
    • 2020
  • In order to overcome the limitations of the rule-based intrusion detection system due to changes in Internet computing environments, the emergence of new services, and creativity of attackers, network anomaly detection (NAD) using machine learning and deep learning technologies has received much attention. Most of these existing machine learning and deep learning technologies for NAD use supervised learning methods to learn a set of training data set labeled 'normal' and 'attack'. This paper presents the feasibility of the unsupervised learning AutoEncoder(AE) to NAD from data sets collecting of secured network traffic without labeled responses. To verify the performance of the proposed AE mode, we present the experimental results in terms of accuracy, precision, recall, f1-score, and ROC AUC value on the NSL-KDD training and test data sets. In particular, we model a reference AE through the deep analysis of diverse AEs varying hyper-parameters such as the number of layers as well as considering the regularization and denoising effects. The reference model shows the f1-scores 90.4% and 89% of binary classification on the KDDTest+ and KDDTest-21 test data sets based on the threshold of the 82-th percentile of the AE reconstruction error of the training data set.

Optimal Growth Model of the Cochlodinium Polykrikoides (Cochlodinium Polykrikoides 최적 성장모형)

  • Cho, Hong-Yeon;Cho, Beom Jun
    • Journal of Korean Society of Coastal and Ocean Engineers
    • /
    • v.26 no.4
    • /
    • pp.217-224
    • /
    • 2014
  • Cochlodinium polykrikoides is a typical harmful algal species which generates the red-tide in the coastal zone, southern Korea. Accurate algal growth model can be established and then the prediction of the red-tide occurrence using this model is possible if the information on the optimal growth model parameters are available because it is directly related between the red-tide occurrence and the rapid algal bloom. However, the limitation factors on the algal growth, such as light intensity, water temperature, salinity, and nutrient concentrations, are so diverse and also the limitation function types are diverse. Thus, the study on the algal growth model development using the available laboratory data set on the growth rate change due to the limitation factors are relatively very poor in the perspective of the model. In this study, the growth model on the C. polykrikoides are developed and suggested as the optimal model which can be used as the element model in the red-tide or ecological models. The optimal parameter estimation and an error analysis are carried out using the available previous research results and data sets. This model can be used for the difference analysis between the lab. condition and in-situ state because it is an optimal model for the lab. condition. The parameter values and ranges also can be used for the model calibration and validation using the in-situ monitoring environmental and algal bloom data sets.

Reconstruction of Neural Circuits Using Serial Block-Face Scanning Electron Microscopy

  • Kim, Gyu Hyun;Lee, Sang-Hoon;Lee, Kea Joo
    • Applied Microscopy
    • /
    • v.46 no.2
    • /
    • pp.100-104
    • /
    • 2016
  • Electron microscopy is currently the only available technique with a spatial resolution sufficient to identify fine neuronal processes and synaptic structures in densely packed neuropil. For large-scale volume reconstruction of neuronal connectivity, serial block-face scanning electron microscopy allows us to acquire thousands of serial images in an automated fashion and reconstruct neural circuits faster by reducing the alignment task. Here we introduce the whole reconstruction procedure of synaptic network in the rat hippocampal CA1 area and discuss technical issues to be resolved for improving image quality and segmentation. Compared to the serial section transmission electron microscopy, serial block-face scanning electron microscopy produced much reliable three-dimensional data sets and accelerated reconstruction by reducing the need of alignment and distortion adjustment. This approach will generate invaluable information on organizational features of our connectomes as well as diverse neurological disorders caused by synaptic impairments.

Cluster-Based Selection of Diverse Query Examples for Active Learning (능동적 학습을 위한 군집화 기반의 다양한 복수 문의 예제 선정 방법)

  • Kang, Jae-Ho;Ryu, Kwang-Ryel;Kwon, Hyuk-Chul
    • Journal of Intelligence and Information Systems
    • /
    • v.11 no.1
    • /
    • pp.169-189
    • /
    • 2005
  • In order to derive a better classifier with a limited number of training examples, active teaming alternately repeats the querying stage fur category labeling and the subsequent learning stage fur rebuilding the calssifier with the newly expanded training set. To relieve the user from the burden of labeling, especially in an on-line environment, it is important to minimize the number of querying steps as well as the total number of query examples. We can derive a good classifier in a small number of querying steps by using only a small number of examples if we can select multiple of diverse, representative, and ambiguous examples to present to the user at each querying step. In this paper, we propose a cluster-based batch query selection method which can select diverse, representative, and highly ambiguous examples for efficient active learning. Experiments with various text data sets have shown that our method can derive a better classifier than other methods which only take into account the ambiguity as the criterion to select multiple query examples.

  • PDF

A Study on Experiential Space Consumption Patterns in Urban Parks through Blog Text Analysis - Focusing on Ttukseom Hangang Park - (블로그 텍스트 분석을 통해 살펴본 도시공원의 경험적 공간 소비 양상 - 뚝섬한강공원을 중심으로 -)

  • Kim, Shinsung
    • Journal of the Korean Institute of Landscape Architecture
    • /
    • v.51 no.2
    • /
    • pp.68-80
    • /
    • 2023
  • With the recent changes in society and the introduction of new technologies, the usage patterns of parks have become diverse, leading to increased complexity in park management. As a result, there is a growing demand for flexible and diverse park management that can adapt to these new requirements. However, there is inadequate discussion on these new demands and whether urban park management policies can respond. Therefore, empirical research on how park usage patterns are evolving is critical. To address this, blog data, in which individuals share their experiences, was used to examine the spatial consumption patterns through semantic network and topic analysis. This study also explored whether these spatial consumption patterns exhibit experiential consumption characteristics according to the experience economy theory. The results showed that consumption behaviors, such as renting picnic sets and having food and drinks delivered, were prominent and that emotional experiences were pursued. Furthermore, these findings were consistent with the experiential consumption characteristics of the experience economy theory. This suggests that park planning and maintenance methods need to become more flexible and diverse in response to the changing demands for park usage.

Support Vector Machine for Interval Regression

  • Hong Dug Hun;Hwang Changha
    • Proceedings of the Korean Statistical Society Conference
    • /
    • 2004.11a
    • /
    • pp.67-72
    • /
    • 2004
  • Support vector machine (SVM) has been very successful in pattern recognition and function estimation problems for crisp data. This paper proposes a new method to evaluate interval linear and nonlinear regression models combining the possibility and necessity estimation formulation with the principle of SVM. For data sets with crisp inputs and interval outputs, the possibility and necessity models have been recently utilized, which are based on quadratic programming approach giving more diverse spread coefficients than a linear programming one. SVM also uses quadratic programming approach whose another advantage in interval regression analysis is to be able to integrate both the property of central tendency in least squares and the possibilistic property In fuzzy regression. However this is not a computationally expensive way. SVM allows us to perform interval nonlinear regression analysis by constructing an interval linear regression function in a high dimensional feature space. In particular, SVM is a very attractive approach to model nonlinear interval data. The proposed algorithm here is model-free method in the sense that we do not have to assume the underlying model function for interval nonlinear regression model with crisp inputs and interval output. Experimental results are then presented which indicate the performance of this algorithm.

  • PDF

Analysis of Drug Utilization after the Mandatory Application of the DRG Payment System in Korea (포괄수가제 당연적용 후 의약품 사용현황 분석)

  • Kang, Hee-Jeong;Kim, Ji Man;Lim, Jae-Young;Lee, Sang Gyu;Shin, Euichul
    • Korea Journal of Hospital Management
    • /
    • v.23 no.2
    • /
    • pp.18-27
    • /
    • 2018
  • Purposes: This study aims to investigate the policy effect of mandatory application of DRG for 7 disease groups in general and tertiary hospitals. Methodology: As DRG was fully implemented in July 2013, this study compares two periods before and after the change(from July 2012 to June 2013, and from July 2013 to June 2014). The benefit claim data of the National Health Insurance Service was used for the comparison. Target patients were those who visited general or tertiary hospitals between July 2012 to June 2014. For pharmaceutical consumption, Interrupted Time Series (ITS) analysis was used to see the effect of DRG mandatory application. Findings: The number of drugs prescribed per patient and pharmaceutical expenditure both showed significant reduction compared to before the DRG implementation. Practical Implications: This study used 2 sets of 1 year period data from before and after the full implementation of DRG to analyze pharmaceutical consumption. When the comparison data accumulates further, it would be possible to conduct more diverse analysis to assess policy effect and to provide way forward for the future.

Developing an Ensemble Classifier for Bankruptcy Prediction (부도 예측을 위한 앙상블 분류기 개발)

  • Min, Sung-Hwan
    • Journal of Korea Society of Industrial Information Systems
    • /
    • v.17 no.7
    • /
    • pp.139-148
    • /
    • 2012
  • An ensemble of classifiers is to employ a set of individually trained classifiers and combine their predictions. It has been found that in most cases the ensembles produce more accurate predictions than the base classifiers. Combining outputs from multiple classifiers, known as ensemble learning, is one of the standard and most important techniques for improving classification accuracy in machine learning. An ensemble of classifiers is efficient only if the individual classifiers make decisions as diverse as possible. Bagging is the most popular method of ensemble learning to generate a diverse set of classifiers. Diversity in bagging is obtained by using different training sets. The different training data subsets are randomly drawn with replacement from the entire training dataset. The random subspace method is an ensemble construction technique using different attribute subsets. In the random subspace, the training dataset is also modified as in bagging. However, this modification is performed in the feature space. Bagging and random subspace are quite well known and popular ensemble algorithms. However, few studies have dealt with the integration of bagging and random subspace using SVM Classifiers, though there is a great potential for useful applications in this area. The focus of this paper is to propose methods for improving SVM performance using hybrid ensemble strategy for bankruptcy prediction. This paper applies the proposed ensemble model to the bankruptcy prediction problem using a real data set from Korean companies.

An Investigation on Core Competencies of Data Curator (데이터 큐레이터의 핵심 직무 요건 고찰에 관한 연구)

  • Lee, You-Kyoung;Chung, EunKyung
    • Journal of the Korean BIBLIA Society for library and Information Science
    • /
    • v.26 no.3
    • /
    • pp.129-150
    • /
    • 2015
  • As the digital technologies and internet have advanced, data have centered in the process of meaningful scientific ramifications and policy making in a wide variety of fields. Data curator in charge of managing data plays a significant role in terms of improving the effectiveness and efficiency of data management and re-uses. The purpose of this study is to identify the core competencies for data curator. For achieving the purpose of this study, two sets of data were collected. First, a total of 255 job descriptions were collected from the web sites including ARL, Digital Curation Exchange, Code4lib, ASIS&T JobLine for the period of 2011-2014. Second, in-depth interviews with five data curators from four diverse organizations were collected. The two sets of data were analyzed into seven categories identified from the related studies. Findings of this study showed that core competencies for data curator were identified into four categories, communication skills, data management techniques, knowledge and strategies for data management, and instructions and service provisions for users. The implications of this study can be considered as integrated and professional curriculum developments for data curator with core competencies.