• Title/Summary/Keyword: Dataset Generation

Search Result 196, Processing Time 0.029 seconds

Incremental Generation of A Decision Tree Using Global Discretization For Large Data (대용량 데이터를 위한 전역적 범주화를 이용한 결정 트리의 순차적 생성)

  • Han, Kyong-Sik;Lee, Soo-Won
    • The KIPS Transactions:PartB
    • /
    • v.12B no.4 s.100
    • /
    • pp.487-498
    • /
    • 2005
  • Recently, It has focused on decision tree algorithm that can handle large dataset. However, because most of these algorithms for large datasets process data in a batch mode, if new data is added, they have to rebuild the tree from scratch. h more efficient approach to reducing the cost problem of rebuilding is an approach that builds a tree incrementally. Representative algorithms for incremental tree construction methods are BOAT and ITI and most of these algorithms use a local discretization method to handle the numeric data type. However, because a discretization requires sorted numeric data in situation of processing large data sets, a global discretization method that sorts all data only once is more suitable than a local discretization method that sorts in every node. This paper proposes an incremental tree construction method that efficiently rebuilds a tree using a global discretization method to handle the numeric data type. When new data is added, new categories influenced by the data should be recreated, and then the tree structure should be changed in accordance with category changes. This paper proposes a method that extracts sample points and performs discretiration from these sample points to recreate categories efficiently and uses confidence intervals and a tree restructuring method to adjust tree structure to category changes. In this study, an experiment using people database was made to compare the proposed method with the existing one that uses a local discretization.

Generating Training Dataset of Machine Learning Model for Context-Awareness in a Health Status Notification Service (사용자 건강 상태알림 서비스의 상황인지를 위한 기계학습 모델의 학습 데이터 생성 방법)

  • Mun, Jong Hyeok;Choi, Jong Sun;Choi, Jae Young
    • KIPS Transactions on Software and Data Engineering
    • /
    • v.9 no.1
    • /
    • pp.25-32
    • /
    • 2020
  • In the context-aware system, rule-based AI technology has been used in the abstraction process for getting context information. However, the rules are complicated by the diversification of user requirements for the service and also data usage is increased. Therefore, there are some technical limitations to maintain rule-based models and to process unstructured data. To overcome these limitations, many studies have applied machine learning techniques to Context-aware systems. In order to utilize this machine learning-based model in the context-aware system, a management process of periodically injecting training data is required. In the previous study on the machine learning based context awareness system, a series of management processes such as the generation and provision of learning data for operating several machine learning models were considered, but the method was limited to the applied system. In this paper, we propose a training data generating method of a machine learning model to extend the machine learning based context-aware system. The proposed method define the training data generating model that can reflect the requirements of the machine learning models and generate the training data for each machine learning model. In the experiment, the training data generating model is defined based on the training data generating schema of the cardiac status analysis model for older in health status notification service, and the training data is generated by applying the model defined in the real environment of the software. In addition, it shows the process of comparing the accuracy by learning the training data generated in the machine learning model, and applied to verify the validity of the generated learning data.

Material Image Classification using Normal Map Generation (Normal map 생성을 이용한 물질 이미지 분류)

  • Nam, Hyeongil;Kim, Tae Hyun;Park, Jong-Il
    • Journal of Broadcast Engineering
    • /
    • v.27 no.1
    • /
    • pp.69-79
    • /
    • 2022
  • In this study, a method of generating and utilizing a normal map image used to represent the characteristics of the surface of an image material to improve the classification accuracy of the original material image is proposed. First of all, (1) to generate a normal map that reflects the surface properties of a material in an image, a U-Net with attention-R2 gate as a generator was used, and a Pix2Pix-based method using the generated normal map and the similarity with the original normal map as a reconstruction loss was used. Next, (2) we propose a network that can improve the accuracy of classification of the original material image by applying the previously created normal map image to the attention gate of the classification network. For normal maps generated using Pixar Dataset, the similarity between normal maps corresponding to ground truth is evaluated. In this case, the results of reconstruction loss function applied differently according to the similarity metrics are compared. In addition, for evaluation of material image classification, it was confirmed that the proposed method based on MINC-2500 and FMD datasets and comparative experiments in previous studies could be more accurately distinguished. The method proposed in this paper is expected to be the basis for various image processing and network construction that can identify substances within an image.

Big Data Analytics in RNA-sequencing (RNA 시퀀싱 기법으로 생성된 빅데이터 분석)

  • Sung-Hun WOO;Byung Chul JUNG
    • Korean Journal of Clinical Laboratory Science
    • /
    • v.55 no.4
    • /
    • pp.235-243
    • /
    • 2023
  • As next-generation sequencing has been developed and used widely, RNA-sequencing (RNA-seq) has rapidly emerged as the first choice of tools to validate global transcriptome profiling. With the significant advances in RNA-seq, various types of RNA-seq have evolved in conjunction with the progress in bioinformatic tools. On the other hand, it is difficult to interpret the complex data underlying the biological meaning without a general understanding of the types of RNA-seq and bioinformatic approaches. In this regard, this paper discusses the two main sections of RNA-seq. First, two major variants of RNA-seq are described and compared with the standard RNA-seq. This provides insights into which RNA-seq method is most appropriate for their research. Second, the most widely used RNA-seq data analyses are discussed: (1) exploratory data analysis and (2) pathway enrichment analysis. This paper introduces the most widely used exploratory data analysis for RNA-seq, such as principal component analysis, heatmap, and volcano plot, which can provide the overall trends in the dataset. The pathway enrichment analysis section introduces three generations of pathway enrichment analysis and how they generate enriched pathways with the RNA-seq dataset.

Construction of a full-length cDNA library from Pinus koraiensis and analysis of EST dataset (잣나무(Pinus koraiensis)의 cDNA library 제작 및 EST 분석)

  • Kim, Joon-Ki;Im, Su-Bin;Choi, Sun-Hee;Lee, Jong-Suk;Roh, Mark S.;Lim, Yong-Pyo
    • Korean Journal of Agricultural Science
    • /
    • v.38 no.1
    • /
    • pp.11-16
    • /
    • 2011
  • In this study, we report the generation and analysis of a total of 1,211 expressed sequence tags (ESTs) from Pinus koraiensis. A cDNA library was generated from the young leaf tissue and a total of 1,211 cDNA were partially sequenced. EST and unigene sequence quality were determined by computational filtering, manual review, and BLAST analyses. In all, 857 ESTs were acquired after the removal of the vector sequence and filtering over a minimum length 50 nucleotides. A total of 411 unigene, consisting of 89 contigs and 322 singletons, was identified after assembling. Also, we identified 77 new microsatellite-containing sequences from the unigenes and classified the structure according to their repeat unit. According to homology search with BLASTX against the NCBI database, 63.1% of ESTs were homologous with known function and 22.2% of ESTs were matched with putative or unknown function. The remaining 14.6% of ESTs showed no significant similarity to any protein sequences found in the public database. Gene ontology (GO) classification showed that the most abundant GO terms were transport, nucleotide binding, plastid, in terms biological process, molecular function and cellular component, respectively. The sequence data will be used to characterize potential roles of new genes in Pinus and provided for the useful tools as a genetic resource.

Classification of HDAC8 Inhibitors and Non-Inhibitors Using Support Vector Machines

  • Cao, Guang Ping;Thangapandian, Sundarapandian;John, Shalini;Lee, Keun-Woo
    • Interdisciplinary Bio Central
    • /
    • v.4 no.1
    • /
    • pp.2.1-2.7
    • /
    • 2012
  • Introduction: Histone deacetylases (HDAC) are a class of enzymes that remove acetyl groups from ${\varepsilon}$-N-acetyl lysine amino acids of histone proteins. Their action is opposite to that of histone acetyltransferase that adds acetyl groups to these lysines. Only few HDAC inhibitors are approved and used as anti-cancer therapeutics. Thus, discovery of new and potential HDAC inhibitors are necessary in the effective treatment of cancer. Materials and Methods: This study proposed a method using support vector machine (SVM) to classify HDAC8 inhibitors and non-inhibitors in early-phase virtual compound filtering and screening. The 100 experimentally known HDAC8 inhibitors including 52 inhibitors and 48 non-inhibitors were used in this study. A set of molecular descriptors was calculated for all compounds in the dataset using ADRIANA. Code of Molecular Networks. Different kernel functions available from SVM Tools of free support vector machine software and training and test sets of varying size were used in model generation and validation. Results and Conclusion: The best model obtained using kernel functions has shown 75% of accuracy on test set prediction. The other models have also displayed good prediction over the test set compounds. The results of this study can be used as simple and effective filters in the drug discovery process.

A Study on Improvement of Dynamic Object Detection using Dense Grid Model and Anchor Model (고밀도 그리드 모델과 앵커모델을 이용한 동적 객체검지 향상에 관한 연구)

  • Yun, Borin;Lee, Sun Woo;Choi, Ho Kyung;Lee, Sangmin;Kwon, Jang Woo
    • The Journal of The Korea Institute of Intelligent Transport Systems
    • /
    • v.17 no.3
    • /
    • pp.98-110
    • /
    • 2018
  • In this paper, we propose both Dense grid model and Anchor model to improve the recognition rate of dynamic objects. Two experiments are conducted to study the performance of two proposed CNNs models (Dense grid model and Anchor model), which are to detect dynamic objects. In the first experiment, YOLO-v2 network is adjusted, and then fine-tuned on KITTI datasets. The Dense grid model and Anchor model are then compared with YOLO-v2. Regarding to the evaluation, the two models outperform YOLO-v2 from 6.26% to 10.99% on car detection at different difficulty levels. In the second experiment, this paper conducted further training of the models on a new dataset. The two models outperform YOLO-v2 up to 22.40% on car detection at different difficulty levels.

SuperDepthTransfer: Depth Extraction from Image Using Instance-Based Learning with Superpixels

  • Zhu, Yuesheng;Jiang, Yifeng;Huang, Zhuandi;Luo, Guibo
    • KSII Transactions on Internet and Information Systems (TIIS)
    • /
    • v.11 no.10
    • /
    • pp.4968-4986
    • /
    • 2017
  • In this paper, we primarily address the difficulty of automatic generation of a plausible depth map from a single image in an unstructured environment. The aim is to extrapolate a depth map with a more correct, rich, and distinct depth order, which is both quantitatively accurate as well as visually pleasing. Our technique, which is fundamentally based on a preexisting DepthTransfer algorithm, transfers depth information at the level of superpixels. This occurs within a framework that replaces a pixel basis with one of instance-based learning. A vital superpixels feature enhancing matching precision is posterior incorporation of predictive semantic labels into the depth extraction procedure. Finally, a modified Cross Bilateral Filter is leveraged to augment the final depth field. For training and evaluation, experiments were conducted using the Make3D Range Image Dataset and vividly demonstrate that this depth estimation method outperforms state-of-the-art methods for the correlation coefficient metric, mean log10 error and root mean squared error, and achieves comparable performance for the average relative error metric in both efficacy and computational efficiency. This approach can be utilized to automatically convert 2D images into stereo for 3D visualization, producing anaglyph images that are visually superior in realism and simultaneously more immersive.

Data Cleaning and Integration of Multi-year Dietary Survey in the Korea National Health and Nutrition Examination Survey (KNHANES) using Database Normalization Theory (데이터베이스 정규화 이론을 이용한 국민건강영양조사 중 다년도 식이조사 자료 정제 및 통합)

  • Kwon, Namji;Suh, Jihye;Lee, Hunjoo
    • Journal of Environmental Health Sciences
    • /
    • v.43 no.4
    • /
    • pp.298-306
    • /
    • 2017
  • Objectives: Since 1998, the Korea National Health and Nutrition Examination Survey (KNHANES) has been conducted in order to investigate the health and nutritional status of Koreans. The food intake data of individuals in the KNHANES has also been utilized as source dataset for risk assessment of chemicals via food. To improve the reliability of intake estimation and prevent missing data for less-responded foods, the structure of integrated long-standing datasets is significant. However, it is difficult to merge multi-year survey datasets due to ineffective cleaning processes for handling extensive numbers of codes for each food item along with changes in dietary habits over time. Therefore, this study aims at 1) cleaning the process of abnormal data 2) generation of integrated long-standing raw data, and 3) contributing to the production of consistent dietary exposure factors. Methods: Codebooks, the guideline book, and raw intake data from KNHANES V and VI were used for analysis. The violation of the primary key constraint and the $1^{st}-3rd$ normal form in relational database theory were tested for the codebook and the structure of the raw data, respectively. Afterwards, the cleaning process was executed for the raw data by using these integrated codes. Results: Duplication of key records and abnormality in table structures were observed. However, after adjusting according to the suggested method above, the codes were corrected and integrated codes were newly created. Finally, we were able to clean the raw data provided by respondents to the KNHANES survey. Conclusion: The results of this study will contribute to the integration of the multi-year datasets and help improve the data production system by clarifying, testing, and verifying the primary key, integrity of the code, and primitive data structure according to the database normalization theory in the national health data.

Prediction Model of Software Size for 4GL and Database Projects

  • Yoon, myoung-Young
    • Journal of Korea Society of Industrial Information Systems
    • /
    • v.4 no.3
    • /
    • pp.1-7
    • /
    • 1999
  • An important task for any software project manager is to be able to predict and control project size. Unfortunately, there is comparatively little work that deals with the problem of building prediction methods for software size in fourth-generation languages and database projects. In this paper, we propose a new estimation method for estimating for software size based on minimum relative error(MRE) criterion. The characteristic of the proposed method is insensitive to the extreme values of the observed measures which can be obtained early in the development life cycle. In order to verify the performance of the proposed estimation method for software size in terms of both quality of fit and predictive quality, the experiments has been conducted for the dataset Ⅰ and Ⅱ, respectively. For the data set Ⅰ and Ⅱ, our proposed prediction method was shown to be superior to the traditional method LS and RLS in terms of both the quality of fit and predictive quality when applied to data obtained from actual software development projects.

  • PDF