• Title/Summary/Keyword: dataset

Search Result 4,026, Processing Time 0.03 seconds

A Noise-Tolerant Hierarchical Image Classification System based on Autoencoder Models (오토인코더 기반의 잡음에 강인한 계층적 이미지 분류 시스템)

  • Lee, Jong-kwan
    • Journal of Internet Computing and Services
    • /
    • v.22 no.1
    • /
    • pp.23-30
    • /
    • 2021
  • This paper proposes a noise-tolerant image classification system using multiple autoencoders. The development of deep learning technology has dramatically improved the performance of image classifiers. However, if the images are contaminated by noise, the performance degrades rapidly. Noise added to the image is inevitably generated in the process of obtaining and transmitting the image. Therefore, in order to use the classifier in a real environment, we have to deal with the noise. On the other hand, the autoencoder is an artificial neural network model that is trained to have similar input and output values. If the input data is similar to the training data, the error between the input data and output data of the autoencoder will be small. However, if the input data is not similar to the training data, the error will be large. The proposed system uses the relationship between the input data and the output data of the autoencoder, and it has two phases to classify the images. In the first phase, the classes with the highest likelihood of classification are selected and subject to the procedure again in the second phase. For the performance analysis of the proposed system, classification accuracy was tested on a Gaussian noise-contaminated MNIST dataset. As a result of the experiment, it was confirmed that the proposed system in the noisy environment has higher accuracy than the CNN-based classification technique.

Extraction of Important Areas Using Feature Feedback Based on PCA (PCA 기반 특징 되먹임을 이용한 중요 영역 추출)

  • Lee, Seung-Hyeon;Kim, Do-Yun;Choi, Sang-Il;Jeong, Gu-Min
    • The Journal of Korea Institute of Information, Electronics, and Communication Technology
    • /
    • v.13 no.6
    • /
    • pp.461-469
    • /
    • 2020
  • In this paper, we propose a PCA-based feature feedback method for extracting important areas of handwritten numeric data sets and face data sets. A PCA-based feature feedback method is proposed by extending the previous LDA-based feature feedback method. In the proposed method, the data is reduced to important feature dimensions by applying the PCA technique, one of the dimension reduction machine learning algorithms. Through the weights derived during the dimensional reduction process, the important points of data in each reduced dimensional axis are identified. Each dimension axis has a different weight in the total data according to the size of the eigenvalue of the axis. Accordingly, a weight proportional to the size of the eigenvalues of each dimension axis is given, and an operation process is performed to add important points of data in each dimension axis. The critical area of the data is calculated by applying a threshold to the data obtained through the calculation process. After that, induces reverse mapping to the original data in the important area of the derived data, and selects the important area in the original data space. The results of the experiment on the MNIST dataset are checked, and the effectiveness and possibility of the pattern recognition method based on PCA-based feature feedback are verified by comparing the results with the existing LDA-based feature feedback method.

Do Not Just Talk, Show Me in Action: Investigating the Effect of OSSD Activities on Job Change of IT Professional (오픈소스 소프트웨어 개발 플랫폼 활동이 IT 전문직 취업에 미치는 영향)

  • Jang, Moonkyoung;Lee, Saerom;Baek, Hyunmi;Jung, Yoonhyuk
    • The Journal of Society for e-Business Studies
    • /
    • v.26 no.1
    • /
    • pp.43-65
    • /
    • 2021
  • With the advancement of information and communications technology, a means to recruit IT professional has fundamentally changed. Nowadays recruiters search for candidate information from the Web as well as traditional information sources such as résumés or interviews. Particularly, open-source software development (OSSD) platforms have become an opportunity for developers to demonstrate their IT capabilities, making it a way for recruiters to find the right candidates, whom they need. Therefore, this study aims to investigate the impact developers' profiles in an OSSD platform on their finding a job. This study examined four antecedents of developer information that can accelerate their job search: job-seeking status, personal-information posting, learning activities and knowledge contribution activities. For the empirical analysis, we developed a Web crawler and gathered a dataset on 4,005 developers from GitHub, which is a well-known OSSD platform. Proportional hazards regression was used for data analysis because shorter job-seeking period implies more successful result of job change. Our results indicate that developers, who explicitly posted their job-seeking status, had shorter job-seeking periods than those who did not. The other antecedents (i.e., personal-information posting, learning, and knowledge contribution activities) also contributed in reducing the job-seeking period. These findings imply values of OSSD platforms for recruiters to find proper candidates and for developers to successfully find a job.

Automatic Classification and Vocabulary Analysis of Political Bias in News Articles by Using Subword Tokenization (부분 단어 토큰화 기법을 이용한 뉴스 기사 정치적 편향성 자동 분류 및 어휘 분석)

  • Cho, Dan Bi;Lee, Hyun Young;Jung, Won Sup;Kang, Seung Shik
    • KIPS Transactions on Software and Data Engineering
    • /
    • v.10 no.1
    • /
    • pp.1-8
    • /
    • 2021
  • In the political field of news articles, there are polarized and biased characteristics such as conservative and liberal, which is called political bias. We constructed keyword-based dataset to classify bias of news articles. Most embedding researches represent a sentence with sequence of morphemes. In our work, we expect that the number of unknown tokens will be reduced if the sentences are constituted by subwords that are segmented by the language model. We propose a document embedding model with subword tokenization and apply this model to SVM and feedforward neural network structure to classify the political bias. As a result of comparing the performance of the document embedding model with morphological analysis, the document embedding model with subwords showed the highest accuracy at 78.22%. It was confirmed that the number of unknown tokens was reduced by subword tokenization. Using the best performance embedding model in our bias classification task, we extract the keywords based on politicians. The bias of keywords was verified by the average similarity with the vector of politicians from each political tendency.

A Study on Prediction of EPB shield TBM Advance Rate using Machine Learning Technique and TBM Construction Information (머신러닝 기법과 TBM 시공정보를 활용한 토압식 쉴드TBM 굴진율 예측 연구)

  • Kang, Tae-Ho;Choi, Soon-Wook;Lee, Chulho;Chang, Soo-Ho
    • Tunnel and Underground Space
    • /
    • v.30 no.6
    • /
    • pp.540-550
    • /
    • 2020
  • Machine learning has been actively used in the field of automation due to the development and establishment of AI technology. The important thing in utilizing machine learning is that appropriate algorithms exist depending on data characteristics, and it is needed to analysis the datasets for applying machine learning techniques. In this study, advance rate is predicted using geotechnical and machine data of TBM tunnel section passing through the soil ground below the stream. Although there were no problems of application of statistical technology in the linear regression model, the coefficient of determination was 0.76. While, the ensemble model and support vector machine showed the predicted performance of 0.88 or higher. it is indicating that the model suitable for predicting advance rate of the EPB Shield TBM was the support vector machine in the analyzed dataset. As a result, it is judged that the suitability of the prediction model using data including mechanical data and ground information is high. In addition, research is needed to increase the diversity of ground conditions and the amount of data.

Analysis of Future Demand and Utilization of the Urban Meteorological Data for the Smart City (스마트시티를 위한 도시기상자료의 미래수요 및 활용가치 분석)

  • Kim, Seong-Gon;Kim, Seung Hee;Lim, Chul-Hee;Na, Seong-Kyun;Park, Sang Seo;Kim, Jaemin;Lee, Yun Gon
    • Atmosphere
    • /
    • v.31 no.2
    • /
    • pp.241-249
    • /
    • 2021
  • A smart city utilizes data collected from various sensors through the internet of things (IoT) and improves city operations across the urban area. Recently substantial research is underway to examine all aspects of data that requires for the smart city operation. Atmospheric data are an essential component for successful smart city implementation, including Urban Air Mobility (UAM), infrastructure planning, safety and convenience, and traffic management. Unfortunately, the current level of conventional atmospheric data does not meet the needs of the new city concept. New and innovative approaches to developing high spatiotemporal resolution of observational and modeling data, resolving the complex urban structure, are expected to support the future needs. The geographic information system (GIS) integrates the atmospheric data with the urban structure and offers information system enhancement. In this study we proposed the necessity and applicability of the high resolution urban meteorological dataset based on heavy fog cases in the smart city region (e.g., Sejong and Pusan) in Korea.

An Analysis of Movements in the Labor Share of Income in the Korean Manufacturing Industries (한국 제조업에서의 노동소득분배율 변동요인 분석)

  • Hong, Jang-Pyo
    • Korean Journal of Labor Studies
    • /
    • v.19 no.1
    • /
    • pp.1-34
    • /
    • 2013
  • Labor share of income in Korea has fallen from 90% in 1996 to 79% in 2010. This paper explores the factors driving the movements in the labor share of income based on a panel dataset containing 19 years of data on 18 Korean manufacturing industries. The effects of technical progress, globalization and the bargaining power of labor and capital on the labor share of income are tested for the period of 1991-2009. The main empirical results are as follows. (1) Capital-aug menting technical prog ress measured by capital-labor ratio and R&D intensity has a negative effect on the labor share. (2) Market openness measured by the value of export and import as a ratio to value-added production is found to have a positive impact. (3) Globalization of production measured by inward-FDI and outward-FDI as a ratio to total domestic fixed capital is found to have a negative impact on the labor share. (4) Union density is found to have had a statistically significant effect in 1991-1998. This finding is consistent with the efficient bargain model in which firms and workers bargain over both wages and employment. But union density is insignificant in 2000-2009. This implies that since the financial crisis in 1997, the bargaining institution in Korea has been approaching the right-to-manage model in which firms and unions bargain over wages and then firms set employment unilaterally. (5) Variables for domestic financialization measured by dividend-income ratio and financial-fixed assets ratio have an insignificant effect on labor share.

Identification of a Single Nucleotide Polymorphism (SNP) Marker for the Detection of Enhanced Honey Production in Hoenybee (수밀력 우수 꿀벌 계통 판별을 위한 계통 특이 분자마커 개발)

  • Kim, Hye-Kyung;Lee, Myeong-Lyeol;Lee, Man-Young;Choi, Yong-Soo;Kim, Dongwon;Kang, Ah Rang
    • Journal of Apiculture
    • /
    • v.32 no.3
    • /
    • pp.147-154
    • /
    • 2017
  • Honeybees (Apis mellifera) are common pollinators and important insects studied in agriculture, ecology and basic research. Recently, RDA (Rural Development Administration) and YIRI (Yecheon-gun Industrial Insect Research Institute) have been breeding a triple crossbred honey bee named Jangwon, which have the ability to produce superior quality honey. In this study, we identified a single nucleotide polymorphism (SNP) marker in the genome of Jangwon honeybee, particularly, in the paternal line (D line). Initially, we performed Sequence-Based Genotyping (SBG) using the Illumina Hiseq 2500 in 5 honeybee inbred lines; A, C, D, E, and F; and obtained 1,029 SNPs. Seventeen SNPs for each inbred line were generated and selected after further filtering of the SNP dataset. The 17 SNP markers validated by performing TaqMan probe-based real-time PCR and genotyping analysis was conducted. Genotyping analysis of the 5 honeybee inbred lines and one hybrid line, $D{\times}F$, revealed that one set of SNP marker, AmD9, precisely discriminated the inbred line D from the others. Our results suggest that the identified SNP marker, AmD9, is successful in distinguishing the inbred honeybee lines D, and can be directly used for genotyping and breeding applications.

Privacy Preserving Data Publication of Dynamic Datasets (프라이버시를 보호하는 동적 데이터의 재배포 기법)

  • Lee, Joo-Chang;Ahn, Sung-Joon;Won, Dong-Ho;Kim, Ung-Mo
    • Journal of the Korea Institute of Information Security & Cryptology
    • /
    • v.18 no.6A
    • /
    • pp.139-149
    • /
    • 2008
  • The amount of personal information collected by organizations and government agencies is continuously increasing. When a data collector publishes personal information for research and other purposes, individuals' sensitive information should not be revealed. On the other hand, published data is also required to provide accurate statistical information for analysis. k-Anonymity and ${\iota}$-diversity models are popular approaches for privacy preserving data publication. However, they are limited to static data release. After a dataset is updated with insertions and deletions, a data collector cannot safely release up-to-date information. Recently, the m-invariance model has been proposed to support re-publication of dynamic datasets. However, the m-invariant generalization can cause high information loss. In addition, if the adversary already obtained sensitive values of some individuals before accessing released information, the m-invariance leads to severe privacy disclosure. In this paper, we propose a novel technique for safely releasing dynamic datasets. The proposed technique offers a simple and effective method for handling inserted and deleted records without generalization. It also gives equivalent degree of privacy preservation to the m-invariance model.

An Auto-Labeling based Smart Image Annotation System (자동-레이블링 기반 영상 학습데이터 제작 시스템)

  • Lee, Ryong;Jang, Rae-young;Park, Min-woo;Lee, Gunwoo;Choi, Myung-Seok
    • The Journal of the Korea Contents Association
    • /
    • v.21 no.6
    • /
    • pp.701-715
    • /
    • 2021
  • The drastic advance of recent deep learning technologies is heavily dependent on training datasets which are essential to train models by themselves with less human efforts. In comparison with the work to design deep learning models, preparing datasets is a long haul; at the moment, in the domain of vision intelligent, datasets are still being made by handwork requiring a lot of time and efforts, where workers need to directly make labels on each image usually with GUI-based labeling tools. In this paper, we overview the current status of vision datasets focusing on what datasets are being shared and how they are prepared with various labeling tools. Particularly, in order to relieve the repetitive and tiring labeling work, we present an interactive smart image annotating system with which the annotation work can be transformed from the direct human-only manual labeling to a correction-after-checking by means of a support of automatic labeling. In an experiment, we show that automatic labeling can greatly improve the productivity of datasets especially reducing time and efforts to specify regions of objects found in images. Finally, we discuss critical issues that we faced in the experiment to our annotation system and describe future work to raise the productivity of image datasets creation for accelerating AI technology.