• Title/Summary/Keyword: small data set

Search Result 664, Processing Time 0.03 seconds

Study on the Emerging Technology-Product Portfolio Generation Based on Firm's Technology Capability (기업 보유역량 기반의 잠재 유망 기술-제품 포트폴리오 도출에 관한 연구)

  • Lee, Yong-Ho;Kwon, Oh-Jin;Coh, Byoung-Youl
    • Journal of Korea Technology Innovation Society
    • /
    • v.14 no.spc
    • /
    • pp.1187-1208
    • /
    • 2011
  • This research aims to propose a systematic approach to identify emerging technology-product portfolio for small and medium enterprises (SMEs). Firstly, operational definition of emerging technology for SMEs is presented. Secondly, research framework is suggested and case study to show usefulness of the newly proposed framwork is analyzed. In detail, reference patent set which represent company's capabilities and business area are constructed. The research constructs patent data set for bibliometric analysis using reference patent set and citing patents to 2nd level. Clustering (expert judgement) and keyword based bibliometric approach are used. Then, cluster activity index (AI) and relevance index (RI) comparing with reference patent set are estimated. With emerging technology-product portfolio using AI and RI, a firm can identify emerging technology-product area and monitoring area.

  • PDF

An Experimental Study on Smoothness Regularized LDA in Hyperspectral Data Classification (하이퍼스펙트럴 데이터 분류에서의 평탄도 LDA 규칙화 기법의 실험적 분석)

  • Park, Lae-Jeong
    • Journal of the Korean Institute of Intelligent Systems
    • /
    • v.20 no.4
    • /
    • pp.534-540
    • /
    • 2010
  • High dimensionality and highly correlated features are the major characteristics of hyperspectral data. Linear projections such as LDA and its variants have been used in extracting low-dimensional features from high-dimensional spectral data. Regularization of LDA has been introduced to alleviate the overfitting that often occurs in a small-sized training data set and leads to poor generalization performance. Among them, a smoothness regularized LDA seems to be effective in the feature extraction for hyperspectral data due to its capability of utilizing the high correlatedness. This paper studies the performance of the regularized LDA in hyperspectral data classification experimentally with varying conditions of the training data. In addition, a new dual smoothness regularized LDA is proposed and evaluated that makes use of both the spectral-domain and spatial-domain correlations between neighboring pixels.

Experimental Analysis of Equilibrization in Binary Classification for Non-Image Imbalanced Data Using Wasserstein GAN

  • Wang, Zhi-Yong;Kang, Dae-Ki
    • International Journal of Internet, Broadcasting and Communication
    • /
    • v.11 no.4
    • /
    • pp.37-42
    • /
    • 2019
  • In this paper, we explore the details of three classic data augmentation methods and two generative model based oversampling methods. The three classic data augmentation methods are random sampling (RANDOM), Synthetic Minority Over-sampling Technique (SMOTE), and Adaptive Synthetic Sampling (ADASYN). The two generative model based oversampling methods are Conditional Generative Adversarial Network (CGAN) and Wasserstein Generative Adversarial Network (WGAN). In imbalanced data, the whole instances are divided into majority class and minority class, where majority class occupies most of the instances in the training set and minority class only includes a few instances. Generative models have their own advantages when they are used to generate more plausible samples referring to the distribution of the minority class. We also adopt CGAN to compare the data augmentation performance with other methods. The experimental results show that WGAN-based oversampling technique is more stable than other approaches (RANDOM, SMOTE, ADASYN and CGAN) even with the very limited training datasets. However, when the imbalanced ratio is too small, generative model based approaches cannot achieve satisfying performance than the conventional data augmentation techniques. These results suggest us one of future research directions.

Image-based rainfall prediction from a novel deep learning method

  • Byun, Jongyun;Kim, Jinwon;Jun, Changhyun
    • Proceedings of the Korea Water Resources Association Conference
    • /
    • 2021.06a
    • /
    • pp.183-183
    • /
    • 2021
  • Deep learning methods and their application have become an essential part of prediction and modeling in water-related research areas, including hydrological processes, climate change, etc. It is known that application of deep learning leads to high availability of data sources in hydrology, which shows its usefulness in analysis of precipitation, runoff, groundwater level, evapotranspiration, and so on. However, there is still a limitation on microclimate analysis and prediction with deep learning methods because of deficiency of gauge-based data and shortcomings of existing technologies. In this study, a real-time rainfall prediction model was developed from a sky image data set with convolutional neural networks (CNNs). These daily image data were collected at Chung-Ang University and Korea University. For high accuracy of the proposed model, it considers data classification, image processing, ratio adjustment of no-rain data. Rainfall prediction data were compared with minutely rainfall data at rain gauge stations close to image sensors. It indicates that the proposed model could offer an interpolation of current rainfall observation system and have large potential to fill an observation gap. Information from small-scaled areas leads to advance in accurate weather forecasting and hydrological modeling at a micro scale.

  • PDF

Anomaly Detection of Machining Process based on Power Load Analysis (전력 부하 분석을 통한 절삭 공정 이상탐지)

  • Jun Hong Yook;Sungmoon Bae
    • Journal of Korean Society of Industrial and Systems Engineering
    • /
    • v.46 no.4
    • /
    • pp.173-180
    • /
    • 2023
  • Smart factory companies are installing various sensors in production facilities and collecting field data. However, there are relatively few companies that actively utilize collected data, academic research using field data is actively underway. This study seeks to develop a model that detects anomalies in the process by analyzing spindle power data from a company that processes shafts used in automobile throttle valves. Since the data collected during machining processing is time series data, the model was developed through unsupervised learning by applying the Holt Winters technique and various deep learning algorithms such as RNN, LSTM, GRU, BiRNN, BiLSTM, and BiGRU. To evaluate each model, the difference between predicted and actual values was compared using MSE and RMSE. The BiLSTM model showed the optimal results based on RMSE. In order to diagnose abnormalities in the developed model, the critical point was set using statistical techniques in consultation with experts in the field and verified. By collecting and preprocessing real-world data and developing a model, this study serves as a case study of utilizing time-series data in small and medium-sized enterprises.

An Active Co-Training Algorithm for Biomedical Named-Entity Recognition

  • Munkhdalai, Tsendsuren;Li, Meijing;Yun, Unil;Namsrai, Oyun-Erdene;Ryu, Keun Ho
    • Journal of Information Processing Systems
    • /
    • v.8 no.4
    • /
    • pp.575-588
    • /
    • 2012
  • Exploiting unlabeled text data with a relatively small labeled corpus has been an active and challenging research topic in text mining, due to the recent growth of the amount of biomedical literature. Biomedical named-entity recognition is an essential prerequisite task before effective text mining of biomedical literature can begin. This paper proposes an Active Co-Training (ACT) algorithm for biomedical named-entity recognition. ACT is a semi-supervised learning method in which two classifiers based on two different feature sets iteratively learn from informative examples that have been queried from the unlabeled data. We design a new classification problem to measure the informativeness of an example in unlabeled data. In this classification problem, the examples are classified based on a joint view of a feature set to be informative/non-informative to both classifiers. To form the training data for the classification problem, we adopt a query-by-committee method. Therefore, in the ACT, both classifiers are considered to be one committee, which is used on the labeled data to give the informativeness label to each example. The ACT method outperforms the traditional co-training algorithm in terms of f-measure as well as the number of training iterations performed to build a good classification model. The proposed method tends to efficiently exploit a large amount of unlabeled data by selecting a small number of examples having not only useful information but also a comprehensive pattern.

Soft Information and Government Loan Approval (연성정보와 정책자금 대출결정 요인 분석)

  • Yoo, Shi-Yong
    • Journal of the Korea Academia-Industrial cooperation Society
    • /
    • v.10 no.12
    • /
    • pp.3768-3774
    • /
    • 2009
  • This paper explored how soft information and hard information were used when SBC(Small Business Corporation, Korea) reviewed government loan applications. The data set is made up of financial and non-financial data of small-business firms since 2004. A non-financial data set is considered as soft information. Relative importance of three kinds information such as credit information, soft information, financial information is compared with each other by using the logit model. As a result, credit information is most critical to the loan approval, and then soft information follows, lastly financial information has the smallest effect on the loan approval. This is because the credit information is made up of the non-linear combination of soft information and financial information. When the relative importance of soft information and financial information is considered, soft information is relatively more critical to the loan approval then financial information. This is because financial ratios provided by small-business firms are not reliable enough.

Risk Prediction Model of Legal Contract Based on Korean Machine Reading Comprehension (한국어 기계독해 기반 법률계약서 리스크 예측 모델)

  • Lee, Chi Hoon;Woo, Noh Ji;Jeong, Jae Hoon;Joo, Kyung Sik;Lee, Dong Hee
    • Journal of Information Technology Services
    • /
    • v.20 no.1
    • /
    • pp.131-143
    • /
    • 2021
  • Commercial transactions, one of the pillars of the capitalist economy, are occurring countless times every day, especially small and medium-sized businesses. However, small and medium-sized enterprises are bound to be the legal underdogs in contracts for commercial transactions and do not receive legal support for contracts for fair and legitimate commercial transactions. When subcontracting contracts are concluded among small and medium-sized enterprises, 58.2% of them do not apply standard contracts and sign contracts that have not undergone legal review. In order to support small and medium-sized enterprises' fair and legitimate contracts, small and medium-sized enterprises can be protected from legal threats if they can reduce the risk of signing contracts by analyzing various risks in the contract and analyzing and informing them of toxic clauses and omitted contracts in advance. We propose a risk prediction model for the machine reading-based legal contract to minimize legal damage to small and medium-sized business owners in the legal blind spots. We have established our own set of legal questions and answers based on the legal data disclosed for the purpose of building a model specialized in legal contracts. Quantitative verification was carried out through indicators such as EM and F1 Score by applying pine tuning and hostile learning to pre-learned machine reading models. The highest F1 score was 87.93, with an EM value of 72.41.

A Study on the Quantitative Evaluation Method of Small-Scale Environmental Impact Assessment

  • Dong-Myung CHO;Ju-Yeon LEE;Woo-Taeg KWON
    • Journal of Wellbeing Management and Applied Psychology
    • /
    • v.6 no.2
    • /
    • pp.39-46
    • /
    • 2023
  • Purpose: The small-scale environmental impact assessment system in Korea was introduced and implemented in August 2000, but it has a problem that it cannot guarantee implementation due to the large proportion of qualitative reduction measures for each evaluation item. Therefore, when preparing a small-scale environmental impact assessment, research was conducted on how to improve the existing simple listing-type reduction measures and qualitative evaluation standards to quantitative reduction measures and evaluation standards reflecting regional characteristics. Research design, data and methodology: The small-scale environmental impact assessment system in Korea was introduced and implemented in August 2000, but it has a problem that it cannot guarantee implementation due to the large proportion of qualitative reduction measures for each evaluation item. Therefore, when preparing a small-scale environmental impact assessment, research was conducted on how to improve the existing simple listing-type reduction measures and qualitative evaluation standards to quantitative reduction measures and evaluation standards reflecting regional characteristics. Results: As a result of the analysis of qualitative and quantitative factors, the arithmetic sum of the qualitative factors of the total six projects is 160, accounting for 80% of the total number of reduction measures, and the quantitative factors are 40, accounting for 20%. Among them, the number of qualitative reduction measures reached 97.4% for animal and plant items, and more than 90% for air quality, noise and vibration, and eco-friendly resource circulation items. Conclusions: Therefore, it is necessary to avoid establishing qualitative reduction measures and set quantitative measures as the basis, but to specify the specifications, size, and installation location related to the reduction measures, and to calculate the numerical reduction efficiency.

New Optimization Algorithm for Data Clustering (최적화에 기반 한 데이터 클러스터링 알고리즘)

  • Kim, Ju-Mi
    • Journal of Intelligence and Information Systems
    • /
    • v.13 no.3
    • /
    • pp.31-45
    • /
    • 2007
  • Large data handling is one of critical issues that the data mining community faces. This is particularly true for computationally intense tasks such as data clustering. Random sampling of instances is one possible means of achieving large data handling, but a pervasive problem with this approach is how to deal with the noise in the evaluation of the learning algorithm. This paper develops a new optimization based clustering approach using an algorithm specifically designed for noisy performance. Numerical results show this algorithm better than the other algorithms such as PAM and CLARA. Also with this algorithm substantial benefits can be achieved in terms of computational time without sacrificing solution quality using partial data.

  • PDF