• Title/Summary/Keyword: Supervised learning

Search Result 747, Processing Time 0.023 seconds

Semi-supervised GPT2 for News Article Recommendation with Curriculum Learning (준 지도 학습과 커리큘럼 학습을 이용한 유사 기사 추천 모델)

  • Seo, Jaehyung;Oh, Dongsuk;Eo, Sugyeong;Park, Sungjin;Lim, Heuiseok
    • Annual Conference on Human and Language Technology
    • /
    • 2020.10a
    • /
    • pp.495-500
    • /
    • 2020
  • 뉴스 기사는 반드시 객관적이고 넓은 시각으로 정보를 전달하지 않는다. 따라서 뉴스 기사를 기존의 추천 시스템과 같이 개인의 관심사나 사적 정보를 바탕으로 선별적으로 추천하는 것은 바람직하지 않다. 본 논문에서는 최대한 객관적으로 다양한 시각에서 비슷한 사건과 인물에 대해서 판단할 수 있도록 유사도 기반의 기사 추천 모델을 제시한다. 길이가 긴 문서 사이의 유사도를 측정하기 위해 GPT2 [1]언어 모델을 활용했다. 이 과정에서 단방향 디코더 모델인 GPT2 [1]의 단점을 추가 학습으로 개선했으며, 저장 공간의 효율과 핵심 문단 추출을 위해 BM25 [2]함수를 사용했다. 그리고 준 지도 학습 [3]을 통해 유사도 레이블링이 되어있지 않은 최신 뉴스 기사에 대해서도 자가 학습을 진행했으며, 이와 함께 길이가 긴 문단에 대해서도 효과적으로 학습할 수 있도록 문장 길이를 기준으로 3개의 단계로 나누어진 커리큘럼 학습 [4]방식을 적용했다.

  • PDF

Leveraging Analytics for Talent Acquisition: Case of IT Sector in India

  • Avik Ghosh;Bhaskar Basu
    • Asia pacific journal of information systems
    • /
    • v.30 no.4
    • /
    • pp.879-918
    • /
    • 2020
  • One of the challenges faced by Talent Acquisition teams today pertains to the acquisition of human resources by matching job descriptions and skillsets desired. It is more so in the case of competitive sectors like the Indian IT sector. There can be various channels for Talent Acquisition and accordingly, the cost and benefits might vary. However, the consequences of a mismatch have an impact on the quality of deliverables, high recruitment expenses and loss of revenue for the organization. With increased and diverse sources of data that are available to organizations today, there is ample opportunity to apply analytics for informed decision making in this field. This paper reveals useful insights that help streamline the Talent Acquisition process in the Indian IT Industry. The paper adopts a data-centric approach to examine the critical determinants for efficient and effective Talent Acquisition process in IT organizations. Selected supervised machine learning algorithms are applied for the analysis of the dataset. The study is likely to help organizations in reassessing their talent acquisition strategy with respect to key parameters like expected cost to company (CTC), candidate sourcing channels and optimal joining period.

Dual Dictionary Learning for Cell Segmentation in Bright-field Microscopy Images (명시야 현미경 영상에서의 세포 분할을 위한 이중 사전 학습 기법)

  • Lee, Gyuhyun;Quan, Tran Minh;Jeong, Won-Ki
    • Journal of the Korea Computer Graphics Society
    • /
    • v.22 no.3
    • /
    • pp.21-29
    • /
    • 2016
  • Cell segmentation is an important but time-consuming and laborious task in biological image analysis. An automated, robust, and fast method is required to overcome such burdensome processes. These needs are, however, challenging due to various cell shapes, intensity, and incomplete boundaries. A precise cell segmentation will allow to making a pathological diagnosis of tissue samples. A vast body of literature exists on cell segmentation in microscopy images [1]. The majority of existing work is based on input images and predefined feature models only - for example, using a deformable model to extract edge boundaries in the image. Only a handful of recent methods employ data-driven approaches, such as supervised learning. In this paper, we propose a novel data-driven cell segmentation algorithm for bright-field microscopy images. The proposed method minimizes an energy formula defined by two dictionaries - one is for input images and the other is for their manual segmentation results - and a common sparse code, which aims to find the pixel-level classification by deploying the learned dictionaries on new images. In contrast to deformable models, we do not need to know a prior knowledge of objects. We also employed convolutional sparse coding and Alternating Direction of Multiplier Method (ADMM) for fast dictionary learning and energy minimization. Unlike an existing method [1], our method trains both dictionaries concurrently, and is implemented using the GPU device for faster performance.

A Novel of Data Clustering Architecture for Outlier Detection to Electric Power Data Analysis (전력데이터 분석에서 이상점 추출을 위한 데이터 클러스터링 아키텍처에 관한 연구)

  • Jung, Se Hoon;Shin, Chang Sun;Cho, Young Yun;Park, Jang Woo;Park, Myung Hye;Kim, Young Hyun;Lee, Seung Bae;Sim, Chun Bo
    • KIPS Transactions on Software and Data Engineering
    • /
    • v.6 no.10
    • /
    • pp.465-472
    • /
    • 2017
  • In the past, researchers mainly used the supervised learning technique of machine learning to analyze power data and investigated the identification of patterns through the data mining technique. Data analysis research, however, faces its limitations with the old data classification and analysis techniques today when the size of electric power data has increased with the possible real-time provision of data. This study thus set out to propose a clustering architecture to analyze large-sized electric power data. The clustering process proposed in the study supplements the K-means algorithm, an unsupervised learning technique, for its problems and is capable of automating the entire process from the collection of electric power data to their analysis. In the present study, power data were categorized and analyzed in total three levels, which include the row data level, clustering level, and user interface level. In addition, the investigator identified K, the ideal number of clusters, based on principal component analysis and normal distribution and proposed an altered K-means algorithm to reduce data that would be categorized as ideal points in order to increase the efficiency of clustering.

Predicting Program Code Changes Using a CNN Model (CNN 모델을 이용한 프로그램 코드 변경 예측)

  • Kim, Dong Kwan
    • Journal of the Korea Convergence Society
    • /
    • v.12 no.9
    • /
    • pp.11-19
    • /
    • 2021
  • A software system is required to change during its life cycle due to various requirements such as adding functionalities, fixing bugs, and adjusting to new computing environments. Such program code modification should be considered as carefully as a new system development becase unexpected software errors could be introduced. In addition, when reusing open source programs, we can expect higher quality software if code changes of the open source program are predicted in advance. This paper proposes a Convolutional Neural Network (CNN)-based deep learning model to predict source code changes. In this paper, the prediction of code changes is considered as a kind of a binary classification problem in deep learning and labeled datasets are used for supervised learning. Java projects and code change logs are collected from GitHub for training and testing datasets. Software metrics are computed from the collected Java source code and they are used as input data for the proposed model to detect code changes. The performance of the proposed model has been measured by using evaluation metrics such as precision, recall, F1-score, and accuracy. The experimental results show the proposed CNN model has achieved 95% in terms of F1-Score and outperformed the multilayer percept-based DNN model whose F1-Score is 92%.

Estimation of KOSPI200 Index option volatility using Artificial Intelligence (이기종 머신러닝기법을 활용한 KOSPI200 옵션변동성 예측)

  • Shin, Sohee;Oh, Hayoung;Kim, Jang Hyun
    • Journal of the Korea Institute of Information and Communication Engineering
    • /
    • v.26 no.10
    • /
    • pp.1423-1431
    • /
    • 2022
  • Volatility is one of the variables that the Black-Scholes model requires for option pricing. It is an unknown variable at the present time, however, since the option price can be observed in the market, implied volatility can be derived from the price of an option at any given point in time and can represent the market's expectation of future volatility. Although volatility in the Black-Scholes model is constant, when calculating implied volatility, it is common to observe a volatility smile which shows that the implied volatility is different depending on the strike prices. We implement supervised learning to target implied volatility by adding V-KOSPI to ease volatility smile. We examine the estimation performance of KOSPI200 index options' implied volatility using various Machine Learning algorithms such as Linear Regression, Tree, Support Vector Machine, KNN and Deep Neural Network. The training accuracy was the highest(99.9%) in Decision Tree model and test accuracy was the highest(96.9%) in Random Forest model.

Building robust Korean speech recognition model by fine-tuning large pretrained model (대형 사전훈련 모델의 파인튜닝을 통한 강건한 한국어 음성인식 모델 구축)

  • Changhan Oh;Cheongbin Kim;Kiyoung Park
    • Phonetics and Speech Sciences
    • /
    • v.15 no.3
    • /
    • pp.75-82
    • /
    • 2023
  • Automatic speech recognition (ASR) has been revolutionized with deep learning-based approaches, among which self-supervised learning methods have proven to be particularly effective. In this study, we aim to enhance the performance of OpenAI's Whisper model, a multilingual ASR system on the Korean language. Whisper was pretrained on a large corpus (around 680,000 hours) of web speech data and has demonstrated strong recognition performance for major languages. However, it faces challenges in recognizing languages such as Korean, which is not major language while training. We address this issue by fine-tuning the Whisper model with an additional dataset comprising about 1,000 hours of Korean speech. We also compare its performance against a Transformer model that was trained from scratch using the same dataset. Our results indicate that fine-tuning the Whisper model significantly improved its Korean speech recognition capabilities in terms of character error rate (CER). Specifically, the performance improved with increasing model size. However, the Whisper model's performance on English deteriorated post fine-tuning, emphasizing the need for further research to develop robust multilingual models. Our study demonstrates the potential of utilizing a fine-tuned Whisper model for Korean ASR applications. Future work will focus on multilingual recognition and optimization for real-time inference.

Performance of Investment Strategy using Investor-specific Transaction Information and Machine Learning (투자자별 거래정보와 머신러닝을 활용한 투자전략의 성과)

  • Kim, Kyung Mock;Kim, Sun Woong;Choi, Heung Sik
    • Journal of Intelligence and Information Systems
    • /
    • v.27 no.1
    • /
    • pp.65-82
    • /
    • 2021
  • Stock market investors are generally split into foreign investors, institutional investors, and individual investors. Compared to individual investor groups, professional investor groups such as foreign investors have an advantage in information and financial power and, as a result, foreign investors are known to show good investment performance among market participants. The purpose of this study is to propose an investment strategy that combines investor-specific transaction information and machine learning, and to analyze the portfolio investment performance of the proposed model using actual stock price and investor-specific transaction data. The Korea Exchange offers daily information on the volume of purchase and sale of each investor to securities firms. We developed a data collection program in C# programming language using an API provided by Daishin Securities Cybosplus, and collected 151 out of 200 KOSPI stocks with daily opening price, closing price and investor-specific net purchase data from January 2, 2007 to July 31, 2017. The self-organizing map model is an artificial neural network that performs clustering by unsupervised learning and has been introduced by Teuvo Kohonen since 1984. We implement competition among intra-surface artificial neurons, and all connections are non-recursive artificial neural networks that go from bottom to top. It can also be expanded to multiple layers, although many fault layers are commonly used. Linear functions are used by active functions of artificial nerve cells, and learning rules use Instar rules as well as general competitive learning. The core of the backpropagation model is the model that performs classification by supervised learning as an artificial neural network. We grouped and transformed investor-specific transaction volume data to learn backpropagation models through the self-organizing map model of artificial neural networks. As a result of the estimation of verification data through training, the portfolios were rebalanced monthly. For performance analysis, a passive portfolio was designated and the KOSPI 200 and KOSPI index returns for proxies on market returns were also obtained. Performance analysis was conducted using the equally-weighted portfolio return, compound interest rate, annual return, Maximum Draw Down, standard deviation, and Sharpe Ratio. Buy and hold returns of the top 10 market capitalization stocks are designated as a benchmark. Buy and hold strategy is the best strategy under the efficient market hypothesis. The prediction rate of learning data using backpropagation model was significantly high at 96.61%, while the prediction rate of verification data was also relatively high in the results of the 57.1% verification data. The performance evaluation of self-organizing map grouping can be determined as a result of a backpropagation model. This is because if the grouping results of the self-organizing map model had been poor, the learning results of the backpropagation model would have been poor. In this way, the performance assessment of machine learning is judged to be better learned than previous studies. Our portfolio doubled the return on the benchmark and performed better than the market returns on the KOSPI and KOSPI 200 indexes. In contrast to the benchmark, the MDD and standard deviation for portfolio risk indicators also showed better results. The Sharpe Ratio performed higher than benchmarks and stock market indexes. Through this, we presented the direction of portfolio composition program using machine learning and investor-specific transaction information and showed that it can be used to develop programs for real stock investment. The return is the result of monthly portfolio composition and asset rebalancing to the same proportion. Better outcomes are predicted when forming a monthly portfolio if the system is enforced by rebalancing the suggested stocks continuously without selling and re-buying it. Therefore, real transactions appear to be relevant.

Generation of Efficient Fuzzy Classification Rules Using Evolutionary Algorithm with Data Partition Evaluation (데이터 분할 평가 진화알고리즘을 이용한 효율적인 퍼지 분류규칙의 생성)

  • Ryu, Joung-Woo;Kim, Sung-Eun;Kim, Myung-Won
    • Journal of the Korean Institute of Intelligent Systems
    • /
    • v.18 no.1
    • /
    • pp.32-40
    • /
    • 2008
  • Fuzzy rules are very useful and efficient to describe classification rules especially when the attribute values are continuous and fuzzy in nature. However, it is generally difficult to determine membership functions for generating efficient fuzzy classification rules. In this paper, we propose a method of automatic generation of efficient fuzzy classification rules using evolutionary algorithm. In our method we generate a set of initial membership functions for evolutionary algorithm by supervised clustering the training data set and we evolve the set of initial membership functions in order to generate fuzzy classification rules taking into consideration both classification accuracy and rule comprehensibility. To reduce time to evaluate an individual we also propose an evolutionary algorithm with data partition evaluation in which the training data set is partitioned into a number of subsets and individuals are evaluated using a randomly selected subset of data at a time instead of the whole training data set. We experimented our algorithm with the UCI learning data sets, the experiment results showed that our method was more efficient at average compared with the existing algorithms. For the evolutionary algorithm with data partition evaluation, we experimented with our method over the intrusion detection data of KDD'99 Cup, and confirmed that evaluation time was reduced by about 70%. Compared with the KDD'99 Cup winner, the accuracy was increased by 1.54% while the cost was reduced by 20.8%.

Performance Comparison of Anomaly Detection Algorithms: in terms of Anomaly Type and Data Properties (이상탐지 알고리즘 성능 비교: 이상치 유형과 데이터 속성 관점에서)

  • Jaeung Kim;Seung Ryul Jeong;Namgyu Kim
    • Journal of Intelligence and Information Systems
    • /
    • v.29 no.3
    • /
    • pp.229-247
    • /
    • 2023
  • With the increasing emphasis on anomaly detection across various fields, diverse anomaly detection algorithms have been developed for various data types and anomaly patterns. However, the performance of anomaly detection algorithms is generally evaluated on publicly available datasets, and the specific performance of each algorithm on anomalies of particular types remains unexplored. Consequently, selecting an appropriate anomaly detection algorithm for specific analytical contexts poses challenges. Therefore, in this paper, we aim to investigate the types of anomalies and various attributes of data. Subsequently, we intend to propose approaches that can assist in the selection of appropriate anomaly detection algorithms based on this understanding. Specifically, this study compares the performance of anomaly detection algorithms for four types of anomalies: local, global, contextual, and clustered anomalies. Through further analysis, the impact of label availability, data quantity, and dimensionality on algorithm performance is examined. Experimental results demonstrate that the most effective algorithm varies depending on the type of anomaly, and certain algorithms exhibit stable performance even in the absence of anomaly-specific information. Furthermore, in some types of anomalies, the performance of unsupervised anomaly detection algorithms was observed to be lower than that of supervised and semi-supervised learning algorithms. Lastly, we found that the performance of most algorithms is more strongly influenced by the type of anomalies when the data quantity is relatively scarce or abundant. Additionally, in cases of higher dimensionality, it was noted that excellent performance was exhibited in detecting local and global anomalies, while lower performance was observed for clustered anomaly types.