• Title/Summary/Keyword: 신경망 알고리즘

Search Result 1,671, Processing Time 0.032 seconds

Study on the development of automatic translation service system for Korean astronomical classics by artificial intelligence - Focused on development results and test operation (천문 고문헌 특화 인공지능 자동번역 서비스 시스템 개발 연구 - 개발 결과 및 시험 운영 위주)

  • Seo, Yoon Kyung;Kim, Sang Hyuk;Ahn, Young Sook;Choi, Go-Eun;Choi, Young Sil;Baik, Hangi;Sun, Bo Min;Kim, Hyun Jin;Choi, Byung Sook;Lee, Sahng Woon;Park, Raejin
    • The Bulletin of The Korean Astronomical Society
    • /
    • v.45 no.1
    • /
    • pp.56.1-56.1
    • /
    • 2020
  • 한국의 고문헌 중에는 다양한 고천문 기록들이 한문 형태로 존재하며, 이를 학술적으로 활용하기 위해서는 전문 번역가 투입에 따른 많은 비용과 시간이 요구된다. 이에 인공신경망 기계학습에 의한 인공지능 번역기를 개발하여 비록 초벌 번역 수준일지라도 문장 형태의 한문을 한글로 자동번역해 주는 학술 도구를 소개하고자 한다. 이 자동번역기는 한국천문연구원이 한국정보화진흥원이 주관하는 2019년도 Information and Communication Technology 기반 공공서비스 촉진사업에 한국고전번역원과 공동 참여하여 개발 완료한 것이다. 이 연구는 고천문 도메인에 특화된 인공지능 기계학습용 데이터인 천문 고전 코퍼스를 구축하여 이를 기반으로 천문 고전 특화 자동번역 모델을 개발하고 번역 서비스하는 것을 목적으로 한다. 이를 위해 구축되는 시스템은 크게 세 가지이다. 첫째, 로그인이 필요 없이 누구나 웹 접속을 통해 사용이 가능한 클라우드 기반의 고문헌 자동번역 대국민서비스 시스템이다. 둘째, 참여 기관별로 구축된 코퍼스와 도메인 특화된 번역 모델의 생성 및 관리할 수 있는 클라우드 기반의 대기관 서비스 플랫폼 구축이다. 셋째, 개발된 자동번역 Applied Programmable Interface를 활용한 한국천문연구원 내 자체 서비스가 가능한 AITHA 시스템이다. 연구 결과로서 먼저 구축된 천문 고전 코퍼스 60,760건에 대한 샘플링 검수 결과는 품질 순도 99.9% 이상이다. 아울러 도출된 천문 고전 특화 번역 모델 총 20개 중 대표 모델에 대한 성능 평가 결과는 기계 번역 텍스트 품질 평가 알고리즘인 Bilingual Evaluation Understudy 평가에서 40.02점이며, 전문가에 의한 휴먼 평가에서 5.0 만점 중 4.05점이다. 이는 당초 연구 목표로 삼았던 초벌 번역 수준에 충분하며, 현재 개발된 시스템들은 자체 시험 운영 중이다. 이 연구는 특수 고문헌에 해당되는 고천문 기록들의 번역 장벽을 낮춰 관련 연구자들의 학술적 접근 및 다양한 연구에 도움을 줄 수 있다는 점에서 의의가 있다. 또한 고천문 분야가 인공지능 자동번역 확산 플랫폼 시범의 첫 케이스로써 추후 타 학문 분야 참여 시 시너지 효과도 기대해 볼 수 있다. 고문헌 자동번역기는 점차 더 많은 학습 데이터와 학습량이 쌓일수록 더 좋은 학술 도구로 진화할 것이다.

  • PDF

A Study on the Effect of Network Centralities on Recommendation Performance (네트워크 중심성 척도가 추천 성능에 미치는 영향에 대한 연구)

  • Lee, Dongwon
    • Journal of Intelligence and Information Systems
    • /
    • v.27 no.1
    • /
    • pp.23-46
    • /
    • 2021
  • Collaborative filtering, which is often used in personalization recommendations, is recognized as a very useful technique to find similar customers and recommend products to them based on their purchase history. However, the traditional collaborative filtering technique has raised the question of having difficulty calculating the similarity for new customers or products due to the method of calculating similaritiesbased on direct connections and common features among customers. For this reason, a hybrid technique was designed to use content-based filtering techniques together. On the one hand, efforts have been made to solve these problems by applying the structural characteristics of social networks. This applies a method of indirectly calculating similarities through their similar customers placed between them. This means creating a customer's network based on purchasing data and calculating the similarity between the two based on the features of the network that indirectly connects the two customers within this network. Such similarity can be used as a measure to predict whether the target customer accepts recommendations. The centrality metrics of networks can be utilized for the calculation of these similarities. Different centrality metrics have important implications in that they may have different effects on recommended performance. In this study, furthermore, the effect of these centrality metrics on the performance of recommendation may vary depending on recommender algorithms. In addition, recommendation techniques using network analysis can be expected to contribute to increasing recommendation performance even if they apply not only to new customers or products but also to entire customers or products. By considering a customer's purchase of an item as a link generated between the customer and the item on the network, the prediction of user acceptance of recommendation is solved as a prediction of whether a new link will be created between them. As the classification models fit the purpose of solving the binary problem of whether the link is engaged or not, decision tree, k-nearest neighbors (KNN), logistic regression, artificial neural network, and support vector machine (SVM) are selected in the research. The data for performance evaluation used order data collected from an online shopping mall over four years and two months. Among them, the previous three years and eight months constitute social networks composed of and the experiment was conducted by organizing the data collected into the social network. The next four months' records were used to train and evaluate recommender models. Experiments with the centrality metrics applied to each model show that the recommendation acceptance rates of the centrality metrics are different for each algorithm at a meaningful level. In this work, we analyzed only four commonly used centrality metrics: degree centrality, betweenness centrality, closeness centrality, and eigenvector centrality. Eigenvector centrality records the lowest performance in all models except support vector machines. Closeness centrality and betweenness centrality show similar performance across all models. Degree centrality ranking moderate across overall models while betweenness centrality always ranking higher than degree centrality. Finally, closeness centrality is characterized by distinct differences in performance according to the model. It ranks first in logistic regression, artificial neural network, and decision tree withnumerically high performance. However, it only records very low rankings in support vector machine and K-neighborhood with low-performance levels. As the experiment results reveal, in a classification model, network centrality metrics over a subnetwork that connects the two nodes can effectively predict the connectivity between two nodes in a social network. Furthermore, each metric has a different performance depending on the classification model type. This result implies that choosing appropriate metrics for each algorithm can lead to achieving higher recommendation performance. In general, betweenness centrality can guarantee a high level of performance in any model. It would be possible to consider the introduction of proximity centrality to obtain higher performance for certain models.

A Study on Training Dataset Configuration for Deep Learning Based Image Matching of Multi-sensor VHR Satellite Images (다중센서 고해상도 위성영상의 딥러닝 기반 영상매칭을 위한 학습자료 구성에 관한 연구)

  • Kang, Wonbin;Jung, Minyoung;Kim, Yongil
    • Korean Journal of Remote Sensing
    • /
    • v.38 no.6_1
    • /
    • pp.1505-1514
    • /
    • 2022
  • Image matching is a crucial preprocessing step for effective utilization of multi-temporal and multi-sensor very high resolution (VHR) satellite images. Deep learning (DL) method which is attracting widespread interest has proven to be an efficient approach to measure the similarity between image pairs in quick and accurate manner by extracting complex and detailed features from satellite images. However, Image matching of VHR satellite images remains challenging due to limitations of DL models in which the results are depending on the quantity and quality of training dataset, as well as the difficulty of creating training dataset with VHR satellite images. Therefore, this study examines the feasibility of DL-based method in matching pair extraction which is the most time-consuming process during image registration. This paper also aims to analyze factors that affect the accuracy based on the configuration of training dataset, when developing training dataset from existing multi-sensor VHR image database with bias for DL-based image matching. For this purpose, the generated training dataset were composed of correct matching pairs and incorrect matching pairs by assigning true and false labels to image pairs extracted using a grid-based Scale Invariant Feature Transform (SIFT) algorithm for a total of 12 multi-temporal and multi-sensor VHR images. The Siamese convolutional neural network (SCNN), proposed for matching pair extraction on constructed training dataset, proceeds with model learning and measures similarities by passing two images in parallel to the two identical convolutional neural network structures. The results from this study confirm that data acquired from VHR satellite image database can be used as DL training dataset and indicate the potential to improve efficiency of the matching process by appropriate configuration of multi-sensor images. DL-based image matching techniques using multi-sensor VHR satellite images are expected to replace existing manual-based feature extraction methods based on its stable performance, thus further develop into an integrated DL-based image registration framework.

Detection of Wildfire Burned Areas in California Using Deep Learning and Landsat 8 Images (딥러닝과 Landsat 8 영상을 이용한 캘리포니아 산불 피해지 탐지)

  • Youngmin Seo;Youjeong Youn;Seoyeon Kim;Jonggu Kang;Yemin Jeong;Soyeon Choi;Yungyo Im;Yangwon Lee
    • Korean Journal of Remote Sensing
    • /
    • v.39 no.6_1
    • /
    • pp.1413-1425
    • /
    • 2023
  • The increasing frequency of wildfires due to climate change is causing extreme loss of life and property. They cause loss of vegetation and affect ecosystem changes depending on their intensity and occurrence. Ecosystem changes, in turn, affect wildfire occurrence, causing secondary damage. Thus, accurate estimation of the areas affected by wildfires is fundamental. Satellite remote sensing is used for forest fire detection because it can rapidly acquire topographic and meteorological information about the affected area after forest fires. In addition, deep learning algorithms such as convolutional neural networks (CNN) and transformer models show high performance for more accurate monitoring of fire-burnt regions. To date, the application of deep learning models has been limited, and there is a scarcity of reports providing quantitative performance evaluations for practical field utilization. Hence, this study emphasizes a comparative analysis, exploring performance enhancements achieved through both model selection and data design. This study examined deep learning models for detecting wildfire-damaged areas using Landsat 8 satellite images in California. Also, we conducted a comprehensive comparison and analysis of the detection performance of multiple models, such as U-Net and High-Resolution Network-Object Contextual Representation (HRNet-OCR). Wildfire-related spectral indices such as normalized difference vegetation index (NDVI) and normalized burn ratio (NBR) were used as input channels for the deep learning models to reflect the degree of vegetation cover and surface moisture content. As a result, the mean intersection over union (mIoU) was 0.831 for U-Net and 0.848 for HRNet-OCR, showing high segmentation performance. The inclusion of spectral indices alongside the base wavelength bands resulted in increased metric values for all combinations, affirming that the augmentation of input data with spectral indices contributes to the refinement of pixels. This study can be applied to other satellite images to build a recovery strategy for fire-burnt areas.

Analysis of the Impact of Satellite Remote Sensing Information on the Prediction Performance of Ungauged Basin Stream Flow Using Data-driven Models (인공위성 원격 탐사 정보가 자료 기반 모형의 미계측 유역 하천유출 예측성능에 미치는 영향 분석)

  • Seo, Jiyu;Jung, Haeun;Won, Jeongeun;Choi, Sijung;Kim, Sangdan
    • Journal of Wetlands Research
    • /
    • v.26 no.2
    • /
    • pp.147-159
    • /
    • 2024
  • Lack of streamflow observations makes model calibration difficult and limits model performance improvement. Satellite-based remote sensing products offer a new alternative as they can be actively utilized to obtain hydrological data. Recently, several studies have shown that artificial intelligence-based solutions are more appropriate than traditional conceptual and physical models. In this study, a data-driven approach combining various recurrent neural networks and decision tree-based algorithms is proposed, and the utilization of satellite remote sensing information for AI training is investigated. The satellite imagery used in this study is from MODIS and SMAP. The proposed approach is validated using publicly available data from 25 watersheds. Inspired by the traditional regionalization approach, a strategy is adopted to learn one data-driven model by integrating data from all basins, and the potential of the proposed approach is evaluated by using a leave-one-out cross-validation regionalization setting to predict streamflow from different basins with one model. The GRU + Light GBM model was found to be a suitable model combination for target basins and showed good streamflow prediction performance in ungauged basins (The average model efficiency coefficient for predicting daily streamflow in 25 ungauged basins is 0.7187) except for the period when streamflow is very small. The influence of satellite remote sensing information was found to be up to 10%, with the additional application of satellite information having a greater impact on streamflow prediction during low or dry seasons than during wet or normal seasons.

Optimization of Multiclass Support Vector Machine using Genetic Algorithm: Application to the Prediction of Corporate Credit Rating (유전자 알고리즘을 이용한 다분류 SVM의 최적화: 기업신용등급 예측에의 응용)

  • Ahn, Hyunchul
    • Information Systems Review
    • /
    • v.16 no.3
    • /
    • pp.161-177
    • /
    • 2014
  • Corporate credit rating assessment consists of complicated processes in which various factors describing a company are taken into consideration. Such assessment is known to be very expensive since domain experts should be employed to assess the ratings. As a result, the data-driven corporate credit rating prediction using statistical and artificial intelligence (AI) techniques has received considerable attention from researchers and practitioners. In particular, statistical methods such as multiple discriminant analysis (MDA) and multinomial logistic regression analysis (MLOGIT), and AI methods including case-based reasoning (CBR), artificial neural network (ANN), and multiclass support vector machine (MSVM) have been applied to corporate credit rating.2) Among them, MSVM has recently become popular because of its robustness and high prediction accuracy. In this study, we propose a novel optimized MSVM model, and appy it to corporate credit rating prediction in order to enhance the accuracy. Our model, named 'GAMSVM (Genetic Algorithm-optimized Multiclass Support Vector Machine),' is designed to simultaneously optimize the kernel parameters and the feature subset selection. Prior studies like Lorena and de Carvalho (2008), and Chatterjee (2013) show that proper kernel parameters may improve the performance of MSVMs. Also, the results from the studies such as Shieh and Yang (2008) and Chatterjee (2013) imply that appropriate feature selection may lead to higher prediction accuracy. Based on these prior studies, we propose to apply GAMSVM to corporate credit rating prediction. As a tool for optimizing the kernel parameters and the feature subset selection, we suggest genetic algorithm (GA). GA is known as an efficient and effective search method that attempts to simulate the biological evolution phenomenon. By applying genetic operations such as selection, crossover, and mutation, it is designed to gradually improve the search results. Especially, mutation operator prevents GA from falling into the local optima, thus we can find the globally optimal or near-optimal solution using it. GA has popularly been applied to search optimal parameters or feature subset selections of AI techniques including MSVM. With these reasons, we also adopt GA as an optimization tool. To empirically validate the usefulness of GAMSVM, we applied it to a real-world case of credit rating in Korea. Our application is in bond rating, which is the most frequently studied area of credit rating for specific debt issues or other financial obligations. The experimental dataset was collected from a large credit rating company in South Korea. It contained 39 financial ratios of 1,295 companies in the manufacturing industry, and their credit ratings. Using various statistical methods including the one-way ANOVA and the stepwise MDA, we selected 14 financial ratios as the candidate independent variables. The dependent variable, i.e. credit rating, was labeled as four classes: 1(A1); 2(A2); 3(A3); 4(B and C). 80 percent of total data for each class was used for training, and remaining 20 percent was used for validation. And, to overcome small sample size, we applied five-fold cross validation to our dataset. In order to examine the competitiveness of the proposed model, we also experimented several comparative models including MDA, MLOGIT, CBR, ANN and MSVM. In case of MSVM, we adopted One-Against-One (OAO) and DAGSVM (Directed Acyclic Graph SVM) approaches because they are known to be the most accurate approaches among various MSVM approaches. GAMSVM was implemented using LIBSVM-an open-source software, and Evolver 5.5-a commercial software enables GA. Other comparative models were experimented using various statistical and AI packages such as SPSS for Windows, Neuroshell, and Microsoft Excel VBA (Visual Basic for Applications). Experimental results showed that the proposed model-GAMSVM-outperformed all the competitive models. In addition, the model was found to use less independent variables, but to show higher accuracy. In our experiments, five variables such as X7 (total debt), X9 (sales per employee), X13 (years after founded), X15 (accumulated earning to total asset), and X39 (the index related to the cash flows from operating activity) were found to be the most important factors in predicting the corporate credit ratings. However, the values of the finally selected kernel parameters were found to be almost same among the data subsets. To examine whether the predictive performance of GAMSVM was significantly greater than those of other models, we used the McNemar test. As a result, we found that GAMSVM was better than MDA, MLOGIT, CBR, and ANN at the 1% significance level, and better than OAO and DAGSVM at the 5% significance level.

A study on the optimization of tunnel support patterns using ANN and SVR algorithms (ANN 및 SVR 알고리즘을 활용한 최적 터널지보패턴 선정에 관한 연구)

  • Lee, Je-Kyum;Kim, YangKyun;Lee, Sean Seungwon
    • Journal of Korean Tunnelling and Underground Space Association
    • /
    • v.24 no.6
    • /
    • pp.617-628
    • /
    • 2022
  • A ground support pattern should be designed by properly integrating various support materials in accordance with the rock mass grade when constructing a tunnel, and a technical decision must be made in this process by professionals with vast construction experiences. However, designing supports at the early stage of tunnel design, such as feasibility study or basic design, may be very challenging due to the short timeline, insufficient budget, and deficiency of field data. Meanwhile, the design of the support pattern can be performed more quickly and reliably by utilizing the machine learning technique and the accumulated design data with the rapid increase in tunnel construction in South Korea. Therefore, in this study, the design data and ground exploration data of 48 road tunnels in South Korea were inspected, and data about 19 items, including eight input items (rock type, resistivity, depth, tunnel length, safety index by tunnel length, safety index by rick index, tunnel type, tunnel area) and 11 output items (rock mass grade, two items for shotcrete, three items for rock bolt, three items for steel support, two items for concrete lining), were collected to automatically determine the rock mass class and the support pattern. Three machine learning models (S1, A1, A2) were developed using two machine learning algorithms (SVR, ANN) and organized data. As a result, the A2 model, which applied different loss functions according to the output data format, showed the best performance. This study confirms the potential of support pattern design using machine learning, and it is expected that it will be able to improve the design model by continuously using the model in the actual design, compensating for its shortcomings, and improving its usability.

A Study on Market Size Estimation Method by Product Group Using Word2Vec Algorithm (Word2Vec을 활용한 제품군별 시장규모 추정 방법에 관한 연구)

  • Jung, Ye Lim;Kim, Ji Hui;Yoo, Hyoung Sun
    • Journal of Intelligence and Information Systems
    • /
    • v.26 no.1
    • /
    • pp.1-21
    • /
    • 2020
  • With the rapid development of artificial intelligence technology, various techniques have been developed to extract meaningful information from unstructured text data which constitutes a large portion of big data. Over the past decades, text mining technologies have been utilized in various industries for practical applications. In the field of business intelligence, it has been employed to discover new market and/or technology opportunities and support rational decision making of business participants. The market information such as market size, market growth rate, and market share is essential for setting companies' business strategies. There has been a continuous demand in various fields for specific product level-market information. However, the information has been generally provided at industry level or broad categories based on classification standards, making it difficult to obtain specific and proper information. In this regard, we propose a new methodology that can estimate the market sizes of product groups at more detailed levels than that of previously offered. We applied Word2Vec algorithm, a neural network based semantic word embedding model, to enable automatic market size estimation from individual companies' product information in a bottom-up manner. The overall process is as follows: First, the data related to product information is collected, refined, and restructured into suitable form for applying Word2Vec model. Next, the preprocessed data is embedded into vector space by Word2Vec and then the product groups are derived by extracting similar products names based on cosine similarity calculation. Finally, the sales data on the extracted products is summated to estimate the market size of the product groups. As an experimental data, text data of product names from Statistics Korea's microdata (345,103 cases) were mapped in multidimensional vector space by Word2Vec training. We performed parameters optimization for training and then applied vector dimension of 300 and window size of 15 as optimized parameters for further experiments. We employed index words of Korean Standard Industry Classification (KSIC) as a product name dataset to more efficiently cluster product groups. The product names which are similar to KSIC indexes were extracted based on cosine similarity. The market size of extracted products as one product category was calculated from individual companies' sales data. The market sizes of 11,654 specific product lines were automatically estimated by the proposed model. For the performance verification, the results were compared with actual market size of some items. The Pearson's correlation coefficient was 0.513. Our approach has several advantages differing from the previous studies. First, text mining and machine learning techniques were applied for the first time on market size estimation, overcoming the limitations of traditional sampling based- or multiple assumption required-methods. In addition, the level of market category can be easily and efficiently adjusted according to the purpose of information use by changing cosine similarity threshold. Furthermore, it has a high potential of practical applications since it can resolve unmet needs for detailed market size information in public and private sectors. Specifically, it can be utilized in technology evaluation and technology commercialization support program conducted by governmental institutions, as well as business strategies consulting and market analysis report publishing by private firms. The limitation of our study is that the presented model needs to be improved in terms of accuracy and reliability. The semantic-based word embedding module can be advanced by giving a proper order in the preprocessed dataset or by combining another algorithm such as Jaccard similarity with Word2Vec. Also, the methods of product group clustering can be changed to other types of unsupervised machine learning algorithm. Our group is currently working on subsequent studies and we expect that it can further improve the performance of the conceptually proposed basic model in this study.

A Study on the Prediction Model of Stock Price Index Trend based on GA-MSVM that Simultaneously Optimizes Feature and Instance Selection (입력변수 및 학습사례 선정을 동시에 최적화하는 GA-MSVM 기반 주가지수 추세 예측 모형에 관한 연구)

  • Lee, Jong-sik;Ahn, Hyunchul
    • Journal of Intelligence and Information Systems
    • /
    • v.23 no.4
    • /
    • pp.147-168
    • /
    • 2017
  • There have been many studies on accurate stock market forecasting in academia for a long time, and now there are also various forecasting models using various techniques. Recently, many attempts have been made to predict the stock index using various machine learning methods including Deep Learning. Although the fundamental analysis and the technical analysis method are used for the analysis of the traditional stock investment transaction, the technical analysis method is more useful for the application of the short-term transaction prediction or statistical and mathematical techniques. Most of the studies that have been conducted using these technical indicators have studied the model of predicting stock prices by binary classification - rising or falling - of stock market fluctuations in the future market (usually next trading day). However, it is also true that this binary classification has many unfavorable aspects in predicting trends, identifying trading signals, or signaling portfolio rebalancing. In this study, we try to predict the stock index by expanding the stock index trend (upward trend, boxed, downward trend) to the multiple classification system in the existing binary index method. In order to solve this multi-classification problem, a technique such as Multinomial Logistic Regression Analysis (MLOGIT), Multiple Discriminant Analysis (MDA) or Artificial Neural Networks (ANN) we propose an optimization model using Genetic Algorithm as a wrapper for improving the performance of this model using Multi-classification Support Vector Machines (MSVM), which has proved to be superior in prediction performance. In particular, the proposed model named GA-MSVM is designed to maximize model performance by optimizing not only the kernel function parameters of MSVM, but also the optimal selection of input variables (feature selection) as well as instance selection. In order to verify the performance of the proposed model, we applied the proposed method to the real data. The results show that the proposed method is more effective than the conventional multivariate SVM, which has been known to show the best prediction performance up to now, as well as existing artificial intelligence / data mining techniques such as MDA, MLOGIT, CBR, and it is confirmed that the prediction performance is better than this. Especially, it has been confirmed that the 'instance selection' plays a very important role in predicting the stock index trend, and it is confirmed that the improvement effect of the model is more important than other factors. To verify the usefulness of GA-MSVM, we applied it to Korea's real KOSPI200 stock index trend forecast. Our research is primarily aimed at predicting trend segments to capture signal acquisition or short-term trend transition points. The experimental data set includes technical indicators such as the price and volatility index (2004 ~ 2017) and macroeconomic data (interest rate, exchange rate, S&P 500, etc.) of KOSPI200 stock index in Korea. Using a variety of statistical methods including one-way ANOVA and stepwise MDA, 15 indicators were selected as candidate independent variables. The dependent variable, trend classification, was classified into three states: 1 (upward trend), 0 (boxed), and -1 (downward trend). 70% of the total data for each class was used for training and the remaining 30% was used for verifying. To verify the performance of the proposed model, several comparative model experiments such as MDA, MLOGIT, CBR, ANN and MSVM were conducted. MSVM has adopted the One-Against-One (OAO) approach, which is known as the most accurate approach among the various MSVM approaches. Although there are some limitations, the final experimental results demonstrate that the proposed model, GA-MSVM, performs at a significantly higher level than all comparative models.

Improving the Accuracy of Document Classification by Learning Heterogeneity (이질성 학습을 통한 문서 분류의 정확성 향상 기법)

  • Wong, William Xiu Shun;Hyun, Yoonjin;Kim, Namgyu
    • Journal of Intelligence and Information Systems
    • /
    • v.24 no.3
    • /
    • pp.21-44
    • /
    • 2018
  • In recent years, the rapid development of internet technology and the popularization of smart devices have resulted in massive amounts of text data. Those text data were produced and distributed through various media platforms such as World Wide Web, Internet news feeds, microblog, and social media. However, this enormous amount of easily obtained information is lack of organization. Therefore, this problem has raised the interest of many researchers in order to manage this huge amount of information. Further, this problem also required professionals that are capable of classifying relevant information and hence text classification is introduced. Text classification is a challenging task in modern data analysis, which it needs to assign a text document into one or more predefined categories or classes. In text classification field, there are different kinds of techniques available such as K-Nearest Neighbor, Naïve Bayes Algorithm, Support Vector Machine, Decision Tree, and Artificial Neural Network. However, while dealing with huge amount of text data, model performance and accuracy becomes a challenge. According to the type of words used in the corpus and type of features created for classification, the performance of a text classification model can be varied. Most of the attempts are been made based on proposing a new algorithm or modifying an existing algorithm. This kind of research can be said already reached their certain limitations for further improvements. In this study, aside from proposing a new algorithm or modifying the algorithm, we focus on searching a way to modify the use of data. It is widely known that classifier performance is influenced by the quality of training data upon which this classifier is built. The real world datasets in most of the time contain noise, or in other words noisy data, these can actually affect the decision made by the classifiers built from these data. In this study, we consider that the data from different domains, which is heterogeneous data might have the characteristics of noise which can be utilized in the classification process. In order to build the classifier, machine learning algorithm is performed based on the assumption that the characteristics of training data and target data are the same or very similar to each other. However, in the case of unstructured data such as text, the features are determined according to the vocabularies included in the document. If the viewpoints of the learning data and target data are different, the features may be appearing different between these two data. In this study, we attempt to improve the classification accuracy by strengthening the robustness of the document classifier through artificially injecting the noise into the process of constructing the document classifier. With data coming from various kind of sources, these data are likely formatted differently. These cause difficulties for traditional machine learning algorithms because they are not developed to recognize different type of data representation at one time and to put them together in same generalization. Therefore, in order to utilize heterogeneous data in the learning process of document classifier, we apply semi-supervised learning in our study. However, unlabeled data might have the possibility to degrade the performance of the document classifier. Therefore, we further proposed a method called Rule Selection-Based Ensemble Semi-Supervised Learning Algorithm (RSESLA) to select only the documents that contributing to the accuracy improvement of the classifier. RSESLA creates multiple views by manipulating the features using different types of classification models and different types of heterogeneous data. The most confident classification rules will be selected and applied for the final decision making. In this paper, three different types of real-world data sources were used, which are news, twitter and blogs.