• Title/Summary/Keyword: kernel regression

Search Result 240, Processing Time 0.025 seconds

Ensemble Machine Learning Model Based YouTube Spam Comment Detection (앙상블 머신러닝 모델 기반 유튜브 스팸 댓글 탐지)

  • Jeong, Min Chul;Lee, Jihyeon;Oh, Hayoung
    • Journal of the Korea Institute of Information and Communication Engineering
    • /
    • v.24 no.5
    • /
    • pp.576-583
    • /
    • 2020
  • This paper proposes a technique to determine the spam comments on YouTube, which have recently seen tremendous growth. On YouTube, the spammers appeared to promote their channels or videos in popular videos or leave comments unrelated to the video, as it is possible to monetize through advertising. YouTube is running and operating its own spam blocking system, but still has failed to block them properly and efficiently. Therefore, we examined related studies on YouTube spam comment screening and conducted classification experiments with six different machine learning techniques (Decision tree, Logistic regression, Bernoulli Naive Bayes, Random Forest, Support vector machine with linear kernel, Support vector machine with Gaussian kernel) and ensemble model combining these techniques in the comment data from popular music videos - Psy, Katy Perry, LMFAO, Eminem and Shakira.

Optimization of Support Vector Machines for Financial Forecasting (재무예측을 위한 Support Vector Machine의 최적화)

  • Kim, Kyoung-Jae;Ahn, Hyun-Chul
    • Journal of Intelligence and Information Systems
    • /
    • v.17 no.4
    • /
    • pp.241-254
    • /
    • 2011
  • Financial time-series forecasting is one of the most important issues because it is essential for the risk management of financial institutions. Therefore, researchers have tried to forecast financial time-series using various data mining techniques such as regression, artificial neural networks, decision trees, k-nearest neighbor etc. Recently, support vector machines (SVMs) are popularly applied to this research area because they have advantages that they don't require huge training data and have low possibility of overfitting. However, a user must determine several design factors by heuristics in order to use SVM. For example, the selection of appropriate kernel function and its parameters and proper feature subset selection are major design factors of SVM. Other than these factors, the proper selection of instance subset may also improve the forecasting performance of SVM by eliminating irrelevant and distorting training instances. Nonetheless, there have been few studies that have applied instance selection to SVM, especially in the domain of stock market prediction. Instance selection tries to choose proper instance subsets from original training data. It may be considered as a method of knowledge refinement and it maintains the instance-base. This study proposes the novel instance selection algorithm for SVMs. The proposed technique in this study uses genetic algorithm (GA) to optimize instance selection process with parameter optimization simultaneously. We call the model as ISVM (SVM with Instance selection) in this study. Experiments on stock market data are implemented using ISVM. In this study, the GA searches for optimal or near-optimal values of kernel parameters and relevant instances for SVMs. This study needs two sets of parameters in chromosomes in GA setting : The codes for kernel parameters and for instance selection. For the controlling parameters of the GA search, the population size is set at 50 organisms and the value of the crossover rate is set at 0.7 while the mutation rate is 0.1. As the stopping condition, 50 generations are permitted. The application data used in this study consists of technical indicators and the direction of change in the daily Korea stock price index (KOSPI). The total number of samples is 2218 trading days. We separate the whole data into three subsets as training, test, hold-out data set. The number of data in each subset is 1056, 581, 581 respectively. This study compares ISVM to several comparative models including logistic regression (logit), backpropagation neural networks (ANN), nearest neighbor (1-NN), conventional SVM (SVM) and SVM with the optimized parameters (PSVM). In especial, PSVM uses optimized kernel parameters by the genetic algorithm. The experimental results show that ISVM outperforms 1-NN by 15.32%, ANN by 6.89%, Logit and SVM by 5.34%, and PSVM by 4.82% for the holdout data. For ISVM, only 556 data from 1056 original training data are used to produce the result. In addition, the two-sample test for proportions is used to examine whether ISVM significantly outperforms other comparative models. The results indicate that ISVM outperforms ANN and 1-NN at the 1% statistical significance level. In addition, ISVM performs better than Logit, SVM and PSVM at the 5% statistical significance level.

Predicting Interesting Web Pages by SVM and Logit-regression (SVM과 로짓회귀분석을 이용한 흥미있는 웹페이지 예측)

  • Jeon, Dohong;Kim, Hyoungrae
    • Journal of the Korea Society of Computer and Information
    • /
    • v.20 no.3
    • /
    • pp.47-56
    • /
    • 2015
  • Automated detection of interesting web pages could be used in many different application domains. Determining a user's interesting web pages can be performed implicitly by observing the user's behavior. The task of distinguishing interesting web pages belongs to a classification problem, and we choose white box learning methods (fixed effect logit regression and support vector machine) to test empirically. The result indicated that (1) fixed effect logit regression, fixed effect SVMs with both polynomial and radial basis kernels showed higher performance than the linear kernel model, (2) a personalization is a critical issue for improving the performance of a model, (3) when asking a user explicit grading of web pages, the scale could be as simple as yes/no answer, (4) every second the duration in a web page increases, the ratio of the probability to be interesting increased 1.004 times, but the number of scrollbar clicks (p=0.56) and the number of mouse clicks (p=0.36) did not have statistically significant relations with the interest.

An Energy Consumption Prediction Model for Smart Factory Using Data Mining Algorithms (데이터 마이닝 기반 스마트 공장 에너지 소모 예측 모델)

  • Sathishkumar, VE;Lee, Myeongbae;Lim, Jonghyun;Kim, Yubin;Shin, Changsun;Park, Jangwoo;Cho, Yongyun
    • KIPS Transactions on Software and Data Engineering
    • /
    • v.9 no.5
    • /
    • pp.153-160
    • /
    • 2020
  • Energy Consumption Predictions for Industries has a prominent role to play in the energy management and control system as dynamic and seasonal changes are occurring in energy demand and supply. This paper introduces and explores the steel industry's predictive models of energy consumption. The data used includes lagging and leading reactive power lagging and leading current variable, emission of carbon dioxide (tCO2) and load type. Four statistical models are trained and tested in the test set: (a) Linear Regression (LR), (b) Radial Kernel Support Vector Machine (SVM RBF), (c) Gradient Boosting Machine (GBM), and (d) Random Forest (RF). Root Mean Squared Error (RMSE), Mean Absolute Error (MAE) and Mean Absolute Percentage Error (MAPE) are used for calculating regression model predictive performance. When using all the predictors, the best model RF can provide RMSE value 7.33 in the test set.

A Robust Depth Map Upsampling Against Camera Calibration Errors (카메라 보정 오류에 강건한 깊이맵 업샘플링 기술)

  • Kim, Jae-Kwang;Lee, Jae-Ho;Kim, Chang-Ick
    • Journal of the Institute of Electronics Engineers of Korea SP
    • /
    • v.48 no.6
    • /
    • pp.8-17
    • /
    • 2011
  • Recently, fusion camera systems that consist of depth sensors and color cameras have been widely developed with the advent of a new type of sensor, time-of-flight (TOF) depth sensor. The physical limitation of depth sensors usually generates low resolution images compared to corresponding color images. Therefore, the pre-processing module, such as camera calibration, three dimensional warping, and hole filling, is necessary to generate the high resolution depth map that is placed in the image plane of the color image. However, the result of the pre-processing step is usually inaccurate due to errors from the camera calibration and the depth measurement. Therefore, in this paper, we present a depth map upsampling method robust these errors. First, the confidence of the measured depth value is estimated by the interrelation between the color image and the pre-upsampled depth map. Then, the detailed depth map can be generated by the modified kernel regression method which exclude depth values having low confidence. Our proposed algorithm guarantees the high quality result in the presence of the camera calibration errors. Experimental comparison with other data fusion techniques shows the superiority of our proposed method.

Major Characteristics Affecting Popping Volume of Popcorn (튀김옥수수의 튀김부피에 영향을 미치는 주요특성)

  • 김선림;박승의;차선우;서종호
    • KOREAN JOURNAL OF CROP SCIENCE
    • /
    • v.40 no.2
    • /
    • pp.167-174
    • /
    • 1995
  • This experiment was carried out to investigate the major characters affecting the popping volume of popcorn. Tuygimok 1 (Kp1 ${\times}$ Kp2) and 8 popcorn hybrids' agronomic characters were tested to evaluate a certain extent how much they affect on the popping volume. Moisture con-tent was considered as the most important factor, but failed to evaluate the optimum moisture con-tent level in this experiment moisture range (12.2-14.4%) because popping volume increased as moisture content of kernels increased. The maximum popping volume was obtained at 55-60kg of kernel hardness, 80-90,um of pericarp thickness and 45-50% of S/H (Soft/Hard starch). But the Em/En(Embryo/Endosperm) ratio was negatively associated with the popping volume. Therefore the minimum popping volume was observed at the 10-11 % of Em/En ratio. Moisture content, hardness, pericarp thickness, Em/En and S/H ratio were selected as the appropriate variables for the maxi-mum popping volume using the stepwise forward regression method and the expecting popping volume was estimated by the multiple linear regression formular. The mean popping volume of ninepopcorn hybrids was about 29.2cm3/g.

  • PDF

Online Information Retrieval and Changes in the Restaurant Location: The Case Study of Seoul (온라인 정보검색과 음식점 입지에 나타나는 변화: 서울시를 사례로)

  • Lee, Keumsook;Park, Sohyun;Shin, Hyeyoung
    • Journal of the Economic Geographical Society of Korea
    • /
    • v.23 no.1
    • /
    • pp.56-70
    • /
    • 2020
  • This study identifies the impact of social network service (SNS) on the spatial characteristics of retail stores locations in the hyper-connected society, which have been closely related to the everyday lives of urban residents. In particular, we focus on the changes in the spatial distribution of restaurants since the information retrieval process was added to the decision-making process of a consumer's restaurant selection. Empirically, we analyze restaurants in Seoul, Korea since the smart-phone was introduced. By applying the kernel density estimation and Moran's I index, we examine the changes in the spatial distribution pattern of restaurants during the last ten years for running, newly-open and closed restaurants as well as SNS popular ones. Finally, we develop a spatial regression model to identify geographic features affecting their locations. As the results, we identified geographical variables and online factors that influence the location of restaurants. The results of this study could provide important groundwork for food and beverage location planning and policy formulation.

Optimizing Clustering and Predictive Modelling for 3-D Road Network Analysis Using Explainable AI

  • Rotsnarani Sethy;Soumya Ranjan Mahanta;Mrutyunjaya Panda
    • International Journal of Computer Science & Network Security
    • /
    • v.24 no.9
    • /
    • pp.30-40
    • /
    • 2024
  • Building an accurate 3-D spatial road network model has become an active area of research now-a-days that profess to be a new paradigm in developing Smart roads and intelligent transportation system (ITS) which will help the public and private road impresario for better road mobility and eco-routing so that better road traffic, less carbon emission and road safety may be ensured. Dealing with such a large scale 3-D road network data poses challenges in getting accurate elevation information of a road network to better estimate the CO2 emission and accurate routing for the vehicles in Internet of Vehicle (IoV) scenario. Clustering and regression techniques are found suitable in discovering the missing elevation information in 3-D spatial road network dataset for some points in the road network which is envisaged of helping the public a better eco-routing experience. Further, recently Explainable Artificial Intelligence (xAI) draws attention of the researchers to better interprete, transparent and comprehensible, thus enabling to design efficient choice based models choices depending upon users requirements. The 3-D road network dataset, comprising of spatial attributes (longitude, latitude, altitude) of North Jutland, Denmark, collected from publicly available UCI repositories is preprocessed through feature engineering and scaling to ensure optimal accuracy for clustering and regression tasks. K-Means clustering and regression using Support Vector Machine (SVM) with radial basis function (RBF) kernel are employed for 3-D road network analysis. Silhouette scores and number of clusters are chosen for measuring cluster quality whereas error metric such as MAE ( Mean Absolute Error) and RMSE (Root Mean Square Error) are considered for evaluating the regression method. To have better interpretability of the Clustering and regression models, SHAP (Shapley Additive Explanations), a powerful xAI technique is employed in this research. From extensive experiments , it is observed that SHAP analysis validated the importance of latitude and altitude in predicting longitude, particularly in the four-cluster setup, providing critical insights into model behavior and feature contributions SHAP analysis validated the importance of latitude and altitude in predicting longitude, particularly in the four-cluster setup, providing critical insights into model behavior and feature contributions with an accuracy of 97.22% and strong performance metrics across all classes having MAE of 0.0346, and MSE of 0.0018. On the other hand, the ten-cluster setup, while faster in SHAP analysis, presented challenges in interpretability due to increased clustering complexity. Hence, K-Means clustering with K=4 and SVM hybrid models demonstrated superior performance and interpretability, highlighting the importance of careful cluster selection to balance model complexity and predictive accuracy.

Optimization of Multiclass Support Vector Machine using Genetic Algorithm: Application to the Prediction of Corporate Credit Rating (유전자 알고리즘을 이용한 다분류 SVM의 최적화: 기업신용등급 예측에의 응용)

  • Ahn, Hyunchul
    • Information Systems Review
    • /
    • v.16 no.3
    • /
    • pp.161-177
    • /
    • 2014
  • Corporate credit rating assessment consists of complicated processes in which various factors describing a company are taken into consideration. Such assessment is known to be very expensive since domain experts should be employed to assess the ratings. As a result, the data-driven corporate credit rating prediction using statistical and artificial intelligence (AI) techniques has received considerable attention from researchers and practitioners. In particular, statistical methods such as multiple discriminant analysis (MDA) and multinomial logistic regression analysis (MLOGIT), and AI methods including case-based reasoning (CBR), artificial neural network (ANN), and multiclass support vector machine (MSVM) have been applied to corporate credit rating.2) Among them, MSVM has recently become popular because of its robustness and high prediction accuracy. In this study, we propose a novel optimized MSVM model, and appy it to corporate credit rating prediction in order to enhance the accuracy. Our model, named 'GAMSVM (Genetic Algorithm-optimized Multiclass Support Vector Machine),' is designed to simultaneously optimize the kernel parameters and the feature subset selection. Prior studies like Lorena and de Carvalho (2008), and Chatterjee (2013) show that proper kernel parameters may improve the performance of MSVMs. Also, the results from the studies such as Shieh and Yang (2008) and Chatterjee (2013) imply that appropriate feature selection may lead to higher prediction accuracy. Based on these prior studies, we propose to apply GAMSVM to corporate credit rating prediction. As a tool for optimizing the kernel parameters and the feature subset selection, we suggest genetic algorithm (GA). GA is known as an efficient and effective search method that attempts to simulate the biological evolution phenomenon. By applying genetic operations such as selection, crossover, and mutation, it is designed to gradually improve the search results. Especially, mutation operator prevents GA from falling into the local optima, thus we can find the globally optimal or near-optimal solution using it. GA has popularly been applied to search optimal parameters or feature subset selections of AI techniques including MSVM. With these reasons, we also adopt GA as an optimization tool. To empirically validate the usefulness of GAMSVM, we applied it to a real-world case of credit rating in Korea. Our application is in bond rating, which is the most frequently studied area of credit rating for specific debt issues or other financial obligations. The experimental dataset was collected from a large credit rating company in South Korea. It contained 39 financial ratios of 1,295 companies in the manufacturing industry, and their credit ratings. Using various statistical methods including the one-way ANOVA and the stepwise MDA, we selected 14 financial ratios as the candidate independent variables. The dependent variable, i.e. credit rating, was labeled as four classes: 1(A1); 2(A2); 3(A3); 4(B and C). 80 percent of total data for each class was used for training, and remaining 20 percent was used for validation. And, to overcome small sample size, we applied five-fold cross validation to our dataset. In order to examine the competitiveness of the proposed model, we also experimented several comparative models including MDA, MLOGIT, CBR, ANN and MSVM. In case of MSVM, we adopted One-Against-One (OAO) and DAGSVM (Directed Acyclic Graph SVM) approaches because they are known to be the most accurate approaches among various MSVM approaches. GAMSVM was implemented using LIBSVM-an open-source software, and Evolver 5.5-a commercial software enables GA. Other comparative models were experimented using various statistical and AI packages such as SPSS for Windows, Neuroshell, and Microsoft Excel VBA (Visual Basic for Applications). Experimental results showed that the proposed model-GAMSVM-outperformed all the competitive models. In addition, the model was found to use less independent variables, but to show higher accuracy. In our experiments, five variables such as X7 (total debt), X9 (sales per employee), X13 (years after founded), X15 (accumulated earning to total asset), and X39 (the index related to the cash flows from operating activity) were found to be the most important factors in predicting the corporate credit ratings. However, the values of the finally selected kernel parameters were found to be almost same among the data subsets. To examine whether the predictive performance of GAMSVM was significantly greater than those of other models, we used the McNemar test. As a result, we found that GAMSVM was better than MDA, MLOGIT, CBR, and ANN at the 1% significance level, and better than OAO and DAGSVM at the 5% significance level.

Constructing Negative Links from Multi-facet of Social Media

  • Li, Lin;Yan, YunYi;Jia, LiBin;Ma, Jun
    • KSII Transactions on Internet and Information Systems (TIIS)
    • /
    • v.11 no.5
    • /
    • pp.2484-2498
    • /
    • 2017
  • Various types of social media make the people share their personal experience in different ways. In some social networking sites. Some users post their reviews, some users can support these reviews with comments, and some users just rate the reviews as kind of support or not. Unfortunately, there is rare explicit negative comments towards other reviews. This means if there is a link between two users, it must be positive link. Apparently, the negative link is invisible in these social network. Or in other word, the negative links are redundant to positive links. In this work, we first discuss the feature extraction from social media data and propose new method to compute the distance between each pair of comments or reviews on social media. Then we investigate whether we can predict negative links via regression analysis when only positive links are manifested from social media data. In particular, we provide a principled way to mathematically incorporate multi-facet data in a novel framework, Constructing Negative Links, CsNL to predict negative links for discovering the hidden information. Additionally, we investigate the ways of solution to general negative link predication problems with CsNL and its extension. Experiments are performed on real-world data and results show that negative links is predictable with multi-facet of social media data by the proposed framework CsNL. Essentially, high prediction accuracy suggests that negative links are redundant to positive links. Further experiments are performed to evaluate coefficients on different kernels. The results show that user generated content dominates the prediction performance of CsNL.