• Title/Summary/Keyword: Feature Variables

Search Result 366, Processing Time 0.03 seconds

Prediction model of hypercholesterolemia using body fat mass based on machine learning (머신러닝 기반 체지방 측정정보를 이용한 고콜레스테롤혈증 예측모델)

  • Lee, Bum Ju
    • The Journal of the Convergence on Culture Technology
    • /
    • v.5 no.4
    • /
    • pp.413-420
    • /
    • 2019
  • The purpose of the present study is to develop a model for predicting hypercholesterolemia using an integrated set of body fat mass variables based on machine learning techniques, beyond the study of the association between body fat mass and hypercholesterolemia. For this study, a total of six models were created using two variable subset selection methods and machine learning algorithms based on the Korea National Health and Nutrition Examination Survey (KNHANES) data. Among the various body fat mass variables, we found that trunk fat mass was the best variable for predicting hypercholesterolemia. Furthermore, we obtained the area under the receiver operating characteristic curve value of 0.739 and the Matthews correlation coefficient value of 0.36 in the model using the correlation-based feature subset selection and naive Bayes algorithm. Our findings are expected to be used as important information in the field of disease prediction in large-scale screening and public health research.

Fault Diagnosis of Bearing Based on Convolutional Neural Network Using Multi-Domain Features

  • Shao, Xiaorui;Wang, Lijiang;Kim, Chang Soo;Ra, Ilkyeun
    • KSII Transactions on Internet and Information Systems (TIIS)
    • /
    • v.15 no.5
    • /
    • pp.1610-1629
    • /
    • 2021
  • Failures frequently occurred in manufacturing machines due to complex and changeable manufacturing environments, increasing the downtime and maintenance costs. This manuscript develops a novel deep learning-based method named Multi-Domain Convolutional Neural Network (MDCNN) to deal with this challenging task with vibration signals. The proposed MDCNN consists of time-domain, frequency-domain, and statistical-domain feature channels. The Time-domain channel is to model the hidden patterns of signals in the time domain. The frequency-domain channel uses Discrete Wavelet Transformation (DWT) to obtain the rich feature representations of signals in the frequency domain. The statistic-domain channel contains six statistical variables, which is to reflect the signals' macro statistical-domain features, respectively. Firstly, in the proposed MDCNN, time-domain and frequency-domain channels are processed by CNN individually with various filters. Secondly, the CNN extracted features from time, and frequency domains are merged as time-frequency features. Lastly, time-frequency domain features are fused with six statistical variables as the comprehensive features for identifying the fault. Thereby, the proposed method could make full use of those three domain-features for fault diagnosis while keeping high distinguishability due to CNN's utilization. The authors designed massive experiments with 10-folder cross-validation technology to validate the proposed method's effectiveness on the CWRU bearing data set. The experimental results are calculated by ten-time averaged accuracy. They have confirmed that the proposed MDCNN could intelligently, accurately, and timely detect the fault under the complex manufacturing environments, whose accuracy is nearly 100%.

A Study on the Remaining Useful Life Prediction Performance Variation based on Identification and Selection by using SHAP (SHAP를 활용한 중요변수 파악 및 선택에 따른 잔여유효수명 예측 성능 변동에 대한 연구)

  • Yoon, Yeon Ah;Lee, Seung Hoon;Kim, Yong Soo
    • Journal of Korean Society of Industrial and Systems Engineering
    • /
    • v.44 no.4
    • /
    • pp.1-11
    • /
    • 2021
  • Recently, the importance of preventive maintenance has been emerging since failures in a complex system are automatically detected due to the development of artificial intelligence techniques and sensor technology. Therefore, prognostic and health management (PHM) is being actively studied, and prediction of the remaining useful life (RUL) of the system is being one of the most important tasks. A lot of researches has been conducted to predict the RUL. Deep learning models have been developed to improve prediction performance, but studies on identifying the importance of features are not carried out. It is very meaningful to extract and interpret features that affect failures while improving the predictive accuracy of RUL is important. In this paper, a total of six popular deep learning models were employed to predict the RUL, and identified important variables for each model through SHAP (Shapley Additive explanations) that one of the explainable artificial intelligence (XAI). Moreover, the fluctuations and trends of prediction performance according to the number of variables were identified. This paper can suggest the possibility of explainability of various deep learning models, and the application of XAI can be demonstrated. Also, through this proposed method, it is expected that the possibility of utilizing SHAP as a feature selection method.

Large Language Models-based Feature Extraction for Short-Term Load Forecasting (거대언어모델 기반 특징 추출을 이용한 단기 전력 수요량 예측 기법)

  • Jaeseung Lee;Jehyeok Rew
    • Journal of Korea Society of Industrial Information Systems
    • /
    • v.29 no.3
    • /
    • pp.51-65
    • /
    • 2024
  • Accurate electrical load forecasting is important to the effective operation of power systems in smart grids. With the recent development in machine learning, artificial intelligence-based models for predicting power demand are being actively researched. However, since existing models get input variables as numerical features, the accuracy of the forecasting model may decrease because they do not reflect the semantic relationship between these features. In this paper, we propose a scheme for short-term load forecasting by using features extracted through the large language models for input data. We firstly convert input variables into a sentence-like prompt format. Then, we use the large language model with frozen weights to derive the embedding vectors that represent the features of the prompt. These vectors are used to train the forecasting model. Experimental results show that the proposed scheme outperformed models based on numerical data, and by visualizing the attention weights in the large language models on the prompts, we identified the information that significantly influences predictions.

A Two-Phase Hybrid Stock Price Forecasting Model : Cointegration Tests and Artificial Neural Networks (2단계 하이브리드 주가 예측 모델 : 공적분 검정과 인공 신경망)

  • Oh, Yu-Jin;Kim, Yu-Seop
    • The KIPS Transactions:PartB
    • /
    • v.14B no.7
    • /
    • pp.531-540
    • /
    • 2007
  • In this research, we proposed a two-phase hybrid stock price forecasting model with cointegration tests and artificial neural networks. Using not only the related stocks to the target stock but also the past information as input features in neural networks, the new model showed an improved performance in forecasting than that of the usual neural networks. Firstly in order to extract stocks which have long run relationships with the target stock, we made use of Johansen's cointegration test. In stock market, some stocks are apt to vary similarly and these phenomenon can be very informative to forecast the target stock. Johansen's cointegration test provides whether variables are related and whether the relationship is statistically significant. Secondly, we learned the model which includes lagged variables of the target and related stocks in addition to other characteristics of them. Although former research usually did not incorporate those variables, it is well known that most economic time series data are depend on its past value. Also, it is common in econometric literatures to consider lagged values as dependent variables. We implemented a price direction forecasting system for KOSPI index to examine the performance of the proposed model. As the result, our model had 11.29% higher forecasting accuracy on average than the model learned without cointegration test and also showed 10.59% higher on average than the model which randomly selected stocks to make the size of the feature set same as that of the proposed model.

Factors Influencing the Activities of Collecting Data for Program Development in the Social Welfare Centers (종합사회복지관의 프로그램개발을 위한 정보수집에 영향을 미치는 요인에 관한 연구 - 청소년복지 프로그램 담당자들을 중심으로 -)

  • Seo, In-Hae
    • Korean Journal of Social Welfare
    • /
    • v.54
    • /
    • pp.245-272
    • /
    • 2003
  • Despite the importance of the program development activities and the necessity of the systematical investigation on the features of the program development activities in the social service agencies, it has been observed that recent social work studies have ignored an important study area of program development, including the activities of collecting data in the process of program development in social service agencies. Therefore, this study was undertaken to investigate salient features of the activities of collecting data for program development in the social welfare centers in Korea. A questionnaire was constructed with three parts, including a dependent variable and 6 independent variables, and 201 questionnaires were collected from 353 agencies during two months. As the result of the descriptive analyses, the five noticeable features were found, (1) the activities of collecting data for program development in the agencies are very active; (2) staff in his/her twenties are in charge of program development; (3) diverse data are collected in the process of program development (4) hard data are more collected than soft data in the process of program development; (5) the respondents are more despondent on knowledge learned from individual studies than knowledge learned from academic institutes. Multiple regressions were applied to analyze the relationship between independent variables and three kind of dependent variables - total feature of data collecting, collecting hard data, collecting soft data. The result showed that the total feature of data collecting was critically influenced by social workers' autonomy, openness, and knowledge learned from academic institutes, and workload. The activities of collecting hard data was influenced by the above variables, except social workers' workload, The activities of collecting soft data were influenced by the social workers' autonomy, openness, and knowledge learned from academic institutes, and workload. Major findings were discussed and several suggestions were made for future research and improvement of the program development in social welfare centers.

  • PDF

Prediction of Key Variables Affecting NBA Playoffs Advancement: Focusing on 3 Points and Turnover Features (미국 프로농구(NBA)의 플레이오프 진출에 영향을 미치는 주요 변수 예측: 3점과 턴오버 속성을 중심으로)

  • An, Sehwan;Kim, Youngmin
    • Journal of Intelligence and Information Systems
    • /
    • v.28 no.1
    • /
    • pp.263-286
    • /
    • 2022
  • This study acquires NBA statistical information for a total of 32 years from 1990 to 2022 using web crawling, observes variables of interest through exploratory data analysis, and generates related derived variables. Unused variables were removed through a purification process on the input data, and correlation analysis, t-test, and ANOVA were performed on the remaining variables. For the variable of interest, the difference in the mean between the groups that advanced to the playoffs and did not advance to the playoffs was tested, and then to compensate for this, the average difference between the three groups (higher/middle/lower) based on ranking was reconfirmed. Of the input data, only this year's season data was used as a test set, and 5-fold cross-validation was performed by dividing the training set and the validation set for model training. The overfitting problem was solved by comparing the cross-validation result and the final analysis result using the test set to confirm that there was no difference in the performance matrix. Because the quality level of the raw data is high and the statistical assumptions are satisfied, most of the models showed good results despite the small data set. This study not only predicts NBA game results or classifies whether or not to advance to the playoffs using machine learning, but also examines whether the variables of interest are included in the major variables with high importance by understanding the importance of input attribute. Through the visualization of SHAP value, it was possible to overcome the limitation that could not be interpreted only with the result of feature importance, and to compensate for the lack of consistency in the importance calculation in the process of entering/removing variables. It was found that a number of variables related to three points and errors classified as subjects of interest in this study were included in the major variables affecting advancing to the playoffs in the NBA. Although this study is similar in that it includes topics such as match results, playoffs, and championship predictions, which have been dealt with in the existing sports data analysis field, and comparatively analyzed several machine learning models for analysis, there is a difference in that the interest features are set in advance and statistically verified, so that it is compared with the machine learning analysis result. Also, it was differentiated from existing studies by presenting explanatory visualization results using SHAP, one of the XAI models.

Health Risk Management using Feature Extraction and Cluster Analysis considering Time Flow (시간흐름을 고려한 특징 추출과 군집 분석을 이용한 헬스 리스크 관리)

  • Kang, Ji-Soo;Chung, Kyungyong;Jung, Hoill
    • Journal of the Korea Convergence Society
    • /
    • v.12 no.1
    • /
    • pp.99-104
    • /
    • 2021
  • In this paper, we propose health risk management using feature extraction and cluster analysis considering time flow. The proposed method proceeds in three steps. The first is the pre-processing and feature extraction step. It collects user's lifelog using a wearable device, removes incomplete data, errors, noise, and contradictory data, and processes missing values. Then, for feature extraction, important variables are selected through principal component analysis, and data similar to the relationship between the data are classified through correlation coefficient and covariance. In order to analyze the features extracted from the lifelog, dynamic clustering is performed through the K-means algorithm in consideration of the passage of time. The new data is clustered through the similarity distance measurement method based on the increment of the sum of squared errors. Next is to extract information about the cluster by considering the passage of time. Therefore, using the health decision-making system through feature clusters, risks able to managed through factors such as physical characteristics, lifestyle habits, disease status, health care event occurrence risk, and predictability. The performance evaluation compares the proposed method using Precision, Recall, and F-measure with the fuzzy and kernel-based clustering. As a result of the evaluation, the proposed method is excellently evaluated. Therefore, through the proposed method, it is possible to accurately predict and appropriately manage the user's potential health risk by using the similarity with the patient.

Feature Analysis on Industrial Accidents of Manufacturing Businesses Using QUEST Algorithm

  • Leem, Young-Moon;Rogers, K.J.;Hwang, Young-Seob
    • International Journal of Safety
    • /
    • v.5 no.1
    • /
    • pp.37-41
    • /
    • 2006
  • The major objective of the statistical analysis about industrial accidents is to determine the safety factors so that it is possible to prevent or decrease the number of future accidents by educating those who work in a given industrial field in safety management. So far, however, there exists no quantitative method for evaluating danger related to industrial accidents. Therefore, as a method for developing quantitative evaluation technique, this study presents feature analysis of industrial accidents in manufacturing field using QUEST algorithm. In order to analyze features of industrial accidents, a retrospective analysis was performed on 10,536 subjects (10,313 injured people, 223 deaths). The sample for this work was chosen from data related to manufacturing businesses during a three-year period ($2002{\sim}2004$) in Korea. This study used AnswerTree of SPSS and the analysis results enabled us to determine the most important variables that can affect injured people such as the occurrence type, the company size, and the time of occurrence. Also, it was found that the classification system adopted in the present study using QUEST algorithm is quite reliable.

Combining Feature Variables for Improving the Accuracy of $Na\ddot{i}ve$ Bayes Classifiers (나이브베이즈분류기의 정확도 향상을 위한 자질변수통합)

  • Heo Min-Oh;Kim Byoung-Hee;Hwang Kyu-Baek;Zhang Byoung-Tak
    • Proceedings of the Korean Information Science Society Conference
    • /
    • 2005.07b
    • /
    • pp.727-729
    • /
    • 2005
  • 나이브베이즈분류기($na\ddot{i}ve$ Bayes classifier)는 학습, 적용 및 계산자원 이용의 측면에서 매우 효율적인 모델이다. 또한, 그 분류 성능 역시 다른 기법에 비해 크게 떨어지지 않음이 다양한 실험을 통해 보여져 왔다. 특히, 데이터를 생성한 실제 확률분포를 나이브베이즈분류기가 정확하게 표현할 수 있는 경우에는 최대의 효과를 볼 수 있다. 하지만, 실제 확률분포에 존재하는 조건부독립성(conditional independence)이 나이브베이즈분류기의 구조와 일치하지 않는 경우에는 성능이 하락할 수 있다. 보다 구체적으로, 각 자질변수(feature variable)들 사이에 확률적 의존관계(probabilistic dependency)가 존재하는 경우 성능 하락은 심화된다. 본 논문에서는 이러한 나이브베이즈분류기의 약점을 효율적으로 해결할 수 있는 자질변수의 통합기법을 제시한다. 자질변수의 통합은 각 변수들 사이의 관계를 명시적으로 표현해 주는 방법이며, 특히 상호정보량(mutual information)에 기반한 통합 변수의 선정이 성능 향상에 크게 기여함을 실험을 통해 보인다.

  • PDF