• Title/Summary/Keyword: Random Forest Classification

Search Result 311, Processing Time 0.029 seconds

A study on variable selection and classification in dynamic analysis data for ransomware detection (랜섬웨어 탐지를 위한 동적 분석 자료에서의 변수 선택 및 분류에 관한 연구)

  • Lee, Seunghwan;Hwang, Jinsoo
    • The Korean Journal of Applied Statistics
    • /
    • v.31 no.4
    • /
    • pp.497-505
    • /
    • 2018
  • Attacking computer systems using ransomware is very common all over the world. Since antivirus and detection methods are constantly improved in order to detect and mitigate ransomware, the ransomware itself becomes equally better to avoid detection. Several new methods are implemented and tested in order to optimize the protection against ransomware. In our work, 582 of ransomware and 942 of normalware sample data along with 30,967 dynamic action sequence variables are used to detect ransomware efficiently. Several variable selection techniques combined with various machine learning based classification techniques are tried to protect systems from ransomwares. Among various combinations, chi-square variable selection and random forest gives the best detection rates and accuracy.

Opponent Move Prediction of a Real-time Strategy Game Using a Multi-label Classification Based on Machine Learning (기계학습 기반 다중 레이블 분류를 이용한 실시간 전략 게임에서의 상대 행동 예측)

  • Shin, Seung-Soo;Cho, Dong-Hee;Kim, Yong-Hyuk
    • Journal of the Korea Convergence Society
    • /
    • v.11 no.10
    • /
    • pp.45-51
    • /
    • 2020
  • Recently, many games provide data related to the users' game play, and there have been a few studies that predict opponent move by combining machine learning methods. This study predicts opponent move using match data of a real-time strategy game named ClashRoyale and a multi-label classification based on machine learning. In the initial experiment, binary card properties, binary card coordinates, and normalized time information are input, and card type and card coordinates are predicted using random forest and multi-layer perceptron. Subsequently, experiments were conducted sequentially using the next three data preprocessing methods. First, some property information of the input data were transformed. Next, input data were converted to nested form considering the consecutive card input system. Finally, input data were predicted by dividing into the early and the latter according to the normalized time information. As a result, the best preprocessing step was shown about 2.6% improvement in card type and about 1.8% improvement in card coordinates when nested data divided into the early.

Development of The Irregular Radial Pulse Detection Algorithm Based on Statistical Learning Model (통계적 학습 모형에 기반한 불규칙 맥파 검출 알고리즘 개발)

  • Bae, Jang-Han;Jang, Jun-Su;Ku, Boncho
    • Journal of Biomedical Engineering Research
    • /
    • v.41 no.5
    • /
    • pp.185-194
    • /
    • 2020
  • Arrhythmia is basically diagnosed with the electrocardiogram (ECG) signal, however, ECG is difficult to measure and it requires expert help in analyzing the signal. On the other hand, the radial pulse can be measured with easy and uncomplicated way in daily life, and could be suitable bio-signal for the recent untact paradigm and extensible signal for diagnosis of Korean medicine based on pulse pattern. In this study, we developed an irregular radial pulse detection algorithm based on a learning model and considered its applicability as arrhythmia screening. A total of 1432 pulse waves including irregular pulse data were used in the experiment. Three data sets were prepared with minimal preprocessing to avoid the heuristic feature extraction. As classification algorithms, elastic net logistic regression, random forest, and extreme gradient boosting were applied to each data set and the irregular pulse detection performances were estimated using area under the receiver operating characteristic curve based on a 10-fold cross-validation. The extreme gradient boosting method showed the superior performance than others and found that the classification accuracy reached 99.7%. The results confirmed that the proposed algorithm could be used for arrhythmia screening. To make a fusion technology integrating western and Korean medicine, arrhythmia subtype classification from the perspective of Korean medicine will be needed for future research.

Detection of Depression Trends in Literary Cyber Writers Using Sentiment Analysis and Machine Learning

  • Faiza Nasir;Haseeb Ahmad;CM Nadeem Faisal;Qaisar Abbas;Mubarak Albathan;Ayyaz Hussain
    • International Journal of Computer Science & Network Security
    • /
    • v.23 no.3
    • /
    • pp.67-80
    • /
    • 2023
  • Rice is an important food crop for most of the population in Nowadays, psychologists consider social media an important tool to examine mental disorders. Among these disorders, depression is one of the most common yet least cured disease Since abundant of writers having extensive followers express their feelings on social media and depression is significantly increasing, thus, exploring the literary text shared on social media may provide multidimensional features of depressive behaviors: (1) Background: Several studies observed that depressive data contains certain language styles and self-expressing pronouns, but current study provides the evidence that posts appearing with self-expressing pronouns and depressive language styles contain high emotional temperatures. Therefore, the main objective of this study is to examine the literary cyber writers' posts for discovering the symptomatic signs of depression. For this purpose, our research emphases on extracting the data from writers' public social media pages, blogs, and communities; (3) Results: To examine the emotional temperatures and sentences usage between depressive and not depressive groups, we employed the SentiStrength algorithm as a psycholinguistic method, TF-IDF and N-Gram for ranked phrases extraction, and Latent Dirichlet Allocation for topic modelling of the extracted phrases. The results unearth the strong connection between depression and negative emotional temperatures in writer's posts. Moreover, we used Naïve Bayes, Support Vector Machines, Random Forest, and Decision Tree algorithms to validate the classification of depressive and not depressive in terms of sentences, phrases and topics. The results reveal that comparing with others, Support Vectors Machines algorithm validates the classification while attaining highest 79% f-score; (4) Conclusions: Experimental results show that the proposed system outperformed for detection of depression trends in literary cyber writers using sentiment analysis.

Development of System for Enhancing the Quality of Power Generation Facilities Failure History Data Based on Explainable AI (XAI) (XAI 기반 발전설비 고장 기록 데이터 품질 향상 시스템 개발)

  • Kim Yu Rim;Park Jeong In;Park Dong Hyun;Kang Sung Woo
    • Journal of Korean Society for Quality Management
    • /
    • v.52 no.3
    • /
    • pp.479-493
    • /
    • 2024
  • Purpose: The deterioration in the quality of failure history data due to differences in interpretation of failures among workers at power plants and the lack of consistency in the way failures are recorded negatively impacts the efficient operation of power plants. The purpose of this study is to propose a system that classifies power generation facilities failures consistently based on the failure history text data created by the workers. Methods: This study utilizes data collected from three coal unloaders operated by Korea Midland Power Co., LTD, from 2012 to 2023. It classifies failures based on the results of Soft Voting, which incorporates the prediction probabilities derived from applying the predict_proba technique to four machine learning models: Random Forest, Logistic Regression, XGBoost, and SVM, along with scores obtained by constructing word dictionaries for each type of failure using LIME, one of the XAI (Explainable Artificial Intelligence) methods. Through this, failure classification system is proposed to improve the quality of power generation facilities failure history data. Results: The results of this study are as follows. When the power generation facilities failure classification system was applied to the failure history data of Continuous Ship Unloader, XGBoost showed the best performance with a Macro_F1 Score of 93%. When the system proposed in this study was applied, there was an increase of up to 0.17 in the Macro_F1 Score for Logistic Regression compared to when the model was applied alone. All four models used in this study, when the system was applied, showed equal or higher values in Accuracy and Macro_F1 Score than the single model alone. Conclusion: This study propose a failure classification system for power generation facilities to improve the quality of failure history data. This will contribute to cost reduction and stability of power generation facilities, as well as further improvement of power plant operation efficiency and stability.

Detection of Wildfire Smoke Plumes Using GEMS Images and Machine Learning (GEMS 영상과 기계학습을 이용한 산불 연기 탐지)

  • Jeong, Yemin;Kim, Seoyeon;Kim, Seung-Yeon;Yu, Jeong-Ah;Lee, Dong-Won;Lee, Yangwon
    • Korean Journal of Remote Sensing
    • /
    • v.38 no.5_3
    • /
    • pp.967-977
    • /
    • 2022
  • The occurrence and intensity of wildfires are increasing with climate change. Emissions from forest fire smoke are recognized as one of the major causes affecting air quality and the greenhouse effect. The use of satellite product and machine learning is essential for detection of forest fire smoke. Until now, research on forest fire smoke detection has had difficulties due to difficulties in cloud identification and vague standards of boundaries. The purpose of this study is to detect forest fire smoke using Level 1 and Level 2 data of Geostationary Environment Monitoring Spectrometer (GEMS), a Korean environmental satellite sensor, and machine learning. In March 2022, the forest fire in Gangwon-do was selected as a case. Smoke pixel classification modeling was performed by producing wildfire smoke label images and inputting GEMS Level 1 and Level 2 data to the random forest model. In the trained model, the importance of input variables is Aerosol Optical Depth (AOD), 380 nm and 340 nm radiance difference, Ultra-Violet Aerosol Index (UVAI), Visible Aerosol Index (VisAI), Single Scattering Albedo (SSA), formaldehyde (HCHO), nitrogen dioxide (NO2), 380 nm radiance, and 340 nm radiance were shown in that order. In addition, in the estimation of the forest fire smoke probability (0 ≤ p ≤ 1) for 2,704 pixels, Mean Bias Error (MBE) is -0.002, Mean Absolute Error (MAE) is 0.026, Root Mean Square Error (RMSE) is 0.087, and Correlation Coefficient (CC) showed an accuracy of 0.981.

Classification Algorithm-based Prediction Performance of Order Imbalance Information on Short-Term Stock Price (분류 알고리즘 기반 주문 불균형 정보의 단기 주가 예측 성과)

  • Kim, S.W.
    • Journal of Intelligence and Information Systems
    • /
    • v.28 no.4
    • /
    • pp.157-177
    • /
    • 2022
  • Investors are trading stocks by keeping a close watch on the order information submitted by domestic and foreign investors in real time through Limit Order Book information, so-called price current provided by securities firms. Will order information released in the Limit Order Book be useful in stock price prediction? This study analyzes whether it is significant as a predictor of future stock price up or down when order imbalances appear as investors' buying and selling orders are concentrated to one side during intra-day trading time. Using classification algorithms, this study improved the prediction accuracy of the order imbalance information on the short-term price up and down trend, that is the closing price up and down of the day. Day trading strategies are proposed using the predicted price trends of the classification algorithms and the trading performances are analyzed through empirical analysis. The 5-minute KOSPI200 Index Futures data were analyzed for 4,564 days from January 19, 2004 to June 30, 2022. The results of the empirical analysis are as follows. First, order imbalance information has a significant impact on the current stock prices. Second, the order imbalance information observed in the early morning has a significant forecasting power on the price trends from the early morning to the market closing time. Third, the Support Vector Machines algorithm showed the highest prediction accuracy on the day's closing price trends using the order imbalance information at 54.1%. Fourth, the order imbalance information measured at an early time of day had higher prediction accuracy than the order imbalance information measured at a later time of day. Fifth, the trading performances of the day trading strategies using the prediction results of the classification algorithms on the price up and down trends were higher than that of the benchmark trading strategy. Sixth, except for the K-Nearest Neighbor algorithm, all investment performances using the classification algorithms showed average higher total profits than that of the benchmark strategy. Seventh, the trading performances using the predictive results of the Logical Regression, Random Forest, Support Vector Machines, and XGBoost algorithms showed higher results than the benchmark strategy in the Sharpe Ratio, which evaluates both profitability and risk. This study has an academic difference from existing studies in that it documented the economic value of the total buy & sell order volume information among the Limit Order Book information. The empirical results of this study are also valuable to the market participants from a trading perspective. In future studies, it is necessary to improve the performance of the trading strategy using more accurate price prediction results by expanding to deep learning models which are actively being studied for predicting stock prices recently.

A Study on the Classification of Unstructured Data through Morpheme Analysis

  • Kim, SungJin;Choi, NakJin;Lee, JunDong
    • Journal of the Korea Society of Computer and Information
    • /
    • v.26 no.4
    • /
    • pp.105-112
    • /
    • 2021
  • In the era of big data, interest in data is exploding. In particular, the development of the Internet and social media has led to the creation of new data, enabling the realization of the era of big data and artificial intelligence and opening a new chapter in convergence technology. Also, in the past, there are many demands for analysis of data that could not be handled by programs. In this paper, an analysis model was designed and verified for classification of unstructured data, which is often required in the era of big data. Data crawled DBPia's thesis summary, main words, and sub-keyword, and created a database using KoNLP's data dictionary, and tokenized words through morpheme analysis. In addition, nouns were extracted using KAIST's 9 part-of-speech classification system, TF-IDF values were generated, and an analysis dataset was created by combining training data and Y values. Finally, The adequacy of classification was measured by applying three analysis algorithms(random forest, SVM, decision tree) to the generated analysis dataset. The classification model technique proposed in this paper can be usefully used in various fields such as civil complaint classification analysis and text-related analysis in addition to thesis classification.

A Comparative Study on Mapping and Filtering Radii of Local Climate Zone in Changwon city using WUDAPT Protocol (WUDAPT 절차를 활용한 창원시의 국지기후대 제작과 필터링 반경에 따른 비교 연구)

  • Tae-Gyeong KIM;Kyung-Hun PARK;Bong-Geun SONG;Seoung-Hyeon KIM;Da-Eun JEONG;Geon-Ung PARK
    • Journal of the Korean Association of Geographic Information Studies
    • /
    • v.27 no.2
    • /
    • pp.78-95
    • /
    • 2024
  • For the establishment and comparison of environmental plans across various domains, considering climate change and urban issues, it is crucial to build spatial data at the regional scale classified with consistent criteria. This study mapping the Local Climate Zone (LCZ) of Changwon City, where active climate and environmental research is being conducted, using the protocol suggested by the World Urban Database and Access Portal Tools (WUDAPT). Additionally, to address the fragmentation issue where some grids are classified with different climate characteristics despite being in regions with homogeneous climate traits, a filtering technique was applied, and the LCZ classification characteristics were compared according to the filtering radius. Using satellite images, ground reference data, and the supervised classification machine learning technique Random Forest, classification maps without filtering and with filtering radii of 1, 2, and 3 were produced, and their accuracies were compared. Furthermore, to compare the LCZ classification characteristics according to building types in urban areas, an urban form index used in GIS-based classification methodology was created and compared with the ranges suggested in previous studies. As a result, the overall accuracy was highest when the filtering radius was 1. When comparing the urban form index, the differences between LCZ types were minimal, and most satisfied the ranges of previous studies. However, the study identified a limitation in reflecting the height information of buildings, and it is believed that adding data to complement this would yield results with higher accuracy. The findings of this study can be used as reference material for creating fundamental spatial data for environmental research related to urban climates in South Korea.

Predicting the mortality of pneumonia patients visiting the emergency department through machine learning (기계학습모델을 통한 응급실 폐렴환자의 사망예측 모델과 기존 예측 모델의 비교)

  • Bae, Yeol;Moon, Hyung Ki;Kim, Soo Hyun
    • Journal of The Korean Society of Emergency Medicine
    • /
    • v.29 no.5
    • /
    • pp.455-464
    • /
    • 2018
  • Objective: Machine learning is not yet widely used in the medical field. Therefore, this study was conducted to compare the performance of preexisting severity prediction models and machine learning based models (random forest [RF], gradient boosting [GB]) for mortality prediction in pneumonia patients. Methods: We retrospectively collected data from patients who visited the emergency department of a tertiary training hospital in Seoul, Korea from January to March of 2015. The Pneumonia Severity Index (PSI) and Sequential Organ Failure Assessment (SOFA) scores were calculated for both groups and the area under the curve (AUC) for mortality prediction was computed. For the RF and GB models, data were divided into a test set and a validation set by the random split method. The training set was learned in RF and GB models and the AUC was obtained from the validation set. The mean AUC was compared with the other two AUCs. Results: Of the 536 investigated patients, 395 were enrolled and 41 of them died. The AUC values of PSI and SOFA scores were 0.799 (0.737-0.862) and 0.865 (0.811-0.918), respectively. The mean AUC values obtained by the RF and GB models were 0.928 (0.899-0.957) and 0.919 (0.886-0.952), respectively. There were significant differences between preexisting severity prediction models and machine learning based models (P<0.001). Conclusion: Classification through machine learning may help predict the mortality of pneumonia patients visiting the emergency department.