• Title/Summary/Keyword: WEKA

Search Result 57, Processing Time 0.021 seconds

Social Network Spam Detection using Recursive Structure Features (소셜 네트워크 상에서의 재귀적 네트워크 구조 특성을 활용한 스팸탐지 기법)

  • Jang, Boyeon;Jeong, Sihyun;Kim, Chongkwon
    • Journal of KIISE
    • /
    • v.44 no.11
    • /
    • pp.1231-1235
    • /
    • 2017
  • Given the network structure in online social network, it is important to determine a way to distinguish spam accounts from the network features. In online social network, the service provider attempts to detect social spamming to maintain their service quality. However the spammer group changes their strategies to avoid being detected. Even though the spammer attempts to act as legitimate users, certain distinguishable structural features are not easily changed. In this paper, we investigate a way to generate meaningful network structure features, and suggest spammer detection method using recursive structural features. From a result of real-world dataset experiment, we found that the proposed algorithm could improve the classification performance by about 8%.

A Development of Suicidal Ideation Prediction Model and Decision Rules for the Elderly: Decision Tree Approach (의사결정나무 기법을 이용한 노인들의 자살생각 예측모형 및 의사결정 규칙 개발)

  • Kim, Deok Hyun;Yoo, Dong Hee;Jeong, Dae Yul
    • The Journal of Information Systems
    • /
    • v.28 no.3
    • /
    • pp.249-276
    • /
    • 2019
  • Purpose The purpose of this study is to develop a prediction model and decision rules for the elderly's suicidal ideation based on the Korean Welfare Panel survey data. By utilizing this data, we obtained many decision rules to predict the elderly's suicide ideation. Design/methodology/approach This study used classification analysis to derive decision rules to predict on the basis of decision tree technique. Weka 3.8 is used as the data mining tool in this study. The decision tree algorithm uses J48, also known as C4.5. In addition, 66.6% of the total data was divided into learning data and verification data. We considered all possible variables based on previous studies in predicting suicidal ideation of the elderly. Finally, 99 variables including the target variable were used. Classification analysis was performed by introducing sampling technique through backward elimination and data balancing. Findings As a result, there were significant differences between the data sets. The selected data sets have different, various decision tree and several rules. Based on the decision tree method, we derived the rules for suicide prevention. The decision tree derives not only the rules for the suicidal ideation of the depressed group, but also the rules for the suicidal ideation of the non-depressed group. In addition, in developing the predictive model, the problem of over-fitting due to the data imbalance phenomenon was directly identified through the application of data balancing. We could conclude that it is necessary to balance the data on the target variables in order to perform the correct classification analysis without over-fitting. In addition, although data balancing is applied, it is shown that performance is not inferior in prediction rate when compared with a biased prediction model.

Exploring Support Vector Machine Learning for Cloud Computing Workload Prediction

  • ALOUFI, OMAR
    • International Journal of Computer Science & Network Security
    • /
    • v.22 no.10
    • /
    • pp.374-388
    • /
    • 2022
  • Cloud computing has been one of the most critical technology in the last few decades. It has been invented for several purposes as an example meeting the user requirements and is to satisfy the needs of the user in simple ways. Since cloud computing has been invented, it had followed the traditional approaches in elasticity, which is the key characteristic of cloud computing. Elasticity is that feature in cloud computing which is seeking to meet the needs of the user's with no interruption at run time. There are traditional approaches to do elasticity which have been conducted for several years and have been done with different modelling of mathematical. Even though mathematical modellings have done a forward step in meeting the user's needs, there is still a lack in the optimisation of elasticity. To optimise the elasticity in the cloud, it could be better to benefit of Machine Learning algorithms to predict upcoming workloads and assign them to the scheduling algorithm which would achieve an excellent provision of the cloud services and would improve the Quality of Service (QoS) and save power consumption. Therefore, this paper aims to investigate the use of machine learning techniques in order to predict the workload of Physical Hosts (PH) on the cloud and their energy consumption. The environment of the cloud will be the school of computing cloud testbed (SoC) which will host the experiments. The experiments will take on real applications with different behaviours, by changing workloads over time. The results of the experiments demonstrate that our machine learning techniques used in scheduling algorithm is able to predict the workload of physical hosts (CPU utilisation) and that would contribute to reducing power consumption by scheduling the upcoming virtual machines to the lowest CPU utilisation in the environment of physical hosts. Additionally, there are a number of tools, which are used and explored in this paper, such as the WEKA tool to train the real data to explore Machine learning algorithms and the Zabbix tool to monitor the power consumption before and after scheduling the virtual machines to physical hosts. Moreover, the methodology of the paper is the agile approach that helps us in achieving our solution and managing our paper effectively.

Development of Prediction Model for Prevalence of Metabolic Syndrome Using Data Mining: Korea National Health and Nutrition Examination Study (국민건강영양조사를 활용한 대사증후군 유병 예측모형 개발을 위한 융복합 연구: 데이터마이닝을 활용하여)

  • Kim, Han-Kyoul;Choi, Keun-Ho;Lim, Sung-Won;Rhee, Hyun-Sill
    • Journal of Digital Convergence
    • /
    • v.14 no.2
    • /
    • pp.325-332
    • /
    • 2016
  • The purpose of this study is to investigate the attributes influencing the prevalence of metabolic syndrome and develop the prediction model for metabolic syndrome over 40-aged people from Korea Health and Nutrition Examination Study 2012. The researcher chose the attributes for prediction model through literature review. Also, we used the decision tree, logistic regression, artificial neural network of data mining algorithm through Weka 3.6. As results, social economic status factors of input attributes were ranked higher than health-related factors. Additionally, prediction model using decision tree algorithm showed finally the highest accuracy. This study suggests that, first of all, prevention and management of metabolic syndrome will be approached by aspect of social economic status and health-related factors. Also, decision tree algorithms known from other research are useful in the field of public health due to their usefulness of interpretation.

Data mining Algorithms for the Development of Sasang Type Diagnosis (사상체질 진단검사를 위한 데이터마이닝 알고리즘 연구)

  • Hong, Jin-Woo;Kim, Young-In;Park, So-Jung;Kim, Byoung-Chul;Eom, Il-Kyu;Hwang, Min-Woo;Shin, Sang-Woo;Kim, Byung-Joo;Kwon, Young-Kyu;Chae, Han
    • Journal of Physiology & Pathology in Korean Medicine
    • /
    • v.23 no.6
    • /
    • pp.1234-1240
    • /
    • 2009
  • This study was to compare the effectiveness and validity of various data-mining algorithm for Sasang type diagnostic test. We compared the sensitivity and specificity index of nine attribute selection and eleven class classification algorithms with 31 data-set characterizing Sasang typology and 10-fold validation methods installed in Waikato Environment Knowledge Analysis (WEKA). The highest classification validity score can be acquired as follows; 69.9 as Percentage Correctly Predicted index with Naive Bayes Classifier, 80 as sensitivity index with LWL/Tae-Eum type, 93.5 as specificity index with Naive Bayes Classifier/So-Eum type. The classification algorithm with highest PCP index of 69.62 after attribute selection was Naive Bayes Classifier. In this study we can find that the best-fit algorithm for traditional medicine is case sensitive and that characteristics of clinical circumstances, and data-mining algorithms and study purpose should be considered to get the highest validity even with the well defined data sets. It is also confirmed that we can't find one-fits-all algorithm and there should be many studies with trials and errors. This study will serve as a pivotal foundation for the development of medical instruments for Pattern Identification and Sasang type diagnosis on the basis of traditional Korean Medicine.

Classification of Natural and Artificial Forests from KOMPSAT-3/3A/5 Images Using Artificial Neural Network (인공신경망을 이용한 KOMPSAT-3/3A/5 영상으로부터 자연림과 인공림의 분류)

  • Lee, Yong-Suk;Park, Sung-Hwan;Jung, Hyung-Sup;Baek, Won-Kyung
    • Korean Journal of Remote Sensing
    • /
    • v.34 no.6_3
    • /
    • pp.1399-1414
    • /
    • 2018
  • Natural forests are un-manned forests where the artificial forces of people are not applied to the formation of forests. On the other hand, artificial forests are managed by people for their own purposes such as producing wood, preventing natural disasters, and protecting wind. The artificial forests enable us to enhance economical benefits of producing more wood per unit area because it is well-maintained with the purpose of the production of wood. The distinction surveys have been performed due to different management methods according to forests. The distinction survey between natural forests and artificial forests is traditionally performed via airborne remote sensing or in-situ surveys. In this study, we suggest a classification method of forest types using satellite imagery to reduce the time and cost of in-situ surveying. A classification map of natural forest and artificial forest were generated using KOMPSAT-3, 3A, 5 data by employing artificial neural network (ANN). And in order to validate the accuracy of classification, we utilized reference data from 1/5,000 stock map. As a result of the study on the classification of natural forest and plantation forest using artificial neural network, the overall accuracy of classification of learning result is 77.03% when compared with 1/5,000 stock map. It was confirmed that the acquisition time of the image and other factors such as needleleaf trees and broadleaf trees affect the distinction between artificial and natural forests using artificial neural networks.

A Study on Ransomware Detection Methods in Actual Cases of Public Institutions (공공기관 실제 사례로 보는 랜섬웨어 탐지 방안에 대한 연구)

  • Yong Ju Park;Huy Kang Kim
    • Journal of the Korea Institute of Information Security & Cryptology
    • /
    • v.33 no.3
    • /
    • pp.499-510
    • /
    • 2023
  • Recently, an intelligent and advanced cyber attack attacks a computer network of a public institution using a file containing malicious code or leaks information, and the damage is increasing. Even in public institutions with various information protection systems, known attacks can be detected, but unknown dynamic and encryption attacks can be detected when existing signature-based or static analysis-based malware and ransomware file detection methods are used. vulnerable to The detection method proposed in this study extracts the detection result data of the system that can detect malicious code and ransomware among the information protection systems actually used by public institutions, derives various attributes by combining them, and uses a machine learning classification algorithm. Results are derived through experiments on how the derived properties are classified and which properties have a significant effect on the classification result and accuracy improvement. In the experimental results of this paper, although it is different for each algorithm when a specific attribute is included or not, the learning with a specific attribute shows an increase in accuracy, and later detects malicious code and ransomware files and abnormal behavior in the information protection system. It is expected that it can be used for property selection when creating algorithms.