• Title/Summary/Keyword: XGboost

Search Result 244, Processing Time 0.029 seconds

Scalable Prediction Models for Airbnb Listing in Spark Big Data Cluster using GPU-accelerated RAPIDS

  • Muralidharan, Samyuktha;Yadav, Savita;Huh, Jungwoo;Lee, Sanghoon;Woo, Jongwook
    • Journal of information and communication convergence engineering
    • /
    • v.20 no.2
    • /
    • pp.96-102
    • /
    • 2022
  • We aim to build predictive models for Airbnb's prices using a GPU-accelerated RAPIDS in a big data cluster. The Airbnb Listings datasets are used for the predictive analysis. Several machine-learning algorithms have been adopted to build models that predict the price of Airbnb listings. We compare the results of traditional and big data approaches to machine learning for price prediction and discuss the performance of the models. We built big data models using Databricks Spark Cluster, a distributed parallel computing system. Furthermore, we implemented models using multiple GPUs using RAPIDS in the spark cluster. The model was developed using the XGBoost algorithm, whereas other models were developed using traditional central processing unit (CPU)-based algorithms. This study compared all models in terms of accuracy metrics and computing time. We observed that the XGBoost model with RAPIDS using GPUs had the highest accuracy and computing time.

ANN-XGB based predictions of dissolved oxygen (ANN-XGB를 이용한 수중 산소 농도 예측)

  • Jo, Gwanghyun;Lee, Keun Young
    • Proceedings of the Korean Institute of Information and Commucation Sciences Conference
    • /
    • 2022.10a
    • /
    • pp.457-458
    • /
    • 2022
  • The dissolved oxygen (DO) is one of the factors of ecosystem that affects survival of aquatic life. Artificial neural network - XGboost (ANN-XGB), which was trained by water quality and weather data obtained at Anyang-streamto, was employed to forecast DO after 1 hours. We document the perforamnce of ANN-XGB.

  • PDF

Determination presence of people in accommodation using feature extraction and XGBoost method of energy data (전력 데이터의 특징 추출 및 XGBoost를 이용한 숙박 업소 재실 여부 판단)

  • Kim, Eden;Ko, Seok-Gap;Son, Seung-Chul;Lee, Hyung-Ok;Lee, Byung-Tak
    • Proceedings of the Korea Information Processing Society Conference
    • /
    • 2020.05a
    • /
    • pp.458-460
    • /
    • 2020
  • 스마트미터의 기술 발달과 보급으로 인해 전력데이터의 수집이 보다 수월 해짐에 따라 각 시스템에 효율적인 맞춤 서비스 제공을 위한 전력 데이터 분석 기술에 관한 다양한 연구가 활발하게 진행되고 있다. 관련하여 본 논문에서는 숙박업소의 각 방마다 전력소비량을 측정 및 수집하여 전력소비패턴을 분석하고 특징 추출 및 XGBoost 를 이용한 머신러닝 분석방법으로 각 방의 사람 재실 여부를 판별하는 방법을 소개한다. 이와 같은 연구를 통해 추후 숙박업소 혹은 숙박업소를 이용하는 소비자들의 맞춤 서비스 제공에 응용 및 적용 할 수 있다.

A study on data scaling and feature selection techniques for XGBoost-based intrusion detection model (XGBoost 기반 침입탐지모델을 위한 데이터 스케일링 및 특성선택 기법 연구)

  • Kim, Young-Won;Lee, Soo-Jin
    • Proceedings of the Korean Society of Computer Information Conference
    • /
    • 2022.07a
    • /
    • pp.251-254
    • /
    • 2022
  • 본 논문은 XGBoost 알고리즘 기반의 침입탐지모델의 성능을 향상하기 위한 스케일링(scaling) 및 특성선택(feature selection) 기법을 제안한다. 머신러닝 모델 개발 중 전처리 단계에서 스케일링 및 특성선택을 수행하면 데이터세트의 조건수가 감소하여 모델의 성능을 향상할 수 있다. 각 과정별로 다양한 기법이 있지만 기존의 연구에서는 이러한 기법들을 적용한 결과를 비교·분석하지 않고 특정 기법을 적용한 결과만을 나열하였고 스케일링 및 특성선택에 대해 최적의 조합은 제시하지 못하였다. 따라서 본 논문에서는 다양한 전처리 기법들의 적용결과를 비교하고 최적의 조합을 제안한다. 또한 기존의 연구들이 특정 데이터세트에만 적용 가능한 전처리 기법을 제안하는데 비해 본 논문은 다양한 데이터세트에 대해 공통적으로 적용 가능한 전처리 기법을 제안함으로써 제안 기법의 범용성과 실세계 적용 가능성을 증명한다.

  • PDF

Enhancing E-commerce Security: A Comprehensive Approach to Real-Time Fraud Detection

  • Sara Alqethami;Badriah Almutanni;Walla Aleidarousr
    • International Journal of Computer Science & Network Security
    • /
    • v.24 no.4
    • /
    • pp.1-10
    • /
    • 2024
  • In the era of big data, the growth of e-commerce transactions brings forth both opportunities and risks, including the threat of data theft and fraud. To address these challenges, an automated real-time fraud detection system leveraging machine learning was developed. Four algorithms (Decision Tree, Naïve Bayes, XGBoost, and Neural Network) underwent comparison using a dataset from a clothing website that encompassed both legitimate and fraudulent transactions. The dataset exhibited an imbalance, with 9.3% representing fraud and 90.07% legitimate transactions. Performance evaluation metrics, including Recall, Precision, F1 Score, and AUC ROC, were employed to assess the effectiveness of each algorithm. XGBoost emerged as the top-performing model, achieving an impressive accuracy score of 95.85%. The proposed system proves to be a robust defense mechanism against fraudulent activities in e-commerce, thereby enhancing security and instilling trust in online transactions.

Research on predicting changes in crop cultivation areas due to climate change: Focusing on Hallabong (기후변화에 따른 과수작물 재배지 변화 예측 연구: 한라봉을 중심으로)

  • Park, Hye Eun;Lee, Jong Tae
    • The Journal of Information Systems
    • /
    • v.33 no.1
    • /
    • pp.31-44
    • /
    • 2024
  • Purpose The purpose of this study is to use climate data to find the algorithm with the highest Hallabong production prediction ability and to predict future Hallabong production in areas where Hallabong cultivation is expected to be possible. Design/methodology/approach The research is conducted in two stages. In the first step, find the algorithm with the highest predictive power among XGBoost, Random Forest, SVM, and LSTM methodologies. In the second stage, the algorithm found in the first stage is applied to predict future Hallabong production in three regions where Hallabong production is expected to be possible. Findings As with many prediction studies, we found that XGBoost showed the highest prediction power. Even in areas where Hallabong production is expected to be possible, Hallabong production was predicted to be highest in Hongcheon, Gangwon-do, which has the highest latitude.

Prediction of Dissolved Oxygen at Anyang-stream using XG-Boost and Artificial Neural Networks

  • Keun Young Lee;Bomchul Kim;Gwanghyun Jo
    • Journal of information and communication convergence engineering
    • /
    • v.22 no.2
    • /
    • pp.133-138
    • /
    • 2024
  • Dissolved oxygen (DO) is an important factor in ecosystems. However, the analysis of DO is frequently rather complicated because of the nonlinear phenomenon of the river system. Therefore, a convenient model-free algorithm for DO variable is required. In this study, a data-driven algorithm for predicting DO was developed by combining XGBoost and an artificial neural network (ANN), called ANN-XGB. To train the model, two years of ecosystem data were collected in Anyang, Seoul using the Troll 9500 model. One advantage of the proposed algorithm is its ability to capture abrupt changes in climate-related features that arise from sudden events. Moreover, our algorithm can provide a feature importance analysis owing to the use of XGBoost. The results obtained using the ANN-XGB algorithm were compared with those obtained using the ANN algorithm in the Results Section. The predictions made by ANN-XGB were mostly in closer agreement with the measured DO values in the river than those made by the ANN.

Anomaly CAN Message Detection Using Heuristics and XGBoost (휴리스틱과 XGBoost 를 활용한 비정상 CAN 메시지 탐지)

  • Se-Rin Kim;Beom-Heon Youn;Hark-Su Cho
    • Proceedings of the Korea Information Processing Society Conference
    • /
    • 2024.05a
    • /
    • pp.362-363
    • /
    • 2024
  • 최근 자동차의 네트워크화와 연결성이 증가함에 따라, CAN(Controller Area Network) bus 의 설계상 취약점이 보안 위협으로 대두되고 있다. 이에 대응하여 CAN bus 의 취약점을 극복하고 보안을 강화하기 위해 머신러닝을 활용한 침입 탐지 시스템에 대한 연구가 필요하다. 본 논문은 XGBoost 를 활용한 비정상 분류 방법론을 제안한다. 고려대학교 해킹 대응 기술 연구실에서 개발한 데이터 세트를 기반으로 실험을 수행한 결과, 초기 모델의 정확도는 96%였다. 그러나 추가적으로 TimeDiff(발생 간격)과 DataDiff(바이트의 차분 값)을 모델에 통합하면서 정확도가 3% 상승하였다. 본 논문은 향후에 보다 정교한 머신러닝 알고리즘과 데이터 전처리 기법을 적용하여 세밀한 모델을 개발하고, 업체의 CAN Database 를 활용하여 데이터 분석을 보다 정확하게 수행할 계획이다. 이를 통해 보다 신뢰성 높은 자동차 네트워크 보안 시스템을 구축할 수 있을 것으로 기대된다.

Factors influencing metabolic syndrome perception and exercising behaviors in Korean adults: Data mining approach (대사증후군의 인지와 신체활동 실천에 영향을 미치는 요인: 데이터 마이닝 접근)

  • Lee, Soo-Kyoung;Moon, Mikyung
    • Journal of the Korea Academia-Industrial cooperation Society
    • /
    • v.18 no.12
    • /
    • pp.581-588
    • /
    • 2017
  • This study was conducted to determine which factors would predict metabolic syndrome (MetS) perception and exercise by applying a machine learning classifier, or Extreme Gradient Boosting algorithm (XGBoost) from July 2014 to December 2015. Data were obtained from the Korean Community Health Survey (KCHS), representing different community-dwelling Korean adults 19 years and older, from 2009 to 2013. The dataset includes 370,430 adults. Outcomes were categorized as follows based on the perception of MetS and physical activity (PA): Stage 1 (no perception, no PA), Stage 2 (perception, no PA), and Stage 3 (perception, PA). Features common to all questionnaires for the last 5 years were selected for modeling. Overall, there were 161 features, categorical except for age and the visual analogue scale (EQ-VAS). We used the Extreme Boosting algorithm in R programming for a model to predict factors and achieved prediction accuracy in 0.735 submissions. The top 10 predictive factors in Stage 3 were: age, education level, attempt to control weight, EQ mobility, nutrition label checks, private health insurance, EQ-5D usual activities, anti-smoking advertising, EQ-VAS, education in health centers for diabetes, and dental care. In conclusion, the results showed that XGBoost can be used to identify factors influencing disease prevention and management using healthcare bigdata.

Detecting Errors in POS-Tagged Corpus on XGBoost and Cross Validation (XGBoost와 교차검증을 이용한 품사부착말뭉치에서의 오류 탐지)

  • Choi, Min-Seok;Kim, Chang-Hyun;Park, Ho-Min;Cheon, Min-Ah;Yoon, Ho;Namgoong, Young;Kim, Jae-Kyun;Kim, Jae-Hoon
    • KIPS Transactions on Software and Data Engineering
    • /
    • v.9 no.7
    • /
    • pp.221-228
    • /
    • 2020
  • Part-of-Speech (POS) tagged corpus is a collection of electronic text in which each word is annotated with a tag as the corresponding POS and is widely used for various training data for natural language processing. The training data generally assumes that there are no errors, but in reality they include various types of errors, which cause performance degradation of systems trained using the data. To alleviate this problem, we propose a novel method for detecting errors in the existing POS tagged corpus using the classifier of XGBoost and cross-validation as evaluation techniques. We first train a classifier of a POS tagger using the POS-tagged corpus with some errors and then detect errors from the POS-tagged corpus using cross-validation, but the classifier cannot detect errors because there is no training data for detecting POS tagged errors. We thus detect errors by comparing the outputs (probabilities of POS) of the classifier, adjusting hyperparameters. The hyperparameters is estimated by a small scale error-tagged corpus, in which text is sampled from a POS-tagged corpus and which is marked up POS errors by experts. In this paper, we use recall and precision as evaluation metrics which are widely used in information retrieval. We have shown that the proposed method is valid by comparing two distributions of the sample (the error-tagged corpus) and the population (the POS-tagged corpus) because all detected errors cannot be checked. In the near future, we will apply the proposed method to a dependency tree-tagged corpus and a semantic role tagged corpus.