• Title/Summary/Keyword: Random forest algorithm

Search Result 218, Processing Time 0.028 seconds

Mining Intellectual History Using Unstructured Data Analytics to Classify Thoughts for Digital Humanities (디지털 인문학에서 비정형 데이터 분석을 이용한 사조 분류 방법)

  • Seo, Hansol;Kwon, Ohbyung
    • Journal of Intelligence and Information Systems
    • /
    • v.24 no.1
    • /
    • pp.141-166
    • /
    • 2018
  • Information technology improves the efficiency of humanities research. In humanities research, information technology can be used to analyze a given topic or document automatically, facilitate connections to other ideas, and increase our understanding of intellectual history. We suggest a method to identify and automatically analyze the relationships between arguments contained in unstructured data collected from humanities writings such as books, papers, and articles. Our method, which is called history mining, reveals influential relationships between arguments and the philosophers who present them. We utilize several classification algorithms, including a deep learning method. To verify the performance of the methodology proposed in this paper, empiricists and rationalism - related philosophers were collected from among the philosophical specimens and collected related writings or articles accessible on the internet. The performance of the classification algorithm was measured by Recall, Precision, F-Score and Elapsed Time. DNN, Random Forest, and Ensemble showed better performance than other algorithms. Using the selected classification algorithm, we classified rationalism or empiricism into the writings of specific philosophers, and generated the history map considering the philosopher's year of activity.

Determination of Survival of Gastric Cancer Patients With Distant Lymph Node Metastasis Using Prealbumin Level and Prothrombin Time: Contour Plots Based on Random Survival Forest Algorithm on High-Dimensionality Clinical and Laboratory Datasets

  • Zhang, Cheng;Xie, Minmin;Zhang, Yi;Zhang, Xiaopeng;Feng, Chong;Wu, Zhijun;Feng, Ying;Yang, Yahui;Xu, Hui;Ma, Tai
    • Journal of Gastric Cancer
    • /
    • v.22 no.2
    • /
    • pp.120-134
    • /
    • 2022
  • Purpose: This study aimed to identify prognostic factors for patients with distant lymph node-involved gastric cancer (GC) using a machine learning algorithm, a method that offers considerable advantages and new prospects for high-dimensional biomedical data exploration. Materials and Methods: This study employed 79 features of clinical pathology, laboratory tests, and therapeutic details from 289 GC patients whose distant lymphadenopathy was presented as the first episode of recurrence or metastasis. Outcomes were measured as any-cause death events and survival months after distant lymph node metastasis. A prediction model was built based on possible outcome predictors using a random survival forest algorithm and confirmed by 5×5 nested cross-validation. The effects of single variables were interpreted using partial dependence plots. A contour plot was used to visually represent survival prediction based on 2 predictive features. Results: The median survival time of patients with GC with distant nodal metastasis was 9.2 months. The optimal model incorporated the prealbumin level and the prothrombin time (PT), and yielded a prediction error of 0.353. The inclusion of other variables resulted in poorer model performance. Patients with higher serum prealbumin levels or shorter PTs had a significantly better prognosis. The predicted one-year survival rate was stratified and illustrated as a contour plot based on the combined effect the prealbumin level and the PT. Conclusions: Machine learning is useful for identifying the important determinants of cancer survival using high-dimensional datasets. The prealbumin level and the PT on distant lymph node metastasis are the 2 most crucial factors in predicting the subsequent survival time of advanced GC.

Prediction of Soil Moisture with Open Source Weather Data and Machine Learning Algorithms (공공 기상데이터와 기계학습 모델을 이용한 토양수분 예측)

  • Jang, Young-bin;Jang, Ik-hoon;Choe, Young-chan
    • Korean Journal of Agricultural and Forest Meteorology
    • /
    • v.22 no.1
    • /
    • pp.1-12
    • /
    • 2020
  • As one of the essential resources in the agricultural process, soil moisture has been carefully managed by predicting future changes and deficits. In recent years, statistics and machine learning based approach to predict soil moisture has been preferred in academia for its generalizability and ease of use in the field. However, little is known that machine learning based soil moisture prediction is applicable in the situation of South Korea. In this sense, this paper aims to examine 1) whether publicly available weather data generated in South Korea has sufficient quality to predict soil moisture, 2) which machine learning algorithm would perform best in the situation of South Korea, and 3) whether a single machine learning model could be generally applicable in various regions. We used various machine learning methods such as Support Vector Machines (SVM), Random Forest (RF), Extremely Randomized Trees (ET), Gradient Boosting Machines (GBM), and Deep Feedforward Network (DFN) to predict future soil moisture in Andong, Boseong, Cheolwon, Suncheon region with open source weather data. As a result, GBM model showed the lowest prediction error in every data set we used (R squared: 0.96, RMSE: 1.8). Furthermore, GBM showed the lowest variance of prediction error between regions which indicates it has the highest generalizability.

Multiple imputation for competing risks survival data via pseudo-observations

  • Han, Seungbong;Andrei, Adin-Cristian;Tsui, Kam-Wah
    • Communications for Statistical Applications and Methods
    • /
    • v.25 no.4
    • /
    • pp.385-396
    • /
    • 2018
  • Competing risks are commonly encountered in biomedical research. Regression models for competing risks data can be developed based on data routinely collected in hospitals or general practices. However, these data sets usually contain the covariate missing values. To overcome this problem, multiple imputation is often used to fit regression models under a MAR assumption. Here, we introduce a multivariate imputation in a chained equations algorithm to deal with competing risks survival data. Using pseudo-observations, we make use of the available outcome information by accommodating the competing risk structure. Lastly, we illustrate the practical advantages of our approach using simulations and two data examples from a coronary artery disease data and hepatocellular carcinoma data.

Analysis on Review Data of Restaurants in Google Maps through Text Mining: Focusing on Sentiment Analysis

  • Shin, Bee;Ryu, Sohee;Kim, Yongjun;Kim, Dongwhan
    • Journal of Multimedia Information System
    • /
    • v.9 no.1
    • /
    • pp.61-68
    • /
    • 2022
  • The importance of online reviews is prevalent as more people access goods or places online and make decisions to visit or purchase. However, such reviews are generally provided by short sentences or mere star ratings; failing to provide a general overview of customer preferences and decision factors. This study explored and broke down restaurant reviews found on Google Maps. After collecting and analyzing 5,427 reviews, we vectorized the importance of words using the TF-IDF. We used a random forest machine learning algorithm to calculate the coefficient of positivity and negativity of words used in reviews. As the result, we were able to build a dictionary of words for positive and negative sentiment using each word's coefficient. We classified words into four major evaluation categories and derived insights into sentiment in each criterion. We believe the dictionary of review words and analyzing the major evaluation categories can help prospective restaurant visitors to read between the lines on restaurant reviews found on the Web.

Applying Machine Learning approaches to predict High-school Student Assessment scores based on high school transcript records

  • Nguyen Ba Tien;Hoai-Nam Nguyen;Hoang-Ha Le;Tran Thu Trang;Chau Van Dinh;Ha-Nam Nguyen;Gyoo Seok Choi
    • International Journal of Internet, Broadcasting and Communication
    • /
    • v.15 no.2
    • /
    • pp.261-267
    • /
    • 2023
  • A common approach to the problem of predicting student test scores is based on the student's previous educational history. In this study, high school transcripts of about two thousand candidates, who took the High-school Student Assessment (HSA) were collected. The data were estimated through building a regression model - Random Forest and optimizing the model's parameters based on Genetic Algorithm (GA) to predict the HSA scores. The RMSE (Root Mean Square Error) measure of the predictive models was used to evaluate the model's performance.

Prediction of the Shelter Dog Outcome using Machine Learning Models (머신러닝을 이용한 유기견 안락사 예측)

  • Lee, Ye-Seol;Lee, Se-Hoon;Keane, John
    • Proceedings of the Korean Society of Computer Information Conference
    • /
    • 2020.07a
    • /
    • pp.301-302
    • /
    • 2020
  • The number of abandoned dogs were increasing every year in South Korea. However, many dogs are euthanized in the shelter because of the lack of budget. This project predicts euthanasia of abandoned dogs using machine learning algorithm. It collects data from the public data portal where Korea government provides a public dataset as a form of open API. This project uses recent three-year data 2017 to 2019 and 263371 cases were founded. This project implements random forest and logistic regression models. This project attained an average 72% of prediction accuracy.

  • PDF

A Study on Smoker Prediction Using Machine Learning Algorithm (기계학습 알고리즘을 이용한 흡연자 예측 연구)

  • Jongwoo Baek;Joonil Bang;Joowon Lee;Hwajong Kim
    • Proceedings of the Korean Society of Computer Information Conference
    • /
    • 2023.07a
    • /
    • pp.537-538
    • /
    • 2023
  • 본 논문에서는 사람에게서 나타나는 생체 특성과 흡연여부의 상관관계 분석을 위해 랜덤 포레스트와 그래디언트 부스팅 트리의 두 가지 기계학습 알고리즘을 사용하였다. 연구에 사용된 데이터는 국민건강보험공단에서 제공하고 Kaggle에서 취합하여 정리한 건강검진 정보를 사용하였다. 분류 모델의 학습에 있어 혈청 정보가 높은 관계성을 보일 것으로 예상하였으나, 실제 결과는 성별이 가장 큰 영향을 끼치는 것으로 확인되었다.

  • PDF

Classification of Network Traffic using Machine Learning for Software Defined Networks

  • Muhammad Shahzad Haroon;Husnain Mansoor
    • International Journal of Computer Science & Network Security
    • /
    • v.23 no.12
    • /
    • pp.91-100
    • /
    • 2023
  • As SDN devices and systems hit the market, security in SDN must be raised on the agenda. SDN has become an interesting area in both academics and industry. SDN promises many benefits which attract many IT managers and Leading IT companies which motivates them to switch to SDN. Over the last three decades, network attacks becoming more sophisticated and complex to detect. The goal is to study how traffic information can be extracted from an SDN controller and open virtual switches (OVS) using SDN mechanisms. The testbed environment is created using the RYU controller and Mininet. The extracted information is further used to detect these attacks efficiently using a machine learning approach. To use the Machine learning approach, a dataset is required. Currently, a public SDN based dataset is not available. In this paper, SDN based dataset is created which include legitimate and non-legitimate traffic. Classification is divided into two categories: binary and multiclass classification. Traffic has been classified with or without dimension reduction techniques like PCA and LDA. Our approach provides 98.58% of accuracy using a random forest algorithm.

Machine Learning based Prediction of The Value of Buildings

  • Lee, Woosik;Kim, Namgi;Choi, Yoon-Ho;Kim, Yong Soo;Lee, Byoung-Dai
    • KSII Transactions on Internet and Information Systems (TIIS)
    • /
    • v.12 no.8
    • /
    • pp.3966-3991
    • /
    • 2018
  • Due to the lack of visualization services and organic combinations between public and private buildings data, the usability of the basic map has remained low. To address this issue, this paper reports on a solution that organically combines public and private data while providing visualization services to general users. For this purpose, factors that can affect building prices first were examined in order to define the related data attributes. To extract the relevant data attributes, this paper presents a method of acquiring public information data and real estate-related information, as provided by private real estate portal sites. The paper also proposes a pretreatment process required for intelligent machine learning. This report goes on to suggest an intelligent machine learning algorithm that predicts buildings' value pricing and future value by using big data regarding buildings' spatial information, as acquired from a database containing building value attributes. The algorithm's availability was tested by establishing a prototype targeting pilot areas, including Suwon, Anyang, and Gunpo in South Korea. Finally, a prototype visualization solution was developed in order to allow general users to effectively use buildings' value ranking and value pricing, as predicted by intelligent machine learning.