• Title/Summary/Keyword: Kaggle Dataset

Search Result 27, Processing Time 0.058 seconds

Developing of New a Tensorflow Tutorial Model on Machine Learning : Focusing on the Kaggle Titanic Dataset (텐서플로우 튜토리얼 방식의 머신러닝 신규 모델 개발 : 캐글 타이타닉 데이터 셋을 중심으로)

  • Kim, Dong Gil;Park, Yong-Soon;Park, Lae-Jeong;Chung, Tae-Yun
    • IEMEK Journal of Embedded Systems and Applications
    • /
    • v.14 no.4
    • /
    • pp.207-218
    • /
    • 2019
  • The purpose of this study is to develop a model that can systematically study the whole learning process of machine learning. Since the existing model describes the learning process with minimum coding, it can learn the progress of machine learning sequentially through the new model, and can visualize each process using the tensor flow. The new model used all of the existing model algorithms and confirmed the importance of the variables that affect the target variable, survival. The used to classification training data into training and verification, and to evaluate the performance of the model with test data. As a result of the final analysis, the ensemble techniques is the all tutorial model showed high performance, and the maximum performance of the model was improved by maximum 5.2% when compared with the existing model using. In future research, it is necessary to construct an environment in which machine learning can be learned regardless of the data preprocessing method and OS that can learn a model that is better than the existing performance.

Detecting Fake Job Recruitment with a Machine Learning Approach (머신 러닝 접근 방식을 통한 가짜 채용 탐지)

  • Taghiyev Ilkin;Jae Heung Lee
    • Smart Media Journal
    • /
    • v.12 no.2
    • /
    • pp.36-41
    • /
    • 2023
  • With the advent of applicant tracking systems, online recruitment has become more popular, and recruitment fraud has become a serious problem. This research aims to develop a reliable model to detect recruitment fraud in online recruitment environments to reduce cost losses and enhance privacy. The main contribution of this paper is to provide an automated methodology that leverages insights gained from exploratory analysis of data to distinguish which job postings are fraudulent and which are legitimate. Using EMSCAD, a recruitment fraud dataset provided by Kaggle, we trained and evaluated various single-classifier and ensemble-classifier-based machine learning models, and found that the ensemble classifier, the random forest classifier, performed best with an accuracy of 98.67% and an F1 score of 0.81.

Detection of Bacteria in Blood in Darkfield Microscopy Image (암시야 현미경 영상에서 혈액 내 박테리아 검출 방법)

  • Park, Hyun-jun
    • Proceedings of the Korean Institute of Information and Commucation Sciences Conference
    • /
    • 2021.10a
    • /
    • pp.183-185
    • /
    • 2021
  • Detecting bacteria in blood could be an important research area in medicine and computer vision. In this paper, we propose a method for detecting bacteria in blood from 366 darkfield microscopy images acquired at Kaggle. Generate a training dataset through preprocessing and data augmentation using image processing techniques, and define a deep learning model for learning it. As a result of the experiment, it was confirmed that the proposed deep learning model effectively detects red blood cells and bacteria in darkfield microscopy images. In this paper, we learned using a relatively simple model, but it seems that more accurate results can be obtained by using a deeper model.

  • PDF

Exploring the Impact of Pesticide Usage on Crop Condition: A Causal Analysis of Agricultural Factors

  • Mee Qi Siow;Yang Sok Kim;Mi Jin Noh;Mu Moung Cho Han
    • Smart Media Journal
    • /
    • v.12 no.10
    • /
    • pp.29-37
    • /
    • 2023
  • Human lifestyle is affected by the agricultural development in the last 12,000 years ago. The development of agriculture is one of the reasons that global population surged. To ensure sufficient food production for supporting human life, pesticides as a more effective and economical tools, are extensively used to enhance the yield quality and boost crop production. This study investigated the factors that affect crop production and whether the factors of pesticide usage are the most important factors in crop production using the dataset from Kaggle that provides information based on crops harvested by various farmers. Logistic regression is used to investigate the relationship between various factors and crop production. However, the logistic regression is unable to deal with predictors that are related to each other and identifying the greatest impact factor. Therefore, causal discovery is applied to address the above limitations. The result of causal discovery showed that crop condition is greatly impacted by the estimated insects count, where estimated insects count is affected by the factors of pesticide usage. This study enhances our understanding of the influence of pesticide usage on crop production and contributes to the progress of agricultural practices.

Comparative analysis of deep learning performance for Python and C# using Keras (Keras를 이용한 Python과 C#의 딥러닝 성능 비교 분석)

  • Lee, Sung-jin;Moon, Sang-Ho
    • Proceedings of the Korean Institute of Information and Commucation Sciences Conference
    • /
    • 2022.10a
    • /
    • pp.360-363
    • /
    • 2022
  • According to the 2018 Kaggle ML & DS Survey, among the proportions of frameworks for machine learning and data science, TensorFlow and Keras each account for 41.82%. It was found to be 34.09%, and in the case of development programming, it is confirmed that about 82% use Python. A significant number of machine learning and deep learning structures utilize the Keras framework and Python, but in the case of Python, distribution and execution are limited to the Python script environment due to the script language, so it is judged that it is difficult to operate in various environments. This paper implemented a machine learning and deep learning system using C# and Keras running in Visual Studio 2019. Using the Mnist dataset, 100 tests were performed in Python 3.8,2 and C# .NET 5.0 environments, and the minimum time for Python was 1.86 seconds, the maximum time was 2.38 seconds, and the average time was 1.98 seconds. Time 1.78 seconds, maximum time 2.11 seconds, average time 1.85 seconds, total time 37.02 seconds. As a result of the experiment, the performance of C# improved by about 6% compared to Python, and it is expected that the utilization will be high because executable files can be extracted.

  • PDF

Heart Disease Prediction Using Decision Tree With Kaggle Dataset

  • Noh, Young-Dan;Cho, Kyu-Cheol
    • Journal of the Korea Society of Computer and Information
    • /
    • v.27 no.5
    • /
    • pp.21-28
    • /
    • 2022
  • All health problems that occur in the circulatory system are refer to cardiovascular illness, such as heart and vascular diseases. Deaths from cardiovascular disorders are recorded one third of in total deaths in 2019 worldwide, and the number of deaths continues to rise. Therefore, if it is possible to predict diseases that has high mortality rate with patient's data and AI system, they would enable them to be detected and be treated in advance. In this study, models are produced to predict heart disease, which is one of the cardiovascular diseases, and compare the performance of models with Accuracy, Precision, and Recall, with description of the way of improving the performance of the Decision Tree(Decision Tree, KNN (K-Nearest Neighbor), SVM (Support Vector Machine), and DNN (Deep Neural Network) are used in this study.). Experiments were conducted using scikit-learn, Keras, and TensorFlow libraries using Python as Jupyter Notebook in macOS Big Sur. As a result of comparing the performance of the models, the Decision Tree demonstrates the highest performance, thus, it is recommended to use the Decision Tree in this study.

Ensuring Data Confidentiality and Privacy in the Cloud using Non-Deterministic Cryptographic Scheme

  • John Kwao Dawson;Frimpong Twum;James Benjamin Hayfron Acquah;Yaw Missah
    • International Journal of Computer Science & Network Security
    • /
    • v.23 no.7
    • /
    • pp.49-60
    • /
    • 2023
  • The amount of data generated by electronic systems through e-commerce, social networks, and data computation has risen. However, the security of data has always been a challenge. The problem is not with the quantity of data but how to secure the data by ensuring its confidentiality and privacy. Though there are several research on cloud data security, this study proposes a security scheme with the lowest execution time. The approach employs a non-linear time complexity to achieve data confidentiality and privacy. A symmetric algorithm dubbed the Non-Deterministic Cryptographic Scheme (NCS) is proposed to address the increased execution time of existing cryptographic schemes. NCS has linear time complexity with a low and unpredicted trend of execution times. It achieves confidentiality and privacy of data on the cloud by converting the plaintext into Ciphertext with a small number of iterations thereby decreasing the execution time but with high security. The algorithm is based on Good Prime Numbers, Linear Congruential Generator (LGC), Sliding Window Algorithm (SWA), and XOR gate. For the implementation in C, thirty different execution times were performed and their average was taken. A comparative analysis of the NCS was performed against AES, DES, and RSA algorithms based on key sizes of 128kb, 256kb, and 512kb using the dataset from Kaggle. The results showed the proposed NCS execution times were lower in comparison to AES, which had better execution time than DES with RSA having the longest. Contrary, to existing knowledge that execution time is relative to data size, the results obtained from the experiment indicated otherwise for the proposed NCS algorithm. With data sizes of 128kb, 256kb, and 512kb, the execution times in milliseconds were 38, 711, and 378 respectively. This validates the NCS as a Non-Deterministic Cryptographic Algorithm. The study findings hence are in support of the argument that data size does not determine the execution.