• Title/Summary/Keyword: Research dataset

Search Result 1,350, Processing Time 0.025 seconds

Classification of Malware Families Using Hybrid Datasets (하이브리드 데이터셋을 이용한 악성코드 패밀리 분류)

  • Seo-Woo Choi;Myeong-Jin Han;Yeon-Ji Lee;Il-Gu Lee
    • Journal of the Korea Institute of Information Security & Cryptology
    • /
    • v.33 no.6
    • /
    • pp.1067-1076
    • /
    • 2023
  • Recently, as variant malware has increased, the scale of cyber hacking incidents is expanding. To respond to intelligent cyberhacking attack, machine learning-based research is actively underway to effectively classify malware families. However, existing classification models have problems where performance deteriorates when the dataset is obfuscated or sparse. In this paper, we propose a hybrid dataset that combines features extracted from ASM files and BYTES files, and evaluate classification performance using FNN. As a result of the experiment, the proposed method showed performance improvement of about 4% compared to a single dataset, and in particular, performance improvement of about 30% for rare families.

Utilizing Deep Learning for Early Diagnosis of Autism: Detecting Self-Stimulatory Behavior

  • Seongwoo Park;Sukbeom Chang;JooHee Oh
    • International Journal of Advanced Culture Technology
    • /
    • v.12 no.3
    • /
    • pp.148-158
    • /
    • 2024
  • We investigate Autism Spectrum Disorder (ASD), which is typified by deficits in social interaction, repetitive behaviors, limited vocabulary, and cognitive delays. Traditional diagnostic methodologies, reliant on expert evaluations, frequently result in deferred detection and intervention, particularly in South Korea, where there is a dearth of qualified professionals and limited public awareness. In this study, we employ advanced deep learning algorithms to enhance early ASD screening through automated video analysis. Utilizing architectures such as Convolutional Long Short-Term Memory (ConvLSTM), Long-term Recurrent Convolutional Network (LRCN), and Convolutional Neural Networks with Gated Recurrent Units (CNN+GRU), we analyze video data from platforms like YouTube and TikTok to identify stereotypic behaviors (arm flapping, head banging, spinning). Our results indicate that the LRCN model exhibited superior performance with 79.61% accuracy on the augmented platform video dataset and 79.37% on the original SSBD dataset. The ConvLSTM and CNN+GRU models also achieved higher accuracy than the original SSBD dataset. Through this research, we underscore AI's potential in early ASD detection by automating the identification of stereotypic behaviors, thereby enabling timely intervention. We also emphasize the significance of utilizing expanded datasets from social media platform videos in augmenting model accuracy and robustness, thus paving the way for more accessible diagnostic methods.

The development of food image detection and recognition model of Korean food for mobile dietary management

  • Park, Seon-Joo;Palvanov, Akmaljon;Lee, Chang-Ho;Jeong, Nanoom;Cho, Young-Im;Lee, Hae-Jeung
    • Nutrition Research and Practice
    • /
    • v.13 no.6
    • /
    • pp.521-528
    • /
    • 2019
  • BACKGROUND/OBJECTIVES: The aim of this study was to develop Korean food image detection and recognition model for use in mobile devices for accurate estimation of dietary intake. MATERIALS/METHODS: We collected food images by taking pictures or by searching web images and built an image dataset for use in training a complex recognition model for Korean food. Augmentation techniques were performed in order to increase the dataset size. The dataset for training contained more than 92,000 images categorized into 23 groups of Korean food. All images were down-sampled to a fixed resolution of $150{\times}150$ and then randomly divided into training and testing groups at a ratio of 3:1, resulting in 69,000 training images and 23,000 test images. We used a Deep Convolutional Neural Network (DCNN) for the complex recognition model and compared the results with those of other networks: AlexNet, GoogLeNet, Very Deep Convolutional Neural Network, VGG and ResNet, for large-scale image recognition. RESULTS: Our complex food recognition model, K-foodNet, had higher test accuracy (91.3%) and faster recognition time (0.4 ms) than those of the other networks. CONCLUSION: The results showed that K-foodNet achieved better performance in detecting and recognizing Korean food compared to other state-of-the-art models.

Comparative Analysis of Machine Learning Techniques for IoT Anomaly Detection Using the NSL-KDD Dataset

  • Zaryn, Good;Waleed, Farag;Xin-Wen, Wu;Soundararajan, Ezekiel;Maria, Balega;Franklin, May;Alicia, Deak
    • International Journal of Computer Science & Network Security
    • /
    • v.23 no.1
    • /
    • pp.46-52
    • /
    • 2023
  • With billions of IoT (Internet of Things) devices populating various emerging applications across the world, detecting anomalies on these devices has become incredibly important. Advanced Intrusion Detection Systems (IDS) are trained to detect abnormal network traffic, and Machine Learning (ML) algorithms are used to create detection models. In this paper, the NSL-KDD dataset was adopted to comparatively study the performance and efficiency of IoT anomaly detection models. The dataset was developed for various research purposes and is especially useful for anomaly detection. This data was used with typical machine learning algorithms including eXtreme Gradient Boosting (XGBoost), Support Vector Machines (SVM), and Deep Convolutional Neural Networks (DCNN) to identify and classify any anomalies present within the IoT applications. Our research results show that the XGBoost algorithm outperformed both the SVM and DCNN algorithms achieving the highest accuracy. In our research, each algorithm was assessed based on accuracy, precision, recall, and F1 score. Furthermore, we obtained interesting results on the execution time taken for each algorithm when running the anomaly detection. Precisely, the XGBoost algorithm was 425.53% faster when compared to the SVM algorithm and 2,075.49% faster than the DCNN algorithm. According to our experimental testing, XGBoost is the most accurate and efficient method.

A Comparative Study on Artificial in Intelligence Model Performance between Image and Video Recognition in the Fire Detection Area (화재 탐지 영역의 이미지와 동영상 인식 사이 인공지능 모델 성능 비교 연구)

  • Jeong Rok Lee;Dae Woong Lee;Sae Hyun Jeong;Sang Jeong
    • Journal of the Society of Disaster Information
    • /
    • v.19 no.4
    • /
    • pp.968-975
    • /
    • 2023
  • Purpose: We would like to confirm that the false positive rate of flames/smoke is high when detecting fires. Propose a method and dataset to recognize and classify fire situations to reduce the false detection rate. Method: Using the video as learning data, the characteristics of the fire situation were extracted and applied to the classification model. For evaluation, the model performance of Yolov8 and Slowfast were compared and analyzed using the fire dataset conducted by the National Information Society Agency (NIA). Result: YOLO's detection performance varies sensitively depending on the influence of the background, and it was unable to properly detect fires even when the fire scale was too large or too small. Since SlowFast learns the time axis of the video, we confirmed that detects fire excellently even in situations where the shape of an atypical object cannot be clearly inferred because the surrounding area is blurry or bright. Conclusion: It was confirmed that the fire detection rate was more appropriate when using a video-based artificial intelligence detection model rather than using image data.

Shanghai Containerised Freight Index Forecasting Based on Deep Learning Methods: Evidence from Chinese Futures Markets

  • Liang Chen;Jiankun Li;Rongyu Pei;Zhenqing Su;Ziyang Liu
    • East Asian Economic Review
    • /
    • v.28 no.3
    • /
    • pp.359-388
    • /
    • 2024
  • With the escalation of global trade, the Chinese commodity futures market has ascended to a pivotal role within the international shipping landscape. The Shanghai Containerized Freight Index (SCFI), a leading indicator of the shipping industry's health, is particularly sensitive to the vicissitudes of the Chinese commodity futures sector. Nevertheless, a significant research gap exists regarding the application of Chinese commodity futures prices as predictive tools for the SCFI. To address this gap, the present study employs a comprehensive dataset spanning daily observations from March 24, 2017, to May 27, 2022, encompassing a total of 29,308 data points. We have crafted an innovative deep learning model that synergistically combines Long Short-Term Memory (LSTM) and Convolutional Neural Network (CNN) architectures. The outcomes show that the CNN-LSTM model does a great job of finding the nonlinear dynamics in the SCFI dataset and accurately capturing its long-term temporal dependencies. The model can handle changes in random sample selection, data frequency, and structural shifts within the dataset. It achieved an impressive R2 of 96.6% and did better than the LSTM and CNN models that were used alone. This research underscores the predictive prowess of the Chinese futures market in influencing the Shipping Cost Index, deepening our understanding of the intricate relationship between the shipping industry and the financial sphere. Furthermore, it broadens the scope of machine learning applications in maritime transportation management, paving the way for SCFI forecasting research. The study's findings offer potent decision-support tools and risk management solutions for logistics enterprises, shipping corporations, and governmental entities.

Empirical Verification of Conversion and Restoration of Preservation Format for Dataset: Application of Dataset with Disaster Safety Information to SIARD (데이터세트 보존포맷 검증방안에 관한 연구: 재난안전정보 데이터세트의 SIARD 적용을 통해)

  • Han, Hui-Jeong;Yoon, Sung-Ho;Oh, Hyo-Jung;Yang, Dongmin
    • Journal of the Korean Society for information Management
    • /
    • v.37 no.2
    • /
    • pp.251-284
    • /
    • 2020
  • As the use of information has emerged as the core of national competitiveness, major developed countries and the Korean government have realized the importance of data. They have pursued technical research and standard establishment for long-term preservation and continuously strived for systematic management and preservation of data. However, although various types of data are specified for the purpose of record management in the law, there is no specific method on how to collect, manage and preserve them, except standard electronic documents. In particular, management and preservation of huge datasets from the administrative information system have been strongly demanded above all. Any guidelines for datasets do not have been properly provided. After the framework for selecting preservation format must be prepared, the system can be supplemented and built. The framework considering the characteristics of the dataset should be specified more concretely, and empirical verification of the conversion and restoration for the dataset preservation format derived according to the selection criteria is necessary. Therefore, this study intends to propose a method for long-term preservation through empirical verification of the preservation format after deriving an evaluation the framework for the preservation format selection criteria considering the characteristics of the dataset.

Verification of Cardiac Electrophysiological Features as a Predictive Indicator of Drug-Induced Torsades de pointes (약물의 염전성 부정맥 유발 예측 지표로서 심장의 전기생리학적 특징 값들의 검증)

  • Yoo, Yedam;Jeong, Da Un;Marcellinus, Aroli;Lim, Ki Moo
    • Journal of Biomedical Engineering Research
    • /
    • v.43 no.1
    • /
    • pp.19-26
    • /
    • 2022
  • The Comprehensive in vitro Proarrhythmic Assay(CiPA) project was launched for solving the hERG assay problem of being classified as high-risk groups even though they are low-risk drugs due to their high sensitivity. CiPA presented a protocol to predict drug toxicity using physiological data calculated based on the in-silico model. in this study, features calculated through the in-silico model are analyzed for correlation of changing action potential in the near future, and features are verified through predictive performance according to drug datasets. Using the O'Hara Rudy model modified by Dutta et al., Pearson correlation analysis was performed between 13 features(dVm/dtmax, APpeak, APresting, APD90, APD50, APDtri, Capeak, Caresting, CaD90, CaD50, CaDtri, qNet, qInward) calculated at 100 pacing, and between dVm/dtmax_repol calculated at 1,000 pacing, and linear regression analysis was performed on each of the 12 training drugs, 16 verification drugs, and 28 drugs. Indicators showing high coefficient of determination(R2) in the training drug dataset were qNet 0.93, AP resting 0.83, APDtri 0.78, Ca resting 0.76, dVm/dtmax 0.63, and APD90 0.61. The indicators showing high determinants in the validated drug dataset were APDtri 0.94, APD90 0.92, APD50 0.85, CaD50 0.84, qNet 0.76, and CaD90 0.64. Indicators with high coefficients of determination for all 28 drugs are qNet 0.78, APD90 0.74, and qInward 0.59. The indicators vary in predictive performance depending on the drug dataset, and qNet showed the same high performance of 0.7 or more on the training drug dataset, the verified drug dataset, and the entire drug dataset.

Time-Series based Dataset Selection Method for Effective Text Classification (효율적인 문헌 분류를 위한 시계열 기반 데이터 집합 선정 기법)

  • Chae, Yeonghun;Jeong, Do-Heon
    • The Journal of the Korea Contents Association
    • /
    • v.17 no.1
    • /
    • pp.39-49
    • /
    • 2017
  • As the Internet technology advances, data on the web is increasing sharply. Many research study about incremental learning for classifying effectively in data increasing. Web document contains the time-series data such as published date. If we reflect time-series data to classification, it will be an effective classification. In this study, we analyze the time-series variation of the words. We propose an efficient classification through dividing the dataset based on the analysis of time-series information. For experiment, we corrected 1 million online news articles including time-series information. We divide the dataset and classify the dataset using SVM and $Na{\ddot{i}}ve$ Bayes. In each model, we show that classification performance is increasing. Through this study, we showed that reflecting time-series information can improve the classification performance.

A Study on the Semiautomatic Construction of Domain-Specific Relation Extraction Datasets from Biomedical Abstracts - Mainly Focusing on a Genic Interaction Dataset in Alzheimer's Disease Domain - (바이오 분야 학술 문헌에서의 분야별 관계 추출 데이터셋 반자동 구축에 관한 연구 - 알츠하이머병 유관 유전자 간 상호 작용 중심으로 -)

  • Choi, Sung-Pil;Yoo, Suk-Jong;Cho, Hyun-Yang
    • Journal of Korean Library and Information Science Society
    • /
    • v.47 no.4
    • /
    • pp.289-307
    • /
    • 2016
  • This paper introduces a software system and process model for constructing domain-specific relation extraction datasets semi-automatically. The system uses a set of terms such as genes, proteins diseases and so forth as inputs and then by exploiting massive biological interaction database, generates a set of term pairs which are utilized as queries for retrieving sentences containing the pairs from scientific databases. To assess the usefulness of the proposed system, this paper applies it into constructing a genic interaction dataset related to Alzheimer's disease domain, which extracts 3,510 interaction-related sentences by using 140 gene names in the area. In conclusion, the resulting outputs of the case study performed in this paper indicate the fact that the system and process could highly boost the efficiency of the dataset construction in various subfields of biomedical research.