• Title/Summary/Keyword: Preprocessed data

Search Result 188, Processing Time 0.024 seconds

Comparative Study of Anomaly Detection Accuracy of Intrusion Detection Systems Based on Various Data Preprocessing Techniques (다양한 데이터 전처리 기법 기반 침입탐지 시스템의 이상탐지 정확도 비교 연구)

  • Park, Kyungseon;Kim, Kangseok
    • KIPS Transactions on Software and Data Engineering
    • /
    • v.10 no.11
    • /
    • pp.449-456
    • /
    • 2021
  • An intrusion detection system is a technology that detects abnormal behaviors that violate security, and detects abnormal operations and prevents system attacks. Existing intrusion detection systems have been designed using statistical analysis or anomaly detection techniques for traffic patterns, but modern systems generate a variety of traffic different from existing systems due to rapidly growing technologies, so the existing methods have limitations. In order to overcome this limitation, study on intrusion detection methods applying various machine learning techniques is being actively conducted. In this study, a comparative study was conducted on data preprocessing techniques that can improve the accuracy of anomaly detection using NGIDS-DS (Next Generation IDS Database) generated by simulation equipment for traffic in various network environments. Padding and sliding window were used as data preprocessing, and an oversampling technique with Adversarial Auto-Encoder (AAE) was applied to solve the problem of imbalance between the normal data rate and the abnormal data rate. In addition, the performance improvement of detection accuracy was confirmed by using Skip-gram among the Word2Vec techniques that can extract feature vectors of preprocessed sequence data. PCA-SVM and GRU were used as models for comparative experiments, and the experimental results showed better performance when sliding window, skip-gram, AAE, and GRU were applied.

A Study on Market Size Estimation Method by Product Group Using Word2Vec Algorithm (Word2Vec을 활용한 제품군별 시장규모 추정 방법에 관한 연구)

  • Jung, Ye Lim;Kim, Ji Hui;Yoo, Hyoung Sun
    • Journal of Intelligence and Information Systems
    • /
    • v.26 no.1
    • /
    • pp.1-21
    • /
    • 2020
  • With the rapid development of artificial intelligence technology, various techniques have been developed to extract meaningful information from unstructured text data which constitutes a large portion of big data. Over the past decades, text mining technologies have been utilized in various industries for practical applications. In the field of business intelligence, it has been employed to discover new market and/or technology opportunities and support rational decision making of business participants. The market information such as market size, market growth rate, and market share is essential for setting companies' business strategies. There has been a continuous demand in various fields for specific product level-market information. However, the information has been generally provided at industry level or broad categories based on classification standards, making it difficult to obtain specific and proper information. In this regard, we propose a new methodology that can estimate the market sizes of product groups at more detailed levels than that of previously offered. We applied Word2Vec algorithm, a neural network based semantic word embedding model, to enable automatic market size estimation from individual companies' product information in a bottom-up manner. The overall process is as follows: First, the data related to product information is collected, refined, and restructured into suitable form for applying Word2Vec model. Next, the preprocessed data is embedded into vector space by Word2Vec and then the product groups are derived by extracting similar products names based on cosine similarity calculation. Finally, the sales data on the extracted products is summated to estimate the market size of the product groups. As an experimental data, text data of product names from Statistics Korea's microdata (345,103 cases) were mapped in multidimensional vector space by Word2Vec training. We performed parameters optimization for training and then applied vector dimension of 300 and window size of 15 as optimized parameters for further experiments. We employed index words of Korean Standard Industry Classification (KSIC) as a product name dataset to more efficiently cluster product groups. The product names which are similar to KSIC indexes were extracted based on cosine similarity. The market size of extracted products as one product category was calculated from individual companies' sales data. The market sizes of 11,654 specific product lines were automatically estimated by the proposed model. For the performance verification, the results were compared with actual market size of some items. The Pearson's correlation coefficient was 0.513. Our approach has several advantages differing from the previous studies. First, text mining and machine learning techniques were applied for the first time on market size estimation, overcoming the limitations of traditional sampling based- or multiple assumption required-methods. In addition, the level of market category can be easily and efficiently adjusted according to the purpose of information use by changing cosine similarity threshold. Furthermore, it has a high potential of practical applications since it can resolve unmet needs for detailed market size information in public and private sectors. Specifically, it can be utilized in technology evaluation and technology commercialization support program conducted by governmental institutions, as well as business strategies consulting and market analysis report publishing by private firms. The limitation of our study is that the presented model needs to be improved in terms of accuracy and reliability. The semantic-based word embedding module can be advanced by giving a proper order in the preprocessed dataset or by combining another algorithm such as Jaccard similarity with Word2Vec. Also, the methods of product group clustering can be changed to other types of unsupervised machine learning algorithm. Our group is currently working on subsequent studies and we expect that it can further improve the performance of the conceptually proposed basic model in this study.

Analysis of Global Entrepreneurship Trends Due to COVID-19: Focusing on Crunchbase (Covid-19에 따른 글로벌 창업 트렌드 분석: Crunchbase를 중심으로)

  • Shinho Kim;Youngjung Geum
    • Asia-Pacific Journal of Business Venturing and Entrepreneurship
    • /
    • v.18 no.3
    • /
    • pp.141-156
    • /
    • 2023
  • Due to the unprecedented worldwide pandemic of the new Covid-19 infection, business trends of companies have changed significantly. Therefore, it is strongly required to monitor the rapid changes of innovation trends to design and plan future businesses. Since the pandemic, many studies have attempted to analyze business changes, but they are limited to specific industries and are insufficient in terms of data objectivity. In response, this study aims to analyze business trends after Covid-19 using Crunchbase, a global startup data. The data is collected and preprocessed every two years from 2018 to 2021 to compare the business trends. To capture the major trends, a network analysis is conducted for the industry groups and industry information based on the co-occurrence. To analyze the minor trends, LDA-based topic modelling and word2vec-based clustering is used. As a result, e-commerce, education, delivery, game and entertainment industries are promising based on their technological advances, showing extension and diversification of industry boundaries as well as digitalization and servitization of business contents. This study is expected to help venture capitalists and entrepreneurs to understand the rapid changes under the impact of Covid-19 and to make right decisions for the future.

  • PDF

Support Vector Machine Based Arrhythmia Classification Using Reduced Features

  • Song, Mi-Hye;Lee, Jeon;Cho, Sung-Pil;Lee, Kyoung-Joung;Yoo, Sun-Kook
    • International Journal of Control, Automation, and Systems
    • /
    • v.3 no.4
    • /
    • pp.571-579
    • /
    • 2005
  • In this paper, we proposed an algorithm for arrhythmia classification, which is associated with the reduction of feature dimensions by linear discriminant analysis (LDA) and a support vector machine (SVM) based classifier. Seventeen original input features were extracted from preprocessed signals by wavelet transform, and attempts were then made to reduce these to 4 features, the linear combination of original features, by LDA. The performance of the SVM classifier with reduced features by LDA showed higher than with that by principal component analysis (PCA) and even with original features. For a cross-validation procedure, this SVM classifier was compared with Multilayer Perceptrons (MLP) and Fuzzy Inference System (FIS) classifiers. When all classifiers used the same reduced features, the overall performance of the SVM classifier was comprehensively superior to all others. Especially, the accuracy of discrimination of normal sinus rhythm (NSR), arterial premature contraction (APC), supraventricular tachycardia (SVT), premature ventricular contraction (PVC), ventricular tachycardia (VT) and ventricular fibrillation (VF) were $99.307\%,\;99.274\%,\;99.854\%,\;98.344\%,\;99.441\%\;and\;99.883\%$, respectively. And, even with smaller learning data, the SVM classifier offered better performance than the MLP classifier.

A Study on the Research Trends in the Area of Geospatial-Information Using Text-mining Technique Focused on National R&D Reports and Theses (텍스트마이닝 기술을 이용한 공간정보 분야의 연구 동향에 관한 고찰 -국가연구개발사업 보고서 및 논문을 중심으로-)

  • Lim, Si Yeong;Yi, Mi Sook;Jin, Gi Ho;Shin, Dong Bin
    • Spatial Information Research
    • /
    • v.22 no.4
    • /
    • pp.11-20
    • /
    • 2014
  • This study aims to provide information about the research-trends in the area of Geospatial Information using text-mining methods. We derived the National R&D Reports and papers from NDSL(National Discovery for Science Leaders) site. And then we preprocessed their key-words and classified those in separable sectors. We investigated the appearance rates and changes of key-words for R&D reports and papers. As a result, we conformed that the researches concerning applications are increasing, while the researches dealing with systems are decreasing. Especially, with in the framework of the keyword, '3D-GIS', 'sensor' and 'service' xcept ITS are emerging. It could be helpful to investigate research items later.

Evaluation of Firmness and Sweetness Index of Tomatoes using Hyperspectral Imaging

  • Rahman, Anisur;Faqeerzada, Mohammad Akbar;Joshi, Rahul;Cho, Byoung-Kwan
    • Proceedings of the Korean Society for Agricultural Machinery Conference
    • /
    • 2017.04a
    • /
    • pp.44-44
    • /
    • 2017
  • The objective of this study was to evaluate firmness, and sweetness index (SI) of tomatoes (Lycopersicum esculentum) by using hyperspectral imaging (HSI) in the range of 1000-1400 nm. The mean spectra of the 95 matured tomato samples were extracted from the hyperspectral images, and the reference firmness and sweetness index of the same sample were measured and calibrated with their corresponding spectral data by partial least squares (PLS) regression with different preprocessing method. The results showed that the regression model developed by PLS regression based on Savitzky-Golay (S-G) second-derivative preprocessed spectra resulted in better performance for firmness, and SI of tomatoes compared to models developed by other preprocessing methods, with correlation coefficients (rpred) of 0.82, and 0.74 with standard error of prediction (SEP) of 0.86 N, and 0.63 respectively. Then, the feature wavelengths were identified using model-based variable selection method, i.e., variable important in projection (VIP), resulting from the PLS regression analyses and finally chemical images were derived by applying the respective regression coefficient on the spectral image in a pixel-wise manner. The resulting chemical images provided detailed information on firmness, and sweetness index (SI) of tomatoes. Therefore, these research demonstrated that HIS technique has a potential for rapid and non-destructive evaluation of the firmness and sweetness index of tomatoes.

  • PDF

Study on Rapid Measurement of Wood Powder Concentration of Wood-Plastic Composites using FT-NIR and FT-IR Spectroscopy Techniques

  • Cho, Byoung-kwan;Lohoumi, Santosh;Choi, Chul;Yang, Seong-min;Kang, Seog-goo
    • Journal of the Korean Wood Science and Technology
    • /
    • v.44 no.6
    • /
    • pp.852-863
    • /
    • 2016
  • Wood-plastic composite (WPC) is a promising and sustainable material, and refers to a combination of wood and plastic along with some binding (adhesive) materials. In comparison to pure wood material, WPCs are in general have advantages of being cost effective, high durability, moisture resistance, and microbial resistance. The properties of WPCs come directly from the concentration of different components in composite; such as wood flour concentration directly affect mechanical and physical properties of WPCs. In this study, wood powder concentration in WPC was determined by Fourier transform near-infrared (FT-NIR) and Fourier transform infrared (FT-IR) spectroscopy. The reflectance spectra from WPC in both powdered and tableted form with five different concentrations of wood powder were collected and preprocessed to remove noise caused by several factors. To correlate the collected spectra with wood powder concentration, multivariate calibration method of partial least squares (PLS) was applied. During validation with an independent set of samples, good correlations with reference values were demonstrated for both FT-NIR and FT-IR data sets. In addition, high coefficient of determination (${R^2}_p$) and lower standard error of prediction (SEP) was yielded for tableted WPC than powdered WPC. The combination of FT-NIR and FT-IR spectral region was also studied. The results presented here showed that the use of both zones improved the determination accuracy for powdered WPC; however, no improvement in prediction result was achieved for tableted WPCs. The results obtained suggest that these spectroscopic techniques are a useful tool for fast and nondestructive determination of wood concentration in WPCs and have potential to replace conventional methods.

INLINE NEAR INFRARED (NIR) SPECTROSCOPY FOR PROCESS CONTROL IN POLYMER EXTRUSION

  • Rohe, Thomas;Koelle, Sabine;Becker, Wolfgang;Eisenreich, Norbert;Eyerer, Peter
    • Proceedings of the Korean Society of Near Infrared Spectroscopy Conference
    • /
    • 2001.06a
    • /
    • pp.1082-1082
    • /
    • 2001
  • Extrusion is one of the most important processes in polymer industry. The characterization of the polymer melt during processing will improve this process noticeably, One possibility of characterizing the actual processed polymer melt is the inline near infrared (NIR) spectroscopy, With this method several polymer properties can be observed during processing, e.g. composition, moisture ormechanical properties of the melt. For this purpose probes for transmission and reflection measurements have been developed, withstanding the high temperatures and pressures appearing during extrusion process (tested up to 300$^{\circ}C$ and 10 ㎫). For the transmission system an optical bypass was developed to eliminate disturbing spectral influences and hence increase the long term stability, which is the prerequisite for an industrial application. Measurements in transmission and reflection produced comparable results (or blending processes, where the prediction error was less than 1%. An optimum RMSEP of only 0.24% was found for preprocessed polymer blends measured in transmission on a laboratory extruder. A transflection measurement allowed for the first time the recording of relevant NIR-spectra in the screw area of an extruder. The application to a (PE+PP) blending process delivered promising results. This new measurement mode allows the observation of the ongoing processes within the screw area, which is of maximum Interest for reactive extrusion processes. Due to economic reasons the calibration transfer between different extrusion systems is also of high importance. Investigations on simulated and real-world spectra showed that a calibration transfer is possible. A new method alternatively to the well-known direct standardization procedures was developed, which is based on an automatic data pretreatment. This procedure delivers comparable results for the calibration transfer. Overall this paper presents concepts, components and algorithms for the inline near infrared (NIR) spectroscopy for polymer extrusion, which allows the use of it in a real industrial extrusion process.

  • PDF

A Study on Detection of Malicious Android Apps based on LSTM and Information Gain (LSTM 및 정보이득 기반의 악성 안드로이드 앱 탐지연구)

  • Ahn, Yulim;Hong, Seungah;Kim, Jiyeon;Choi, Eunjung
    • Journal of Korea Multimedia Society
    • /
    • v.23 no.5
    • /
    • pp.641-649
    • /
    • 2020
  • As the usage of mobile devices extremely increases, malicious mobile apps(applications) that target mobile users are also increasing. It is challenging to detect these malicious apps using traditional malware detection techniques due to intelligence of today's attack mechanisms. Deep learning (DL) is an alternative technique of traditional signature and rule-based anomaly detection techniques and thus have actively been used in numerous recent studies on malware detection. In order to develop DL-based defense mechanisms against intelligent malicious apps, feeding recent datasets into DL models is important. In this paper, we develop a DL-based model for detecting intelligent malicious apps using KU-CISC 2018-Android, the most up-to-date dataset consisting of benign and malicious Android apps. This dataset has hardly been addressed in other studies so far. We extract OPcode sequences from the Android apps and preprocess the OPcode sequences using an N-gram model. We then feed the preprocessed data into LSTM and apply the concept of Information Gain to improve performance of detecting malicious apps. Furthermore, we evaluate our model with numerous scenarios in order to verify the model's design and performance.

Automatic Leather Quality Inspection and Grading System by Leather Texture Analysis (텍스쳐 분석에 의한 피혁 등급 판정 및 자동 선별시스템에의 응용)

  • 권장우;김명재;길경석
    • Journal of Korea Multimedia Society
    • /
    • v.7 no.4
    • /
    • pp.451-458
    • /
    • 2004
  • A leather quality inspection by naked eyes has known as unreliable because of its biological characteristics like accumulated fatigue caused from an optical illusion and biological phenomenon. Therefore it is necessary to automate the leather quality inspection by computer vision technique. In this paper, we present automatic leather qua1ity classification system get information from leather surface. Leather is usually graded by its information such as texture density, types and distribution of defects. The presented algorithm explain how we analyze leather information like texture density and defects from the gray-level images obtained by digital camera. The density data is computed by its ratio of distribution area, width, and height of Fourier spectrum magnitude. And the defect information of leather surface can be obtained by histogram distribution of pixels which is Windowed from preprocessed images. The information for entire leather could be a standard for grading leather quality. The proposed leather inspection system using machine vision can also be applied to another field to substitute human eye inspection.

  • PDF