• Title/Summary/Keyword: Preprocessing Process

Search Result 440, Processing Time 0.026 seconds

A study on rethinking EDA in digital transformation era (DX 전환 환경에서 EDA에 대한 재고찰)

  • Seoung-gon Ko
    • The Korean Journal of Applied Statistics
    • /
    • v.37 no.1
    • /
    • pp.87-102
    • /
    • 2024
  • Digital transformation refers to the process by which a company or organization changes or innovates its existing business model or sales activities using digital technology. This requires the use of various digital technologies - cloud computing, IoT, artificial intelligence, etc. - to strengthen competitiveness in the market, improve customer experience, and discover new businesses. In addition, in order to derive knowledge and insight about the market, customers, and production environment, it is necessary to select the right data, preprocess the data to an analyzable state, and establish the right process for systematic analysis suitable for the purpose. The usefulness of such digital data is determined by the importance of pre-processing and the correct application of exploratory data analysis (EDA), which is useful for information and hypothesis exploration and visualization of knowledge and insights. In this paper, we reexamine the philosophy and basic concepts of EDA and discuss key visualization information, information expression methods based on the grammar of graphics, and the ACCENT principle, which is the final visualization review standard, for effective visualization.

Prediction of Composition Ratio of DNA Solution from Measurement Data with White Noise Using Neural Network (잡음이 포함된 측정 자료에 대한 신경망의 DNA 용액 조성비 예측)

  • Gyeonghee Kang;Minji Kim;Hyomin Lee
    • Korean Chemical Engineering Research
    • /
    • v.62 no.1
    • /
    • pp.118-124
    • /
    • 2024
  • A neural network is utilized for preprocessing of de-noizing in electrocardiogram signals, retinal images, seismic waves, etc. However, the de-noizing process could provoke increase of computational time and distortion of the original signals. In this study, we investigated a neural network architecture to analyze measurement data without additional de-noizing process. From the dynamical behaviors of DNA in aqueous solution, our neural network model aimed to predict the mole fraction of each DNA in the solution. By adding white noise to the dynamics data of DNA artificially, we investigated the effect of the noise to neural network's predictions. As a result, our model was able to predict the DNA mole fraction with an error of O(0.01) when signal-to-noise ratio was O(1). This work can be applied as a efficient artificial intelligence methodology for analyzing DNA related to genetic disease or cancer cells which would be sensitive to background measuring noise.

Analysis on Domestic Franchise Food Tech Interest by using Big Data

  • Hyun Seok Kim;Yang-Ja Bae;Munyeong Yun;Gi-Hwan Ryu
    • International Journal of Internet, Broadcasting and Communication
    • /
    • v.16 no.2
    • /
    • pp.179-184
    • /
    • 2024
  • Franchise are now a red ocean in Food industry and they need to find other options to appeal for their product, the uprising content, food tech. The franchises are working on R&D to help franchisees with the operations. Through this paper, we analyze the franchise interest on food tech and to help find the necessity of development for franchisees who are in needs with hand, not of human, but of technology. Using Textom, a big data analysis tool, "franchise" and "food tech" were selected as keywords, and search frequency information of Naver and Daum was collected for a year from 01 January, 2023 to 31 December, 2023, and data preprocessing was conducted based on this. For the suitability of the study and more accurate data, data not related to "food tech" was removed through the refining process, and similar keywords were grouped into the same keyword to perform analysis. As a result of the word refining process, a total of 10,049 words were derived, and among them, the top 50 keywords with the highest relevance and search frequency were selected and applied to this study. The top 50 keywords derived through word purification were subjected to TF-IDF analysis, visualization analysis using Ucinet6 and NetDraw programs, network analysis between keywords, and cluster analysis between each keyword through Concor analysis. By using big data analysis, it was found out that franchise do have interest on food tech. "technology", "franchise", "robots" showed many interests and keyword "R&D" showed that franchise are keen on developing food tech to seize competitiveness in Franchise Industry.

Research on convergence data pre-processing technology for indoor positioning - based on crowdsourcing - (실내 측위를 위한 융합데이터 전처리기술 연구 - 크라우드 소싱 기반 -)

  • Seungyeob Lee;Byunghoon Jeon
    • Journal of Platform Technology
    • /
    • v.11 no.5
    • /
    • pp.97-103
    • /
    • 2023
  • Unlike GPS, which is an outdoor positioning technology that is universally and uniformly used all over the world, various technologies are still being developed in the field of indoor positioning technology. In order to acquire accurate indoor location information, a standard of representative indoor positioning technology is required. Recently, indoor positioning technology is expanding into the Real Time Location Service (RTLS) area based on high-precision location data. Accordingly, a new type of indoor positioning technology is being proposed. Thanks to the development of artificial intelligence, artificial intelligence-based indoor positioning technology using wireless signal data of a smartphone is rapidly developing. At this time, in the process of collecting data necessary for artificial intelligence learning, data that is distorted or inappropriate for learning may be included, resulting in lower indoor positioning accuracy. In this study, we propose a data preprocessing technology for artificial intelligence learning to obtain improved indoor positioning results through the refinement process of the collected data.

  • PDF

Prediction on the amount of river water use using support vector machine with time series decomposition (TDSVM을 이용한 하천수 취수량 예측)

  • Choi, Seo Hye;Kwon, Hyun-Han;Park, Moonhyung
    • Journal of Korea Water Resources Association
    • /
    • v.52 no.12
    • /
    • pp.1075-1086
    • /
    • 2019
  • Recently, as the incidence of climate warming and abnormal climate increases, the forecasting of hydrological factors such as precipitation and river flow is getting more complicated, and the risk of water shortage is also increasing. Therefore, this study aims to develop a model for predicting the amount of water intake in mid-term. To this end, the correlation between water intake and meteorological factors, including temperature and precipitation, was used to select input factors. In addition, the amount of water intake increased with time series and seasonal characteristics were clearly shown. Thus, the preprocessing process was performed using the time series decomposition method, and the support vector machine (SVM) was applied to the residual to develop the river intake prediction model. This model has an error of 4.1% on average, which is higher accuracy than the SVM model without preprocessing. In particular, this model has an advantage in mid-term prediction for one to two months. It is expected that the water intake forecasting model developed in this study is useful to be applied for water allocation computation in the permission of river water use, water quality management, and drought measurement for sustainable and efficient management of water resources.

A Research on Network Intrusion Detection based on Discrete Preprocessing Method and Convolution Neural Network (이산화 전처리 방식 및 컨볼루션 신경망을 활용한 네트워크 침입 탐지에 대한 연구)

  • Yoo, JiHoon;Min, Byeongjun;Kim, Sangsoo;Shin, Dongil;Shin, Dongkyoo
    • Journal of Internet Computing and Services
    • /
    • v.22 no.2
    • /
    • pp.29-39
    • /
    • 2021
  • As damages to individuals, private sectors, and businesses increase due to newly occurring cyber attacks, the underlying network security problem has emerged as a major problem in computer systems. Therefore, NIDS using machine learning and deep learning is being studied to improve the limitations that occur in the existing Network Intrusion Detection System. In this study, a deep learning-based NIDS model study is conducted using the Convolution Neural Network (CNN) algorithm. For the image classification-based CNN algorithm learning, a discrete algorithm for continuity variables was added in the preprocessing stage used previously, and the predicted variables were expressed in a linear relationship and converted into easy-to-interpret data. Finally, the network packet processed through the above process is mapped to a square matrix structure and converted into a pixel image. For the performance evaluation of the proposed model, NSL-KDD, a representative network packet data, was used, and accuracy, precision, recall, and f1-score were used as performance indicators. As a result of the experiment, the proposed model showed the highest performance with an accuracy of 85%, and the harmonic mean (F1-Score) of the R2L class with a small number of training samples was 71%, showing very good performance compared to other models.

A Method for Determining Face Recognition Suitability of Face Image (얼굴영상의 얼굴인식 적합성 판정 방법)

  • Lee, Seung Ho
    • Journal of the Korea Academia-Industrial cooperation Society
    • /
    • v.19 no.11
    • /
    • pp.295-302
    • /
    • 2018
  • Face recognition (FR) has been widely used in various applications, such as smart surveillance systems, immigration control in airports, user authentication in smart devices, and so on. FR in well-controlled conditions has been extensively studied and is relatively mature. However, in unconstrained conditions, FR performance could degrade due to undesired characteristics of the input face image (such as irregular facial pose variations). To overcome this problem, this paper proposes a new method for determining if an input image is suitable for FR. In the proposed method, for an input face image, reconstruction error is computed by using a predefined set of reference face images. Then, suitability can be determined by comparing the reconstruction error with a threshold value. In order to reduce the effect of illumination changes on the determination of suitability, a preprocessing algorithm is applied to the input and reference face images before the reconstruction. Experimental results show that the proposed method is able to accurately discriminate non-frontal and/or incorrectly aligned face images from correctly aligned frontal face images. In addition, only 3 ms is required to process a face image of $64{\times}64$ pixels, which further demonstrates the efficiency of the proposed method.

An Improved Skyline Query Scheme for Recommending Real-Time User Preference Data Based on Big Data Preprocessing (빅데이터 전처리 기반의 실시간 사용자 선호 데이터 추천을 위한 개선된 스카이라인 질의 기법)

  • Kim, JiHyun;Kim, Jongwan
    • KIPS Transactions on Software and Data Engineering
    • /
    • v.11 no.5
    • /
    • pp.189-196
    • /
    • 2022
  • Skyline query is a scheme for exploring objects that are suitable for user preferences based on multiple attributes of objects. Existing skyline queries return search results as batch processing, but the need for real-time search results has increased with the advent of interactive apps or mobile environments. Online algorithm for Skyline improves the return speed of objects to explore preferred objects in real time. However, the object navigation process requires unnecessary navigation time due to repeated comparative operations. This paper proposes a Pre-processing Online Algorithm for Skyline Query (POA) to eliminate unnecessary search time in Online Algorithm exploration techniques and provide the results of skyline queries in real time. Proposed techniques use the concept of range-limiting to existing Online Algorithm to perform pretreatment and then eliminate repetitive rediscovering regions first. POAs showed improvement in standard distributions, bias distributions, positive correlations, and negative correlations of discrete data sets compared to Online Algorithm. The POAs used in this paper improve navigation performance by minimizing comparison targets for Online Algorithm, which will be a new criterion for rapid service to users in the face of increasing use of mobile devices.

Development of Machine Learning Model Use Cases for Intelligent Internet of Things Technology Education (지능형 사물인터넷 기술 교육을 위한 머신러닝 모델 활용 사례 개발)

  • Kyeong Hur
    • Journal of Practical Engineering Education
    • /
    • v.16 no.4
    • /
    • pp.449-457
    • /
    • 2024
  • AIoT, the intelligent Internet of Things, refers to a technology that collects data measured by IoT devices and applies machine learning technology to create and utilize predictive models. Existing research on AIoT technology education focused on building an educational AIoT platform and teaching how to use it. However, there was a lack of case studies that taught the process of automatically creating and utilizing machine learning models from data measured by IoT devices. In this paper, we developed a case study using a machine learning model for AIoT technology education. The case developed in this paper consists of the following steps: data collection from AIoT devices, data preprocessing, automatic creation of machine learning models, calculation of accuracy for each model, determination of valid models, and data prediction using the valid models. In this paper, we considered that sensors in AIoT devices measure different ranges of values, and presented an example of data preprocessing accordingly. In addition, we developed a case where AIoT devices automatically determine what information they can predict by automatically generating several machine learning models and determining effective models with high accuracy among these models. By applying the developed cases, a variety of educational contents using AIoT, such as prediction-based object control using AIoT, can be developed.

Hierarchical Overlapping Clustering to Detect Complex Concepts (중복을 허용한 계층적 클러스터링에 의한 복합 개념 탐지 방법)

  • Hong, Su-Jeong;Choi, Joong-Min
    • Journal of Intelligence and Information Systems
    • /
    • v.17 no.1
    • /
    • pp.111-125
    • /
    • 2011
  • Clustering is a process of grouping similar or relevant documents into a cluster and assigning a meaningful concept to the cluster. By this process, clustering facilitates fast and correct search for the relevant documents by narrowing down the range of searching only to the collection of documents belonging to related clusters. For effective clustering, techniques are required for identifying similar documents and grouping them into a cluster, and discovering a concept that is most relevant to the cluster. One of the problems often appearing in this context is the detection of a complex concept that overlaps with several simple concepts at the same hierarchical level. Previous clustering methods were unable to identify and represent a complex concept that belongs to several different clusters at the same level in the concept hierarchy, and also could not validate the semantic hierarchical relationship between a complex concept and each of simple concepts. In order to solve these problems, this paper proposes a new clustering method that identifies and represents complex concepts efficiently. We developed the Hierarchical Overlapping Clustering (HOC) algorithm that modified the traditional Agglomerative Hierarchical Clustering algorithm to allow overlapped clusters at the same level in the concept hierarchy. The HOC algorithm represents the clustering result not by a tree but by a lattice to detect complex concepts. We developed a system that employs the HOC algorithm to carry out the goal of complex concept detection. This system operates in three phases; 1) the preprocessing of documents, 2) the clustering using the HOC algorithm, and 3) the validation of semantic hierarchical relationships among the concepts in the lattice obtained as a result of clustering. The preprocessing phase represents the documents as x-y coordinate values in a 2-dimensional space by considering the weights of terms appearing in the documents. First, it goes through some refinement process by applying stopwords removal and stemming to extract index terms. Then, each index term is assigned a TF-IDF weight value and the x-y coordinate value for each document is determined by combining the TF-IDF values of the terms in it. The clustering phase uses the HOC algorithm in which the similarity between the documents is calculated by applying the Euclidean distance method. Initially, a cluster is generated for each document by grouping those documents that are closest to it. Then, the distance between any two clusters is measured, grouping the closest clusters as a new cluster. This process is repeated until the root cluster is generated. In the validation phase, the feature selection method is applied to validate the appropriateness of the cluster concepts built by the HOC algorithm to see if they have meaningful hierarchical relationships. Feature selection is a method of extracting key features from a document by identifying and assigning weight values to important and representative terms in the document. In order to correctly select key features, a method is needed to determine how each term contributes to the class of the document. Among several methods achieving this goal, this paper adopted the $x^2$�� statistics, which measures the dependency degree of a term t to a class c, and represents the relationship between t and c by a numerical value. To demonstrate the effectiveness of the HOC algorithm, a series of performance evaluation is carried out by using a well-known Reuter-21578 news collection. The result of performance evaluation showed that the HOC algorithm greatly contributes to detecting and producing complex concepts by generating the concept hierarchy in a lattice structure.