• Title/Summary/Keyword: Large Data Set

Search Result 1,058, Processing Time 0.029 seconds

Density Adaptive Grid-based k-Nearest Neighbor Regression Model for Large Dataset (대용량 자료에 대한 밀도 적응 격자 기반의 k-NN 회귀 모형)

  • Liu, Yiqi;Uk, Jung
    • Journal of Korean Society for Quality Management
    • /
    • v.49 no.2
    • /
    • pp.201-211
    • /
    • 2021
  • Purpose: This paper proposes a density adaptive grid algorithm for the k-NN regression model to reduce the computation time for large datasets without significant prediction accuracy loss. Methods: The proposed method utilizes the concept of the grid with centroid to reduce the number of reference data points so that the required computation time is much reduced. Since the grid generation process in this paper is based on quantiles of original variables, the proposed method can fully reflect the density information of the original reference data set. Results: Using five real-life datasets, the proposed k-NN regression model is compared with the original k-NN regression model. The results show that the proposed density adaptive grid-based k-NN regression model is superior to the original k-NN regression in terms of data reduction ratio and time efficiency ratio, and provides a similar prediction error if the appropriate number of grids is selected. Conclusion: The proposed density adaptive grid algorithm for the k-NN regression model is a simple and effective model which can help avoid a large loss of prediction accuracy with faster execution speed and fewer memory requirements during the testing phase.

Sequential Pattern Mining for Intrusion Detection System with Feature Selection on Big Data

  • Fidalcastro, A;Baburaj, E
    • KSII Transactions on Internet and Information Systems (TIIS)
    • /
    • v.11 no.10
    • /
    • pp.5023-5038
    • /
    • 2017
  • Big data is an emerging technology which deals with wide range of data sets with sizes beyond the ability to work with software tools which is commonly used for processing of data. When we consider a huge network, we have to process a large amount of network information generated, which consists of both normal and abnormal activity logs in large volume of multi-dimensional data. Intrusion Detection System (IDS) is required to monitor the network and to detect the malicious nodes and activities in the network. Massive amount of data makes it difficult to detect threats and attacks. Sequential Pattern mining may be used to identify the patterns of malicious activities which have been an emerging popular trend due to the consideration of quantities, profits and time orders of item. Here we propose a sequential pattern mining algorithm with fuzzy logic feature selection and fuzzy weighted support for huge volumes of network logs to be implemented in Apache Hadoop YARN, which solves the problem of speed and time constraints. Fuzzy logic feature selection selects important features from the feature set. Fuzzy weighted supports provide weights to the inputs and avoid multiple scans. In our simulation we use the attack log from NS-2 MANET environment and compare the proposed algorithm with the state-of-the-art sequential Pattern Mining algorithm, SPADE and Support Vector Machine with Hadoop environment.

New Optimization Algorithm for Data Clustering (최적화에 기반 한 데이터 클러스터링 알고리즘)

  • Kim, Ju-Mi
    • Journal of Intelligence and Information Systems
    • /
    • v.13 no.3
    • /
    • pp.31-45
    • /
    • 2007
  • Large data handling is one of critical issues that the data mining community faces. This is particularly true for computationally intense tasks such as data clustering. Random sampling of instances is one possible means of achieving large data handling, but a pervasive problem with this approach is how to deal with the noise in the evaluation of the learning algorithm. This paper develops a new optimization based clustering approach using an algorithm specifically designed for noisy performance. Numerical results show this algorithm better than the other algorithms such as PAM and CLARA. Also with this algorithm substantial benefits can be achieved in terms of computational time without sacrificing solution quality using partial data.

  • PDF

An Efficient Bulk Loading for High Dimensional Index Structures (고차원 색인 구조를 위한 효율적인 벌크 로딩)

  • Bok, Kyoung-Soo;Lee, Seok-Hee;Cho, Ki-Hyung;Yoo, Jae-Soo
    • The Transactions of the Korea Information Processing Society
    • /
    • v.7 no.8
    • /
    • pp.2327-2340
    • /
    • 2000
  • Existing bulk loading algorithms for multi-dimensional index structures suffer from satisfying both index construction time and retrieval perfonnancc. In this paper, we propose an efficient bulk loading algorithm to construct high dimensional index structures for large data set that overcomes the problem. Although several bulk loading algorithms have been proposed for this purpose, none of them improve both constnlCtion time and search performance. To improve the construction time, we don't sort whole data set and use bisectiou algorithm that divides the whole data set or a subset into two partitions according to the specific pivot value. Also, we improve the search performance by selecting split positions according to the distribution properties of the data set. We show that the proposed algorithm is superior to existing algorithms in terms of construction time and search perfomlance through various experiments.

  • PDF

Extraction Method of Significant Clinical Tests Based on Data Discretization and Rough Set Approximation Techniques: Application to Differential Diagnosis of Cholecystitis and Cholelithiasis Diseases (데이터 이산화와 러프 근사화 기술에 기반한 중요 임상검사항목의 추출방법: 담낭 및 담석증 질환의 감별진단에의 응용)

  • Son, Chang-Sik;Kim, Min-Soo;Seo, Suk-Tae;Cho, Yun-Kyeong;Kim, Yoon-Nyun
    • Journal of Biomedical Engineering Research
    • /
    • v.32 no.2
    • /
    • pp.134-143
    • /
    • 2011
  • The selection of meaningful clinical tests and its reference values from a high-dimensional clinical data with imbalanced class distribution, one class is represented by a large number of examples while the other is represented by only a few, is an important issue for differential diagnosis between similar diseases, but difficult. For this purpose, this study introduces methods based on the concepts of both discernibility matrix and function in rough set theory (RST) with two discretization approaches, equal width and frequency discretization. Here these discretization approaches are used to define the reference values for clinical tests, and the discernibility matrix and function are used to extract a subset of significant clinical tests from the translated nominal attribute values. To show its applicability in the differential diagnosis problem, we have applied it to extract the significant clinical tests and its reference values between normal (N = 351) and abnormal group (N = 101) with either cholecystitis or cholelithiasis disease. In addition, we investigated not only the selected significant clinical tests and the variations of its reference values, but also the average predictive accuracies on four evaluation criteria, i.e., accuracy, sensitivity, specificity, and geometric mean, during l0-fold cross validation. From the experimental results, we confirmed that two discretization approaches based rough set approximation methods with relative frequency give better results than those with absolute frequency, in the evaluation criteria (i.e., average geometric mean). Thus it shows that the prediction model using relative frequency can be used effectively in classification and prediction problems of the clinical data with imbalanced class distribution.

Analysis of IT Service Quality Elements Using Text Sentiment Analysis (텍스트 감정분석을 이용한 IT 서비스 품질요소 분석)

  • Kim, Hong Sam;Kim, Chong Su
    • Journal of Korean Society of Industrial and Systems Engineering
    • /
    • v.43 no.4
    • /
    • pp.33-40
    • /
    • 2020
  • In order to satisfy customers, it is important to identify the quality elements that affect customers' satisfaction. The Kano model has been widely used in identifying multi-dimensional quality attributes in this purpose. However, the model suffers from various shortcomings and limitations, especially those related to survey practices such as the data amount, reply attitude and cost. In this research, a model based on the text sentiment analysis is proposed, which aims to substitute the survey-based data gathering process of Kano models with sentiment analysis. In this model, from the set of opinion text, quality elements for the research are extracted using the morpheme analysis. The opinions' polarity attributes are evaluated using text sentiment analysis, and those polarity text items are transformed into equivalent Kano survey questions. Replies for the transformed survey questions are generated based on the total score of the original data. Then, the question-reply set is analyzed using both the original Kano evaluation method and the satisfaction index method. The proposed research model has been tested using a large amount of data of public IT service project evaluations. The result shows that it can replace the existing practice and it promises advantages in terms of quality and cost of data gathering. The authors hope that the proposed model of this research may serve as a new quality analysis model for a wide range of areas.

An RDBMS-based Inverted Index Technique for Path Queries Processing on XML Documents with Different Structures (상이한 구조의 XML문서들에서 경로 질의 처리를 위한 RDBMS기반 역 인덱스 기법)

  • 민경섭;김형주
    • Journal of KIISE:Databases
    • /
    • v.30 no.4
    • /
    • pp.420-428
    • /
    • 2003
  • XML is a data-oriented language to represent all types of documents including web documents. By means of the advent of XML-based document generation tools and grow of proprietary XML documents using those tools and translation from legacy data to XML documents at an accelerating pace, we have been gotten a large amount of differently-structured XML documents. Therefore, it is more and more important to retrieve the right documents from the document set. But, previous works on XML have mainly focused on the storage and retrieval methods for a large XML document or XML documents had a same DTD. And, researches that supported the structural difference did not efficiently process path queries on the document set. To resolve the problem, we suggested a new inverted index mechanism using RDBMS and proved it outperformed the previous works. And especially, as it showed the higher efficiency in indirect containment relationship, we argues that the index structure is fit for the differently-structured XML document set.

A Study on Defect Prediction through Real-time Monitoring of Die-Casting Process Equipment (주조공정 설비에 대한 실시간 모니터링을 통한 불량예측에 대한 연구)

  • Chulsoon Park;Heungseob Kim
    • Journal of Korean Society of Industrial and Systems Engineering
    • /
    • v.45 no.4
    • /
    • pp.157-166
    • /
    • 2022
  • In the case of a die-casting process, defects that are difficult to confirm by visual inspection, such as shrinkage bubbles, may occur due to an error in maintaining a vacuum state. Since these casting defects are discovered during post-processing operations such as heat treatment or finishing work, they cannot be taken in advance at the casting time, which can cause a large number of defects. In this study, we propose an approach that can predict the occurrence of casting defects by defect type using machine learning technology based on casting parameter data collected from equipment in the die casting process in real time. Die-casting parameter data can basically be collected through the casting equipment controller. In order to perform classification analysis for predicting defects by defect type, labeling of casting parameters must be performed. In this study, first, the defective data set is separated by performing the primary clustering based on the total defect rate obtained during the post-processing. Second, the secondary cluster analysis is performed using the defect rate by type for the separated defect data set, and the labeling task is performed by defect type using the cluster analysis result. Finally, a classification learning model is created by collecting the entire labeled data set, and a real-time monitoring system for defect prediction using LabView and Python was implemented. When a defect is predicted, notification is performed so that the operator can cope with it, such as displaying on the monitoring screen and alarm notification.

An Active Candidate Set Management Model on Association Rule Discovery using Database Trigger and Incremental Update Technique (트리거와 점진적 갱신기법을 이용한 연관규칙 탐사의 능동적 후보항목 관리 모델)

  • Hwang, Jeong-Hui;Sin, Ye-Ho;Ryu, Geun-Ho
    • Journal of KIISE:Databases
    • /
    • v.29 no.1
    • /
    • pp.1-14
    • /
    • 2002
  • Association rule discovery is a method of mining for the associated item set on large databases based on support and confidence threshold. The discovered association rules can be applied to the marketing pattern analysis in E-commerce, large shopping mall and so on. The association rule discovery makes multiple scan over the database storing large transaction data, thus, the algorithm requiring very high overhead might not be useful in real-time association rule discovery in dynamic environment. Therefore this paper proposes an active candidate set management model based on trigger and incremental update mechanism to overcome non-realtime limitation of association rule discovery. In order to implement the proposed model, we not only describe an implementation model for incremental updating operation, but also evaluate the performance characteristics of this model through the experiment.

The Effects of Selection Attributes for HMR on Satisfaction and Repurchase Intention: Comparative Analysis of Convenience Store and Large Market (HMR 선택속성이 만족과 재구매의도에 미치는 영향: 편의점과 대형마트의 비교 분석)

  • Yang, Dong-Hwi
    • Culinary science and hospitality research
    • /
    • v.24 no.3
    • /
    • pp.204-214
    • /
    • 2018
  • The study set up research models and hypotheses to examine the influence of HMR selection attributes on satisfaction and repurchase intention by distribution channels(convenience store/large market), verify the research hypothesis through empirical analysis, respectively. The purpose of this study is to investigate the convenience sampling method of HMR purchase from convenience store and large market in Seoul and Gyeonggi area. The survey was conducted from January 8, 2018 to January 26, 2018, and 300 questionnaires were distributed and 289 of them were used as an effective data. For the empirical analysis, SPSS 20.0 was used. The results of the analysis are as follows. First, product quality only has a significant effect on satisfaction among HMR selection attributes at convenience store, and product safety and convenience have no significant effect on satisfaction. Second, only the convenience of HMR selection attributes in the large market has a significant effect on satisfaction, and product safety and product quality have no significant effect on satisfaction. Third, HMR satisfaction in convenience stores and large markets has a significant effect on repurchase intention. The purpose of this study is to investigate the relationships among HMR selection attributes, satisfaction, and repurchase intention, which are important in the existing HMR research, by each distribution channel(convenience store/large market). It is meaningful to help them establish an effective sales strategy for each segment.