A Novel of Data Clustering Architecture for Outlier Detection to Electric Power Data Analysis (전력데이터 분석에서 이상점 추출을 위한 데이터 클러스터링 아키텍처에 관한 연구)
-
- KIPS Transactions on Software and Data Engineering
- /
- v.6 no.10
- /
- pp.465-472
- /
- 2017
In the past, researchers mainly used the supervised learning technique of machine learning to analyze power data and investigated the identification of patterns through the data mining technique. Data analysis research, however, faces its limitations with the old data classification and analysis techniques today when the size of electric power data has increased with the possible real-time provision of data. This study thus set out to propose a clustering architecture to analyze large-sized electric power data. The clustering process proposed in the study supplements the K-means algorithm, an unsupervised learning technique, for its problems and is capable of automating the entire process from the collection of electric power data to their analysis. In the present study, power data were categorized and analyzed in total three levels, which include the row data level, clustering level, and user interface level. In addition, the investigator identified K, the ideal number of clusters, based on principal component analysis and normal distribution and proposed an altered K-means algorithm to reduce data that would be categorized as ideal points in order to increase the efficiency of clustering.
The increased transport of marine hazardous and noxious substances (HNS) has resulted in frequent HNS spill accidents domestically and internationally. There are about 6,000 species of HNS internationally, and most of them have toxic properties. When an accidental HNS spill occurs, it can destroys the marine ecosystem and can damage life and property due to explosion and fire. Constructing a spectral library of HNS according to wavelength and developing a detection algorithm would help prepare for accidents. In this study, a ground HNS spill experiment was conducted in France. The toluene spectrum was determined through hyperspectral sensor measurements. HNS present in the hyperspectral images were detected by applying the spectral mixture algorithm. Preprocessing principal component analysis (PCA) removed noise and performed dimensional compression. The endmember spectra of toluene and seawater were extracted through the N-FINDR technique. By calculating the abundance fraction of toluene and seawater based on the spectrum, the detection accuracy of HNS in all pixels was presented as a probability. The probability was compared with radiance images at a wavelength of 418.15 nm to select abundance fractions with maximum detection accuracy. The accuracy exceeded 99% at a ratio of approximately 42%. Response to marine spills of HNS are presently impeded by the restricted access to the site because of high risk of exposure to toxic compounds. The present experimental and detection results could help estimate the area of contamination with HNS based on hyperspectral remote sensing.
In a nuclear power plant (NPP), periodic sensor calibrations are required to assure sensors are operating correctly. However, only a few faulty sensors are found to be calibrated. For the safe operation of an NPP and the reduction of unnecessary calibration, on-line calibration monitoring is needed. In this paper, principal component-based Auto-Associative support vector regression (PCSVR) was proposed for the sensor signal validation of the NPP. It utilizes the attractive merits of principal component analysis (PCA) for extracting predominant feature vectors and AASVR because it easily represents complicated processes that are difficult to model with analytical and mechanistic models. With the use of real plant startup data from the Kori Nuclear Power Plant Unit 3, SVR hyperparameters were optimized by the response surface methodology (RSM). Moreover the statistical techniques are integrated with PCSVR for the failure detection. The residuals between the estimated signals and the measured signals are tested by the Shewhart Control Chart, Exponentially Weighted Moving Average (EWMA), Cumulative Sum (CUSUM) and generalized likelihood ratio test (GLRT) to detect whether the sensors are failed or not. This study shows the GLRT can be a candidate for the detection of sensor drift.
From January 2020 to October 2021, more than 500,000 academic studies related to COVID-19 (Coronavirus-2, a fatal respiratory syndrome) have been published. The rapid increase in the number of papers related to COVID-19 is putting time and technical constraints on healthcare professionals and policy makers to quickly find important research. Therefore, in this study, we propose a method of extracting useful information from text data of extensive literature using LDA and Word2vec algorithm. Papers related to keywords to be searched were extracted from papers related to COVID-19, and detailed topics were identified. The data used the CORD-19 data set on Kaggle, a free academic resource prepared by major research groups and the White House to respond to the COVID-19 pandemic, updated weekly. The research methods are divided into two main categories. First, 41,062 articles were collected through data filtering and pre-processing of the abstracts of 47,110 academic papers including full text. For this purpose, the number of publications related to COVID-19 by year was analyzed through exploratory data analysis using a Python program, and the top 10 journals under active research were identified. LDA and Word2vec algorithm were used to derive research topics related to COVID-19, and after analyzing related words, similarity was measured. Second, papers containing 'vaccine' and 'treatment' were extracted from among the topics derived from all papers, and a total of 4,555 papers related to 'vaccine' and 5,971 papers related to 'treatment' were extracted. did For each collected paper, detailed topics were analyzed using LDA and Word2vec algorithms, and a clustering method through PCA dimension reduction was applied to visualize groups of papers with similar themes using the t-SNE algorithm. A noteworthy point from the results of this study is that the topics that were not derived from the topics derived for all papers being researched in relation to COVID-19 (