• Title/Summary/Keyword: Outlier analysis

Search Result 234, Processing Time 0.026 seconds

A Data Mining Tool for Massive Trajectory Data (대규모 궤적 데이타를 위한 데이타 마이닝 툴)

  • Lee, Jae-Gil
    • Journal of KIISE:Computing Practices and Letters
    • /
    • v.15 no.3
    • /
    • pp.145-153
    • /
    • 2009
  • Trajectory data are ubiquitous in the real world. Recent progress on satellite, sensor, RFID, video, and wireless technologies has made it possible to systematically track object movements and collect huge amounts of trajectory data. Accordingly, there is an ever-increasing interest in performing data analysis over trajectory data. In this paper, we develop a data mining tool for massive trajectory data. This mining tool supports three operations, clustering, classification, and outlier detection, which are the most widely used ones. Trajectory clustering discovers common movement patterns, trajectory classification predicts the class labels of moving objects based on their trajectories, and trajectory outlier detection finds trajectories that are grossly different from or inconsistent with the remaining set of trajectories. The primary advantage of the mining tool is to take advantage of the information of partial trajectories in the process of data mining. The effectiveness of the mining tool is shown using various real trajectory data sets. We believe that we have provided practical software for trajectory data mining which can be used in many real applications.

A Study on Developing a CER Using Production Cost Data in Korean Maneuver Weapon System (한국형 기동무기체계 양산비 비용추정관계식 개발에 관한 연구)

  • Lee, Doo-Hyun;Kim, Gak-Gyu
    • Journal of the Korean Operations Research and Management Science Society
    • /
    • v.39 no.3
    • /
    • pp.51-61
    • /
    • 2014
  • In this paper, we deal with developing a cost estimation relationships (CER) for Korean maneuverable weapons systems using historical production cost. To develop the CER, we collected the historical data of the production cost of four tanks and five armored vehicles. We also analyzed the Required Operational Capability (ROC) of the weapons systems and chose cost drivers that can compare operational capabilities of the weapons systems We used Forward selection, Backward selection, Stepwise Regression and $R^2$ selection as the cost drivers which have the greatest influence with the dependent variables. And we used Principle Component Regression, Robust Regression and Weighted Regression to deal with multicollinearity and outlier among the data to develop a more appropriate CER. As a result, we were able to develop a production cost CER for Korean maneuverable weapons systems that have the lowest cost errors. Thus, this research is meaningful in terms of developing a CER based on Korean original cost data without foreign data and these methods will contribute to developing a Korean cost analysis program in the future.

Application of Statistical Geo-Spatial Information Technology to Soil Stratification (통계적 지반 공간 정보 기법을 이용한 지층구조 분석)

  • Kim, Han-Saem;Kim, Hyun-Ki;Shin, Si-Yeol;Chung, Choong-Ki
    • Journal of the Korean Geotechnical Society
    • /
    • v.27 no.7
    • /
    • pp.59-68
    • /
    • 2011
  • Subsurface Investigation results always reflect a level of soil uncertainty, which sometimes requires statistical corrections of the data for the appropriate engineering decision. This study suggests a closed-form framework to extract the outlying data points from the testing results using the statistical geo-spatial information analyses with outlier analysis and kring-based crossvalidation. The suggested analysis method is conducted to soil stratification using the borehole data in Yeouido.

Improving the Quality of Response Surface Analysis of an Experiment for Coffee-Supplemented Milk Beverage: I. Data Screening at the Center Point and Maximum Possible R-Square

  • Rheem, Sungsue;Oh, Sejong
    • Food Science of Animal Resources
    • /
    • v.39 no.1
    • /
    • pp.114-120
    • /
    • 2019
  • Response surface methodology (RSM) is a useful set of statistical techniques for modeling and optimizing responses in research studies of food science. As a design for a response surface experiment, a central composite design (CCD) with multiple runs at the center point is frequently used. However, sometimes there exist situations where some among the responses at the center point are outliers and these outliers are overlooked. Since the responses from center runs are those from the same experimental conditions, there should be no outliers at the center point. Outliers at the center point ruin statistical analysis. Thus, the responses at the center point need to be looked at, and if outliers are observed, they have to be examined. If the reasons for the outliers are not errors in measuring or typing, such outliers need to be deleted. If the outliers are due to such errors, they have to be corrected. Through a re-analysis of a dataset published in the Korean Journal for Food Science of Animal Resources, we have shown that outlier elimination resulted in the increase of the maximum possible R-square that the modeling of the data can obtain, which enables us to improve the quality of response surface analysis.

Study on the applicability of the principal component analysis for detecting leaks in water pipe networks (상수관망의 누수감지를 위한 주성분 분석의 적용 가능성에 대한 연구)

  • Kim, Kimin;Park, Suwan
    • Journal of Korean Society of Water and Wastewater
    • /
    • v.33 no.2
    • /
    • pp.159-167
    • /
    • 2019
  • In this paper the potential of the principal component analysis(PCA) technique for the application of detecting leaks in water pipe networks was evaluated. For this purpose the PCA was conducted to evaluate the relevance of the calculated outliers of a PCA model utilizing the recorded pipe flows and the recorded pipe leak incidents of a case study water distribution system. The PCA technique was enhanced by applying the computational algorithms developed in this study which were designed to extract a partial set of flow data from the original 24 hour flow data so that the effective outlier detection rate was maximized. The relevance of the calculated outliers of a PCA model and the recorded pipe leak incidents was analyzed. The developed algorithm may be applied in determining further leak detection field work for water distribution blocks that have more than 70% of the effective outlier detection rate. However, the analysis suggested that further development on the algorithm is needed to enhance the applicability of the PCA in detecting leaks by considering series of leak reports happening in a relatively short period.

Development of the Financial Account Pre-screening System for Corporate Credit Evaluation (분식 적발을 위한 재무이상치 분석시스템 개발)

  • Roh, Tae-Hyup
    • The Journal of Information Systems
    • /
    • v.18 no.4
    • /
    • pp.41-57
    • /
    • 2009
  • Although financial information is a great influence upon determining of the group which use them, detection of management fraud and earning manipulation is a difficult task using normal audit procedures and corporate credit evaluation processes, due to the shortage of knowledge concerning the characteristics of management fraud, and the limitation of time and cost. These limitations suggest the need of systemic process for !he effective risk of earning manipulation for credit evaluators, external auditors, financial analysts, and regulators. Moot researches on management fraud have examined how various characteristics of the company's management features affect the occurrence of corporate fraud. This study examines financial characteristics of companies engaged in fraudulent financial reporting and suggests a model and system for detecting GAAP violations to improve reliability of accounting information and transparency of their management. Since the detection of management fraud has limited proven theory, this study used the detecting method of outlier(upper, and lower bound) financial ratio, as a real-field application. The strength of outlier detecting method is its use of easiness and understandability. In the suggested model, 14 variables of the 7 useful variable categories among the 76 financial ratio variables are examined through the distribution analysis as possible indicators of fraudulent financial statements accounts. The developed model from these variables show a 80.82% of hit ratio for the holdout sample. This model was developed as a financial outlier detecting system for a financial institution. External auditors, financial analysts, regulators, and other users of financial statements might use this model to pre-screen potential earnings manipulators in the credit evaluation system. Especially, this model will be helpful for the loan evaluators of financial institutes to decide more objective and effective credit ratings and to improve the quality of financial statements.

A Study of the Roust Degradation Model by Analyzing the Filament Lamp Degradation Data (헤드램프용 필라멘트 램프 가속열화데이터 분석을 통한 로버스트 열화모형 연구)

  • Sung, Ki-Woo
    • Transactions of the Korean Society of Automotive Engineers
    • /
    • v.20 no.6
    • /
    • pp.132-139
    • /
    • 2012
  • It is generally needed to test durability and lifetime when we develop parts in new technology. In this paper, the accelerated degradation analysis methods are developed to test them. This study is presented robust model estimation method that is less affected by outlier in regresstion model estimation. In addition, the lifetime can be predicted by Degradation-stress relationship in stress level.

Frequency Analysis of Extreme Rainfall Using 3 Parameter Probability Distributions (3변수 확률분포형에 의한 극치강우의 빈도분석)

  • Kim, Byeong-Jun;Maeng, Sung-Jin;Ryoo, Kyong-Sik;Lee, Soon-Hyuk
    • Journal of The Korean Society of Agricultural Engineers
    • /
    • v.46 no.3
    • /
    • pp.31-42
    • /
    • 2004
  • This research seeks to derive the design rainfalls through the L-moment with the test of homogeneity, independence and outlier of data on annual maximum daily rainfall at 38 rainfall stations in Korea. To select the appropriate distribution of annual maximum daily rainfall data by the rainfall stations, Generalized Extreme Value (GEV), Generalized Logistic (GLO), Generalized Pareto (GPA), Generalized Normal (GNO) and Pearson Type 3 (PT3) probability distributions were applied and their aptness were judged using an L-moment ratio diagram and the Kolmogorov-Smirnov (K-S) test. Parameters of appropriate distributions were estimated from the observed and simulated annual maximum daily rainfall using Monte Carlo techniques. Design rainfalls were finally derived by GEV distribution, which was proved to be more appropriate than the other distributions.

Analysis and Performance enhancement of angle-based outlier detection (각도 기반 이상치 탐지 방법의 분석과 성능 개선)

  • Sin, Yong-Joon;Park, Cheong-Hee
    • Proceedings of the Korean Information Science Society Conference
    • /
    • 2010.06c
    • /
    • pp.452-457
    • /
    • 2010
  • 고차원 공간에서 효과적인 이상치 탐지 방법으로 제안되었던 각도 기반 이상치 탐지(Angle Based Outlier Detection)는 객체와 객체를 비교하는 척도로 각도 개념을 사용하여 고차원 공간에서도 일반적인 거리기반 이상치 측정 방법보다 좋은 이상치 탐지 성능을 가진다. 그러나 어떤 이상치가 다른 이상치에 의해 둘러싸인 경우 정상객체와 구분하기 어렵다는 문제가 있다. 이 논문에서는 기존의 이상치 탐지 방법을 개선한 방법을 제안하고 실험을 통하여 기존의 방법과 제안한 새로운 방법을 비교하여 향상된 성능을 입증한다.

  • PDF

Automatic Cleaning Algorithm of Asset Data for Transmission Cable (지중 송전케이블 자산데이터의 자동 정제 알고리즘 개발연구)

  • Hwang, Jae-Sang;Mun, Sung-Duk;Kim, Tae-Joon;Kim, Kang-Sik
    • KEPCO Journal on Electric Power and Energy
    • /
    • v.7 no.1
    • /
    • pp.79-84
    • /
    • 2021
  • The fundamental element to be kept for big data analysis, artificial intelligence technologies and asset management system is a data quality, which could directly affect the entire system reliability. For this reason, the momentum of data cleaning works is recently increased and data cleaning methods have been investigating around the world. In the field of electric power, however, asset data cleaning methods have not been fully determined therefore, automatic cleaning algorithm of asset data for transmission cables has been studied in this paper. Cleaning algorithm is composed of missing data treatment and outlier data one. Rule-based and expert opinion based cleaning methods are converged and utilized for these dirty data.