• Title/Summary/Keyword: Skewed Data

Search Result 206, Processing Time 0.025 seconds

Fixed-accuracy confidence interval estimation of P(X > c) for a two-parameter gamma population

  • Zhuang, Yan;Hu, Jun;Zou, Yixuan
    • Communications for Statistical Applications and Methods
    • /
    • v.27 no.6
    • /
    • pp.625-639
    • /
    • 2020
  • The gamma distribution is a flexible right-skewed distribution widely used in many areas, and it is of great interest to estimate the probability of a random variable exceeding a specified value in survival and reliability analysis. Therefore, the study develops a fixed-accuracy confidence interval for P(X > c) when X follows a gamma distribution, Γ(α, β), and c is a preassigned positive constant through: 1) a purely sequential procedure with known shape parameter α and unknown rate parameter β; and 2) a nonparametric purely sequential procedure with both shape and rate parameters unknown. Both procedures enjoy appealing asymptotic first-order efficiency and asymptotic consistency properties. Extensive simulations validate the theoretical findings. Three real-life data examples from health studies and steel manufacturing study are discussed to illustrate the practical applicability of both procedures.

Anomaly Detection in Sensor Data

  • Kim, Jong-Min;Baik, Jaiwook
    • Journal of Applied Reliability
    • /
    • v.18 no.1
    • /
    • pp.20-32
    • /
    • 2018
  • Purpose: The purpose of this study is to set up an anomaly detection criteria for sensor data coming from a motorcycle. Methods: Five sensor values for accelerator pedal, engine rpm, transmission rpm, gear and speed are obtained every 0.02 second from a motorcycle. Exploratory data analysis is used to find any pattern in the data. Traditional process control methods such as X control chart and time series models are fitted to find any anomaly behavior in the data. Finally unsupervised learning algorithm such as k-means clustering is used to find any anomaly spot in the sensor data. Results: According to exploratory data analysis, the distribution of accelerator pedal sensor values is very much skewed to the left. The motorcycle seemed to have been driven in a city at speed less than 45 kilometers per hour. Traditional process control charts such as X control chart fail due to severe autocorrelation in each sensor data. However, ARIMA model found three abnormal points where they are beyond 2 sigma limits in the control chart. We applied a copula based Markov chain to perform statistical process control for correlated observations. Copula based Markov model found anomaly behavior in the similar places as ARIMA model. In an unsupervised learning algorithm, large sensor values get subdivided into two, three, and four disjoint regions. So extreme sensor values are the ones that need to be tracked down for any sign of anomaly behavior in the sensor values. Conclusion: Exploratory data analysis is useful to find any pattern in the sensor data. Process control chart using ARIMA and Joe's copula based Markov model also give warnings near similar places in the data. Unsupervised learning algorithm shows us that the extreme sensor values are the ones that need to be tracked down for any sign of anomaly behavior.

Statistically Optimized Asynchronous Barrel Shifters for Variable Length Codecs (통계적으로 최적화된 비동기식 가변길이코덱용 배럴 쉬프트)

  • Peter A. Beerel;Kim, Kyeoun-Soo
    • The Journal of Korean Institute of Communications and Information Sciences
    • /
    • v.28 no.11A
    • /
    • pp.891-901
    • /
    • 2003
  • This paper presents low-power asynchronous barrel shifters for variable length encoders and decoders useful in portable applications using multimedia standards. Our approach is to create multi-level asynchronous barrel shifters optimized for the skewed shift control statistics often found in these codecs. For common shifts, data passes through one level, whereas for rare shifts, data passes though multiple levels. We compare our optimized designs with the straightforward asynchronous and synchronous designs. Both pre- and Post-layout HSPICE simulation results indicate that, compared to their synchronous counterparts, our designs provide over a 40% savings in average energy consumption for a given average performance.

Statistical Methods to Control Response Bias in Nursing Activity Surveys (간호활동시간 조사 시 응답편이 통제를 위한 통계적 접근 방안)

  • Lim, Ji-Young;Park, Chang-Gi
    • Journal of Korean Academy of Nursing
    • /
    • v.42 no.1
    • /
    • pp.48-55
    • /
    • 2012
  • Purpose: The aim of this study was to compare statistical methods to control response bias in nursing activity surveys. Methods: Data were collected at a medical unit of a general hospital. The number of nursing activities and consumed activity time were measured using self-report questionnaires. Descriptive statistics were used to identify general characteristics of the units. Average, Z-standardization, gamma regression, finite mixture model, and stochastic frontier model were adopted to estimate true activity time controlling for response bias. Results: The nursing activity time data were highly skewed and had non-normal distributions. Among the 4 different methods, only gamma regression and stochastic frontier model controlled response bias effectively and the estimated total nursing activity time did not exceeded total work time. However, in gamma regression, estimated total nursing activity time was too small to use in real clinical settings. Thus stochastic frontier model was the most appropriate method to control response bias when compared with the other methods. Conclusion: According to these results, we recommend the use of a stochastic frontier model to estimate true nursing activity time when using self-report surveys.

A Study on a Measure for Non-Normal Process Capability (비정규 공정능력 측도에 관한 연구)

  • 김홍준;김진수;조남호
    • Proceedings of the Korean Reliability Society Conference
    • /
    • 2001.06a
    • /
    • pp.311-319
    • /
    • 2001
  • All indices that are now in use assume normally distributed data, and any use of the indices on non-normal data results in inaccurate capability measurements. Therefore, $C_{s}$ is proposed which extends the most useful index to date, the Pearn-Kotz-Johnson $C_{pmk}$, by not only taking into account that the process mean may not lie midway between the specification limits and incorporating a penalty when the mean deviates from its target, but also incorporating a penalty for skewness. Therefore we propose, a new process capability index $C_{psk}$( WV) applying the weighted variance control charting method for non-normally distributed. The main idea of the weighted variance method(WVM) is to divide a skewed or asymmetric distribution into two normal distribution from its mean to create two new distributions which have the same mean but different standard distributions. In this paper we propose an example, a distribution generated from the Johnson family of distributions, to demonstrate how the weighted variance-based process capability indices perform in comparison with another two non-normal methods, namely the Clements and the Wright methods. This example shows that the weighted valiance-based indices are more consistent than the other two methods In terms of sensitivity to departure to the process mean/median from the target value for non-normal process.s.s.s.

  • PDF

A Comparison of Ensemble Methods Combining Resampling Techniques for Class Imbalanced Data (데이터 전처리와 앙상블 기법을 통한 불균형 데이터의 분류모형 비교 연구)

  • Leea, Hee-Jae;Lee, Sungim
    • The Korean Journal of Applied Statistics
    • /
    • v.27 no.3
    • /
    • pp.357-371
    • /
    • 2014
  • There are many studies related to imbalanced data in which the class distribution is highly skewed. To address the problem of imbalanced data, previous studies deal with resampling techniques which correct the skewness of the class distribution in each sampled subset by using under-sampling, over-sampling or hybrid-sampling such as SMOTE. Ensemble methods have also alleviated the problem of class imbalanced data. In this paper, we compare around a dozen algorithms that combine the ensemble methods and resampling techniques based on simulated data sets generated by the Backbone model, which can handle the imbalance rate. The results on various real imbalanced data sets are also presented to compare the effectiveness of algorithms. As a result, we highly recommend the resampling technique combining ensemble methods for imbalanced data in which the proportion of the minority class is less than 10%. We also find that each ensemble method has a well-matched sampling technique. The algorithms which combine bagging or random forest ensembles with random undersampling tend to perform well; however, the boosting ensemble appears to perform better with over-sampling. All ensemble methods combined with SMOTE outperform in most situations.

DGR-Tree : An Efficient Index Structure for POI Search in Ubiquitous Location Based Services (DGR-Tree : u-LBS에서 POI의 검색을 위한 효율적인 인덱스 구조)

  • Lee, Deuk-Woo;Kang, Hong-Koo;Lee, Ki-Young;Han, Ki-Joon
    • Journal of Korea Spatial Information System Society
    • /
    • v.11 no.3
    • /
    • pp.55-62
    • /
    • 2009
  • Location based Services in the ubiquitous computing environment, namely u-LBS, use very large and skewed spatial objects that are closely related to locational information. It is especially essential to achieve fast search, which is looking for POI(Point of Interest) related to the location of users. This paper examines how to search large and skewed POI efficiently in the u-LBS environment. We propose the Dynamic-level Grid based R-Tree(DGR-Tree), which is an index for point data that can reduce the cost of stationary POI search. DGR-Tree uses both R-Tree as a primary index and Dynamic-level Grid as a secondary index. DGR-Tree is optimized to be suitable for point data and solves the overlapping problem among leaf nodes. Dynamic-level Grid of DGR-Tree is created dynamically according to the density of POI. Each cell in Dynamic-level Grid has a leaf node pointer for direct access with the leaf node of the primary index. Therefore, the index access performance is improved greatly by accessing the leaf node directly through Dynamic-level Grid. We also propose a K-Nearest Neighbor(KNN) algorithm for DGR-Tree, which utilizes Dynamic-level Grid for fast access to candidate cells. The KNN algorithm for DGR-Tree provides the mechanism, which can access directly to cells enclosing given query point and adjacent cells without tree traversal. The KNN algorithm minimizes sorting cost about candidate lists with minimum distance and provides NEB(Non Extensible Boundary), which need not consider the extension of candidate nodes for KNN search.

  • PDF

Bivariate skewness, kurtosis and surface plot (이변량 왜도, 첨도 그리고 표면그림)

  • Hong, Chong Sun;Sung, Jae Hyun
    • Journal of the Korean Data and Information Science Society
    • /
    • v.28 no.5
    • /
    • pp.959-970
    • /
    • 2017
  • In this study, we propose bivariate skewness and kurtosis statistics and suggest a surface plot that can visually implement bivariate data containing the correlation coefficient. The skewness statistic is expressed in the form of a paired real values because this represents the skewed directions and degrees of the bivariate random sample. The kurtosis has a positive value which can determine how thick the tail part of the data is compared to the bivariate normal distribution. Moreover, the surface plot implements bivariate data based on the quantile vectors. Skewness and kurtosis are obtained and surface plots are explored for various types of bivariate data. With these results, it has been found that the values of the skewness and kurtosis reflect the characteristics of the bivariate data implemented by the surface plots. Therefore, the skewness, kurtosis and surface plot proposed in this paper could be used as one of valuable descriptive statistical methods for analyzing bivariate distributions.

Ensemble Learning for Solving Data Imbalance in Bankruptcy Prediction (기업부실 예측 데이터의 불균형 문제 해결을 위한 앙상블 학습)

  • Kim, Myoung-Jong
    • Journal of Intelligence and Information Systems
    • /
    • v.15 no.3
    • /
    • pp.1-15
    • /
    • 2009
  • In a classification problem, data imbalance occurs when the number of instances in one class greatly outnumbers the number of instances in the other class. Such data sets often cause a default classifier to be built due to skewed boundary and thus the reduction in the classification accuracy of such a classifier. This paper proposes a Geometric Mean-based Boosting (GM-Boost) to resolve the problem of data imbalance. Since GM-Boost introduces the notion of geometric mean, it can perform learning process considering both majority and minority sides, and reinforce the learning on misclassified data. An empirical study with bankruptcy prediction on Korea companies shows that GM-Boost has the higher classification accuracy than previous methods including Under-sampling, Over-Sampling, and AdaBoost, used in imbalanced data and robust learning performance regardless of the degree of data imbalance.

  • PDF

An Exploratory Analysis on the User Response Pattern and Quality Characteristics of Marketing Contents in the SNS of Regional Government (지역마케팅 콘텐츠의 사용자 반응패턴과 품질특성에 관한 탐색적 분석: 지방자치단체가 운영하는 SNS를 중심으로)

  • Jeong, Yeon-Su;Jeong, Dae-Yul
    • The Journal of Information Systems
    • /
    • v.26 no.4
    • /
    • pp.419-442
    • /
    • 2017
  • Purpose The purpose of this study is to explore the pattern of user response and it's duration time through social media content response analysis. We also analyze the characteristics of content quality factors which are associate with the user response pattern. The analysis results will provide some implications to develop strategies and schematic plans for the operator of regional marketing on the SNS. Design/methodology/approach This study used mixed methods to verify the effects and responses of social media contents on the users who have concerns about regional events such as local festival, cultural events, and city tours etc. Big data analysis was conducted with the quantitative data from regional government SNSs. The data was collected through web crawling in order to analyze the social media contents. We especially analyzed the contents duration time and peak level time. This study also analyzed the characteristics of contents quality factors using expert evaluation data on the social media contents. Finally, we verify the relationship between the contents quality factors and user response types by cross correlation analysis. Findings According to the big data analysis, we could find some content life cycle which can be explained through empirical distribution with peak time pattern and left skewed long tail. The user response patterns are dependent on time and contents quality. In addition, this study confirms that the level of quality of social media content is closely relate to user interaction and response pattern. As a result of the contents response pattern analysis, it is necessary to develop high quality contents design strategy and content posting and propagation tactics. The SNS operators need to develop high quality contents using rich-media technology and active response contents that induce opinion leader on the SNS.