• Title/Summary/Keyword: data skew

Search Result 125, Processing Time 0.026 seconds

A Pipelined Hash Join Method for Load Balancing (부하 균형 유지를 고려한 파이프라인 해시 조인 방법)

  • Moon, Jin-Gue;Park, No-Sang;Kim, Pyeong-Jung;Jin, Seong-Il
    • The KIPS Transactions:PartD
    • /
    • v.9D no.5
    • /
    • pp.755-768
    • /
    • 2002
  • We investigate the effect of the data skew of join attributes on the performance of a pipelined multi-way hash join method, and propose two new hash join methods with load balancing capabilities. The first proposed method allocates buckets statically by round-robin fashion, and the second one allocates buckets adaptively via a frequency distribution. Using hash-based joins, multiple joins can be pipelined so that the early results from a join, before the whole join is completed, are sent to the next join processing without staying on disks. Unless the pipelining execution of multiple hash joins includes some load balancing mechanisms, the skew effect can severely deteriorate system performance. In this paper, we derive an execution model of the pipeline segment and a cost model, and develop a simulator for the study. As shown by our simulation with a wide range of parameters, join selectivities and sizes of relations deteriorate the system performance as the degree of data skew is larger. But the proposed method using a large number of buckets and a tuning technique can offer substantial robustness against a wide range of skew conditions.

New composite distributions for insurance claim sizes (보험 청구액에 대한 새로운 복합분포)

  • Jung, Daehyeon;Lee, Jiyeon
    • The Korean Journal of Applied Statistics
    • /
    • v.30 no.3
    • /
    • pp.363-376
    • /
    • 2017
  • The insurance market is saturated and its growth engine is exhausted; consequently, the insurance industry is now in a low growth period with insurance companies that face a fierce competitive environment. In such a situation, it will be an important issue to find the probability distributions that can explain the flow of insurance claims, which are the basis of the actuarial calculation of the insurance product. Insurance claims are generally known to be well fitted by lognormal distributions or Pareto distributions biased to the left with a thick tail. In recent years, skew normal distributions or skew t distributions have been considered reasonable distributions for describing insurance claims. Cooray and Ananda (2005) proposed a composite lognormal-Pareto distribution that has the advantages of both lognormal and Pareto distributions and they also showed the composite distribution has a higher fitness than single distributions. In this paper, we introduce new composite distributions based on skew normal distributions or skew t distributions and apply them to Danish fire insurance claim data and US indemnity loss data to compare their performance with the other composite distributions and single distributions.

An Advanced Parallel Join Algorithm for Managing Data Skew on Hypercube Systems (하이퍼큐브 시스템에서 데이타 비대칭성을 고려한 향상된 병렬 결합 알고리즘)

  • 원영선;홍만표
    • Journal of KIISE:Computer Systems and Theory
    • /
    • v.30 no.3_4
    • /
    • pp.117-129
    • /
    • 2003
  • In this paper, we propose advanced parallel join algorithm to efficiently process join operation on hypercube systems. This algorithm uses a broadcasting method in processing relation R which is compatible with hypercube structure. Hence, we can present optimized parallel join algorithm for that hypercube structure. The proposed algorithm has a complete solution of two essential problems - load balancing problem and data skew problem - in parallelization of join operation. In order to solve these problems, we made good use of the characteristics of clustering effect in the algorithm. As a result of this, performance is improved on the whole system than existing algorithms. Moreover. new algorithm has an advantage that can implement non-equijoin operation easily which is difficult to be implemented in hash based algorithm. Finally, according to the cost model analysis. this algorithm showed better performance than existing parallel join algorithms.

Saddlepoint approximations for the risk measures of portfolios based on skew-normal risk factors (왜정규 위험요인 기반 포트폴리오 위험측도에 대한 안장점근사)

  • Yu, Hye-Kyung;Na, Jong-Hwa
    • Journal of the Korean Data and Information Science Society
    • /
    • v.25 no.6
    • /
    • pp.1171-1180
    • /
    • 2014
  • We considered saddlepoint approximations to VaR (value at risk) and ES (expected shortfall) which frequently encountered in finance and insurance as the measures of risk management. In this paper we supposed univariate and multivariate skew-normal distributions, instead of traditional normal class distributions, as underlying distribution of linear portfolios. Simulation results are provided and showed the suggested saddlepoint approximations are very accurate than normal approximations.

Skew Normal Boxplot and Outliers

  • Huh, Myung-Hoe;Lee, Yong-Goo
    • Communications for Statistical Applications and Methods
    • /
    • v.19 no.4
    • /
    • pp.591-595
    • /
    • 2012
  • We frequently use Tukey's boxplot to identify outliers in the batch of observations of the continuous variable. In doing so, we implicitly assume that the underlying distribution belongs to the family of normal distributions. Such a practice of data handling is often superficial and improper, since in reality too many variables manifest the skewness. In this short paper, we build a modified boxplot and set the outlier identification procedure by assuming that the observations are generated from the skew normal distribution (Azzalini, 1985), which is an extension of the normal distribution. Statistical performance of the proposed procedure is examined with simulated datasets.

Estimation of Regionai Skew Coefficient with Weighted Least Squares Regression (가중회귀분석에 의한 지역화왜곡계수의 추정)

  • 조국광;권순국
    • Magazine of the Korean Society of Agricultural Engineers
    • /
    • v.32 no.1
    • /
    • pp.103-109
    • /
    • 1990
  • The application of the Log-Pearson Type m distribution recommended by Water Resources Council, U. S. A. for flood frequency analysis requires the estimation of the regionalized skew coefficient. In this study, regionalized skew coefficients are estimated using a weighted regression model which relates at-site skews based on logarithms of observed annual flood peak series to both basin characteristics and precipitation data in the Han river and the Nakdong river basin. The model is developed with weighted least squares method in which the weights are determined by separating residual variance into that due to model error and due to sampling error. As the result of analysis, regionalized skews are estimated as - 0.732 and - 0.575 in the Han river and the Nakdong river basin, respectively.

  • PDF

ECM Algorithm for Fitting of Mixtures of Multivariate Skew t-Distribution

  • Kim, Seung-Gu
    • Communications for Statistical Applications and Methods
    • /
    • v.19 no.5
    • /
    • pp.673-683
    • /
    • 2012
  • Cabral et al. (2012) defined a mixture model of multivariate skew t-distributions(STMM), and proposed the use of an ECME algorithm (a variation of a standard EM algorithm) to fit the model. Their estimation by the ECME algorithm is closely related to the estimation of the degree of freedoms in the STMM. With the ECME, their purpose is to escape from the calculation of a conditional expectation that is not provided by a closed form; however, their estimates are quite unstable during the procedure of the ECME algorithm. In this paper, we provide a conditional expectation as a closed form so that it can be easily calculated; in addition, we propose to use the ECM algorithm in order to stably fit the STMM.

ON BAYESIAN ESTIMATION AND PROPERTIES OF THE MARGINAL DISTRIBUTION OF A TRUNCATED BIVARIATE t-DISTRIBUTION

  • KIM HEA-JUNG;KIM Ju SUNG
    • Journal of the Korean Statistical Society
    • /
    • v.34 no.3
    • /
    • pp.245-261
    • /
    • 2005
  • The marginal distribution of X is considered when (X, Y) has a truncated bivariate t-distribution. This paper mainly focuses on the marginal nontruncated distribution of X where Y is truncated below at its mean and its observations are not available. Several properties and applications of this distribution, including relationship with Azzalini's skew-normal distribution, are obtained. To circumvent inferential problem arises from adopting the frequentist's approach, a Bayesian method utilizing a data augmentation method is suggested. Illustrative examples demonstrate the performance of the method.

Modeling Circular Data with Uniformly Dispersed Noise

  • Yu, Hye-Kyung;Jun, Kyoung-Ho;Na, Jong-Hwa
    • The Korean Journal of Applied Statistics
    • /
    • v.25 no.4
    • /
    • pp.651-659
    • /
    • 2012
  • In this paper we developed a statistical model for circular data with noises. In this case, model fitting by single circular model has a lack-of-fit problem. To overcome this problem, we consider some mixture models that include circular uniform distribution and apply an EM algorithm to estimate the parameters. Both von Mises and Wrapped skew normal distributions are considered in this paper. Simulation studies are executed to assess the suggested EM algorithms. Finally, we applied the suggested method to fit 2008 EHFRS(Epidemic Hemorrhagic Fever with Renal Syndrome) data provided by the KCDC(Korea Centers for Disease Control and Prevention).

An approximate fitting for mixture of multivariate skew normal distribution via EM algorithm (EM 알고리즘에 의한 다변량 치우친 정규분포 혼합모형의 근사적 적합)

  • Kim, Seung-Gu
    • The Korean Journal of Applied Statistics
    • /
    • v.29 no.3
    • /
    • pp.513-523
    • /
    • 2016
  • Fitting a mixture of multivariate skew normal distribution (MSNMix) with multiple skewness parameter vectors via EM algorithm often requires a highly expensive computational cost to calculate the moments and probabilities of multivariate truncated normal distribution in E-step. Subsequently, it is common to fit an asymmetric data set with MSNMix with a simple skewness parameter vector since it allows us to compute them in E-step in an univariate manner that guarantees a cheap computational cost. However, the adaptation of a simple skewness parameter is unrealistic in many situations. This paper proposes an approximate estimation for the MSNMix with multiple skewness parameter vectors that also allows us to treat them in an univariate manner. We additionally provide some experiments to show its effectiveness.