Abstract
Protection against disclosure of survey respondents' identifiable and/or sensitive information is a prerequisite for statistical agencies that release microdata files from their sample surveys. Coarsening is one of popular methods for protecting the confidentiality of the data. Grouped data can be released in the form of microdata or tabular data. Instead of releasing the data in a tabular form only, having microdata available to the public with interval codes with their representative values greatly enhances the utility of the data. It allows the researchers to compute covariance between the variables and build statistical models or to run a variety of statistical tests on the data. It may be conjectured that the variance of the interval data is lower that of the ungrouped data in the sense that the coarsened data do not have the within interval variance. This conjecture will be investigated using the uniform and triangular distributions. Traditionally, midpoint is used to represent all the values in an interval. This approach implicitly assumes that the data is uniformly distributed within each interval. However, this assumption may not hold, especially in the last interval of the economic data. In this paper, we will use three distributional assumptions - uniform, Pareto and lognormal distribution - in the last interval and use either midpoint or median for other intervals for wage and food costs of the Statistics Korea's 2006 Household Income and Expenditure Survey(HIES) data and compare these approaches in terms of the first two moments.