• Title/Summary/Keyword: Data partitioning

Search Result 390, Processing Time 0.026 seconds

ROBUST REGRESSION ESTIMATION BASED ON DATA PARTITIONING

  • Lee, Dong-Hee;Park, You-Sung
    • Journal of the Korean Statistical Society
    • /
    • v.36 no.2
    • /
    • pp.299-320
    • /
    • 2007
  • We introduce a high breakdown point estimator referred to as data partitioning robust regression estimator (DPR). Since the DPR is obtained by partitioning observations into a finite number of subsets, it has no computational problem unlike the previous robust regression estimators. Empirical and extensive simulation studies show that the DPR is superior to the previous robust estimators. This is much so in large samples.

A transaction-based vertical partitioning algorithm (트랜잭션 중심의 발견적 파일 수직 분한 방법)

  • 박기택;김재련
    • Journal of the military operations research society of Korea
    • /
    • v.22 no.1
    • /
    • pp.81-96
    • /
    • 1996
  • In a relational database environment, partitioning of data is directly concerned with the amount of data that needs to be required in a query or transaction. In this paper, we consider non-overlapping, vertical partitioning. Vertical partitioning algorithm in this paper is composed of two phases. In phase 1, we cluster the attributes with zero-one integer program that maximize affinity among attributes. The result of phase 1 is called 'Initial Fragments'. In phase 2, we modify Initial Fragments that is not directly considered by cost factors, making use of a transaction-based partitioning method. A transaction-based partitioning method is partitioning attributes according to a set of transactions. In this phase we select logical accesses which needs to be required in a transaction as comparison criteria. In phase 2, proposed algorithm consider only small number of modification of Initial Fragments in phase 1. This algorithm is so insensible to number of transactions and of attributes that it can applied to relatively large problems easily.

  • PDF

Adaptive Partitioning for Efficient Query Support

  • Yun, Hong-Won
    • Journal of information and communication convergence engineering
    • /
    • v.5 no.4
    • /
    • pp.369-373
    • /
    • 2007
  • RFID systems large volume of data, it can lead to slower queries. To achieve better query performance, we can partition into active and some nonactive data. In this paper, we propose two approaches of partitioning for efficient query support. The one is average period plus delta partition and the other is adaptive average period partition. We also present the system architecture to manage active data and non-active data and logical database schema. The data manager check the active partition and move all objects from the active store to an archive store associated with an average period plus data and an adaptive average period. Our experiments show the performance of our partitioning methods.

A Compact Divide-and-conquer Algorithm for Delaunay Triangulation with an Array-based Data Structure (배열기반 데이터 구조를 이용한 간략한 divide-and-conquer 삼각화 알고리즘)

  • Yang, Sang-Wook;Choi, Young
    • Korean Journal of Computational Design and Engineering
    • /
    • v.14 no.4
    • /
    • pp.217-224
    • /
    • 2009
  • Most divide-and-conquer implementations for Delaunay triangulation utilize quad-edge or winged-edge data structure since triangles are frequently deleted and created during the merge process. How-ever, the proposed divide-and-conquer algorithm utilizes the array based data structure that is much simpler than the quad-edge data structure and requires less memory allocation. The proposed algorithm has two important features. Firstly, the information of space partitioning is represented as a permutation vector sequence in a vertices array, thus no additional data is required for the space partitioning. The permutation vector represents adaptively divided regions in two dimensions. The two-dimensional partitioning of the space is more efficient than one-dimensional partitioning in the merge process. Secondly, there is no deletion of edge in merge process and thus no bookkeeping of complex intermediate state for topology change is necessary. The algorithm is described in a compact manner with the proposed data structures and operators so that it can be easily implemented with computational efficiency.

A Cyclic Sliced Partitioning Method for Packing High-dimensional Data (고차원 데이타 패킹을 위한 주기적 편중 분할 방법)

  • 김태완;이기준
    • Journal of KIISE:Databases
    • /
    • v.31 no.2
    • /
    • pp.122-131
    • /
    • 2004
  • Traditional works on indexing have been suggested for low dimensional data under dynamic environments. But recent database applications require efficient processing of huge sire of high dimensional data under static environments. Thus many indexing strategies suggested especially in partitioning ones do not adapt to these new environments. In our study, we point out these facts and propose a new partitioning strategy, which complies with new applications' requirements and is derived from analysis. As a preliminary step to propose our method, we apply a packing technique on the one hand and exploit observations on the Minkowski-sum cost model on the other, under uniform data distribution. Observations predict that unbalanced partitioning strategy may be more query-efficient than balanced partitioning strategy for high dimensional data. Thus we propose our method, called CSP (Cyclic Spliced Partitioning method). Analysis on this method explicitly suggests metrics on how to partition high dimensional data. By the cost model, simulations, and experiments, we show excellent performance of our method over balanced strategy. By experimental studies on other indices and packing methods, we also show the superiority of our method.

A Vertical Partitioning Algorithm based on Fuzzy Graph (퍼지 그래프 기반의 수직 분할 알고리즘)

  • Son, Jin-Hyun;Choi, Kyung-Hoon;Kim, Myoung-Ho
    • Journal of KIISE:Databases
    • /
    • v.28 no.3
    • /
    • pp.315-323
    • /
    • 2001
  • The concept of vertical partitioning has been discussed so far in an objective of improving the performance of query execution and system throughput. It can be applied to the areas where the match between data and queries affects performance, which includes partitioning of individual files in centralized environments, data distribution in distributed databases, dividing data among different levels of memory hierarchies, and so on. In general, a vertical partitioning algorithm should support n-ary partitioning as well as a globally optimal solution for the generation of all meaningful fragments. Most previous methods, however, have some limitations to support both of them efficiently. Because the vertical partitioning problem basically includes the fuzziness property, the proper management is required for the fuzziness problem. In this paper we propose an efficient vertical $\alpha$-partitioning algorithm which is based on the fuzzy theory. The method can not only generate all meaningful fragments but also support n-ary partitioning without any complex mathematical computations.

  • PDF

Spatial Statistic Data Release Based on Differential Privacy

  • Cai, Sujin;Lyu, Xin;Ban, Duohan
    • KSII Transactions on Internet and Information Systems (TIIS)
    • /
    • v.13 no.10
    • /
    • pp.5244-5259
    • /
    • 2019
  • With the continuous development of LBS (Location Based Service) applications, privacy protection has become an urgent problem to be solved. Differential privacy technology is based on strict mathematical theory that provides strong privacy guarantees where it supposes that the attacker has the worst-case background knowledge and that knowledge has been applied to different research directions such as data query, release, and mining. The difficulty of this research is how to ensure data availability while protecting privacy. Spatial multidimensional data are usually released by partitioning the domain into disjointed subsets, then generating a hierarchical index. The traditional data-dependent partition methods need to allocate a part of the privacy budgets for the partitioning process and split the budget among all the steps, which is inefficient. To address such issues, a novel two-step partition algorithm is proposed. First, we partition the original dataset into fixed grids, inject noise and synthesize a dataset according to the noisy count. Second, we perform IH-Tree (Improved H-Tree) partition on the synthetic dataset and use the resulting partition keys to split the original dataset. The algorithm can save the privacy budget allocated to the partitioning process and obtain a more accurate release. The algorithm has been tested on three real-world datasets and compares the accuracy with the state-of-the-art algorithms. The experimental results show that the relative errors of the range query are considerably reduced, especially on the large scale dataset.

The Evaluation of Petroleum Contamination in Heterogeneous Media Using Partitioning Tracer Method (분배성 추적자 시험법을 이용한 불균질 지반의 유류 오염도 평가)

  • Kim, Eun-Hyup;Rhee, Sung-Su;Park, Jun-Boum
    • Proceedings of the Korean Geotechical Society Conference
    • /
    • 2009.09a
    • /
    • pp.1372-1377
    • /
    • 2009
  • For the remediation of the subsurface contaminated by nonaqueous phase liquids(NAPLs), it is important to characterize the NAPL zone properly. Conventional characterization methods provide data at discrete points. To overcome the weak points of conventional characterization methods, the partitioning tracer method has been developed and studied. The average saturation of NAPL($S_n$), which is the representative and continuous saturation value within contaminated site, can be calculated by comparing the transport of the partitioning tracers to that of the conservative tracer in the partitioning tracer method. In this study, the application of the partitioning tracer method in heterogeneous media was investigated. To represent the heterogeneous condition of subsurface, a two-dimensional soil box was divided into four layers and each layer contained different sized soils. Soils in the soil box were contaminated by the mixture of kerosene and diesel, and partitioning tracer tests were conducted before and after the contamination using methanol as conservative tracer and 4-methyl-2-pentanol, 2-ethyl-1-butanol, and hexanol as partitioning tracers. The response curves of partitioning tracers from contaminated soils were separated and retarded in comparison with those from non-contaminated soils. The contamination of soils by NAPLs, therefore, can be detected by partitioning tracer method considering these retardations of tracers. From our experiment condition, the average saturation of NAPLs calculated by partitioning tracer method using the methanol as conservative tracer and hexanol as partitioning tracer showed the highest accuracy, though all results were underestimated. Further studies, therefore, were needed for improving the accuracy using the partitioning tracer test in heterogeneous media.

  • PDF

A Dynamic Partitioning Scheme for Distributed Storage of Large-Scale RDF Data (대규모 RDF 데이터의 분산 저장을 위한 동적 분할 기법)

  • Kim, Cheon Jung;Kim, Ki Yeon;Yoo, Jong Hyeon;Lim, Jong Tae;Bok, Kyoung Soo;Yoo, Jae Soo
    • Journal of KIISE
    • /
    • v.41 no.12
    • /
    • pp.1126-1135
    • /
    • 2014
  • In recent years, RDF partitioning schemes have been studied for the effective distributed storage and management of large-scale RDF data. In this paper, we propose an RDF dynamic partitioning scheme to support load balancing in dynamic environments where the RDF data is continuously inserted and updated. The proposed scheme creates clusters and sub-clusters according to the frequency of the RDF data used by queries to set graph partitioning criteria. We partition the created clusters and sub-clusters by considering the workloads and data sizes for the servers. Therefore, we resolve the data concentration of a specific server, resulting from the continuous insertion and update of the RDF data, in such a way that the load is distributed among servers in dynamic environments. It is shown through performance evaluation that the proposed scheme significantly improves the query processing time over the existing scheme.

Protein Motif Extraction via Feature Interval Selection

  • Sohn, In-Suk;Hwang, Chang-Ha;Ko, Jun-Su;Chiu, David;Hong, Dug-Hun
    • Journal of the Korean Data and Information Science Society
    • /
    • v.17 no.4
    • /
    • pp.1279-1287
    • /
    • 2006
  • The purpose of this paper is to present a new algorithm for extracting the consensus pattern, or motif from sequence belonging to the same family. Two methods are considered for feature interval partitioning based on equal probability and equal width interval partitioning. C2H2 zinc finger protein and epidermal growth factor protein sequences are used to demonstrate the effectiveness of the proposed algorithm for motif extraction. For two protein families, the equal width interval partitioning method performs better than the equal probability interval partitioning method.

  • PDF