DOI QR코드

DOI QR Code

A Table Integration Technique Using Query Similarity Analysis

  • Received : 2019.02.08
  • Accepted : 2019.03.25
  • Published : 2019.03.29

Abstract

In this paper, we propose a technique to analyze similarity between SQL queries and to assist integrating similar tables. First, the table information was extracted from the SQL queries through the query structure analyzer, and the similarity between the tables was measured using the Jacquard index technique. Then, similar table clusters are generated through hierarchical cluster analysis method and the co-occurence probability of the table used in the query is calculated. The possibility of integrating similar tables is classified by using the possibility of co-occurence of similarity table and table, and classifying them into an integrable cluster, a cluster requiring expert review, and a cluster with low integration possibility. This technique analyzes the SQL query in practice and analyse the possibility of table integration independent of the existing business, so that the existing schema can be effectively reconstructed without interruption of work or additional cost.

Keywords

CPTSCQ_2019_v24n3_105_f0001.png 이미지

Fig. 1. Conceptual diagram for table integration support model

CPTSCQ_2019_v24n3_105_f0002.png 이미지

Fig. 2. Conceptual diagram of preprocessing step

CPTSCQ_2019_v24n3_105_f0003.png 이미지

Fig. 3. Conceptual diagram of analysis step

CPTSCQ_2019_v24n3_105_f0004.png 이미지

Fig. 4. Flowchart for similarity table clustering

CPTSCQ_2019_v24n3_105_f0005.png 이미지

Fig. 5. Procedure for cluster analysis

CPTSCQ_2019_v24n3_105_f0006.png 이미지

Fig. 6. The number of cluster according to average of Silhouette Coefficient

CPTSCQ_2019_v24n3_105_f0007.png 이미지

Fig. 7. Part of the SQL query where the table in the 72 cluster is used

CPTSCQ_2019_v24n3_105_f0008.png 이미지

Fig. 8. Part of the SQL query where the table in the 91 cluster is used

Table 1. Evaluation criteria using average silhouette coefficient of cluster

CPTSCQ_2019_v24n3_105_t0001.png 이미지

Table 2. Table integration criteria based-on co-occurence of tables in a cluster

CPTSCQ_2019_v24n3_105_t0002.png 이미지

Table 3. Part of top list of similar table

CPTSCQ_2019_v24n3_105_t0003.png 이미지

Table 4. Part of co-occurrence table pair and co-occurence probability

CPTSCQ_2019_v24n3_105_t0004.png 이미지

Table 5. Sample clusters with high possibility of table integration

CPTSCQ_2019_v24n3_105_t0005.png 이미지

Table 6. Table structure of 72 cluster with a high possibility of table integration

CPTSCQ_2019_v24n3_105_t0006.png 이미지

Table 7. Sample clusters that need to be reviewed by expert

CPTSCQ_2019_v24n3_105_t0007.png 이미지

Table 8. Table of 91 cluster that need to be reviewed by manually

CPTSCQ_2019_v24n3_105_t0008.png 이미지

References

  1. Jong Suk Lee, Chang Ho Lee, "Modeling on Data Performance for Very Large Database," Proceedings of the Korea Safety Management & Science, No. 1, pp. 383-391, 2012.
  2. Kang Soo Seo, "The Guide for Data Architecture Professional" Korea Data Agency, pp. 214-553, 2013.
  3. R. Y. Wang, V. C. Storey and C. P. Firth, “A Framework for Analysis of Data Quality Research,” Transactions on Knowledge and Data Engineering, Vol. 7, No. 4, pp. 623-640, Aug. 1995. https://doi.org/10.1109/69.404034
  4. Hae Kyung Rhee, "Harmfulness of Denormalization Adopted for Database for Database Performance Enhancement," Journal of the Institute of Electronics and Information Engineers, Vol. 42, No. 3, May 2005.
  5. Hye Young Seo, Seo Young Kwon, Jae Kwon Ahn, Young Jin Kim, "A Case Study on the Implementation of Master Data Management System for Global Manufacturing Company," Entrue Journal of Information Technology, Vol. 7, No. 2, pp. 91-102, Jul. 2008. https://doi.org/10.3923/itj.2008.91.97
  6. S. Castano, V. Antonellis, M. G. Fugini and B. Pernici, "Conceptual Schema Analysis: Techniques and Applications," ACM Transactions on Database Systems, Vol. 23, No. 3, pp. 286-333, Sep. 1998. https://doi.org/10.1145/293910.293150
  7. H. Kopcke and E. Rahm, "Frameworks for entity matching: A comparison," Data & Knowledge Engineering, Vol. 69, No. 2, pp. 197-210, Feb. 2010. https://doi.org/10.1016/j.datak.2009.10.003
  8. Hong Girl Lee et al., "A Study on the Database Integration Methodology using XML," Journal of Korean Navigation and Port Research, Vol. 29, No. 5, pp. 883-890, Dec. 2005. https://doi.org/10.5394/KINPR.2005.29.10.883
  9. Sanjay Madria et al., "An XML Schema Integration and Query Mechanism System," Data & Knowledge Engineering, Vol. 65, No. 2, pp. 265-303, May 2008.
  10. D. P. Groth, "Visual Representation of Database Queries using Structual similarity," Information Visualization, pp. 102-107, 2003.
  11. Yun Hee Han, "Design and Implementation of Database Cache engine based on Similarity Query Matching" Masterr's Thesis, Korea Polytechnic University, 2008.
  12. P. J. Rousseeuw, "Silhouette: a graphical aid to the interpretation and validation of cluster analysis," Journal of Computational and Applied Mathematics, Vol. 20, pp. 53-65, Nov. 1987. https://doi.org/10.1016/0377-0427(87)90125-7
  13. Soojung Lee, "Performance Analysis of Similarity Reflecting Jaccard Index for Solving Data Sparsity in Collaborative Filtering," The Journal of Korean Association of Computer Education, Vol. 19, No. 4, pp. 59-66, Jul. 2016. https://doi.org/10.32431/KACE.2016.19.4.006
  14. L. Kaufman and P. J. Rousseeuw, "Finding Groups in Data: An Introduction to Cluster Analysis," Wiley, New York, 1990.