DOI QR코드

DOI QR Code

Locality-Sensitive Hashing for Data with Categorical and Numerical Attributes Using Dual Hashing

  • Lee, Keon Myung (Department of Computer Science, Chungbuk National University)
  • Received : 2014.06.01
  • Accepted : 2014.06.24
  • Published : 2014.06.25

Abstract

Locality-sensitive hashing techniques have been developed to efficiently handle nearest neighbor searches and similar pair identification problems for large volumes of high-dimensional data. This study proposes a locality-sensitive hashing method that can be applied to nearest neighbor search problems for data sets containing both numerical and categorical attributes. The proposed method makes use of dual hashing functions, where one function is dedicated to numerical attributes and the other to categorical attributes. The method consists of creating indexing structures for each of the dual hashing functions, gathering and combining the candidates sets, and thoroughly examining them to determine the nearest ones. The proposed method is examined for a few synthetic data sets, and results show that it improves performance in cases of large amounts of data with both numerical and categorical attributes.

Keywords

References

  1. K. M. Lee, "Locality-Sensitive Hashing Techniques for Nearest Neighbor Search," Int. Journal of Fuzzy Logic and Intelligent Systems, Vol.12, No.4, pp.300-307, Dec. 2012. http://dx.doi.org/10.5391/IJFIS.2012.12.4.300
  2. J. L. Bentley, "Multidimensional Binary Search Trees used for Associative Searching," Communications of the ACM, vol. 18, no. 9, pp. 509-517, Sep. 1975. http://dx.doi.org/10.1145/361002.361007
  3. T. H. Cormen, C. E. Leiserson, R. L. Rivest, Introduction to Algorithms, 2nd ed., Cambridge, MA: MIT Press, 2001.
  4. D. W. Kim, K. H. Lee, "A fuzzy clustering algorithm for clustering categorical data," Journal of The Korean Institute of Intelligent Systems, vol.13, no.6, pp.661-666, Dec. 2003. https://doi.org/10.5391/JKIIS.2003.13.6.661
  5. S. M. Omohundro, Five balltree construction algorithms,, International Computer Science Institute Technical Report, 1989. Available ftp://ftp.icsi.berkeley.edu/pub/techreports/1989/tr-89-063.pdf
  6. S. Pandey, A. Broder, and F. Chierichetti, "Nearest-Neighbor Caching for Content-Match Applications," in Proceedings of the 18th International Conference on World Wide Web, Madrid, Spain, April 20-24, 2009, pp. 441-450. http://dx.doi.org/10.1145/1526709.1526769
  7. M. Potthast and B. Stein, "New Issues in Near-Duplicate Detection," in Data Analysis, Machine Learning and Applications, C. Preisach, H. Burkhardt, L. Schmidt-Thieme, and R. Decker, Eds. Heidelberg, Germany: Springer Berlin, 2008, pp. 601-609. http://dx.doi.org/10.1007/978-3-540-78246-971
  8. J. K. Uhlmann, "Satisfying general proximity/similarity queries with metric trees," Information Processing Letters, vol.4, no. 4, pp.175-179, Nov. 1991. http://dx.doi.org/10.1016/0020-0190(91)90074-R
  9. D. Nister and H. Stewenius, "Scalable Recognition with a Vocabulary Tree," in Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, New York, NY, June 17-22, 2006, pp. 2161-2168. http://dx.doi.org/10.1109/cvpr.2006.264
  10. A. Gionis, P. Indyk, and R. Motwani, "Similarity Search in High Dimensions via Hashing," in Proceedings of the 25th International Conference on Very Large Data Bases, Edinburgh, UK, September 7-10, 1999, pp. 518-529.
  11. P. Indyk and R. Motwani, "Approximate Nearest Neighbors: Towards Removing the Curse of Dimensionality," in Proceedings of the 30th Annual ACM Symposium on Theory of Computing, Dallas, TX, May 24-26, 1998, pp. 604-613. http://dx.doi.org/10.1145/276698.276876
  12. A. Andoni and P. Indyk, "Near-optimal Hashing Algorithms for Approximate Nearest Neighbor in High Dimensions," Communications of the ACM, vol. 51, no. 1, pp. 117-122, Jan. 2008. http://dx.doi.org/10.1145/1327452.1327494
  13. M. Datar, N. Immorlica, P. Indyk, and V. S. Mirrokni, "Locality-sensitive Hashing Scheme based on p-stable Distribution," in Proceedings of the 20th Annual Symposium on Computational Geometry, Brooklyn, NY, June 8-11, 2004, pp. 253-262. http://dx.doi.org/10.1145/997817.997857
  14. K. M. Lee and K.M. Lee, "A Locality Sensitive Hashing Technique for Categorical Data," Applied Mechanics and Materials, vol. 241-244, pp. 3159-3164, Dec. 2012. http://dx.doi.org/10.4028/www.scientific.net/AMM.241-244.3159
  15. Y. Lin, D. Cai, "Density Sensitive Hashing," Submitted on May 14, 2012. Available http://arxivorg/abs/12052930
  16. R. R. Salakhutdinov and G.E. Hinton, "Semantic hashing," International Journal of Approximate Reasoning, vol. 50, no. 7, pp. 969-978, Jul. 2009. http://dx.doi.org/10.1016/j.ijar.2008.11.006
  17. J. Wang, S. Kumar, and S.-F. Chang, "Sequential Projection Learning for Hashing with Compact Codes," in Proceedings of the 27th International Conference on Machine Learning, Haifa, Israel, June 21-24, 2010.
  18. Y. Weiss, A. Torralba, and R. Fergus, "Spectral hashing," in Proceedings of the 22nd Annual Conference on Neural Information Processing Systems, Vancouver, Canada, December 8-10, 2008, pp. 1753-1760.
  19. G. Shakhnarovich, P. Viola, and T. Darrell, "Fast Pose Estimation with Parameter Sensitive Hashing," in Proceedings of the 9th IEEE International Conference on Computer Vision, Nice, France, October 13-16, 2003, pp. 750-757. http://dx.doi.org/10.1109/ICCV.2003.1238424
  20. R. E. Schapire, "The Boosting Approach to Machine Learning : An Overview," in Nonlinear Estimation and Classification, D. Denison, M. Hansen, C. Holmes, B. Mallick, and B. Yu, Eds. New York, NY: Springer, 2003, pp. 149-171. http://dx.doi.org/10.1007/978-0-387-21579-2_9
  21. U. von Luxburg, "A Tutorial on Spectral Clustering," Statistics and Computing, vol. 17, no. 4, pp. 395-416, Dec. 2007. http://dx.doi.org/10.1007/s11222-007-9033-z
  22. J. Wang, S. Kumar, and S.-F. Chang, "Semi-Supervised Hashing for Large Scale Search," IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 34, no. 12, pp. 2393-2406, Dec. 2012. http://dx.doi.org/10.1109/TPAMI.2012.48
  23. S. Boriah, V. Chandola, V. Kumar, "Similarity Measures for Categorical Data: A Comparative Evaluation," in Proceedings of the 8th SIAM International Conference on Data Mining, Atlanta, GA, April 24-26, 2008, pp. 243-254.