Browse > Article
http://dx.doi.org/10.5391/IJFIS.2014.14.2.98

Locality-Sensitive Hashing for Data with Categorical and Numerical Attributes Using Dual Hashing  

Lee, Keon Myung (Department of Computer Science, Chungbuk National University)
Publication Information
International Journal of Fuzzy Logic and Intelligent Systems / v.14, no.2, 2014 , pp. 98-104 More about this Journal
Abstract
Locality-sensitive hashing techniques have been developed to efficiently handle nearest neighbor searches and similar pair identification problems for large volumes of high-dimensional data. This study proposes a locality-sensitive hashing method that can be applied to nearest neighbor search problems for data sets containing both numerical and categorical attributes. The proposed method makes use of dual hashing functions, where one function is dedicated to numerical attributes and the other to categorical attributes. The method consists of creating indexing structures for each of the dual hashing functions, gathering and combining the candidates sets, and thoroughly examining them to determine the nearest ones. The proposed method is examined for a few synthetic data sets, and results show that it improves performance in cases of large amounts of data with both numerical and categorical attributes.
Keywords
Locality sensitive hashing; Data analysis; Search; Hashing;
Citations & Related Records
Times Cited By KSCI : 2  (Citation Analysis)
연도 인용수 순위
1 J. Wang, S. Kumar, and S.-F. Chang, "Semi-Supervised Hashing for Large Scale Search," IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 34, no. 12, pp. 2393-2406, Dec. 2012. http://dx.doi.org/10.1109/TPAMI.2012.48   DOI   ScienceOn
2 G. Shakhnarovich, P. Viola, and T. Darrell, "Fast Pose Estimation with Parameter Sensitive Hashing," in Proceedings of the 9th IEEE International Conference on Computer Vision, Nice, France, October 13-16, 2003, pp. 750-757. http://dx.doi.org/10.1109/ICCV.2003.1238424   DOI
3 R. E. Schapire, "The Boosting Approach to Machine Learning : An Overview," in Nonlinear Estimation and Classification, D. Denison, M. Hansen, C. Holmes, B. Mallick, and B. Yu, Eds. New York, NY: Springer, 2003, pp. 149-171. http://dx.doi.org/10.1007/978-0-387-21579-2_9   DOI
4 U. von Luxburg, "A Tutorial on Spectral Clustering," Statistics and Computing, vol. 17, no. 4, pp. 395-416, Dec. 2007. http://dx.doi.org/10.1007/s11222-007-9033-z   DOI   ScienceOn
5 S. Boriah, V. Chandola, V. Kumar, "Similarity Measures for Categorical Data: A Comparative Evaluation," in Proceedings of the 8th SIAM International Conference on Data Mining, Atlanta, GA, April 24-26, 2008, pp. 243-254.
6 K. M. Lee and K.M. Lee, "A Locality Sensitive Hashing Technique for Categorical Data," Applied Mechanics and Materials, vol. 241-244, pp. 3159-3164, Dec. 2012. http://dx.doi.org/10.4028/www.scientific.net/AMM.241-244.3159   DOI
7 P. Indyk and R. Motwani, "Approximate Nearest Neighbors: Towards Removing the Curse of Dimensionality," in Proceedings of the 30th Annual ACM Symposium on Theory of Computing, Dallas, TX, May 24-26, 1998, pp. 604-613. http://dx.doi.org/10.1145/276698.276876   DOI
8 A. Andoni and P. Indyk, "Near-optimal Hashing Algorithms for Approximate Nearest Neighbor in High Dimensions," Communications of the ACM, vol. 51, no. 1, pp. 117-122, Jan. 2008. http://dx.doi.org/10.1145/1327452.1327494   DOI   ScienceOn
9 M. Datar, N. Immorlica, P. Indyk, and V. S. Mirrokni, "Locality-sensitive Hashing Scheme based on p-stable Distribution," in Proceedings of the 20th Annual Symposium on Computational Geometry, Brooklyn, NY, June 8-11, 2004, pp. 253-262. http://dx.doi.org/10.1145/997817.997857   DOI
10 Y. Lin, D. Cai, "Density Sensitive Hashing," Submitted on May 14, 2012. Available http://arxivorg/abs/12052930
11 R. R. Salakhutdinov and G.E. Hinton, "Semantic hashing," International Journal of Approximate Reasoning, vol. 50, no. 7, pp. 969-978, Jul. 2009. http://dx.doi.org/10.1016/j.ijar.2008.11.006   DOI   ScienceOn
12 D. W. Kim, K. H. Lee, "A fuzzy clustering algorithm for clustering categorical data," Journal of The Korean Institute of Intelligent Systems, vol.13, no.6, pp.661-666, Dec. 2003.   과학기술학회마을   DOI   ScienceOn
13 J. Wang, S. Kumar, and S.-F. Chang, "Sequential Projection Learning for Hashing with Compact Codes," in Proceedings of the 27th International Conference on Machine Learning, Haifa, Israel, June 21-24, 2010.
14 Y. Weiss, A. Torralba, and R. Fergus, "Spectral hashing," in Proceedings of the 22nd Annual Conference on Neural Information Processing Systems, Vancouver, Canada, December 8-10, 2008, pp. 1753-1760.
15 M. Potthast and B. Stein, "New Issues in Near-Duplicate Detection," in Data Analysis, Machine Learning and Applications, C. Preisach, H. Burkhardt, L. Schmidt-Thieme, and R. Decker, Eds. Heidelberg, Germany: Springer Berlin, 2008, pp. 601-609. http://dx.doi.org/10.1007/978-3-540-78246-971
16 S. M. Omohundro, Five balltree construction algorithms,, International Computer Science Institute Technical Report, 1989. Available ftp://ftp.icsi.berkeley.edu/pub/techreports/1989/tr-89-063.pdf
17 S. Pandey, A. Broder, and F. Chierichetti, "Nearest-Neighbor Caching for Content-Match Applications," in Proceedings of the 18th International Conference on World Wide Web, Madrid, Spain, April 20-24, 2009, pp. 441-450. http://dx.doi.org/10.1145/1526709.1526769   DOI
18 J. K. Uhlmann, "Satisfying general proximity/similarity queries with metric trees," Information Processing Letters, vol.4, no. 4, pp.175-179, Nov. 1991. http://dx.doi.org/10.1016/0020-0190(91)90074-R   DOI   ScienceOn
19 D. Nister and H. Stewenius, "Scalable Recognition with a Vocabulary Tree," in Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, New York, NY, June 17-22, 2006, pp. 2161-2168. http://dx.doi.org/10.1109/cvpr.2006.264   DOI
20 A. Gionis, P. Indyk, and R. Motwani, "Similarity Search in High Dimensions via Hashing," in Proceedings of the 25th International Conference on Very Large Data Bases, Edinburgh, UK, September 7-10, 1999, pp. 518-529.
21 K. M. Lee, "Locality-Sensitive Hashing Techniques for Nearest Neighbor Search," Int. Journal of Fuzzy Logic and Intelligent Systems, Vol.12, No.4, pp.300-307, Dec. 2012. http://dx.doi.org/10.5391/IJFIS.2012.12.4.300   과학기술학회마을   DOI   ScienceOn
22 J. L. Bentley, "Multidimensional Binary Search Trees used for Associative Searching," Communications of the ACM, vol. 18, no. 9, pp. 509-517, Sep. 1975. http://dx.doi.org/10.1145/361002.361007   DOI   ScienceOn
23 T. H. Cormen, C. E. Leiserson, R. L. Rivest, Introduction to Algorithms, 2nd ed., Cambridge, MA: MIT Press, 2001.