Locality-Sensitive Hashing for Data with Categorical and Numerical Attributes Using Dual Hashing

Lee, Keon Myung;

doi:10.5391/IJFIS.2014.14.2.98

International Journal of Fuzzy Logic and Intelligent Systems

Volume 14 Issue 2
/
Pages.98-104
/
2014
/
1598-2645(pISSN)
/
2093-744X(eISSN)

Korean Institute of Intelligent Systems (한국지능시스템학회)

DOI QR Code

Locality-Sensitive Hashing for Data with Categorical and Numerical Attributes Using Dual Hashing

Lee, Keon Myung (Department of Computer Science, Chungbuk National University)

Received : 2014.06.01
Accepted : 2014.06.24
Published : 2014.06.25

https://doi.org/10.5391/IJFIS.2014.14.2.98 Citation PDF KSCI

Download PDF

⟨ Previous Next ⟩

Abstract

Locality-sensitive hashing techniques have been developed to efficiently handle nearest neighbor searches and similar pair identification problems for large volumes of high-dimensional data. This study proposes a locality-sensitive hashing method that can be applied to nearest neighbor search problems for data sets containing both numerical and categorical attributes. The proposed method makes use of dual hashing functions, where one function is dedicated to numerical attributes and the other to categorical attributes. The method consists of creating indexing structures for each of the dual hashing functions, gathering and combining the candidates sets, and thoroughly examining them to determine the nearest ones. The proposed method is examined for a few synthetic data sets, and results show that it improves performance in cases of large amounts of data with both numerical and categorical attributes.

Keywords

References

K. M. Lee, "Locality-Sensitive Hashing Techniques for Nearest Neighbor Search," Int. Journal of Fuzzy Logic and Intelligent Systems, Vol.12, No.4, pp.300-307, Dec. 2012. http://dx.doi.org/10.5391/IJFIS.2012.12.4.300
J. L. Bentley, "Multidimensional Binary Search Trees used for Associative Searching," Communications of the ACM, vol. 18, no. 9, pp. 509-517, Sep. 1975. http://dx.doi.org/10.1145/361002.361007
T. H. Cormen, C. E. Leiserson, R. L. Rivest, Introduction to Algorithms, 2nd ed., Cambridge, MA: MIT Press, 2001.
D. W. Kim, K. H. Lee, "A fuzzy clustering algorithm for clustering categorical data," Journal of The Korean Institute of Intelligent Systems, vol.13, no.6, pp.661-666, Dec. 2003. https://doi.org/10.5391/JKIIS.2003.13.6.661
S. M. Omohundro, Five balltree construction algorithms,, International Computer Science Institute Technical Report, 1989. Available ftp://ftp.icsi.berkeley.edu/pub/techreports/1989/tr-89-063.pdf
S. Pandey, A. Broder, and F. Chierichetti, "Nearest-Neighbor Caching for Content-Match Applications," in Proceedings of the 18th International Conference on World Wide Web, Madrid, Spain, April 20-24, 2009, pp. 441-450. http://dx.doi.org/10.1145/1526709.1526769
M. Potthast and B. Stein, "New Issues in Near-Duplicate Detection," in Data Analysis, Machine Learning and Applications, C. Preisach, H. Burkhardt, L. Schmidt-Thieme, and R. Decker, Eds. Heidelberg, Germany: Springer Berlin, 2008, pp. 601-609. http://dx.doi.org/10.1007/978-3-540-78246-971
J. K. Uhlmann, "Satisfying general proximity/similarity queries with metric trees," Information Processing Letters, vol.4, no. 4, pp.175-179, Nov. 1991. http://dx.doi.org/10.1016/0020-0190(91)90074-R
D. Nister and H. Stewenius, "Scalable Recognition with a Vocabulary Tree," in Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, New York, NY, June 17-22, 2006, pp. 2161-2168. http://dx.doi.org/10.1109/cvpr.2006.264
A. Gionis, P. Indyk, and R. Motwani, "Similarity Search in High Dimensions via Hashing," in Proceedings of the 25th International Conference on Very Large Data Bases, Edinburgh, UK, September 7-10, 1999, pp. 518-529.
P. Indyk and R. Motwani, "Approximate Nearest Neighbors: Towards Removing the Curse of Dimensionality," in Proceedings of the 30th Annual ACM Symposium on Theory of Computing, Dallas, TX, May 24-26, 1998, pp. 604-613. http://dx.doi.org/10.1145/276698.276876
A. Andoni and P. Indyk, "Near-optimal Hashing Algorithms for Approximate Nearest Neighbor in High Dimensions," Communications of the ACM, vol. 51, no. 1, pp. 117-122, Jan. 2008. http://dx.doi.org/10.1145/1327452.1327494
M. Datar, N. Immorlica, P. Indyk, and V. S. Mirrokni, "Locality-sensitive Hashing Scheme based on p-stable Distribution," in Proceedings of the 20th Annual Symposium on Computational Geometry, Brooklyn, NY, June 8-11, 2004, pp. 253-262. http://dx.doi.org/10.1145/997817.997857
K. M. Lee and K.M. Lee, "A Locality Sensitive Hashing Technique for Categorical Data," Applied Mechanics and Materials, vol. 241-244, pp. 3159-3164, Dec. 2012. http://dx.doi.org/10.4028/www.scientific.net/AMM.241-244.3159
Y. Lin, D. Cai, "Density Sensitive Hashing," Submitted on May 14, 2012. Available http://arxivorg/abs/12052930
R. R. Salakhutdinov and G.E. Hinton, "Semantic hashing," International Journal of Approximate Reasoning, vol. 50, no. 7, pp. 969-978, Jul. 2009. http://dx.doi.org/10.1016/j.ijar.2008.11.006
J. Wang, S. Kumar, and S.-F. Chang, "Sequential Projection Learning for Hashing with Compact Codes," in Proceedings of the 27th International Conference on Machine Learning, Haifa, Israel, June 21-24, 2010.
Y. Weiss, A. Torralba, and R. Fergus, "Spectral hashing," in Proceedings of the 22nd Annual Conference on Neural Information Processing Systems, Vancouver, Canada, December 8-10, 2008, pp. 1753-1760.
G. Shakhnarovich, P. Viola, and T. Darrell, "Fast Pose Estimation with Parameter Sensitive Hashing," in Proceedings of the 9th IEEE International Conference on Computer Vision, Nice, France, October 13-16, 2003, pp. 750-757. http://dx.doi.org/10.1109/ICCV.2003.1238424
R. E. Schapire, "The Boosting Approach to Machine Learning : An Overview," in Nonlinear Estimation and Classification, D. Denison, M. Hansen, C. Holmes, B. Mallick, and B. Yu, Eds. New York, NY: Springer, 2003, pp. 149-171. http://dx.doi.org/10.1007/978-0-387-21579-2_9
U. von Luxburg, "A Tutorial on Spectral Clustering," Statistics and Computing, vol. 17, no. 4, pp. 395-416, Dec. 2007. http://dx.doi.org/10.1007/s11222-007-9033-z
J. Wang, S. Kumar, and S.-F. Chang, "Semi-Supervised Hashing for Large Scale Search," IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 34, no. 12, pp. 2393-2406, Dec. 2012. http://dx.doi.org/10.1109/TPAMI.2012.48
S. Boriah, V. Chandola, V. Kumar, "Similarity Measures for Categorical Data: A Comparative Evaluation," in Proceedings of the 8th SIAM International Conference on Data Mining, Atlanta, GA, April 24-26, 2008, pp. 243-254.

International Journal of Fuzzy Logic and Intelligent Systems

Locality-Sensitive Hashing for Data with Categorical and Numerical Attributes Using Dual Hashing

Abstract

Keywords

References

이메일무단수집거부

이용약관

제 1 장 총칙

제 2 장 이용계약의 체결

제 3 장 계약 당사자의 의무

제 4 장 서비스의 이용

제 5 장 계약 해지 및 이용 제한

제 6 장 손해배상 및 기타사항

Detail Search

Image Search (β)