[KSCI] Korea Science Citation Index Service

Term Clustering and Duplicate Distribution for Efficient Parallel Information Retrieval

강재호 (동아대하교 지능형통합항만관리연구센터)
양재완 (온빛시스템 정보기술연구원)
정성원 (온빛시스템 정보기술연구원)
류광렬 (부산대학교 정보컴퓨터공학부)
권혁철 (부산대학교 정보컴퓨터공학부)
정상화 (부산대학교 정보컴퓨터공학부)

Publication Information

Journal of KIISE:Software and Applications / v.30, no.1_2, 2003 , pp. 129-139 More about this Journal

Abstract

The PC cluster architecture is considered as a cost-effective alternative to the existing supercomputers for realizing a high-performance information retrieval (IR) system. To implement an efficient IR system on a PC cluster, it is essential to achieve maximum parallelism by having the data appropriately distributed to the local hard disks of the PCs in such a way that the disk I/O and the subsequent computation are distributed as evenly as possible to all the PCs. If the terms in the inverted index file can be classified to closely related clusters, the parallelism can be maximized by distributing them to the PCs in an interleaved manner. One of the goals of this research is the development of methods for automatically clustering the terms based on the likelihood of the terms' co-occurrence in the same query. Also, in this paper, we propose a method for duplicate distribution of inverted index records among the PCs to achieve fault-tolerance as well as dynamic load balancing. Experiments with a large corpus revealed the efficiency and effectiveness of our method.

Keywords

parallel information retrieval; term clustering; PC cluster; fault tolerance;

Citations & Related Records

Times Cited By KSCI : 1 (Citation Analysis)

Reference
Cited By KSCI

1	Samanta, R., Zheng, J., Funkhouser, T., Li, K. and Singh, J.P., 'Load Balancing for Multi-Projector Rendering Systems,' SIGGRAPH/Eurographics Workshop on Graphics Hardware, August, 1999 DOI
2	Lin, Z. and Zhou, S., 'Parallelizing I/O intensive applications for a workstation cluster: a case study,' Computer Architecture News 21, 5, pp.15-22, 1993 DOI
3	Jeong, B. and Omiecinski, E., 'Inverted File Partitioning Schemes in Multiple Disk Sysrems,' IEEE Transactions on Parallel and Distributed Systems, 6(2):142-153, 1995 DOI ScienceOn
4	Sornil, O. and Fox, E. A,, 'Hybrid partitioned inverted indices for large-scale digital libraries,' Proceedings of The 4th International Conference of Asian Digital Library, Bangalore, India, Dec. 10-12, 2001
5	강유경, 류광렬, 정상화, '문서 클러스터링에 의한 효율적인 병렬 정보검색 시스템,' 정보과학회논문지 : 소프트웨어 및 응용, 제28권 제2호, pp.157-167, 2001 과학기술학회마을
6	Stanfill, C. and Thau, R., 'Information Retrieval on the Connection Machine : 1 to 8192 Gigabytes,' Information Processing & Management, pp.285-310, 1991 DOI ScienceOn
7	Wolfson, O., Jajodia, S. and Huang, Y., 'An Adaptive Data Replication Algorithm,' ACM Transactions on Database Systems, vol. 22, no.2, pp.255-314, 1997 DOI ScienceOn
8	Chung, S-H., Kwon, H-C., Ryu, K. R., Jang, H-K., Kim, J-H and Choi, C-A., 'Parallel Information Retrieval on an SCI-Based PC-NOW,' Lecture Notes in Computer Science, Vol. 1800, (IPDPS-2000 Workshops, Cancun, Mexico) pp.81-90, 2000
9	Schutze, H. and Silverstein, C., 'Projections for Efficient Document Clustering,' Proceedings of The 20th Annual International ACM SIGIR Conference on Research and Development in Information Retieval, pp.74-81, 1997
10	Silberstein, C. and Pedersen, J. O., 'Almost-Constant-Time Clustering of Arbitrary Corpus Subsets,' Proceedings of The 20th Annual International ACM SIGIR Conference on Research and Development in Information Retrival, pp.60-66, Philadelphia, Pennsylvania, 1997
11	Salton, G. and Buckely, C., 'Improving retrieval performance by relevance feedback,' Journal of the American Society for Information Science, 41, pp.288-297, 1990 DOI
12	Gray, J., Helland, P., O'Neil, P. and Shasha, D., 'The dangers of replication and a solution,' Proceedings of ACM SIGMOD '96, pp.173-182, 1996 DOI

KSCI

Term Clustering and Duplicate Distribution for Efficient Parallel Information Retrieval 효율적인 병렬정보검색을 위한 색인어 군집화 및 분산저장 기법

Term Clustering and Duplicate Distribution for Efficient Parallel Information Retrieval