Browse > Article

Term Clustering and Duplicate Distribution for Efficient Parallel Information Retrieval  

강재호 (동아대하교 지능형통합항만관리연구센터)
양재완 (온빛시스템 정보기술연구원)
정성원 (온빛시스템 정보기술연구원)
류광렬 (부산대학교 정보컴퓨터공학부)
권혁철 (부산대학교 정보컴퓨터공학부)
정상화 (부산대학교 정보컴퓨터공학부)
Abstract
The PC cluster architecture is considered as a cost-effective alternative to the existing supercomputers for realizing a high-performance information retrieval (IR) system. To implement an efficient IR system on a PC cluster, it is essential to achieve maximum parallelism by having the data appropriately distributed to the local hard disks of the PCs in such a way that the disk I/O and the subsequent computation are distributed as evenly as possible to all the PCs. If the terms in the inverted index file can be classified to closely related clusters, the parallelism can be maximized by distributing them to the PCs in an interleaved manner. One of the goals of this research is the development of methods for automatically clustering the terms based on the likelihood of the terms' co-occurrence in the same query. Also, in this paper, we propose a method for duplicate distribution of inverted index records among the PCs to achieve fault-tolerance as well as dynamic load balancing. Experiments with a large corpus revealed the efficiency and effectiveness of our method.
Keywords
parallel information retrieval; term clustering; PC cluster; fault tolerance;
Citations & Related Records
Times Cited By KSCI : 1  (Citation Analysis)
연도 인용수 순위
1 Samanta, R., Zheng, J., Funkhouser, T., Li, K. and Singh, J.P., 'Load Balancing for Multi-Projector Rendering Systems,' SIGGRAPH/Eurographics Workshop on Graphics Hardware, August, 1999   DOI
2 Lin, Z. and Zhou, S., 'Parallelizing I/O intensive applications for a workstation cluster: a case study,' Computer Architecture News 21, 5, pp.15-22, 1993   DOI
3 Jeong, B. and Omiecinski, E., 'Inverted File Partitioning Schemes in Multiple Disk Sysrems,' IEEE Transactions on Parallel and Distributed Systems, 6(2):142-153, 1995   DOI   ScienceOn
4 Sornil, O. and Fox, E. A,, 'Hybrid partitioned inverted indices for large-scale digital libraries,' Proceedings of The 4th International Conference of Asian Digital Library, Bangalore, India, Dec. 10-12, 2001
5 강유경, 류광렬, 정상화, '문서 클러스터링에 의한 효율적인 병렬 정보검색 시스템,' 정보과학회논문지 : 소프트웨어 및 응용, 제28권 제2호, pp.157-167, 2001   과학기술학회마을
6 Stanfill, C. and Thau, R., 'Information Retrieval on the Connection Machine : 1 to 8192 Gigabytes,' Information Processing & Management, pp.285-310, 1991   DOI   ScienceOn
7 Wolfson, O., Jajodia, S. and Huang, Y., 'An Adaptive Data Replication Algorithm,' ACM Transactions on Database Systems, vol. 22, no.2, pp.255-314, 1997   DOI   ScienceOn
8 Chung, S-H., Kwon, H-C., Ryu, K. R., Jang, H-K., Kim, J-H and Choi, C-A., 'Parallel Information Retrieval on an SCI-Based PC-NOW,' Lecture Notes in Computer Science, Vol. 1800, (IPDPS-2000 Workshops, Cancun, Mexico) pp.81-90, 2000
9 Schutze, H. and Silverstein, C., 'Projections for Efficient Document Clustering,' Proceedings of The 20th Annual International ACM SIGIR Conference on Research and Development in Information Retieval, pp.74-81, 1997
10 Silberstein, C. and Pedersen, J. O., 'Almost-Constant-Time Clustering of Arbitrary Corpus Subsets,' Proceedings of The 20th Annual International ACM SIGIR Conference on Research and Development in Information Retrival, pp.60-66, Philadelphia, Pennsylvania, 1997
11 Salton, G. and Buckely, C., 'Improving retrieval performance by relevance feedback,' Journal of the American Society for Information Science, 41, pp.288-297, 1990   DOI
12 Gray, J., Helland, P., O'Neil, P. and Shasha, D., 'The dangers of replication and a solution,' Proceedings of ACM SIGMOD '96, pp.173-182, 1996   DOI