A Hybrid Information Retrieval Model Using Metadata and Text

메타데이타와 텍스트 정보의 통합검색 모델

  • 유정목 (한국전자통신연구원 디지털홈연구단 인터넷서버그룹) ;
  • 맹성현 (한국정보통신대학교 공학부) ;
  • 김성수 (한국통신 비지니스부문 프로젝트 관리부) ;
  • 이만호 (충남대학교 전기정보통신공학부)
  • Published : 2007.06.15

Abstract

Metadata IR model has high precision and low recall because the query in Metadata IR model is strict that is, the query can express user information need exactly, while Full-text IR model has low precision and high recall because the query in Full-text IR model is a kind of simple keyword query which expresses user information need roughly. If user can translate one's information need into structured query well, the retrieval result will be improved. However, it is little possible to make relevant query without understanding characteristics of metadata. Unfortunately, most users do not interested in metadata, then they cannot construct well-made structured query. Amount of information contained in metadata is less than text information. In this paper, we suggest hybrid IR model using metadata and text which can provide users with lots of relevant documents by retrieving from metadata field and text field complementarily.

메타데이타를 위한 검색모델은 질의에 사용자의 정보요구를 정확하게 반영하기 때문에 정확율(precision)은 높지만 질의 조건에 만족하지 않는 정보를 배제하므로 재현율(recall)은 낮다. 반면 전문(full-text) 텍스트 검색 모델은 사용자 질의에 대하여 모든 문서를 검색대상으로 하므로 정확율은 낮고 재현율은 높다. 메타데이타 검색모델의 높은 정확율은 사용자가 메타데이타의 구조적 특성에 맞게 질의를 구성할 경우 가능하지만 일반적으로 사용자가 메타데이타의 구조 정보를 반영한 사용자 질의를 구성할 수 있다고 기대하기는 어렵다. 또한 메타데이타에 포함된 정보의 양은 전문 텍스트가 가진 정보의 양보다 적기 때문에 텍스트를 검색한 결과보다 재현율이 떨어진다. 본 논문에서는 이러한 특성을 반영하여 메타데이타 검색 시, 사용자의 다양한 질의를 메타데이타의 특성에 맞게 재구성하고 메타데이타뿐 아니라 텍스트에 대해서도 검색을 수행하여 두 모델의 장점을 함께 고려한 통합 검색 모델을 제안한다.

Keywords

References

  1. Baeza-Yates, R., Ribeiro-Neto, B.: Modern Information Retrieval. Addison Wesley, New York, NY (1999)
  2. Callan, J, P.: Document filtering with inference networks. In Proceedings of the 19th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, Zurich Switzerland (1996) 262-269 https://doi.org/10.1145/243199.243273
  3. Calado, P., Cristo, M., Moura, E., Ziviani B., Goncalves, M, A.: Combining link-based and content-based methods for web document classification. In Proceedings of the 12th International Conference on Information and Knowledge Management, New Orleans LA USA (2003) 394-401 https://doi.org/10.1145/956863.956938
  4. Campos, L, M., Ferenandez-Luna, J, M., Huete, J, F.: Query Expansion in Information Retrieval Systems Using a Bayesian Network-Based Thesaurus. In Proceedings of the 14th Annual Conference on Uncertainty in Artificial Intelligence (UAI-98), San Francisco CA (1998) 53-60
  5. Calado, P., Silva, A, S., Vieria, R, C., Laender, A, H, F., Ribeiro-Neto, B, A.: Searching Web Databases by Structuring Keyword-based Queries. In proceedings of the 11th International Conference on Information and Knowledge Management, McLean VA USA (2002) 26-33 https://doi.org/10.1145/584792.584801
  6. Dumais, S, T., Platt, P., Hecherman, D., Sahami, M.: Inductive learning algorithms and representations for text categorization. In Proceedings of the 7th International Conference on Information and Knowledge Management CIKM'98, Bethesda Maryland USA (1998) 148-155 https://doi.org/10.1145/288627.288651
  7. Deniman, D., Sumner, T., Davis L., Bhushan, S., Jackson.: Merging Metadata and Content-Based Retreival. In proceedings of Journal of Digital Information, Volume 4 Issue 3
  8. Goncalves, M, A., Fox, E, A., Krowne, A., Calado, P., Laender, A, H, F., Silva, A, S., Ribeiro-Neto, B, A.: The effectiveness of Automatically Structured Queries in Digital libraries. In proceedings of the 2004 joint ACM/IEEE conference on Digital libraries - Volume 00, Tuscon AZ USA (2004) https://doi.org/10.1109/JCDL.2004.1336106
  9. Haines, D., Croft, W, B.: Relevance feedback and inference networks. In Proceedings of the 16th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, Pittsburgh, PA, USA, June (1993) 2-11 https://doi.org/10.1145/160688.160689
  10. S. H. Myaeng, D.-H. Jang, M.-S. Kim, and Z.-C. Zhoo. A flexible model for retrieval of SGML documents. In Proceedings of the 21st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 138-145, Melbourne, Australia, August 1998 https://doi.org/10.1145/290941.290980
  11. Passin, T, B.: Explorer's Guide to the Semantic Web, Manning press (2004)
  12. Ribeiro-Neto, B., Muntz, R.: A belief network model for IR. In proceedings of the 19th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, Zurich, Switzerland, August (1996) 253-260 https://doi.org/10.1145/243199.243272
  13. Silva, I., Ribeiro-Neto, B., Calado, P., Moura, E., Ziviani, N.: Linked-based and Content-Based Evidential Information in a Belief Network Model. In Proceedings of the 23rd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, Athens Greece (2000) 96-103 https://doi.org/10.1145/345508.345554
  14. Turtle, H, R., Croft, W, B.: Inference networks for document retrieval. In Proceedings of the 13th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, Brussels, Belgium, September (1990) 1-24 https://doi.org/10.1145/96749.98006
  15. Turtle, H, R., Croft, W, B.: Croft. Evaluation of an Inference network-Based Retrieval Model. ACM Transactions on Information Systems 9,3 (1991), 187-222 https://doi.org/10.1145/125187.125188
  16. Valle, R, F., Ribeiro-Neto, B, A., Lima, L, R, S., Laender, A, H, F., Junior, H, R , F, F.: Improving text retrieval in medical collections through automatic categorization. In Proceedings of the 10th International Symposium on String Processing and Information Retrieval SPIRE 2003, Manaus Brazil (2003) 197-210
  17. T. T. Chinenyanga and N. Kushmerick. Expressive retrieval from XML documents. In Proceedings of the 24th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 163-171, New Orleans, Louisiana, USA, September 2001 https://doi.org/10.1145/383952.383982
  18. N. Fuhr and K. Gross. XIRQL: a query language for information retrieval in XML documents. In Proceedings of the 24th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 172-180, New Orleans, Louisiana, USA, September 2001 https://doi.org/10.1145/383952.383985
  19. G. Navarro and R. Baeza- Yates. Proximal nodes: A model to query document databases by content and structure. ACM Transactions 15(4):400-435, Oct. 1997 https://doi.org/10.1145/263479.263482