Browse > Article

XML Document Retrieval Models for Heterogeneous Data Set using Independent Regular paths  

유신재 (서울대학교 전기컴퓨터공학부)
민경섭 (서울대학교 인지과학)
김형주 (서울대학교 컴퓨터공학부)
Abstract
An XML document has a structure which may be irregular. It is difficult for end-users to comprehend the irregular document structure exactly. For these XML documents, an end-user has a difficulty in using structured query. Therefore, an end-user formulates no structured query or a query which has a little structure information. In this context, we propose new retrieval models which use the structured information for ranking and compensate the difference between user query structure and document structure. To ease with querying, we assume the independence among querying paths which represent structural constraints. Since this assumption makes degradation of the expression power of a query language, we also propose a model which overcome this problem. As there had been no test collections for XML documents, we made a small test collection from TIPSTER of the RTEC and experimented on this collection without a structured query, From this experiment, we showed that our models improve average precision about 67% over conventional Vector-Space model.
Keywords
XML; XML; retrieval model; retrieval engine; information retrieval;
Citations & Related Records
연도 인용수 순위
  • Reference
1 Roy Goldman, Jason McHugh and Jennifer Widom, From Semistructured Data to XML: Migrating the Lore Data Model and Query Lansguage, WebDB, pages 25-30, 1999
2 J.P. Callan, Passage-Level Evidence in Document Retrieval, In W. Bruce. Croft and C.J. van Rijsbergen editors. Proceedings of the Seventeenth Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 302 310, Dublin, Ireland, July 1994, Spring-Verlag
3 Charles L. , A. Clarke and Gordon V. Cormack, Shortest-substring retrieval and ranking, TOIS, 18(1), 44-78, January 2000   DOI   ScienceOn
4 D. Hawking and P. Thistlewaite, Proximity operators - so near and yet so far
5 Neil Bradley, The XML Companion, 2nd Edition, Addison-Wesley, 1999
6 http://www.w3.org/TR/xpath
7 http://www.w3.org/TR/xquecry
8 M. Kaszkiel, J. Zobel and R. Sacks-Davis, Efficient passage ranking for document databases, TOIS, 17(4): 406-439, 2000   DOI   ScienceOn
9 G. Salton, J. Allan and C. Buckley, Automatic structuring and retrieval of large text files, CACM, 37(2): 97-108, 1994   DOI   ScienceOn
10 C. Stanfill and D. L. Waltz, Statistical methods,artificial intelligence, and information retrieval, In P. S. Jacobs, editor, Text-Based Intelligent Systems: Current Research and Practice in Infromation Extraction and Retrieval, pages 215-225, Lawrence Erlbaum Associates, Inc., 1992
11 V.I. Levvenshtein, Binary codes capable of correcting deletions, insertions, and reversal, Sov. Phys. Dokl., pages 707-710, 1966
12 Takeyuki Shimura, Masatoshi Yoshikawa and Shunsuke Uermura, Storage and Retrieval of XML Documents using Object-Relational Databases, DEXA, pages 206-217, 1999
13 G. Salton and C, Buckley, Automatic text structuring and retrieval; Experiments in automatic encyclopedia searching, Proceedings of the 14th Annual International ACM/SIGIR Conference, pages 21-31, 1991   DOI
14 Alin Deutsch, Mary F. Fernandez and Dan Suciu, Storing Semistructured Data with STORED, SIGMOD, pages 431-442, 1999   DOI
15 Danicla Florescu and Donald Kossmann, Storing and Querying XML Data Using and RDBMS, Data Engineering Bulletin, 22(3), 1999
16 Ricardo A. Baeza-Yates and Gonzalo Navarro, Intergrating Contents and Structure in Text Retrieval, SIGMOD Record, 25(1):67-79, 1996   DOI
17 Ross Wilkinson, Effective retrieval of structured documents, Proceedings of the 17th Annual International ACM Conference on Research and Development in Information Retrieval (SIGIR `94 Dublin, Ireland, July 3-6), pages 311-317, 1995
18 Justin Zobel, Alistair Moffat, Ross Wilkinson, and Ron Sacks-Davis, Efficient retrieval of partial documents, Information Processing and Management, 31(3):361-377, 1995   DOI   ScienceOn
19 M. Kaszkiel and J. Zobel, Passage retrieval revisited, Proceedings of the 20th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR'97, Philadelphia, PA. USA, July 27-31), pages 178-185, 1997
20 C. Clarke, G. Cormack, and F. Burkowski, Shortest substring ranking (MultiText ex-periment for TREC-4), In D. K. Harman editor, Proceedings of the 4th Text Retrieval Conference(TREC-4, Washington, D.C., Nov.), pages 295-304, 1995
21 J. McHugh, S. Abiteboul, R. Goldman, D. Quass and J. Widom, Lore: A Database Management System for Semistructured Data, SIGMOD Record, Vol.26, No.3n pp.54-66, 1997   DOI   ScienceOn
22 S. E. Robertson and S. Walker, Some simple effective approximations to the 2-Poisson model for probabilistic weighted retrieval, SIGIR, pages 232-241, 1994
23 Howard R. Turle and W. Bruce Croft, Evaluation of an Inference Netwofk-Based Retrieval Model, TOIS, 9(3):187-222, 1991   DOI
24 Pekka Kilpelainen and Heikki Mannila, Retrieval from Hierarchical Texts by Partial Patterns, SIGIR, pages 214-222, 1993   DOI
25 Yong Kyu Lee, Seong-Joon Yoo, Kyoungro Yoon and P. Bruce Berra, Index Structures for Structured Documents, journal of Digital Library, pages 91-99, 1996
26 Chi young Seo and Hyung-joo Kim, An Efficient Inverted Index Technique Using RDBMS for Supporting Containment Queries, technical report, 2001
27 Chun Zhang, Jeffrey F. Naughton, David J. DeWitt and Guy M. Lohman Qiong Luo, On Supporting Containment Queries in Relational Database Management Systems, SIGMOD, 2001   DOI   ScienceOn
28 I. MacLeod, A query language for retrieving information from hierarchic text structures, The Computer Journal, 34(3):254-264, 1991   DOI
29 Gonzalo Mavarro and Ricardo Baezs-Yates, Proximal nodes: a model to query document database by content and structure, TOIS, 15(4):400-435, 1997   DOI   ScienceOn
30 S. H. Myaeng and et al., A Flexible Model for Retricval of SGML Documents, SIGIR, pages 138-145, 1998
31 D. Shin, H. Jang and H. Jin, BUS: An Effective Indexing and Retrieval Scheme in Structured Documents, journal of Digital Library, pages 235-243, 1998