Browse > Article
http://dx.doi.org/10.13088/jiis.2011.17.3.063

Methods for Integration of Documents using Hierarchical Structure based on the Formal Concept Analysis  

Kim, Tae-Hwan (Department of Computer Science and Engineering, Hanyang University)
Jeon, Ho-Cheol (Department of Computer Science and Engineering, Hanyang University)
Choi, Joong-Min (Department of Computer Science and Engineering, Hanyang University)
Publication Information
Journal of Intelligence and Information Systems / v.17, no.3, 2011 , pp. 63-77 More about this Journal
Abstract
The World Wide Web is a very large distributed digital information space. From its origins in 1991, the web has grown to encompass diverse information resources as personal home pasges, online digital libraries and virtual museums. Some estimates suggest that the web currently includes over 500 billion pages in the deep web. The ability to search and retrieve information from the web efficiently and effectively is an enabling technology for realizing its full potential. With powerful workstations and parallel processing technology, efficiency is not a bottleneck. In fact, some existing search tools sift through gigabyte.syze precompiled web indexes in a fraction of a second. But retrieval effectiveness is a different matter. Current search tools retrieve too many documents, of which only a small fraction are relevant to the user query. Furthermore, the most relevant documents do not nessarily appear at the top of the query output order. Also, current search tools can not retrieve the documents related with retrieved document from gigantic amount of documents. The most important problem for lots of current searching systems is to increase the quality of search. It means to provide related documents or decrease the number of unrelated documents as low as possible in the results of search. For this problem, CiteSeer proposed the ACI (Autonomous Citation Indexing) of the articles on the World Wide Web. A "citation index" indexes the links between articles that researchers make when they cite other articles. Citation indexes are very useful for a number of purposes, including literature search and analysis of the academic literature. For details of this work, references contained in academic articles are used to give credit to previous work in the literature and provide a link between the "citing" and "cited" articles. A citation index indexes the citations that an article makes, linking the articleswith the cited works. Citation indexes were originally designed mainly for information retrieval. The citation links allow navigating the literature in unique ways. Papers can be located independent of language, and words in thetitle, keywords or document. A citation index allows navigation backward in time (the list of cited articles) and forwardin time (which subsequent articles cite the current article?) But CiteSeer can not indexes the links between articles that researchers doesn't make. Because it indexes the links between articles that only researchers make when they cite other articles. Also, CiteSeer is not easy to scalability. Because CiteSeer can not indexes the links between articles that researchers doesn't make. All these problems make us orient for designing more effective search system. This paper shows a method that extracts subject and predicate per each sentence in documents. A document will be changed into the tabular form that extracted predicate checked value of possible subject and object. We make a hierarchical graph of a document using the table and then integrate graphs of documents. The graph of entire documents calculates the area of document as compared with integrated documents. We mark relation among the documents as compared with the area of documents. Also it proposes a method for structural integration of documents that retrieves documents from the graph. It makes that the user can find information easier. We compared the performance of the proposed approaches with lucene search engine using the formulas for ranking. As a result, the F.measure is about 60% and it is better as about 15%.
Keywords
Related Documents Retrieval; Integration of Documents; Semantic Expansion;
Citations & Related Records
연도 인용수 순위
  • Reference
1 Giles, C. L., "CiteSeer : Past, Present, and Future", Computer Science Adances in Web Intelligence, 2004.
2 Boyer, R. S. and J. S. Moore, "A fast string searching algorithm", CACM, Vol.20, No.10 (1977), 762-772.   DOI   ScienceOn
3 CiteSeer., http://citeseer.ist.psu.edu.
4 Eijck van Jan, and Z. Joost, "Formal Concept Analysis and Prototypes", 2004.
5 Giles, C. L., K. D., Bollacker and S. Lawrence, "CiteSeer : An Automatic Citation Indexing System", ACM Conference, 1998.
6 Mingjun, L., Y. Shui, B. Ruth, and Z. Walei, "A Co-Recommendation Algorithm for Web Searching", Fifth International Conference on Algorithms and Architectures for Parallel Processing ICA3PP ʻ02). IEEE International Conference, 2002.
7 Philipp, C., H. Andreas and S. Steffen, "Learning Concept Hierarchies from Text Corpora using Formal Concept Analysis", AI Access Foundation, 2005.
8 Salton, G., "Automatic Text Processing", Addison-Wesley Publishing Company, 229-236, 1989.
9 Manning, C. and D. Klein, "Fast Exact Inference with a Factored Model for Natural Language Parsing", Advances in Neural Information Processing Systems, Vol.15(NIPS, 2002).
10 Meadow, C. T., "Text Information Retrieval Systems", Academic Press, Inc, 1992, 201-211.